Biomedical concept association and clustering using word embeddings

dc.contributor.advisorLuo, Xiao
dc.contributor.advisorEl-Sharkawy, Mohamed
dc.contributor.authorShah, Setu
dc.contributor.otherKing, Brian
dc.date.accessioned2018-12-05T21:22:53Z
dc.date.available2018-12-05T21:22:53Z
dc.date.issued2018-12
dc.degree.date2018en_US
dc.degree.disciplineElectrical & Computer Engineeringen
dc.degree.grantorPurdue Universityen_US
dc.degree.levelM.S.en_US
dc.descriptionIndiana University-Purdue University Indianapolis (IUPUI)en_US
dc.description.abstractBiomedical data exists in the form of journal articles, research studies, electronic health records, care guidelines, etc. While text mining and natural language processing tools have been widely employed across various domains, these are just taking off in the healthcare space. A primary hurdle that makes it difficult to build artificial intelligence models that use biomedical data, is the limited amount of labelled data available. Since most models rely on supervised or semi-supervised methods, generating large amounts of pre-processed labelled data that can be used for training purposes becomes extremely costly. Even for datasets that are labelled, the lack of normalization of biomedical concepts further affects the quality of results produced and limits the application to a restricted dataset. This affects reproducibility of the results and techniques across datasets, making it difficult to deploy research solutions to improve healthcare services. The research presented in this thesis focuses on reducing the need to create labels for biomedical text mining by using unsupervised recurrent neural networks. The proposed method utilizes word embeddings to generate vector representations of biomedical concepts based on semantics and context. Experiments with unsupervised clustering of these biomedical concepts show that concepts that are similar to each other are clustered together. While this clustering captures different synonyms of the same concept, it also captures the similarities between various diseases and the symptoms that those diseases are symptomatic of. To test the performance of the concept vectors on corpora of documents, a document vector generation method that utilizes these concept vectors is also proposed. The document vectors thus generated are used as an input to clustering algorithms, and the results show that across multiple corpora, the proposed methods of concept and document vector generation outperform the baselines and provide more meaningful clustering. The applications of this document clustering are huge, especially in the search and retrieval space, providing clinicians, researchers and patients more holistic and comprehensive results than relying on the exclusive term that they search for. At the end, a framework for extracting clinical information that can be mapped to electronic health records from preventive care guidelines is presented. The extracted information can be integrated with the clinical decision support system of an electronic health record. A visualization tool to better understand and observe patient trajectories is also explored. Both these methods have potential to improve the preventive care services provided to patients.en_US
dc.identifier.urihttps://hdl.handle.net/1805/17918
dc.identifier.urihttp://dx.doi.org/10.7912/C2/2468
dc.language.isoen_USen_US
dc.rightsAttribution 3.0 United States
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/us/
dc.subjectArtificial intelligenceen_US
dc.subjectNatural language processingen_US
dc.subjectDocument clusteringen_US
dc.subjectPreventive careen_US
dc.subjectWord embeddingsen_US
dc.subjectBiomedical scienceen_US
dc.titleBiomedical concept association and clustering using word embeddingsen_US
dc.typeThesisen
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Thesis v4.4.pdf
Size:
7.8 MB
Format:
Adobe Portable Document Format
Description:
Thesis
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.99 KB
Format:
Item-specific license agreed upon to submission
Description: