Identification of Publications on Disordered Proteins from PubMed

dc.contributor.advisorXia, Yuni
dc.contributor.authorSirisha, Peyyeti
dc.contributor.otherDunker, A. Keith
dc.contributor.otherChen, Jake
dc.date.accessioned2012-08-07T14:54:58Z
dc.date.available2012-08-07T14:54:58Z
dc.date.issued2012-08-07
dc.degree.date2011en_US
dc.degree.disciplineComputer & Information Scienceen
dc.degree.grantorPurdue Universityen_US
dc.degree.levelM.S.en_US
dc.descriptionIndiana University-Purdue University Indianapolis (IUPUI)en_US
dc.description.abstractThe literature corresponding to disordered proteins has been on a rise. As the number of publications increase, the time and effort needed to manually identify the relevant publications and protein information to add to centralized repository (called DisProt) is becoming arduous and critical. Existing search facilities on PubMed can retrieve a seemingly large number of publications based on keywords and does not have any support for ranking them based on the probability of the protein names mentioned in a given abstract being added to DisProt. This thesis explores a novel system of using disorder predictors and context based dictionary methods to quickly identify publications on disordered proteins from the PubMed database. NLProt, which is built around Support Vector Machines, is used to identify protein names and PONDR-FIT which is an Artificial Neural Network based meta- predictor is used for identifying protein disorder. The work done in this thesis is of immediate significance in identifying disordered protein names. We have tested the new system on 100 abstracts from DisProt [these abstracts were found to be relevant to disordered proteins and were added to DisProt manually by the annotators.] This system had an accuracy of 87% on this test set. We then took another 100 recently added abstracts from PubMed and ran our algorithm on them. This time it had an accuracy of 68%. We suggested improvements to increase the accuracy and believe that this system can be applied for identifying disordered proteins from literature.en_US
dc.identifier.urihttps://hdl.handle.net/1805/2885
dc.identifier.urihttp://dx.doi.org/10.7912/C2/2292
dc.language.isoen_USen_US
dc.subjectDisProt, Database, Software Toolen_US
dc.subject.lcshProteins -- Analysisen_US
dc.subject.lcshBioinformaticsen_US
dc.subject.lcshDatabase searchingen_US
dc.subject.lcshArtificial intelligence -- Biological applicationsen_US
dc.subject.lcshGenomics -- Data processingen_US
dc.subject.lcshHealth -- Computer network resources -- Directoriesen_US
dc.subject.lcshNational Institutes of Health (U.S.). PubMed Centralen_US
dc.titleIdentification of Publications on Disordered Proteins from PubMeden_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Sirisha_Thesis.pdf
Size:
1.83 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.88 KB
Format:
Item-specific license agreed upon to submission
Description: