Roles for Text Mining in Protein Function Prediction

Verspoor, Karin M.

doi:10.1007/978-1-4939-0709-0_6

Karin M. Verspoor⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1159))

2790 Accesses
6 Citations

Abstract

The Human Genome Project has provided science with a hugely valuable resource: the blueprints for life; the specification of all of the genes that make up a human. While the genes have all been identified and deciphered, it is proteins that are the workhorses of the human body: they are essential to virtually all cell functions and are the primary mechanism through which biological function is carried out. Hence in order to fully understand what happens at a molecular level in biological organisms, and eventually to enable development of treatments for diseases where some aspect of a biological system goes awry, we must understand the functions of proteins. However, experimental characterization of protein function cannot scale to the vast amount of DNA sequence data now available. Computational protein function prediction has therefore emerged as a problem at the forefront of modern biology (Radivojac et al., Nat Methods 10(13):221–227, 2013).

Within the varied approaches to computational protein function prediction that have been explored, there are several that make use of biomedical literature mining. These methods take advantage of information in the published literature to associate specific proteins with specific protein functions. In this chapter, we introduce two main strategies for doing this: association of function terms, represented as Gene Ontology terms (Ashburner et al., Nat Genet 25(1):25–29, 2000), to proteins based on information in published articles, and a paradigm called LEAP-FS (Literature-Enhanced Automated Prediction of Functional Sites) in which literature mining is used to validate the predictions of an orthogonal computational protein function prediction method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A et al (2013) A large-scale evaluation of computational protein function prediction. Nat Methods 10(13):221–227
Article PubMed Central PubMed CAS Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29
Article PubMed Central PubMed CAS Google Scholar
Blaschke C, Valencia A (2013) The Functional Genomics Network in the evolution of biological text mining over the past decade. N Biotechnol 30(3):278–285
Article PubMed CAS Google Scholar
Valencia A (2005) Automatic annotation of protein function. Curr Opin Struct Biol 15(3):267–274
Article PubMed CAS Google Scholar
Baumgartner WA Jr, Cohen KB, Fox L, Acquaah-Mensah GK, Hunter L (2007) Manual curation is not sufficient for annotation of genomic databases. Bioinformatics 23:i41–i48
Article PubMed Central PubMed CAS Google Scholar
Friedberg I (2006) Automated protein function prediction—the genomic challenge. Brief Bioinform 7(3):225–242
Article PubMed CAS Google Scholar
Blaschke C, Leon E, Krallinger M, Valencia A (2005) Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics 6(Suppl 1):S16
Article PubMed Central PubMed CAS Google Scholar
Maguitman AG, Rechtsteiner A, Verspoor K, Strauss CE, Rocha LM (2006) Large-scale testing of bibliome informatics using Pfam protein families. Pac Symp Biocomput 76–87
Google Scholar
Shatkay H, Hoglund A, Brady S, Blum T, Donnes P, Kohlbacher O (2007) SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics 23(11):1410–1417
Article PubMed CAS Google Scholar
Verspoor CM, Joslyn C, Papcun GJ (2003) The gene ontology as a source of lexical semantic knowledge for a biological natural language processing application. In: SIGIR workshop on Text Analysis and Search for Bioinformatics, 51–56
Google Scholar
Funk C, Baumgartner W, Garcia B, Roeder C, Bada M, Cohen KB, Hunter LE, Verspoor K (2014) Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC Bioinformatics 15:59. doi: 10.1186/1471-2105-15-59
Wong A, Shatkay H (2013) Protein function prediction using text-based features extracted from the biomedical literature: the CAFA challenge. BMC Bioinformatics 14(Suppl 3):S14
PubMed Central PubMed CAS Google Scholar
Krallinger M, Padron M, Valencia A (2005) A sentence sliding window approach to extract protein annotations from biomedical articles. BMC Bioinformatics 6 Suppl 1
Google Scholar
Couto FM, Silva MJ, Coutinho PM (2005) Finding genomic ontology terms in text using evidence content. BMC Bioinformatics 6 Suppl 1
Google Scholar
Ray S, Craven M (2005) Learning statistical models for annotating proteins with function information using biomedical text. BMC Bioinformatics 6(Suppl 1):S18
Article PubMed Central PubMed CAS Google Scholar
Verspoor K, Cohn J, Joslyn C, Mniszewski S, Rechtsteiner A, Rocha LM, Simas T (2005) Protein annotation as term categorization in the gene ontology using word proximity networks. BMC Bioinformatics 6 Suppl 1
Google Scholar
Martin D, Berriman M, Barton G (2004) GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5:178
Article PubMed Central PubMed CAS Google Scholar
Conesa A, Gotz S, Garcia-Gome J, Terol J, Talon M, Robles M (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21(18):3674–3676
Article PubMed CAS Google Scholar
Sokolov A, Funk C, Graim K, Verspoor K, Ben-Hur A (2013) Combining heterogeneous data sources for accurate functional annotation of proteins. BMC Bioinformatics 14(Suppl 3):S10
PubMed Central PubMed CAS Google Scholar
Sokolov A and Ben-Hur A (2010) Hierarchical classification of Gene Ontology terms using the GOstruct method. Journal of Bioinformatics and Computational Biology 8(2):357–376
Google Scholar
Gabow AP, Leach SM, Baumgartner WA Jr, Hunter L, Goldberg DS (2008) Improving protein function prediction methods with integrated literature data. BMC Bioinformatics 9:198
Article PubMed Central PubMed CAS Google Scholar
Verspoor KM, Cohn JD, Ravikumar KE, Wall ME (2012) Text mining improves prediction of protein functional sites. PLoS One 7(2):e32171
Article PubMed Central PubMed CAS Google Scholar
Jaeger S, Gaudan S, Leser U, Rebholz-Schuhmann D (2008) Integrating protein-protein interactions and text mining for protein function prediction. BMC Bioinformatics 9(Suppl 8):S2
Article PubMed Central PubMed CAS Google Scholar
Andreeva A, Howorth D, Chandonia J-M, Brenner SE, Hubbard TJP, Chothia C, Murzin AG (2008) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36(Database issue):D419–25
PubMed Central PubMed CAS Google Scholar
Porter CT, Bartlett GJ, Thornton JM (2004) The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 32(Database issue):D129–133
Article PubMed Central PubMed CAS Google Scholar
Benson ML, Smith RD, Khazanov NA, Dimcheff B, Beaver J, Dresslar P, Nerothin J, Carlson HA (2008) Binding MOAD, a high-quality protein–ligand database. Nucleic Acids Res 36(suppl 1):D674–D678
PubMed Central PubMed CAS Google Scholar
Verspoor K, MacKinlay A, Cohn JA, Wall ME (2013) Detection of protein catalytic sites in the biomedical literature. Pac Symp Biocomput 18:433–444
Google Scholar
Card GL, Peterson NA, Smith CA, Rupp B, Schick BM, Baker EN (2005) The crystal structure of Rv1347c, a putative antibiotic resistance protein from Mycobacterium tuberculosis, reveals a GCN5-related fold and suggests an alternative function in siderophore biosynthesis. J Biol Chem 280(14):13978–13986
Article PubMed CAS Google Scholar
Verspoor K, Cohn J, Mniszewski S, Joslyn C (2006) A categorization approach to automated ontological function annotation. Protein Sci 15(6):1544–1549
Article PubMed Central PubMed CAS Google Scholar
Verspoor K, Cohen KB, Lanfranchi A, Warner C, Johnson HL, Roeder C, Choi JD, Funk C, Malenkiy Y, Eckert M et al (2012) A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinformatics 13:207
Article PubMed Central PubMed Google Scholar
Bada M, Eckert M, Evans D, Garcia K, Shipley K, Sitnikov D, Baumgartner WA Jr, Cohen KB, Verspoor K, Blake JA et al (2012) Concept annotation in the CRAFT corpus. BMC Bioinformatics 13:161
Article PubMed Central PubMed Google Scholar
Campos D, Matos S, Oliveira JL (2013) Neji: a tool for heterogeneous biomedical concept identification. In: Proceedings of BioLINK SIG 2013; ISMB/ECCB 2013, Berlin, Germany, pp 28–31, See: http://biolinksig.org/past-meetings/biolink-2013/
Jacob C, Thomas P, Leser U (2013) Comprehensive Benchmark of gene ontology concept recognition tools. In: Proceedings of BioLINK SIG 2013; ISMB/ECCB 2013, Berlin, Germany, pp 20–26, See: http://biolinksig.org/past-meetings/biolink-2013/
Li X, Ling C, Wang H (2013) Effective top-down active learning for hierarchical text classification. In: Pei J, Tseng V, Cao L, Motoda H, Xu G (eds) Advances in knowledge discovery and data mining, vol 7819. Springer, Berlin, pp 233–244
Chapter Google Scholar
Silla CN, Freitas AA (2011) A survey of hierarchical classification across different application domains. Data Min Knowl Discov 22(1):31–72
Article Google Scholar
Clark WT, Radivojac P (2013) Information-theoretic evaluation of predicted ontological annotations. Bioinformatics 29(13):i53–i61
Article PubMed Central PubMed CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, 3010, Australia
Karin M. Verspoor

Authors

Karin M. Verspoor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karin M. Verspoor .

Editor information

Editors and Affiliations

GlaxoSmithKline, King of Prussia, Pennsylvania, USA
Vinod D. Kumar
GlaxoSmithKline, Hitchin, Hertfordshire, United Kingdom
Hannah Jane Tipney

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Verspoor, K.M. (2014). Roles for Text Mining in Protein Function Prediction. In: Kumar, V., Tipney, H. (eds) Biomedical Literature Mining. Methods in Molecular Biology, vol 1159. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-0709-0_6

Download citation

DOI: https://doi.org/10.1007/978-1-4939-0709-0_6
Published: 10 April 2014
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-0708-3
Online ISBN: 978-1-4939-0709-0
eBook Packages: Springer Protocols

Publish with us

Policies and ethics