Skip to main content

Roles for Text Mining in Protein Function Prediction

  • Protocol
  • First Online:
Book cover Biomedical Literature Mining

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1159))

Abstract

The Human Genome Project has provided science with a hugely valuable resource: the blueprints for life; the specification of all of the genes that make up a human. While the genes have all been identified and deciphered, it is proteins that are the workhorses of the human body: they are essential to virtually all cell functions and are the primary mechanism through which biological function is carried out. Hence in order to fully understand what happens at a molecular level in biological organisms, and eventually to enable development of treatments for diseases where some aspect of a biological system goes awry, we must understand the functions of proteins. However, experimental characterization of protein function cannot scale to the vast amount of DNA sequence data now available. Computational protein function prediction has therefore emerged as a problem at the forefront of modern biology (Radivojac et al., Nat Methods 10(13):221–227, 2013).

Within the varied approaches to computational protein function prediction that have been explored, there are several that make use of biomedical literature mining. These methods take advantage of information in the published literature to associate specific proteins with specific protein functions. In this chapter, we introduce two main strategies for doing this: association of function terms, represented as Gene Ontology terms (Ashburner et al., Nat Genet 25(1):25–29, 2000), to proteins based on information in published articles, and a paradigm called LEAP-FS (Literature-Enhanced Automated Prediction of Functional Sites) in which literature mining is used to validate the predictions of an orthogonal computational protein function prediction method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A et al (2013) A large-scale evaluation of computational protein function prediction. Nat Methods 10(13):221–227

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  2. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  3. Blaschke C, Valencia A (2013) The Functional Genomics Network in the evolution of biological text mining over the past decade. N Biotechnol 30(3):278–285

    Article  PubMed  CAS  Google Scholar 

  4. Valencia A (2005) Automatic annotation of protein function. Curr Opin Struct Biol 15(3):267–274

    Article  PubMed  CAS  Google Scholar 

  5. Baumgartner WA Jr, Cohen KB, Fox L, Acquaah-Mensah GK, Hunter L (2007) Manual curation is not sufficient for annotation of genomic databases. Bioinformatics 23:i41–i48

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  6. Friedberg I (2006) Automated protein function prediction—the genomic challenge. Brief Bioinform 7(3):225–242

    Article  PubMed  CAS  Google Scholar 

  7. Blaschke C, Leon E, Krallinger M, Valencia A (2005) Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics 6(Suppl 1):S16

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  8. Maguitman AG, Rechtsteiner A, Verspoor K, Strauss CE, Rocha LM (2006) Large-scale testing of bibliome informatics using Pfam protein families. Pac Symp Biocomput 76–87

    Google Scholar 

  9. Shatkay H, Hoglund A, Brady S, Blum T, Donnes P, Kohlbacher O (2007) SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics 23(11):1410–1417

    Article  PubMed  CAS  Google Scholar 

  10. Verspoor CM, Joslyn C, Papcun GJ (2003) The gene ontology as a source of lexical semantic knowledge for a biological natural language processing application. In: SIGIR workshop on Text Analysis and Search for Bioinformatics, 51–56

    Google Scholar 

  11. Funk C, Baumgartner W, Garcia B, Roeder C, Bada M, Cohen KB, Hunter LE, Verspoor K (2014) Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC Bioinformatics 15:59. doi: 10.1186/1471-2105-15-59

  12. Wong A, Shatkay H (2013) Protein function prediction using text-based features extracted from the biomedical literature: the CAFA challenge. BMC Bioinformatics 14(Suppl 3):S14

    PubMed Central  PubMed  CAS  Google Scholar 

  13. Krallinger M, Padron M, Valencia A (2005) A sentence sliding window approach to extract protein annotations from biomedical articles. BMC Bioinformatics 6 Suppl 1

    Google Scholar 

  14. Couto FM, Silva MJ, Coutinho PM (2005) Finding genomic ontology terms in text using evidence content. BMC Bioinformatics 6 Suppl 1

    Google Scholar 

  15. Ray S, Craven M (2005) Learning statistical models for annotating proteins with function information using biomedical text. BMC Bioinformatics 6(Suppl 1):S18

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  16. Verspoor K, Cohn J, Joslyn C, Mniszewski S, Rechtsteiner A, Rocha LM, Simas T (2005) Protein annotation as term categorization in the gene ontology using word proximity networks. BMC Bioinformatics 6 Suppl 1

    Google Scholar 

  17. Martin D, Berriman M, Barton G (2004) GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5:178

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  18. Conesa A, Gotz S, Garcia-Gome J, Terol J, Talon M, Robles M (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21(18):3674–3676

    Article  PubMed  CAS  Google Scholar 

  19. Sokolov A, Funk C, Graim K, Verspoor K, Ben-Hur A (2013) Combining heterogeneous data sources for accurate functional annotation of proteins. BMC Bioinformatics 14(Suppl 3):S10

    PubMed Central  PubMed  CAS  Google Scholar 

  20. Sokolov A and Ben-Hur A (2010) Hierarchical classification of Gene Ontology terms using the GOstruct method. Journal of Bioinformatics and Computational Biology 8(2):357–376

    Google Scholar 

  21. Gabow AP, Leach SM, Baumgartner WA Jr, Hunter L, Goldberg DS (2008) Improving protein function prediction methods with integrated literature data. BMC Bioinformatics 9:198

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  22. Verspoor KM, Cohn JD, Ravikumar KE, Wall ME (2012) Text mining improves prediction of protein functional sites. PLoS One 7(2):e32171

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  23. Jaeger S, Gaudan S, Leser U, Rebholz-Schuhmann D (2008) Integrating protein-protein interactions and text mining for protein function prediction. BMC Bioinformatics 9(Suppl 8):S2

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  24. Andreeva A, Howorth D, Chandonia J-M, Brenner SE, Hubbard TJP, Chothia C, Murzin AG (2008) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36(Database issue):D419–25

    PubMed Central  PubMed  CAS  Google Scholar 

  25. Porter CT, Bartlett GJ, Thornton JM (2004) The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 32(Database issue):D129–133

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  26. Benson ML, Smith RD, Khazanov NA, Dimcheff B, Beaver J, Dresslar P, Nerothin J, Carlson HA (2008) Binding MOAD, a high-quality protein–ligand database. Nucleic Acids Res 36(suppl 1):D674–D678

    PubMed Central  PubMed  CAS  Google Scholar 

  27. Verspoor K, MacKinlay A, Cohn JA, Wall ME (2013) Detection of protein catalytic sites in the biomedical literature. Pac Symp Biocomput 18:433–444

    Google Scholar 

  28. Card GL, Peterson NA, Smith CA, Rupp B, Schick BM, Baker EN (2005) The crystal structure of Rv1347c, a putative antibiotic resistance protein from Mycobacterium tuberculosis, reveals a GCN5-related fold and suggests an alternative function in siderophore biosynthesis. J Biol Chem 280(14):13978–13986

    Article  PubMed  CAS  Google Scholar 

  29. Verspoor K, Cohn J, Mniszewski S, Joslyn C (2006) A categorization approach to automated ontological function annotation. Protein Sci 15(6):1544–1549

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  30. Verspoor K, Cohen KB, Lanfranchi A, Warner C, Johnson HL, Roeder C, Choi JD, Funk C, Malenkiy Y, Eckert M et al (2012) A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinformatics 13:207

    Article  PubMed Central  PubMed  Google Scholar 

  31. Bada M, Eckert M, Evans D, Garcia K, Shipley K, Sitnikov D, Baumgartner WA Jr, Cohen KB, Verspoor K, Blake JA et al (2012) Concept annotation in the CRAFT corpus. BMC Bioinformatics 13:161

    Article  PubMed Central  PubMed  Google Scholar 

  32. Campos D, Matos S, Oliveira JL (2013) Neji: a tool for heterogeneous biomedical concept identification. In: Proceedings of BioLINK SIG 2013; ISMB/ECCB 2013, Berlin, Germany, pp 28–31, See: http://biolinksig.org/past-meetings/biolink-2013/

  33. Jacob C, Thomas P, Leser U (2013) Comprehensive Benchmark of gene ontology concept recognition tools. In: Proceedings of BioLINK SIG 2013; ISMB/ECCB 2013, Berlin, Germany, pp 20–26, See: http://biolinksig.org/past-meetings/biolink-2013/

  34. Li X, Ling C, Wang H (2013) Effective top-down active learning for hierarchical text classification. In: Pei J, Tseng V, Cao L, Motoda H, Xu G (eds) Advances in knowledge discovery and data mining, vol 7819. Springer, Berlin, pp 233–244

    Chapter  Google Scholar 

  35. Silla CN, Freitas AA (2011) A survey of hierarchical classification across different application domains. Data Min Knowl Discov 22(1):31–72

    Article  Google Scholar 

  36. Clark WT, Radivojac P (2013) Information-theoretic evaluation of predicted ontological annotations. Bioinformatics 29(13):i53–i61

    Article  PubMed Central  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Karin M. Verspoor .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this protocol

Cite this protocol

Verspoor, K.M. (2014). Roles for Text Mining in Protein Function Prediction. In: Kumar, V., Tipney, H. (eds) Biomedical Literature Mining. Methods in Molecular Biology, vol 1159. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-0709-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-0709-0_6

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-0708-3

  • Online ISBN: 978-1-4939-0709-0

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics