Information Extraction from Text

Jiang, Jing

doi:10.1007/978-1-4614-3223-4_2

Jing Jiang³

20k Accesses
73 Citations
3 Altmetric

Abstract

Information extraction is the task of finding structured information from unstructured or semi-structured text. It is an important task in text mining and has been extensively studied in various research communities including natural language processing, information retrieval and Web mining. It has a wide range of applications in domains such as biomedical literature mining and business intelligence. Two fundamental tasks of information extraction are named entity recognition and relation extraction. The former refers to finding names of entities such as people, organizations and locations. The latter refers to finding the semantic relations such as FounderOf and HeadquarteredIn between entities. In this chapter we provide a survey of the major work on named entity recognition and relation extraction in the past few decades, with a focus on work from the natural language processing community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Automatic content extraction (ACE) evaluation. http://www.itl. nist.gov/iad/mig/tests/ace/.
Google Scholar
BioCreAtIvE. http://www.biocreative.org/.
Google Scholar
Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the 5th ACM Conference on Digital Libraries, pages 85–94, 2000.
Google Scholar
Douglas E. Appelt, Jerry R. Hobbs, John Bear, David Israel, and Mabry Tyson. FASTUS: A finite-state processor for information extraction from real-world text. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, 1993.
Google Scholar
Andrew Arnold, Ramesh Nallapati, and William W. Cohen. Exploiting feature hierarchy for transfer learning in named entity recognition. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, pages 245–253, 2008.
Google Scholar
Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. Open information extraction from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 2670–2676, 2007.
Google Scholar
Michele Banko and Oren Etzioni. The tradeoffs between open and traditional relation extraction. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, pages 28– 36, 2008.
Google Scholar
Oliver Bender, Franz Josef Och, and Hermann Ney. Maximum entropy models for named entity recognition. In Proceedings of the 7th Conference on Natural Language Learning, 2003.
Google Scholar
Adam L. Bergert, Vincent J. Della Pietra, and Stephen A. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71, March 1996. [10] Daniel M. Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel. Nymble: a high-performance learning name-finder. In Proceedings of the 5th Conference on Applied Natural Language Processing, pages 194–201, 1997.
Google Scholar
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1247–1250, 2008.
Google Scholar
Sergey Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the 1998 International Workshop on the Web and Databases, 1998.
Google Scholar
Razvan Bunescu and Raymond Mooney. A shortest path dependency kernel for relation extraction. In Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing, pages 724–731, 2005.
Google Scholar
Razvan Bunescu and Raymond Mooney. Subsequence kernels for relation extraction. In Advances in Neural Information Processing Systems 18, pages 171–178. 2006.
Google Scholar
Richard H. Byrd, Jorge Nocedal, and Robert B. Schnabel. Representations of quasi-newton matrices and their use in limited memory methods. Journal of Mathematical Programming, 63(2):129–156, January 1994.
MATH Google Scholar
Mary Elaine Califf and Raymond J. Mooney. Relational learning of pattern-match rules for information extraction. In Proceedings of the 16th National Conference on Artificial Intelligence and the 11th Innovative Applications of Artificial Intelligence Conference, pages 328–334, 1999.
Google Scholar
Nathanael Chambers and Dan Jurafsky. Template-based information extraction without the templates. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 976–986, 2011.
Google Scholar
Yee Seng Chan and Dan Roth. Exploiting background knowledge for relation extraction. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 152–160, 2010.
Google Scholar
Yee Seng Chan and Dan Roth. Exploiting syntactico-semantic structures for relation extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 551–560, 2011.
Google Scholar
Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis, and Khaled F. Shaalan. A survey of Web information extraction sys tems. IEEE Transactions on Knowledge and Data Engineering, 18(10):1411–1428, October 2006.
Google Scholar
Tao Cheng, Xifeng Yan, and Kevin Chen-Chuan Chang. Supporting entity search: a large-scale prototype search engine. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pages 1144–1146, 2007.
Google Scholar
Hai Leong Chieu and Hwee Tou Ng. Named entity recognition with a maximum entropy approach. In Proceedings of the Seventh Conference on Natural Language Learning, pages 160–163, 2003.
Google Scholar
Fabio Ciravegna. Adaptive information extraction from text by rule induction and generalisation. In Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2, pages 1251–1256, 2001.
Google Scholar
Michael Collins and Nigel Duffy. Convolution kernels for natural language. In Advances in Neural Information Processing Systems 13. 2001.
Google Scholar
Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Road- Runner: Towards automatic data extraction from large Web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, pages 109–118, 2001.
Google Scholar
Aron Culotta and Jeffrey Sorensen. Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 423–429, 2004.
Google Scholar
James R. Curran and Stephen Clark. Language independent NER using a maximum entropy tagger. In Proceedings of the 7th Conference on Natural Language Learning, 2003.
Google Scholar
Gerald DeJong. Prediction and substantiation: A new approach to natural language processing. Cognitive Science, 3:251–173, 1979.
Article Google Scholar
Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying relations for open information extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1535–1545, 2011.
Google Scholar
Jenny Finkel, Shipra Dingare, Christopher D. Manning, Malvina Nissim, Beatrice Alex, and Claire Grover. Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics, 6(Suppl 1)(S5), 2005.
Google Scholar
Sergio Flesca, Giuseppe Manco, Elio Masciari, Eugenio Rende, and Andrea Tagarelli. Web wrapper induction: a brief survey. AI Communications, 17(2):57–61, April 2004.
Google Scholar
Ralph Grishman, John Sterling, and Catherine Macleod. New York University: Description of the PROTEUS system as used for MUC- 3. In Proceedings of the 3rd Message Understadning Conference, pages 183–190, 1991.
Google Scholar
Ralph Grishman and Beth Sundheim. Message understanding conference-6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics, pages 466–471, 1996.
Google Scholar
Guoping Hu, Jingjing Liu, Hang Li, Yunbo Cao, Jian-Yun Nie, and Jianfeng Gao. A supervised learning approach to entity search. In Proceedings of the 3rd Asia Information Retrieval Symposium, pages 54–66, 2006.
Google Scholar
Hideki Isozaki and Hideto Kazawa. Efficient support vector classifiers for named entity recognition. In Proceedings of the 19th International Conference on Computational Linguistics, 2002.
Google Scholar
Jing Jiang and ChengXiang Zhai. Exploiting domain structure for named entity recognition. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 74–81, 2006.
Google Scholar
Jing Jiang and ChengXiang Zhai. A systematic exploration of the feature space for relation extraction. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 113–120, 2007.
Google Scholar
Nanda Kambhatla. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In The Companion Volume to the Proceedings of 42st Annual Meeting of the Association for Computational Linguistics, pages 178–181, 2004.
Google Scholar
Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning. Named entity recognition with character-level models. In Proceedings of the 7th Conference on Natural Language Learning, 2003.
Google Scholar
Nicholas Kushmerick, Daniel S. Weld, and Robert Doorenbos. Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence, 1997.
Google Scholar
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pages 282–289, 2001.
Google Scholar
Wendy Lehnert, Claire Cardie, Divid Fisher, Ellen Riloff, and Robert Williams. University of Massachusetts: Description of the CIRCUS system as used for MUC-3. In Proceedings of the 3rd Message Understadning Conference, pages 223–233, 1991.
Google Scholar
Cane Wing-ki Leung, Jing Jiang, Kian Ming A. Chai, Hai Leong Chieu, and Loo-Nin Teow. Unsupervised information extraction with distributional prior knowledge. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 814–824, 2011.
Google Scholar
Xin Li and Dan Roth. Learning question classifiers. In Proceedings of the 19th International Conference on Computational Linguistics, pages 1–7, 2002.
Google Scholar
Liu Ling, Calton Pu, and Wei Han. XWRAP: An XML-enabled wrapper construction system for Web information sources. In Proceedings of the 16th International Conference on Data Engineering, pages 611–621, 2000.
Google Scholar
Robert Malouf. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of the 6th Conference on Natural Language Learning, 2002.
Google Scholar
Zvika Marx, Ido Dagan, and Eli Shamir. Cross-component clustering for template learning. In Proceedings of the 2002 ICML Workshop on Text Learning, 2002.
Google Scholar
Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. Maximum entropy Markov models for information extraction and segmentation. In Proceedings of the 17th International Conference on Machine Learning, pages 591–598, 2000.
Google Scholar
Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning, 2003.
Google Scholar
Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011, 2009.
Google Scholar
Truc Vien T. Nguyen and Alessandro Moschitti. End-to-end relation extraction using distant supervision from external semantic repositories. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 277–282, 2011.
Google Scholar
Tomoko Ohta, Yuka Tateisi, and Jin-Dong Kim. The GENIA corpus: an annotated research abstract corpus in molecular biology domain. In Proceedings of the 2nd International Conference on Human Language Technology Research, pages 82–86, 2002.
Google Scholar
Longhua Qian, Guodong Zhou, Fang Kong, Qiaoming Zhu, and Peide Qian. Exploiting constituent dependencies for tree kernelbased semantic relation extraction. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 697–704, 2008.
Google Scholar
Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. 77, 77(2):257–286, 1989.
Google Scholar
Lance A. Ramshaw and Mitch P. Marcus. Text chunking using transformation-based learning. In Proceedings of the 3rd Workship on Very Large Corpora, pages 82–94, 1995.
Google Scholar
Lisa F. Rau. Extracting company names from text. In Proceedings of the 7th IEEE Conference on Artificial Intelligence Applications, pages 29–32, 1991.
Google Scholar
Benjamin Rosenfeld and Ronen Feldman. Clustering for unsupervised relation identification. In Proceedings of the 16th ACM conference on Conference on Information and Knowledge Management, pages 411–418, 2007.
Google Scholar
Sunita Sarawagi and William W. Cohen. Semi-markov conditional random fields for information extraction. In Advances in Neural Information Processing Systems 17, pages 1185–1192. 2005.
Google Scholar
Burr Settles. Biomedical named entity recognition using conditional random fields and rich feature sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, pages 104–107, 2004.
Google Scholar
Yusuke Shinyama and Satoshi Sekine. Preemptive information extraction using unrestricted relation discovery. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 304–311, 2006.
Google Scholar
Stephen Soderland. Learning information extraction rules for semistructured and free text. Machine Learning, 34(1-3):233–272, February 1999.
Article MATH Google Scholar
Stephen Soderland, David Fisher, Jonathan Aseltine, and Wendy Lehnert. CRYSTAL inducing a conceptual dictionary. In Proceed ings of the 14th International Joint Conference on Artificial Intelligence, pages 1314–1319, 1995.
Google Scholar
Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the 7th Conference on Natural Language Learning, pages 142–147, 2003. [64] Richard Tzong-Han Tsai, Shih-Hung Wu, Wen-Chi Chou, Yu-Chun Lin, Ding He, Jieh Hsiang, Ting-Yi Sung, and Wen-Lian Hsu. Various criteria in the evaluation of biomedical named entity recognition. BMC Bioinformatics, 7(92), 2006.
Google Scholar
Vladimir Vapnik. Statistical Learning Theory. John Wiley & Sons, 2008.
Google Scholar
Fei Wu and Daniel S. Weld. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 118–127, 2010. [67] Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083–1106, February 2003.
Google Scholar
Min Zhang, Jie Zhang, and Jian Su. Exploring syntactic features for relation extraction using a convolution tree kernel. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 288–295, 2006.
Google Scholar
Min Zhang, Jie Zhang, Jian Su, and GuoDong Zhou. A composite kernel to extract relations between entities with both flat and structured features. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 825–832, 2006.
Google Scholar
Shubin Zhao and Ralph Grishman. Extracting relations with integrated information using kernel methods. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 419–426, 2005.
Google Scholar
GuoDong Zhou, Jian Su, Jie Zhang, and Min Zhang. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 427–434, 2005.
Google Scholar

Download references

Author information

Authors and Affiliations

Singapore Management University, Singapore, Singapore
Jing Jiang

Authors

Jing Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Jiang .

Editor information

Editors and Affiliations

Thomas J. Watson Research Center, IBM, Skyline Drive 19, Hawthorne, 10532, New York, USA
Charu C. Aggarwal
at Urbana-Champaign, University of Illinois, URBANA, 61801, Illinois, USA
ChengXiang Zhai

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Jiang, J. (2012). Information Extraction from Text. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_2

Download citation

DOI: https://doi.org/10.1007/978-1-4614-3223-4_2
Published: 07 January 2012
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-3222-7
Online ISBN: 978-1-4614-3223-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics