Fact distribution in Information Extraction

Stevenson, Mark

doi:10.1007/s10579-006-9014-4

Fact distribution in Information Extraction

Original Paper
Published: 09 February 2007

Volume 40, pages 183–201, (2006)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Mark Stevenson¹

133 Accesses
12 Citations
Explore all metrics

Abstract

Several recent Information Extraction (IE) systems have been restricted to the identification facts which are described within a single sentence. It is not clear what effect this has on the difficulty of the extraction task or how the performance of systems which consider only single sentences should be compared with those which consider multiple sentences. This paper compares three IE evaluation corpora, from the Message Understanding Conferences, and finds that a significant proportion of the facts mentioned therein are not described within a single sentence. Therefore systems which are evaluated only on facts described within single sentences are being tested against a limited portion of the relevant information in the text and it is difficult to compare their performance with other systems. Further analysis demonstrates that anaphora resolution and world knowledge are required to combine information described across multiple sentences. This result has implications for the development and evaluation of IE systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural Language Processing

Hyperintensionality and overfitting

Article Open access 08 April 2024

Francesco Berto

Automated identification of media bias in news articles: an interdisciplinary literature review

Article Open access 16 November 2018

Felix Hamborg, Karsten Donnay & Bela Gipp

Notes

The MUC6 and MUC7 texts were split into sentences using the Edinburgh University LT-TTT tool (Grover, Matheson, Mikheev, & Moens, 2000). The MUC4 texts are written entirely in upper case and were split using a version of the OpenNLP tools sentence detector (http://www.opennlp.sourceforge.net) which had been retrained on a capitalised version of the Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1993).
The process of converting potential answer keys into a regular expression includes escaping characters such as punctuation which are also metacharacters in the regex language used, allowing variable whitespace between tokens and concatenating each possible variation for a filler into a set of disjunctions.
In the MUC6 corpus the movement of the executive is often encoded in the text using a predicate-argument structure, e.g. “named” in the above example, although alternative structures may also be used, e.g. “Mr. Keller’s resignation ......”. It is difficult to identify these comprehensively in a reliable way and therefore attention is restricted to string slots.
In addition to the IE task the MUC6 and MUC7 evaluations included a number of other language processing tasks, including coreference resolution.
The corpora used for the various MUC evaluations contain a mixture of relevant documents (which contain facts) and non-relevant documents (which do not).
In the annotation format used for this corpus anaphoric expressions and their antecedents are enclosed in <COREF> ... </COREF> SGML tags. The unique identifier of each expression is denoted by the ID attribute and the antecedent of a anaphoric expression by REF .

References

Bagga, A., & Biermann, A. (1997). Analyzing the Complexity of a Domain with Respect to an Information Extraction Task. In Proceedings of the Tenth International Conference on Research on Computational Linguistics (ROCLING-X) (pp. 174–194). Taipei, Taiwan.
Chieu, H., & Ng, H. (2002). A Maximum Entropy Approach to Information Extraction from Semi-structured and Free Text. In Proceedings of the Eighteenth International Conference on Artificial Intelligence (AAAI-02) (pp. 768–791). Edmonton, Canada.
Culotta, A., & Sorensen, J. (2004). Dependency Tree Kernels for Relation Extraction In 42nd Annual Meeting of the Association for Computational Linguistics (pp. 423–429). Barcelona, Spain.
Grishman, R. (2003). Information Extraction. In R. Mitkov (Ed.), The Oxford Handbook of Computational Linguistics (pp. 545–559). Oxford University Press.
Grover, C., Matheson, C., Mikheev, A., & Moens, M. (2000). LT TTT - A Flexible Tokenisation Tool. In Proceedings of Second International Conference on Language Resources and Evaluation (LREC 2000). Athens, Greece.
Hirschman, L. (1992). An Adjunct Test for Discourse Processing in MUC-4. In Proceedings of the Fourth Message Understanding Conference (MUC-4) (pp. 67–77). San Francisco, CA.
Huttunen, S., Yangarber, R., & Grishman R. (2002). Complexity of Event Structures in IE Scenarios. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002) (pp. 376–382). Taipei, Taiwan.
Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a Large Annotated Corpus of English: The Penn Tree Bank. Computational Linguistics, 19(2), 313–330.
Google Scholar
Mitkov, R. (2003). Anaphora Resolution. In R. Mitkov (Ed.), The Oxford Handbook of Computational Linguistics (pp. 266–283). Oxford University Press.
Sekine, S. (2006). On-Demand Information Extraction. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions (pp. 731–738). Sydney, Australia.
Soderland, S. (1999). Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 31(1–3), 233–272.
Article Google Scholar
Stevenson, M. (2004) Information Extraction from Single and Multiple Sentences. In Proceedings of the Twentieth International Conference on Computational Linguistics (COLING-02) (pp. 875–881). Geneva, Switzerland.
Stevenson, M., & Greenwood, M. (2005). A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (pp. 379–386). Ann Arbour, MI.
Sundheim, B. (1991) Overview of the Third Message Understanding Evaluation and Conference. In Proceedings of the Third Message Understanding Conference (MUC-3) (pp. 3–16). San Diego, CA.
Yangarber, R., Grishman, R., Tapanainen, P., & Huttunen, S. (2000). Automatic Acquisition of Domain Knowledge for Information Extraction. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000) (pp. 940–946). Saarbrücken, Germany.
Zelenko, D., Aone, C., & Richardella. A. (2003). Kernel Methods for Relation Extraction. Journal of Machine Learning Research, 3, 1083–1106.
Article Google Scholar

Download references

Acknowledgements

This work was carried out as part of the Result project, funded by the UK EPSRC (GR/T06391). I am grateful to Mark Hepple, Mark Greenwood, David Martinez and Paul Clough for providing feedback on earlier versions of this paper. Any mistakes are my own.

Author information

Authors and Affiliations

Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
Mark Stevenson

Authors

Mark Stevenson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Stevenson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stevenson, M. Fact distribution in Information Extraction. Lang Resources & Evaluation 40, 183–201 (2006). https://doi.org/10.1007/s10579-006-9014-4

Download citation

Received: 13 January 2006
Accepted: 12 December 2006
Published: 09 February 2007
Issue Date: May 2006
DOI: https://doi.org/10.1007/s10579-006-9014-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fact distribution in Information Extraction

Abstract

Access this article

Similar content being viewed by others

Natural Language Processing

Hyperintensionality and overfitting

Automated identification of media bias in news articles: an interdisciplinary literature review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fact distribution in Information Extraction

Abstract

Access this article

Similar content being viewed by others

Natural Language Processing

Hyperintensionality and overfitting

Automated identification of media bias in news articles: an interdisciplinary literature review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation