Abstract
Several recent Information Extraction (IE) systems have been restricted to the identification facts which are described within a single sentence. It is not clear what effect this has on the difficulty of the extraction task or how the performance of systems which consider only single sentences should be compared with those which consider multiple sentences. This paper compares three IE evaluation corpora, from the Message Understanding Conferences, and finds that a significant proportion of the facts mentioned therein are not described within a single sentence. Therefore systems which are evaluated only on facts described within single sentences are being tested against a limited portion of the relevant information in the text and it is difficult to compare their performance with other systems. Further analysis demonstrates that anaphora resolution and world knowledge are required to combine information described across multiple sentences. This result has implications for the development and evaluation of IE systems.
Similar content being viewed by others
Notes
The MUC6 and MUC7 texts were split into sentences using the Edinburgh University LT-TTT tool (Grover, Matheson, Mikheev, & Moens, 2000). The MUC4 texts are written entirely in upper case and were split using a version of the OpenNLP tools sentence detector (http://www.opennlp.sourceforge.net) which had been retrained on a capitalised version of the Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1993).
The process of converting potential answer keys into a regular expression includes escaping characters such as punctuation which are also metacharacters in the regex language used, allowing variable whitespace between tokens and concatenating each possible variation for a filler into a set of disjunctions.
In the MUC6 corpus the movement of the executive is often encoded in the text using a predicate-argument structure, e.g. “named” in the above example, although alternative structures may also be used, e.g. “Mr. Keller’s resignation ......”. It is difficult to identify these comprehensively in a reliable way and therefore attention is restricted to string slots.
In addition to the IE task the MUC6 and MUC7 evaluations included a number of other language processing tasks, including coreference resolution.
The corpora used for the various MUC evaluations contain a mixture of relevant documents (which contain facts) and non-relevant documents (which do not).
In the annotation format used for this corpus anaphoric expressions and their antecedents are enclosed in <COREF> ... </COREF> SGML tags. The unique identifier of each expression is denoted by the ID attribute and the antecedent of a anaphoric expression by REF .
References
Bagga, A., & Biermann, A. (1997). Analyzing the Complexity of a Domain with Respect to an Information Extraction Task. In Proceedings of the Tenth International Conference on Research on Computational Linguistics (ROCLING-X) (pp. 174–194). Taipei, Taiwan.
Chieu, H., & Ng, H. (2002). A Maximum Entropy Approach to Information Extraction from Semi-structured and Free Text. In Proceedings of the Eighteenth International Conference on Artificial Intelligence (AAAI-02) (pp. 768–791). Edmonton, Canada.
Culotta, A., & Sorensen, J. (2004). Dependency Tree Kernels for Relation Extraction In 42nd Annual Meeting of the Association for Computational Linguistics (pp. 423–429). Barcelona, Spain.
Grishman, R. (2003). Information Extraction. In R. Mitkov (Ed.), The Oxford Handbook of Computational Linguistics (pp. 545–559). Oxford University Press.
Grover, C., Matheson, C., Mikheev, A., & Moens, M. (2000). LT TTT - A Flexible Tokenisation Tool. In Proceedings of Second International Conference on Language Resources and Evaluation (LREC 2000). Athens, Greece.
Hirschman, L. (1992). An Adjunct Test for Discourse Processing in MUC-4. In Proceedings of the Fourth Message Understanding Conference (MUC-4) (pp. 67–77). San Francisco, CA.
Huttunen, S., Yangarber, R., & Grishman R. (2002). Complexity of Event Structures in IE Scenarios. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002) (pp. 376–382). Taipei, Taiwan.
Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a Large Annotated Corpus of English: The Penn Tree Bank. Computational Linguistics, 19(2), 313–330.
Mitkov, R. (2003). Anaphora Resolution. In R. Mitkov (Ed.), The Oxford Handbook of Computational Linguistics (pp. 266–283). Oxford University Press.
Sekine, S. (2006). On-Demand Information Extraction. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions (pp. 731–738). Sydney, Australia.
Soderland, S. (1999). Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 31(1–3), 233–272.
Stevenson, M. (2004) Information Extraction from Single and Multiple Sentences. In Proceedings of the Twentieth International Conference on Computational Linguistics (COLING-02) (pp. 875–881). Geneva, Switzerland.
Stevenson, M., & Greenwood, M. (2005). A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (pp. 379–386). Ann Arbour, MI.
Sundheim, B. (1991) Overview of the Third Message Understanding Evaluation and Conference. In Proceedings of the Third Message Understanding Conference (MUC-3) (pp. 3–16). San Diego, CA.
Yangarber, R., Grishman, R., Tapanainen, P., & Huttunen, S. (2000). Automatic Acquisition of Domain Knowledge for Information Extraction. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000) (pp. 940–946). Saarbrücken, Germany.
Zelenko, D., Aone, C., & Richardella. A. (2003). Kernel Methods for Relation Extraction. Journal of Machine Learning Research, 3, 1083–1106.
Acknowledgements
This work was carried out as part of the Result project, funded by the UK EPSRC (GR/T06391). I am grateful to Mark Hepple, Mark Greenwood, David Martinez and Paul Clough for providing feedback on earlier versions of this paper. Any mistakes are my own.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Stevenson, M. Fact distribution in Information Extraction. Lang Resources & Evaluation 40, 183–201 (2006). https://doi.org/10.1007/s10579-006-9014-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-006-9014-4