Skip to main content
Log in

Fact distribution in Information Extraction

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Several recent Information Extraction (IE) systems have been restricted to the identification facts which are described within a single sentence. It is not clear what effect this has on the difficulty of the extraction task or how the performance of systems which consider only single sentences should be compared with those which consider multiple sentences. This paper compares three IE evaluation corpora, from the Message Understanding Conferences, and finds that a significant proportion of the facts mentioned therein are not described within a single sentence. Therefore systems which are evaluated only on facts described within single sentences are being tested against a limited portion of the relevant information in the text and it is difficult to compare their performance with other systems. Further analysis demonstrates that anaphora resolution and world knowledge are required to combine information described across multiple sentences. This result has implications for the development and evaluation of IE systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. The MUC6 and MUC7 texts were split into sentences using the Edinburgh University LT-TTT tool (Grover, Matheson, Mikheev, & Moens, 2000). The MUC4 texts are written entirely in upper case and were split using a version of the OpenNLP tools sentence detector (http://www.opennlp.sourceforge.net) which had been retrained on a capitalised version of the Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1993).

  2. The process of converting potential answer keys into a regular expression includes escaping characters such as punctuation which are also metacharacters in the regex language used, allowing variable whitespace between tokens and concatenating each possible variation for a filler into a set of disjunctions.

  3. In the MUC6 corpus the movement of the executive is often encoded in the text using a predicate-argument structure, e.g. “named” in the above example, although alternative structures may also be used, e.g. “Mr. Keller’s resignation ......”. It is difficult to identify these comprehensively in a reliable way and therefore attention is restricted to string slots.

  4. In addition to the IE task the MUC6 and MUC7 evaluations included a number of other language processing tasks, including coreference resolution.

  5. The corpora used for the various MUC evaluations contain a mixture of relevant documents (which contain facts) and non-relevant documents (which do not).

  6. In the annotation format used for this corpus anaphoric expressions and their antecedents are enclosed in <COREF> ... </COREF> SGML tags. The unique identifier of each expression is denoted by the ID attribute and the antecedent of a anaphoric expression by REF .

References

  • Bagga, A., & Biermann, A. (1997). Analyzing the Complexity of a Domain with Respect to an Information Extraction Task. In Proceedings of the Tenth International Conference on Research on Computational Linguistics (ROCLING-X) (pp. 174–194). Taipei, Taiwan.

  • Chieu, H., & Ng, H. (2002). A Maximum Entropy Approach to Information Extraction from Semi-structured and Free Text. In Proceedings of the Eighteenth International Conference on Artificial Intelligence (AAAI-02) (pp. 768–791). Edmonton, Canada.

  • Culotta, A., & Sorensen, J. (2004). Dependency Tree Kernels for Relation Extraction In 42nd Annual Meeting of the Association for Computational Linguistics (pp. 423–429). Barcelona, Spain.

  • Grishman, R. (2003). Information Extraction. In R. Mitkov (Ed.), The Oxford Handbook of Computational Linguistics (pp. 545–559). Oxford University Press.

  • Grover, C., Matheson, C., Mikheev, A., & Moens, M. (2000). LT TTT - A Flexible Tokenisation Tool. In Proceedings of Second International Conference on Language Resources and Evaluation (LREC 2000). Athens, Greece.

  • Hirschman, L. (1992). An Adjunct Test for Discourse Processing in MUC-4. In Proceedings of the Fourth Message Understanding Conference (MUC-4) (pp. 67–77). San Francisco, CA.

  • Huttunen, S., Yangarber, R., & Grishman R. (2002). Complexity of Event Structures in IE Scenarios. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002) (pp. 376–382). Taipei, Taiwan.

  • Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a Large Annotated Corpus of English: The Penn Tree Bank. Computational Linguistics, 19(2), 313–330.

    Google Scholar 

  • Mitkov, R. (2003). Anaphora Resolution. In R. Mitkov (Ed.), The Oxford Handbook of Computational Linguistics (pp. 266–283). Oxford University Press.

  • Sekine, S. (2006). On-Demand Information Extraction. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions (pp. 731–738). Sydney, Australia.

  • Soderland, S. (1999). Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 31(1–3), 233–272.

    Article  Google Scholar 

  • Stevenson, M. (2004) Information Extraction from Single and Multiple Sentences. In Proceedings of the Twentieth International Conference on Computational Linguistics (COLING-02) (pp. 875–881). Geneva, Switzerland.

  • Stevenson, M., & Greenwood, M. (2005). A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (pp. 379–386). Ann Arbour, MI.

  • Sundheim, B. (1991) Overview of the Third Message Understanding Evaluation and Conference. In Proceedings of the Third Message Understanding Conference (MUC-3) (pp. 3–16). San Diego, CA.

  • Yangarber, R., Grishman, R., Tapanainen, P., & Huttunen, S. (2000). Automatic Acquisition of Domain Knowledge for Information Extraction. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000) (pp. 940–946). Saarbrücken, Germany.

  • Zelenko, D., Aone, C., & Richardella. A. (2003). Kernel Methods for Relation Extraction. Journal of Machine Learning Research, 3, 1083–1106.

    Article  Google Scholar 

Download references

Acknowledgements

This work was carried out as part of the Result project, funded by the UK EPSRC (GR/T06391). I am grateful to Mark Hepple, Mark Greenwood, David Martinez and Paul Clough for providing feedback on earlier versions of this paper. Any mistakes are my own.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Stevenson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stevenson, M. Fact distribution in Information Extraction. Lang Resources & Evaluation 40, 183–201 (2006). https://doi.org/10.1007/s10579-006-9014-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-006-9014-4

Keywords

Navigation