Abstract
This paper overviews the International Standards Organization–Linguistic Annotation Framework (ISO–LAF) developed in ISO TC37 SC4. We describe the XML serialization of ISO–LAF, the Graph Annotation Format (GrAF) and discuss the rationale behind the various decisions that were made in determining the standard. We describe the structure of the GrAF headers in detail and provide multiple examples of GrAF representation for text and multi-media. Finally, we discuss the next steps for standardization of interchange formats for linguistic annotations.
Similar content being viewed by others
Notes
The Lexical Markup Language (LMF) (Francopoulo 2013).
AG was subsequently augmented with ad hoc mechanisms to accommodate hierarchical relations, but these were never part of the underlying AG data model.
Annotation Graphs allow nodes to be associated with locations in primary data, but not with other nodes in the graphs defined over the data.
See Neumann et al. (2013) for a description of the query and visualization tool ANNIS, which enables such queries over MASC data.
The term “document” is applied broadly here to include physical artifacts other than text, and to allow for the possibility that a logical unit of primary data is distributed over multiple computer files.
The @attribute-name notation is used for XML attributes throughout the paper.
Note that all anchor types are associated with one or more media, but a medium is not necessarily associated with an anchor type—in particular, media types associated with documents other than primary data documents (notably, annotation documents) are not associated with an anchor type.
XPath is the XML Path Language defined by W3C; see http://www.w3.org/TR/xpath/.
The annotation documentation would be referenced in the annotation type declaration in the resource header.
Note that the @type attribute on the region element specifies the anchor type and not the region type.
Note that anchors into character data refer to locations between characters, not to the position of the characters themselves.
Sentences may also be represented as annotations defined over tokens, but for some purposes it is less desirable to consider a sentence as an ordered set of tokens than as a single span of characters.
Some detail concerning the html display has been omitted for brevity.
The ANNIS implementation for accessing MASC annotations is available from http://www.anc.org/software/annis.
Note that the names of the object and features are much less important than the types of the objects and associated features.
For more information see http://lapps.anc.org/web-service-exchange-vocabulary/.
References
Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet project. In: COLING-ACL ’98: Proceedings of the conference (pp. 86–90).
Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1–2), 23–60.
Blumtritt, J., Bouda, P., & Rau, F. (2013). Poio API and GraF-XML: A radical stand-off approach in language documentation and language typology. In: Proceedings of balisage: The markup conference 2013, Montreal, Canada, Balisage Series on Markup Technologies (vol. 10). doi:10.4242/BalisageVol10.Bouda01.
Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of ACL’02.
Ferrucci, D., & Lally, A. (2004). UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3–4), 327–348.
Francopoulo, G. (Ed.). (2013). LMF: Lexical Markup Framework. London: Wiley-ISTE.
Ide, N., & Bunt, H. (2010). Anatomy of annotation schemes: Mapping to GrAF. In Proceedings of the fourth linguistic annotation workshop (LAW IV) (pp. 247–255). Uppsala: Association for Computational Linguistics.
Ide, N., & Romary, L. (2001). Standards for language resources. In: Proceedings of IRCS workshop on linguistic databases.
Ide, N., & Romary, L. (2003). Outline of the International Standard Linguistic Annotation Framework. In Proceedings of ACL’03 workshop on linguistic annotation: Getting the model right (pp. 1–5).
Ide, N., & Romary, L. (2004a). A registry of standard data categories for linguistic annotation. In Proceedings of the fourth international language resources and evaluation conference (LREC’04), Lisbon, Portugal (pp. 135–138).
Ide, N., & Romary, L. (2004b). International Standard for a Linguistic Annotation Framework. Journal of Natural Language Engineering, 10(3–4), 211–225.
Ide, N., & Romary, L. (2007). Towards international standards for language resources. In L. Dybkjaer, H. Hemsen, & W. Minker (Eds.), Evaluation of text and speech systems (pp. 263–284). Berlin: Springer.
Ide, N., & Suderman, K. (2007). GrAF: A graph-based format for linguistic annotations. In Proceedings of the linguistic annotation workshop (LAW), association for computational linguistics (pp. 1–8).
Ide, N., Bonhomme, P., & Romary, L. (2000). XCES: An XML-based encoding standard for linguistic corpora. In Proceedings of the second international language resources and evaluation conference (LREC’00).
Ide, N., Baker, C., Fellbaum, C., & Passonneau, R. (2010a). The manually annotated sub-corpus: A community resource for and by the people. In Proceedings of the ACL 2010 conference short papers (pp. 68–73). Uppsala: Association for Computational Linguistics.
Ide, N., Suderman, K., & Simms, B. (2010b). ANC2Go: A web application for customized corpus creation. In Proceedings of the seventh international conference on language resources and evaluation (LREC), Valletta, Malta.
Ide, N., Prasad, R., & Joshi, A. (2011). Towards interoperability for the Penn discourse treebank. In Proceedings of the sixth joint ISO–ACL SIGSEM workshop on interoperable semantic annotation (pp. 49–55).
ISO. (2005). Language Resource Management–Feature Structures, Part 1: Feature structure representation. ISO Document ISO/DIS 24610–1.
ISO. (2012). Language Resource Management–Linguistic Annotation Framework. ISO 24612.
Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2009). ISOCat: Remodelling metadata for language resources. International Journal of Metadata, Semantics and Ontologies, 4, 261–276.
Kipp, M. (2001). ANVIL: A generic annotation tool for multimodal dialogue. In INTERSPEECH’01 (pp. 1367–1370).
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2), 313–330.
Neumann, A., Ide, N., & Stede, M. (2013). Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF. In Proceedings of the seventh linguistic annotation workshop (LAW) (pp. 98–102). Bulgaria: Sofia.
Pustejovsky, J., Lee, K., Bunt, H., & Romary, L. (2010). ISO-TimeML: An international standard for semantic annotation. In Proceedings of the seventh international language resources and evaluation conference (LREC’10).
Thompson, HS., & McKelvie, D. (1997). Hyperlink semantics for standoff markup of read-only documents. In Proceedings of SGML Europe 97: The next decade-pushing the envelope (pp. 227–229).
Zeldes, A., Ritz, J., Lüdeling, A., & Chiarcos, C. (2009). ANNIS: A search tool for multi-layer annotated corpora. In Proceedings of corpus linguistics.
Acknowledgments
This work was supported by National Science Foundation Grant INT-0753069.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ide, N., Suderman, K. The Linguistic Annotation Framework: a standard for annotation interchange and merging. Lang Resources & Evaluation 48, 395–418 (2014). https://doi.org/10.1007/s10579-014-9268-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-014-9268-1