Editorial
Speech annotation and corpus tools

https://doi.org/10.1016/S0167-6393(00)00066-2Get rights and content

Introduction

In the last 20 years, there has been a pressing need to develop speech and language corpora as training and testing material for a wide range of speech technology applications. This has been coupled with a growing interest in the speech community to develop models of spoken language that are based on corpora that are increasingly representative of natural, spontaneous speech.

The growth in the use of speech corpora has benefited in the last 10 years from the establishment of data centres, such as the Linguistic Data Consortium (LDC), the European Language Resources Association (ELRA), the Japanese Language Resource Consortium (GSK: Gengo Shigen Kyouyuukikou), and multisite annotation initiatives, such as the ToBI system for prosodic annotation and the DAMSL system of discourse annotation. Today hundreds of annotated speech corpora exist and are used worldwide, and the demand for richly annotated corpora is growing.

The growth in the use of corpora has, however, not been matched by the development of a standard set of tools for creating, editing, annotating and querying corpora: as a result, many laboratories have developed their own systems for corpus annotation and analysis, precisely because existing tools are ill equipped to cope with the increasing size and range of applications for which corpora were constructed. A wealth of formats and tools have sprung up around this enterprise, a diversity which at once facilitates and frustrates progress. The linguistic annotation page (www.ldc.upenn.edu/annotation) and a series of international workshops have drawn attention to the scale of ongoing activity, to the existence of diverse approaches to similar problems and of similar approaches to diverse problems. Despite the explicit formats and well-documented user interfaces, insights about the structure of the annotations themselves are often buried in coding manuals, internal data structures and file formats.

There are pressing needs to document data models and tool requirements, to identify notational and functional equivalences among different approaches, to report on new approaches to core representational problems, and to describe new domains and empirical problems which stretch our conceptions of the models. These needs are the focus of this special issue. The papers gathered here address a broad range of theoretical and practical issues concerning the representation of annotations, the structure of annotated speech corpora, and the design, analysis and implementation of tools for creating, browsing, searching, manipulating and transforming annotations and annotated speech corpora.

Section snippets

Scalability

To what extent are annotation tools and formats adapted to dealing with very large corpora? A number of papers touch on this issue. One approach is to use the Extended Markup Language (XML) as the data model (McKelvie et al.; Jacobson et al.), which allows data annotation to benefit directly from new developments in indexing, storage, query and transformation of large XML databases. Another approach is to provide a relational representation for annotation data (Bird and Liberman; Cassidy and

Future directions

In the light of the papers in this collection, and the state of the field more generally, we can see a number of key areas where work is underway and where we can expect to see intensive activity in the near future.

A number of powerful general-purpose frameworks have been developed, which often include explicit XML formats for data storage and interchange, and application programming interfaces (APIs). Analysis of the formats and APIs, as well as identification of the substantive differences

Acknowledgements

The editors are grateful to over 50 people who undertook timely and insightful reviews of the submissions. The papers by Bird and Liberman, and Cassidy and Harrington, were reviewed independently by the Speech Communication editors and timed to appear with this issue.

References (0)

Cited by (0)

View full text