Anaphoric reference in clinical reports: Characteristics of an annotated corpus

https://doi.org/10.1016/j.jbi.2012.01.010Get rights and content
Under an Elsevier user license
open archive

Abstract

Motivation

Expressions that refer to a real-world entity already mentioned in a narrative are often considered anaphoric. For example, in the sentence “The pain comes and goes,” the expression “the pain” is probably referring to a previous mention of pain. Interpretation of meaning involves resolving the anaphoric reference: deciding which expression in the text is the correct antecedent of the referring expression, also called an anaphor. We annotated a set of 180 clinical reports (surgical pathology, radiology, discharge summaries, and emergency department) from two institutions to indicate all anaphor–antecedent pairs.

Objective

The objective of this study is to describe the characteristics of the corpus in terms of the frequency of anaphoric relations, the syntactic and semantic nature of the members of the pairs, and the types of anaphoric relations that occur. Understanding how anaphoric reference is exhibited in clinical reports is critical to developing reference resolution algorithms and to identifying peculiarities of clinical text that may alter the features and methodologies that will be successful for automated anaphora resolution.

Results

We found that anaphoric reference is prevalent in all types of clinical reports, that annotations of noun phrases, semantic type, and section headings may be especially important for automated resolution of anaphoric reference, and that separate modules for reference resolution may be required for different report types, different institutions, and different types of anaphors. Accurate resolution will probably require extensive domain knowledge—especially for pathology and radiology reports with more part/whole and set/subset relations.

Conclusion

We hope researchers will leverage the annotations in this corpus to develop automated algorithms and will add to the annotations to generate a more extensive corpus.

Highlights

► Annotated 180 clinical reports to indicate anaphor–antecedent pairs. ► Identity was the most frequent relation, with set/subset and part/whole too. ► Accurate resolution will require extensive domain knowledge. ► Annotations can be used to develop anaphoric resolution algorithms.

Keywords

Natural language processing
Clinical reports

Cited by (0)