Biomedical term extraction: overview and a new methodology

Lossio-Ventura, Juan Antonio; Jonquet, Clement; Roche, Mathieu; Teisseire, Maguelonne

doi:10.1007/s10791-015-9262-2

Biomedical term extraction: overview and a new methodology

Medical Information Retrieval
Published: 25 August 2015

Volume 19, pages 59–99, (2016)
Cite this article

Download PDF

Information Retrieval Journal Aims and scope Submit manuscript

Biomedical term extraction: overview and a new methodology

Download PDF

Juan Antonio Lossio-Ventura¹,
Clement Jonquet¹,
Mathieu Roche^1,2 &
…
Maguelonne Teisseire^1,2

1815 Accesses
44 Citations
Explore all metrics

Abstract

Terminology extraction is an essential task in domain knowledge acquisition, as well as for information retrieval. It is also a mandatory first step aimed at building/enriching terminologies and ontologies. As often proposed in the literature, existing terminology extraction methods feature linguistic and statistical aspects and solve some problems related (but not completely) to term extraction, e.g. noise, silence, low frequency, large-corpora, complexity of the multi-word term extraction process. In contrast, we propose a cutting edge methodology to extract and to rank biomedical terms, covering all the mentioned problems. This methodology offers several measures based on linguistic, statistical, graphic and web aspects. These measures extract and rank candidate terms with excellent precision: we demonstrate that they outperform previously reported precision results for automatic term extraction, and work with different languages (English, French, and Spanish). We also demonstrate how the use of graphs and the web to assess the significance of a term candidate, enables us to outperform precision results. We evaluated our methodology on the biomedical GENIA and LabTestsOnline corpora and compared it with previously reported measures.

Methods for automatic term recognition in domain-specific text collections: A survey

Article 15 November 2015

N. A. Astrakhantsev, D. G. Fedorenko & D. Yu. Turdakov

A term extraction algorithm based on machine learning and comprehensive feature strategy

Article 05 September 2023

Xiuliang Gong, Bo Cheng, … Wen Bo

Evaluation and analysis of term scoring methods for term extraction

Article Open access 10 August 2016

Suzan Verberne, Maya Sappelli, … Wessel Kraaij

1 Introduction

The huge amount of biomedical data available today often consists of plain text fields, e.g. clinical trial descriptions, adverse event reports, electronic health records, emails or notes expressed by patients within forums (Murdoch and Detsky 2013). These texts are often written using a specific language (expressions and terms) of the associated community. Therefore, there is a need for formalization and cataloging of these technical terms or concepts via the construction of terminologies and ontologies (Rubin et al. 2008). These technical terms are also important for information retrieval (IR), for instance when indexing documents or formulating queries. However, as the task of manually extracting terms of a domain is very long and cumbersome, researchers have striving to design automatic methods to assist knowledge experts in the process of cataloging the terms and concepts of a domain under the form of vocabularies, thesauri, terminologies or ontologies.

Automatic term extraction (ATE), or automatic term recognition (ATR), is a domain which aims to automatically extract technical terminology from a given text corpus. We define technical terminology as the set of terms used in a domain. Term extraction is an essential task in domain knowledge acquisition because the technical terminology can be used for lexicon updating, domain ontology construction, summarization, named entity recognition or, as previously mentioned, IR.

In the biomedical domain, there is a substantial difference between existing resources (hereafter called terminologies or ontologies) in English, French, and Spanish. In English, there are about 9,919,000 terms associated with about 8,864,000 concepts such as those in UMLS^{Footnote 1} or BioPortal (Noy et al. 2009). Whereas in French there are only about 330,000 terms associated with about 160,000 concepts (Névéol et al. 2014), and in Spanish 1,172,000 terms associated with about 1,140,000 concepts. Note the strong difference in the number of ontologies and terminologies available in French or Spanish. This makes ATE even more important for these languages.

In biomedical ontologies, different terms may be linked to the same concept and are semantically similar with different writing, for instance “neoplasm” and “cancer” in MeSH or SNOMED-CT. Ontologies also contain terms with morphosyntaxic variants, for instance plurals like ‘‘external fistula” and “external fistulas”, and this group of variants is linked to a preferred term. As one of our goals is to extract new terms to enrich ontologies, our approach does not normalize variant terms, mainly because normalization would lead to penalization in extracting new variant terms. Technical terms are useful to gain further insight into the conceptual structure of a domain. These may be: (i) single-word terms (simple), or (ii) multi-word terms (complex). The proposed study focuses on both cases.

Term extraction methods usually involve two main steps. The first step extracts candidate terms by unithood calculation to qualify a string as a valid term, while the second step verifies them through termhood measures to validate their domain specificity. Formally, unithood refers to the degree of strength or stability of syntagmatic combinations and collocations, and termhood is defined as the degree to which a linguistic unit is related to domain-specific concepts (Kageura and Umino 1996). ATE has been applied to several domains, e.g. biomedical (Lossio-Ventura et al. 2014c; Frantzi et al. 2000; Zhang et al. 2008; Newman et al. 2012), ecological (Conrado et al. 2013), mathematical (Stoykova and Petkova 2012), social networks (Lossio-Ventura et al. 2012), banking (Dobrov and Loukachevitch 2011), natural sciences (Dobrov and Loukachevitch 2011), information technology (Newman et al. 2012; Yang et al. 2009), legal (Yang et al. 2009), as well as post-graduate school websites (Qureshi et al. 2012).

The main issues in ATE are: (i) extraction of non-valid terms (noise) or omission of terms with low frequency (silence), (ii) extraction of multi-word terms having various complex various structures, (iii) manual validation efforts of the candidate terms (Conrado et al. 2013), and (iv) management of large-scale corpora. Inspired by our previously published results and in response to the above issues, we propose a cutting edge methodology to extract biomedical terms. We propose new measures and some modifications of existing baseline measures. Those measures are divided into: (1) ranking measures, and (2) re-ranking measures. Our ranking measures are statistical- and linguistic-based and address issues (i), (ii) and (iv). Our two re-ranking measures the first one called TeRGraph is a graph-based measure which deals with issues (i), (ii) and (iii). The second one, called WAHI, is a web-based measure which also deals with issues (i), (ii) and (iii). The novelty of the WAHI measure is that it is web-based which has, to the best of our knowledge, never been applied within ATE approaches.

The main contributions of our article are: (1) enhanced consideration of the term unithood, by computing a degree of quality for the term unithood, and, (2) consideration of the term dependence in the ATE process. The quality of the proposed methodology is highlighted by comparing the results obtained with the most commonly used baseline measures. Our evaluation experiments were conducted despite difficulties in comparing ATE measures, mainly because of the size of the corpora used and the lack of available libraries associated with previous studies. Our three measures improve the process of automatic extraction of domain-specific terms from text collections that do not offer reliable statistical evidence (i.e. low frequency).

The paper is organized as follows. We first discuss related work in Sect. 2. Then the methodology to extract biomedical terms is detailed in Sect. 3. The results are presented in Sect. 4, followed by discussions in Sect. 5, and finally, the conclusions in Sect. 6.

2 Related work

Recent studies have focused on multi-word (n-grams) and single-word (unigrams) term extraction. Term extraction techniques can be divided into four broad categories: (i) Linguistic, (ii) Statistical, (iii) Machine Learning, and (iv) Hybrid. All of these techniques are encompassed in Text Mining approaches. Graph-based approaches have not yet been applied to ATE, although they have been successively adopted in other information retrieval fields and could be suitable for our purpose. Existing web techniques have not been applied to ATE but, as we will see, these techniques can be adapted for such purposes.

2.1 Text mining approaches

2.1.1 Linguistic approaches

These techniques attempt to recover terms via linguistic pattern formation. This involves building rules to describe naming structures for different classes based on orthographic, lexical, or morphosyntactic characteristics, e.g. Gaizauskas et al. 2000. The main approach is to develop rules (typically manually) describing common naming structures for certain term classes using orthographic or lexical clues, or more complex morpho-syntactic features. Moreover, in many cases, dictionaries of typical term constituents (e.g. terminological heads, affixes, and specific acronyms) are used to facilitate term recognition (Krauthammer and Nenadic December 2004). A recent study on biomedical term extraction (Golik et al. 2013) is based on linguistic patterns plus additional context-based rules to extract candidate terms, which are not scored and the authors leave the term relevance decision to experts.

2.1.2 Statistical methods

Statistical techniques chiefly rely on external evidence presented through surrounding (contextual) information. Such approaches are mainly focused on the recognition of general terms (Eck et al. 2010). The most basic measures are based on frequency. For instance, term frequency (tf) counts the frequency of a term in the corpus, document frequency (df) counts the number of documents where a term occurs, and average term frequency (atf), which is $\frac{tf}{df}$.

A similar research topic, called automatic keyword extraction (AKE), proposes to extract the most relevant words or phrases in a document using automatic indexation. Keywords, which we define as a sequence of one or more words, provide a compact representation of a document’s content. Such measures can be adapted to extract terms from a corpus as well as ATE measures. We take two popular AKE measures as baselines measures, i.e. Term Frequency Inverse Document Frequency (TF-IDF) (Salton and Buckley 1988), and Okapi BM25 (Robertson et al. 1999) (hereafter Okapi), these weight the word frequency according to their distribution along the corpus. Residual inverse document frequency (RIDF) compares the document frequency to another chance model where terms with a particular term frequency are distributed randomly throughout the collection, while Chi-square (Matsuo and Ishizuka 2004) assesses how selectively words and phrases co-occur within the same sentences as a particular subset of frequent terms in the document text. This is applied to determine the bias of word co-occurrences in the document text, which is then used to rank words and phrases as keywords of the document; RAKE (Rose et al. 2010) hypothesises that keywords usually consist of multiple words and do not contain punctuation or stop words. It uses word co-occurrence information to determine the keywords.

2.1.3 Machine learning

Machine Learning (ML) systems are often designed for specific entity classes and thus integrate term extraction and term classification. Machine Learning systems use training data to learn features useful for term extraction and classification. But the availability of reliable training resources is one of the main problems. Some proposed ATE approaches use machine learning (Conrado et al. 2013; Zhang et al. 2010; Newman et al. 2012). However, ML may also generate noise and silence. The main challenge is how to select a set of discriminating features that can be used for accurate recognition (and classification) of term instances. Another challenge concerns the detection of term boundaries, which are the most difficult to learn.

2.1.4 Hybrid methods

Most approaches combine several methods (typically linguistic and statistically based) for the term extraction task. GlossEx (Kozakov et al. 2007) considers the probability of a word in the domain corpus divided by the probability of the appearance of the same word in a general corpus. Moreover, the importance of the word is increased according to its frequency in the domain corpus. Weirdness (Ahmad et al. 1999) considers that the distribution of words in a specific domain corpus differs from that in a general corpus. C/NC-value (Frantzi et al. 2000) combines statistical and linguistic information for the extraction of multi-word and nested terms. This is the most well-known measure in the literature. While most studies address specific types of entities, C/NC-value is a domain-independent method. It has also been used for recognizing terms in the biomedical literature (Hliaoutakis et al. 2009; Hamon et al. 2014). In (Zhang et al. 2008), the authors showed that C-value obtains the best results compared to the other measures cited above. C-value has been also modified to extract single-word terms (Nakagawa and Mori 2002), and in this work the authors extract only terms composed of nouns. Moreover, C-value has also been applied to different languages other than English, e.g. Japanese, Serbian, Slovenian, Polish, Chinese (Ji et al. 2007), Spanish (Barrón-Cedeño et al. 2009), Arabic, and French. We have thus chosen C-value as one of our baseline measure. Those baseline measures will be modified and evaluated with the new proposed measures.

Terminology extraction from parallel and comparable corpora Another kind of approach suggests that terminology may be extracted from parallel and/or comparable corpora. Parallel corpora contain texts and their translation into one or more languages, but such corpora are scarce (Bowker and Pearson 2002). Thus parallel corpora are scarce for specialized domains. Comparable corpora are those which select similar texts in more than one language or variety (Déjean and Gaussier 2002). Comparable corpora are built more easily than parallel corpora. They are often used for machine translation and their approaches are based on linguistics, statistics, machine learning, and hybrid methods. The main objective of these approaches is to extract translation pairs from parallel/comparable corpora. Different studies propose translation of biomedical terms for English-French by alignment techniques (Deléger et al. 2009). English–Greek and English–Romanian bilingual medical dictionaries are also constructed with a hybrid approach that combines semantic information and term alignments (Kontonatsios et al. 2014b). Other approaches are applied for single- and multi-word terms with English–French comparable corpora (Daille and Morin 2005). The authors use statistical methods to align elements by exploiting contextual information. Another study proposes to use graph-based label propagation (Tamura et al. 2012). This approach is based on a graph for each language (English and Japanese) and the application of a similarity calculus between two words in each graph. Moreover, some machine learning algorithms can be used, e.g. the logistic regression classifier (Kontonatsios et al. 2014a). There are also approaches that combine both corpora (Morin and Prochasson 2011) (i.e. parallel and comparable) in an approach to reinforce extraction. Note that our corpora are not parallel and are far of being comparable because of the difference in their size. Therefore these approaches are not evaluated in our study.

2.1.5 Tools and applications for biomedical term extraction

There are several applications implementing some measures previously mentioned, especially C-value for biomedical term extraction. The study of related tools revealed that most existing systems that especially implement statistical methods are made to extract keywords and, to a lesser extent, to extract terminology from a text corpus. Indeed, most systems take a single text document as input, not a set of documents (as corpus), for which the IDF can be computed. Most systems are available only in English and the most relevant for the biomedical domain are:

TerMine ^{Footnote 2}, developed by the authors of the C-value method, only for English term extraction;
Java Automatic Term Extraction ^{Footnote 3} (Zhang et al. 2008), a toolkit which implements several extraction methods including C-value, GlossEx, TermEx and offer other measures such as frequency, average term frequency, IDF, TF-IDF, RIDF;
FlexiTerm ^{Footnote 4} (Spasic et al. 2013), a tool explicitly evaluated on biomedical copora and which offer more flexibility than C-value when comparing term candidates (treating them as bag of words and ignoring the word order);
BioYaTeA ^{Footnote 5} (Golik et al. 2013), is a version of the YaTeA term extractor (Aubin and Hamon 2006), both are available as a Perl module. It is a biomedical term extractor. The method used is based only on linguistic aspects.
BioTex ^{Footnote 6} (Lossio-Ventura et al. 2014a), only for biomedical terminology extraction. It is available for online testing and assessment but can also be used in any program as a Java library (POS tagger not included). In contrast to other existing systems, this system allows us to analyze French and Spanish corpora, manually validate extracted terms and export the list of extracted terms.

2.2 Graph-based approaches

Graph modeling is an alternative for representing information, which clearly highlights relationships of nodes among vertices. It also groups related information in a specific way, and a centrality algorithm can be applied to enhance their efficiency. Centrality in a graph is the identification of the most important vertices within a graph. A host of measures have been proposed to analyze complex networks, especially in the social network domain (Borgatti 2005; Borgatti et al. 2009; Banerjee et al. 2014). Freeman (1979), formalized three different measures of node centrality: degree, closeness and betweenness. Degree is the number of neighbors that a node is connected to. Closeness is the inverse sum of shortest distances to all other neighbor nodes. Betweenness is the number of shortest paths from all vertices to all others that pass through that node. One study proposes to take the number of edges and their weights into account (Opsahl et al. 2010), since the three last measures do not do this. Another well known measure is PageRank (Page et al. 1999), which ranks websites. Boldi and Vigna (2014), evaluated the behavior of ten measures, and associated the centrality to the node with largest degree. Our approach proposes the opposite, i.e. we focus on nodes with a lower degree. An increasingly popular recent application of graph approaches to IR concerns social or collaborative networks and recommender systems (Noh et al. 2009; Banerjee et al. 2014).

Graph representations of text and scoring function definition are two widely explored research topics, but few studies have focused on graph-based IR in terms of both document representation and weighting models (Rousseau and Vazirgiannis 2015). First, text is modeled as a graph where nodes represent words and edges represent relations between words, defined on the basis of any meaningful statistical or linguistic relation (Blanco and Lioma 2012). In Blanco and Lioma (2012), the authors developed a graph-based word weighting model that represents each document as a graph. The importance of a word within a document is estimated by the number of related words and their importance, in the same way that PageRank (Page et al. 1999) estimates the importance of a page via the pages that are linked to it. Another study introduces a different representation of document that captures relationships between words by using an unweighted directed graph of words with a novel scoring function (Rousseau and Vazirgiannis 2015).

In the above approaches, graphs are used to measure the influence of words in documents like automatic keyword extraction methods (AKE), while ranking documents against queries. These approaches differ from ours as they use graphs focused on the extraction of relevant words in a document and computing relations between words. In our proposal, a graph is built such that the vertices are multi-word terms and the edges are relations between multi-word terms. Moreover, we focus especially on a scoring function of relevant multi-word terms in a domain rather than in a document.

2.3 Web mining approaches

Different web mining studies focus on semantic similarity, semantic relatedness. This means quantifying the degree to which some words are related, considering not only similarity but also any possible semantic relationship among them. The word association measures can be divided into three categories (Chaudhari et al. 2011): (i) co-occurrence measures that rely on co-occurrence frequencies of both words in a corpus, (ii) distributional similarity-based measures that characterize a word by the distribution of other words around it, and (iii) knowledge-based measures that use knowledge-sources like thesauri, semantic networks, or taxonomies (Harispe et al. 2014). In this paper, we focus on co-occurrence measures because our goal is to extract multi-word terms and we suggest computing a degree of association between words composing a term. Word association measures are used in several domains like ecology, psychology, medicine, and language processing, and were recently studied in (Pantel et al. 2009; Zadeh and Goel 2013), such as Dice, Jaccard, Overlap, Cosine. Another measure to compute the association between words using web search engines results is the Normalized Google Distance (Cilibrasi and Vitanyi 2007), which relies on the number of times words co-occur in the document indexed by an information retrieval system. In this study, experimental results with our web-based measure will be compared with the basic measures (Dice, Jaccard, Overlap, Cosine).

3 Methodology

This section describes the baseline measures, their modifications as well as new measures that we propose for the biomedical term extraction task. The principle of our approach is to assign a weight to a term, which represents the appropriateness of being a relevant biomedical term. This allows to give as output a list ranked by their appropriateness. Our methodology for automatic term extraction has three main steps plus an additional step (a), described in Fig. 1, and in the sections hereafter:

(a) Pattern construction,

(1) Candidate term extraction,

(2) Ranking of candidate terms,

(3) Re-ranking.

3.1 Pattern construction (step a)

As previously cited, we supposed that biomedical terms have a similar syntactic structure (linguistic aspect). Therefore, we built a list of the most common linguistic patterns according to the syntactic structure of terms present in the UMLS^{Footnote 7} (for English and Spanish), and the French version of MeSH,^{Footnote 8} SNOMED International and the rest of the French content in the UMLS.

Part-of-Speech (POS) tagging is the process of assigning each word in a text to its grammatical category (e.g. noun, adjective). This process is performed based on the definition of the word or on the context in which it appears. This is highly time-consuming, so we conducted automatic part-of-speech tagging.

We evaluated three tools (TreeTagger,^{Footnote 9} Stanford Tagger,^{Footnote 10} and Brill’s tagger^{Footnote 11}). This evaluation was carried out throughout the entire workflow with the three tools and we assessed the precision of the extracted terms. We noted that in general TreeTagger gave the best results for Spanish and French. Meanwhile, for English, the Stanford tagger and TreeTagger gave similar results. We finally chose TreeTagger, which gave better results and may be used for English, French and Spanish. Moreover, our choice was validated with regard to a recent comparison study (Tian and Lo 2015), wherein the authors showed that TreeTagger generally gives the best results, particularly for nouns and verbs.

Therefore, we carried out automatic part-of-speech tagging of the biomedical terms using TreeTagger, and then computed the frequency of the syntactic structures. Patterns among the 200 highest frequencies were selected to build the list of patterns for each language. From this list, we also computed the weight (probability) associated with each pattern, i.e. the frequency of the pattern over the sum of frequencies (see Algorithm 1), but this weight will only be used for one measure. The number of terms used to build these lists of patterns was 3,000,000 for English, 300,000 for French, and 500,000 for Spanish, taken from the previously mentioned terminologies. Table 1 illustrates the computation of the linguistic patterns and their weights for English.

Table 1 Example of pattern construction (where NN is a noun, IN a preposition or subordinating conjunction, JJ an adjective, and CD a cardinal number)

Biomedical term extraction: overview and a new methodology

Abstract

Similar content being viewed by others

Methods for automatic term recognition in domain-specific text collections: A survey

A term extraction algorithm based on machine learning and comprehensive feature strategy

Evaluation and analysis of term scoring methods for term extraction

1 Introduction

2 Related work

2.1 Text mining approaches

2.1.1 Linguistic approaches

2.1.2 Statistical methods

2.1.3 Machine learning

2.1.4 Hybrid methods

2.1.5 Tools and applications for biomedical term extraction

2.2 Graph-based approaches

2.3 Web mining approaches

3 Methodology

3.1 Pattern construction (step a)

3.2 Candidate term extraction (step 1)

3.3 Ranking of candidate terms (step 2)

3.3.1 C-value

3.3.2 TF-IDF and Okapi

3.3.3 Combinations: F-OCapi and F-TFIDF-C

3.3.4 LIDF-value and L-value

3.4 Re-ranking (step 3)

3.4.1 A new graph-based ranking measure: “TeRGraph” (terminology ranking based on graph information)

3.4.2 WebR

3.4.3 A new web ranking measure: WAHI (Web Association based on Hits Information)

4 Experiments and results

4.1 Data, protocol, and validation

4.1.1 Data

4.1.2 Protocol

4.1.3 Validation

4.2 Multilingual comparison (LabTestsOnline)

4.3 Evaluation of the global process (GENIA)

4.3.1 Ranking results (step 2 in Fig. 1)

4.3.2 Results of n-gram terms

4.3.3 Re-ranking results (step 3 in Fig. 1)

4.3.4 Summary

5 Discussion

5.1 Impact of pattern list

5.2 Effect of dictionary size

5.3 Term extraction errors

6 Conclusions and future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation