ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents

Nguyen, Dat Quoc; Zhai, Zenan; Yoshikawa, Hiyori; Fang, Biaoyan; Druckenbrodt, Christian; Thorne, Camilo; Hoessel, Ralph; Akhondi, Saber A.; Cohn, Trevor; Baldwin, Timothy; Verspoor, Karin

doi:10.1007/978-3-030-45442-5_74

Dat Quoc Nguyen^15,17,
Zenan Zhai¹⁵,
Hiyori Yoshikawa^15,18,
Biaoyan Fang¹⁵,
Christian Druckenbrodt¹⁶,
Camilo Thorne¹⁶,
Ralph Hoessel¹⁶,
Saber A. Akhondi¹⁶,
Trevor Cohn¹⁵,
Timothy Baldwin¹⁵ &
…
Karin Verspoor¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12036))

Included in the following conference series:

European Conference on Information Retrieval

8033 Accesses
14 Citations
6 Altmetric

Abstract

We introduce a new evaluation lab named ChEMU (Cheminformatics Elsevier Melbourne University), part of the 11th Conference and Labs of the Evaluation Forum (CLEF-2020). ChEMU involves two key information extraction tasks over chemical reactions from patents. Task 1—Named entity recognition—involves identifying chemical compounds as well as their types in context, i.e., to assign the label of a chemical compound according to the role which the compound plays within a chemical reaction. Task 2—Event extraction over chemical reactions—involves event trigger detection and argument recognition. We briefly present the motivations and goals of the ChEMU tasks, as well as resources and evaluation methodology.

You have full access to this open access chapter, Download conference paper PDF

Overview of ChEMU 2020: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents

Overview of ChEMU 2022 Evaluation Campaign: Information Extraction in Chemical Patents

The ChEMU 2022 Evaluation Campaign: Information Extraction in Chemical Patents

Keywords

1 Introduction

The chemical industry undoubtedly depends on the discovery of new chemical compounds. However, new chemical compounds are often initially disclosed in patent documents, and only a small fraction of these compounds are published in journals, usually taking an additional 1–3 years after the patent [13]. Therefore, most chemical compounds are only available through patent documents [3]. In addition, chemical patent documents contain unique information, such as reactions, experimental conditions, mode of action, which is essential for the understanding of compound prior art, providing a means for novelty checking and validation as well as pointers for chemical research in both academia and industry [1, 2]. As the number of new chemical patent applications has been drastically increasing [11], it is becoming crucial to develop natural language processing (NLP) approaches that enable automatic extraction of key information from the chemical patents [2].

In this paper, we propose a new evaluation lab (called ChEMU) focusing on information extraction over chemical reactions from patents. In particular, we will focus on two key information extraction tasks of chemical named entity recognition (NER) and chemical reaction event extraction. While previous related shared tasks focusing on chemicals or drugs such as CHEMDNER [7] have also included chemical named entity recognition as a task, those have primarily focused on PubMed abstracts. The CHEMDNER patents task [8] was limited to entity mentions and chemical entity passage detection, and only considered titles and abstracts of patents. For our ChEMU lab, we extend the existing corpora in several directions: first, we go beyond chemical NER to require labeling of the role of a chemical with respect to a reaction, and to consider complete chemical reactions in addition to entities. The ChEMU website is available at: http://chemu.eng.unimelb.edu.au.

2 Goals and Importance

What are the Goals of This Evaluation Lab? Our goals are: (1) To develop tasks that impact chemical research in both academia and industry, (2) To provide the community with a new dataset of chemical entities, enriched with relational links between chemical event triggers and arguments, and (3) To advance the state-of-the-art in information extraction over chemical patents.

Why is This Lab Needed? For evaluating information extraction developments in the scientific literature domain, there have been a large number of labs/shared tasks offered within previous i2b2/n2c2, SemEval, BioNLP, BioCreative, TREC and CLEF workshops. However, less attention has been paid to the chemical patent domain. In particular, there has previously been only one shared task on this domain, which is the CHEMDNER patents task at the BioCreative V workshop, involving detection of mentions of chemical compounds and genes/proteins in patent text [8].

Information extraction approaches developed for the scientific literature domain may not apply directly to the chemical patent domain. This is because as legal documents, patents are written very differently as compared to scientific literature. When writing scientific papers, authors strive to make their words as clear and straightforward as possible, whereas patent authors often seek to protect their knowledge from being fully disclosed [15]. In tension with this is the need to claim broad scope for intellectual property reasons, and hence patents typically contain more details and are more exhaustive than scientific papers [9].

There are also a number of characteristics of patent texts that create challenges for NLP in this context. Long sentences listing names of compounds are frequently used in chemical patents. The structure of sentences in patent claims is usually complex, and syntactic parsing in patents can be difficult [4]. A quantitative analysis from [16] showed that the average sentence length in a patent corpus is much longer than in general language use. That work also showed that the lexicon used in patents usually includes domain-specific and novel terms that are difficult to understand.

How Will the Community Benefit from the Lab? The ChEMU lab will provide a new challenging set of tasks, in an area of significant pharmacological importance. The lab will focus attention on more complex analysis of chemical patents, provide strong baselines as well as providing a useful resource for future research.

Table 1. Brief definitions of ChEMU chemical entity types, organised into chemical entity types, a reaction label introduced in the text, and reaction properties.

Full size table

What are Usage Scenarios? Automatically identifying compounds which serve as the starting material or are a product of a chemical reaction would allow more targeted extraction of chemical information from patents and can improve the usefulness of patent resources. Automatic extraction of chemical reaction events supports the construction of cheminformatics databases, capturing key information about chemicals and how they are produced, from the patent resources.

Table 2. An example of a chemical reaction snippet and BRAT annotations in a standoff format [14] w.r.t. Task 1.

Full size table

3 Tasks

The ChEMU lab at CLEF-2020^{Footnote 1} offers the two information extraction tasks of Named entity recognition (Task 1) and Event extraction (Task 2) over chemical reactions from patent documents. Teams may participate in one or both tasks.

3.1 Task 1: Named Entity Recognition

In general, a chemical reaction is a process leading to the transformation of one set of chemical substances to another [10]. Task 1 involves identifying chemical compounds and their specific types, i.e. to assign the label of a chemical compound according to the role which it plays within a chemical reaction. In addition to chemical compounds, this task also requires identification of the temperatures and reaction times at which the chemical reaction is carried out, as well as yields obtained for the final chemical product and the label of the reaction.

This task involves both entity boundary prediction and entity label classification. We define 10 different entity type labels as shown in Table 1. See examples of those entity types in Table 2.

3.2 Task 2: Event Extraction

As illustrated in Figs. 1 and 2, a chemical reaction leading to an end product often consists of a sequence of individual event steps. Task 2 is to identify those steps which involve chemical entities recognized from Task 1. Unlike a conventional event extraction problem [6] which involves event trigger word detection, event typing and argument prediction, our Task 2 requires identification of event trigger words (e.g. “added” and “stirred”) which all have the same type of “EVENT_TRIGGER”, and then determination of the chemical entity arguments of these events.^{Footnote 2}

When predicting event arguments, we adapt semantic argument role labels Arg1 and ArgM from the Proposition Bank [12] to label the relations between the trigger words and the chemical entities: Arg1 is used to label the relation between an event trigger word and a chemical compound. Here, Arg1 represents argument roles of being causally affected by another participant in the event [5]. ArgM represents adjunct roles with respect to an event, used to label the relation between a trigger word and a temperature, time or yield entity.

An end-to-end process incorporating both Task 1 and Task 2 can be equivalently viewed as a relation extraction task which identifies 11 entity types including 10 types defined in Table 1 plus “EVENT_TRIGGER”, and extracts relations between the “EVENT_TRIGGER” entities and the remaining entities.

4 Data and Evaluation

Data: For system development and evaluation, a new corpus of 1500 chemical reaction snippets will be provided for both tasks (an example of a chemical reaction snippet is shown in Table 2). These snippets are sampled from 170 English document patents from the European Patent Office and the United States Patent and Trademark Office. We will mark up every chemical compound or event trigger with both text spans and IDs, and highlight relations and event arguments, as illustrated in Figs. 1 and 2. We have begun preparing the corpus and will make available strong baselines for the tasks. Initial publications related to the data and Task 1 appear at the 2019 ALTA and BioNLP workshops, respectively [18, 19].

The corpus will be split into 70%/10%/20% training/development/test. Gold annotations for the training and development sets will be provided to task participants in the BRAT standoff format [14] during the development phase. The raw test set will be provided for final test phase.

To support teams who are interested in Task 2 only, a pre-trained chemical NER tagger is provided as a resource [19].

Evaluation: For evaluation, precision, recall and F1 scores will be used, under both strict and relaxed span matching conditions. F1 will be the main metric for ranking the participating teams [17].^{Footnote 3}

5 Conclusion

In this paper, we have presented a brief description of the upcoming ChEMU lab at CLEF-2020. ChEMU will focus on two new tasks of named entity recognition and event extraction over chemical reactions from patents. We expect participants from both academia and industry. We will advertise our ChEMU lab via social media as well as NLP-related mailing lists.

Notes

1.
https://clef2020.clef-initiative.eu.
2.
Note that those individual event steps are sequentially ordered, thus we do not consider cases where an event is an argument of another event, i.e. we do not label the relationship between two event triggers.
3.
https://bitbucket.org/nicta_biomed/brateval/src/master/.

References

Akhondi, S.A., et al.: Annotated chemical patent corpus: a gold standard for text mining. PLoS ONE 9, 1–8 (2014)
Article Google Scholar
Akhondi, S.A., et al.: Automatic identification of relevant chemical compounds from patents. Database 2019, baz001 (2019)
Article Google Scholar
Bregonje, M.: Patents: a unique source for scientific technical information in chemistry related industry? World Pat. Inf. 27(4), 309–315 (2005)
Article Google Scholar
Hu, M., Cinciruk, D., Walsh, J.M.: Improving automated patent claim parsing: dataset, system, and experiments. CoRR abs/1605.01744 (2016)
Google Scholar
Jurafsky, D., Martin, J.H.: Semantic Role Labeling and Argument Structure. In: Speech and Language Processing, 3rd edn. (2019)
Google Scholar
Kim, J.D., Ohta, T., Pyysalo, S., Kano, Y., Tsujii, J.: Overview of BioNLP’09 shared task on event extraction. In: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pp. 1–9 (2009)
Google Scholar
Krallinger, M., Leitner, F., Rabal, O., Vazquez, M., Oyarzabal, J., Valencia, A.: CHEMDNER: the drugs and chemical names extraction challenge. J. Cheminform. 7(1), S1 (2015)
Article Google Scholar
Krallinger, M., et al.: Overview of the CHEMDNER patents task. In: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, pp. 63–75 (2015)
Google Scholar
Lupu, M., Mayer, K., Tait, J., Trippe, A.J.: Current Challenges in Patent Information Retrieval, 1st edn. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19231-9
Book Google Scholar
Muller, P.: Glossary of terms used in physical organic chemistry (IUPAC Recommendations 1994). Pure Appl. Chem. 66(5), 1077–1184 (2009)
Article Google Scholar
Muresan, S., et al.: Making every SAR point count: the development of chemistry connect for the large-scale integration of structure and bioactivity data. Drug Discovery Today 16(23), 1019–1030 (2011)
Article Google Scholar
Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2005)
Article Google Scholar
Senger, S., Bartek, L., Papadatos, G., Gaulton, A.: Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents. J. Cheminformatics 7, 49:1–49:12 (2015)
Article Google Scholar
Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: brat: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations Session at EACL 2012 (2012)
Google Scholar
Valentinuzzi, M.E.: Patents and scientific papers: quite different concepts: the reward is found in giving, not in keeping [Retrospectroscope]. IEEE Pulse 8(1), 49–53 (2017)
Article Google Scholar
Verberne, S., D’hondt, E., Oostdijk, N., Koster, C.: Quantifying the challenges in parsing patent claims. In: Proceedings of the 1st International Workshop on Advances in Patent Information Retrieval at ECIR 2010, pp. 14–21 (2010)
Google Scholar
Verspoor, K., et al.: Annotating the biomedical literature for the human variome. Database 2013, bat019 (2013)
Article Google Scholar
Yoshikawa, H., et al.: Detecting chemical reactions in patents. In: Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association, pp. 100–110 (2019)
Google Scholar
Zhai, Z., et al.: Improving chemical named entity recognition in patents with contextualized word embeddings. In: Proceedings of the 18th BioNLP Workshop, pp. 328–338 (2019)
Google Scholar

Download references

Acknowledgments

This work is supported by an Australian Research Council Linkage Project, LP160101469, and Elsevier. We would like to thank Estrid He, Zubair Afzal and Mark Sheehan for supporting this work, as well as the anonymous reviewers for their feedback.

Author information

Authors and Affiliations

The University of Melbourne, Melbourne, Australia
Dat Quoc Nguyen, Zenan Zhai, Hiyori Yoshikawa, Biaoyan Fang, Trevor Cohn, Timothy Baldwin & Karin Verspoor
Elsevier, Amsterdam, The Netherlands
Christian Druckenbrodt, Camilo Thorne, Ralph Hoessel & Saber A. Akhondi
VinAI Research, Hanoi, Vietnam
Dat Quoc Nguyen
Fujitsu Laboratories Ltd., Kanagawa, Japan
Hiyori Yoshikawa

Authors

Dat Quoc Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Zenan Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Hiyori Yoshikawa
View author publications
You can also search for this author in PubMed Google Scholar
Biaoyan Fang
View author publications
You can also search for this author in PubMed Google Scholar
Christian Druckenbrodt
View author publications
You can also search for this author in PubMed Google Scholar
Camilo Thorne
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Hoessel
View author publications
You can also search for this author in PubMed Google Scholar
Saber A. Akhondi
View author publications
You can also search for this author in PubMed Google Scholar
Trevor Cohn
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Baldwin
View author publications
You can also search for this author in PubMed Google Scholar
Karin Verspoor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karin Verspoor .

Editor information

Editors and Affiliations

University of Glasgow, Glasgow, UK
Joemon M. Jose
University College London, London, UK
Emine Yilmaz
Universidade NOVA de Lisboa, Lisbon, Portugal
João Magalhães
Universidad Autónoma de Madrid, Madrid, Spain
Pablo Castells
University of Padua, Padua, Italy
Nicola Ferro
Universidade de Lisboa, Lisbon, Portugal
Mário J. Silva
Universidade NOVA de Lisboa, Lisbon, Portugal
Flávio Martins

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, D.Q. et al. (2020). ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In: Jose, J., et al. Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science(), vol 12036. Springer, Cham. https://doi.org/10.1007/978-3-030-45442-5_74

Download citation

DOI: https://doi.org/10.1007/978-3-030-45442-5_74
Published: 08 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45441-8
Online ISBN: 978-3-030-45442-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics