Overview of CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers

Ehrmann, Maud; Romanello, Matteo; Flückiger, Alex; Clematide, Simon

doi:10.1007/978-3-030-58219-7_21

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12260))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1271 Accesses
16 Citations

Abstract

This paper presents an overview of the first edition of HIPE (Identifying Historical People, Places and other Entities), a pioneering shared task dedicated to the evaluation of named entity processing on historical newspapers in French, German and English. Since its introduction some twenty years ago, named entity (NE) processing has become an essential component of virtually any text mining application and has undergone major changes. Recently, two main trends characterise its developments: the adoption of deep learning architectures and the consideration of textual material originating from historical and cultural heritage collections. While the former opens up new opportunities, the latter introduces new challenges with heterogeneous, historical and noisy inputs. In this context, the objective of HIPE, run as part of the CLEF 2020 conference, is threefold: strengthening the robustness of existing approaches on non-standard inputs, enabling performance comparison of NE processing on historical texts, and, in the long run, fostering efficient semantic indexing of historical documents. Tasks, corpora, and results of 13 participating teams are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://impresso.github.io/CLEF-HIPE-2020/.
2.
muc, ace, conll, kbp, ester, harem, quaero, germeval, etc.
3.
https://impresso-project.ch/.
4.
For space reasons, the discussion of related work is included in the extended version of this overview [12].
5.
From the Swiss National Library, the Luxembourgish National Library, and the Library of Congress (Chronicling America project), respectively. Original collections correspond to 4 Swiss and Luxembourgish titles, and a dozen for English. More details on original sources can be found in [12].
6.
https://www.newseye.eu/.
7.
The November 2019 dump used for annotation is available at https://files.ifi.uzh.ch/cl/impresso/clef-hipe.
8.
https://universaldependencies.org/format.html.
9.
https://creativecommons.org/licenses/by-nc/4.0/legalcode.
10.
https://zenodo.org/deposit/3706857.
11.
https://github.com/impresso/CLEF-HIPE-2020/tree/master/data.
12.
https://github.com/flairNLP/flair.
13.
https://creativecommons.org/licenses/by-sa/4.0/legalcode.
14.
https://files.ifi.uzh.ch/cl/siclemat/impresso/clef-hipe-2020/flair/.
15.
https://github.com/flairNLP/flair.
16.
True positive, False positive, False negative.
17.
https://github.com/impresso/CLEF-HIPE-2020-scorer.
18.
https://github.com/jcklie/wikimapper.
19.
https://huggingface.co/dbmdz.
20.
https://github.com/eldams/mXS.
21.
https://github.com/kermitt2/delft.
22.
https://github.com/kermitt2/entity-fishing.
23.
[38] for French, [32] for German.
24.
SEM [10] is a CRF-based tool using Wapiti [28] as its linear CRF implementation.
25.
https://www.elastic.co/.
26.
https://impresso.github.io/CLEF-HIPE-2020/.

References

Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (demonstrations), pp. 54–59. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://www.aclweb.org/anthology/N19-4010
Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. Association for Computational Linguistics, Santa Fe, New Mexico, USA, August 2018. http://www.aclweb.org/anthology/C18-1139
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). https://www.aclweb.org/anthology/Q17-1010
Article Google Scholar
Bollmann, M.: A large-scale comparison of historical text normalization systems. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3885–3898. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1389
Borin, L., Kokkinakis, D., Olsson, L.J.: Naming the past: named entity and animacy recognition in 19th century Swedish literature. In: Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaT-eCH 2007), pp. 1–8 (2007)
Google Scholar
Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.): CLEF 2020 Working Notes. In: CEUR Workshop Proceedings Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum (2020)
Google Scholar
Chiron, G., Doucet, A., Coustaty, M., Visani, M., Moreux, J.P.: Impact of OCR errors on the use of digital libraries: towards a better access to information. In: Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries JCDL 2017, pp. 249–252. IEEE Press, Piscataway (2017), http://dl.acm.org/citation.cfm?id=3200334.3200364
Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016). https://doi.org/10.1162/tacl_a_00104
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Dupont, Y., Dinarelli, M., Tellier, I., Lautier, C.: Structured named entity recognition by cascading CRFs. In: Intelligent Text Processing and Computational Linguistics (CICling) (2017)
Google Scholar
Ehrmann, M., Colavizza, G., Rochat, Y., Kaplan, F.: Diachronic evaluation of NER systems on old newspapers. In: Proceedings of the 13th Conference on Natural Language Processing KONVENS 2016, pp. 97–107. Bochumer Linguistische Arbeitsberichte (2016). https://infoscience.epfl.ch/record/221391?ln=en
Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Extended overview of CLEF HIPE 2020: named entity processing on historical newspapers. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum CEUR-WS (2020)
Google Scholar
Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: HIPE - shared task participation guidelines (v1.1) (2020). https://doi.org/10.5281/zenodo.3677171
Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Impresso named entity annotation guidelines (2020). https://doi.org/10.5281/zenodo.3604227
El Vaigh, C.B., Goasdoué, F., Gravier, G., Sébillot, P.: Using knowledge base semantics in context-aware entity linking. In: 2019 Proceedings of the ACM Symposium on Document Engineering DocEng 2019, pp. 1–10. Association for Computing Machinery, Berlin, Germany, September 2019. https://doi.org/10.1145/3342558.3345393
Galibert, O., Rosset, S., Grouin, C., Zweigenbaum, P., Quintard, L.: Extended named entity annotation on OCRed documents : from corpus constitution to evaluation campaign. In: Proceedings of the Eighth conference on International Language Resources and Evaluation, pp. 3126–3131. Istanbul, Turkey (2012)
Google Scholar
Ganea, O.E., Hofmann, T.: Deep joint entity disambiguation with local neural attention. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2619–2629 (2017)
Google Scholar
Grishman, R., Sundheim, B.: Design of the MUC-6 evaluation. In: Proceedings of the Sixth Conference on Message Understanding Conference (MUC-6), Columbia, Maryland (1995)
Google Scholar
Hoffart, J., et al.: Robust disambiguation of named entities in text. In: EMNLP (2011)
Google Scholar
Hooland, S.V., Wilde, M.D., Verborgh, R., Steiner, T., Van de Walle, R.: Exploring entity recognition and disambiguation for cultural heritage collections. Digit. Sch. Humanit. 30(2), 262–279 (2015). https://doi.org/10.1093/llc/fqt067
Article Google Scholar
van Hulst, J.M., Hasibi, F., Dercksen, K., Balog, K., de Vries, A.P.: REL: an entity linker standing on the shoulders of giants. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 2020. ACM (2020)
Google Scholar
Kaplan, F., di Lenardo, I.: Big data of the past. Front. Digit. Humanit. 4 (2017). https://doi.org/10.3389/fdigh.2017.00012
Klie, J.C., Bugert, M., Boullosa, B., de Castilho, R.E., Gurevych, I.: The inception platform: machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018)
Google Scholar
Kolitsas, N., Ganea, O.E., Hofmann, T.: End-to-end neural entity linking. In: Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 519–529. Association for Computational Linguistics, Brussels, Belgium, October 2018. https://doi.org/10.18653/v1/K18-1050
Krippendorff, K.: Content Analysis: An Introduction to its Methodology. Sage Publications, Thousand Oaks (1980)
MATH Google Scholar
Labusch, K., Neudecker, C., Zellhöfer, D.: BERT for named entity recognition in contemporary and historic german. In: Preliminary proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, pp. 1–9. German Society for Computational Linguistics & Language Technology, Erlangen, Germany (2019)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition, March 2016. arXiv:1603.01360. http://arxiv.org/abs/1603.01360
Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 504–513. Association for Computational Linguistics (2010)
Google Scholar
Linhares Pontes, E., Hamdi, A., Sidere, N., Doucet, A.: Impact of OCR quality on named entity linking. In: Jatowt, A., Maeda, A., Syn, S.Y. (eds.) ICADL 2019. LNCS, vol. 11853, pp. 102–115. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34058-2_11
Chapter Google Scholar
Makhoul, J., Kubala, F., Schwartz, R., Weischedel, R.: Performance measures for information extraction. In: Proceedings of DARPA Broadcast News Workshop, pp. 249–252 (1999)
Google Scholar
Martin, L., et al.: Camembert: a tasty french language model (2019)
Google Scholar
May, P.: German ELMo model (2019). https://github.com/t-systems-on-site-services-gmbh/german-elmo-model
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
Article Google Scholar
Neudecker, C., Antonacopoulos, A.: Making Europe’s historical newspapers searchable. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 405–410. IEEE, Santorini, Greece, April 2016. https://doi.org/10.1109/DAS.2016.83
Nguyen, D.B., Hoffart, J., Theobald, M., Weikum, G.: Aida-light: high-throughput named-entity disambiguation. In: LDOW (2014)
Google Scholar
Nouvel, D., Antoine, J.-Y., Friburger, N.: Pattern mining for named entity recognition. In: Vetulani, Z., Mariani, J. (eds.) LTC 2011. LNCS (LNAI), vol. 8387, pp. 226–237. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08958-4_19
Chapter Google Scholar
Okazaki, N.: CRFsuite: a fast implementation of Conditional Random Fields (CRFs) (2007). http://www.chokkan.org/software/crfsuite/
Ortiz Suárez, P.J., Dupont, Y., Muller, B., Romary, L., Sagot, B.: Establishing a new state-of-the-art for French named entity recognition. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4631–4638. European Language Resources Association, Marseille, France, May 2020. https://www.aclweb.org/anthology/2020.lrec-1.569
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. EMNLP 14, 1532–43 (2014)
Google Scholar
Peters, M., et al.: Deep Contextualized Word Representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1202
Piotrowski, M.: Natural language processing for historical texts. Synth. Lect. Hum. Lang. Technol. 5(2), 1–157 (2012)
Article Google Scholar
Plank, B.: What to do about non-standard (or non-canonical) language in NLP. In: Proceedings of the 13th Conference on Natural Language Processing KONVENS 2016. Bochumer Linguistische Arbeitsberichte (2016)
Google Scholar
Rao, D., McNamee, P., Dredze, M.: Entity linking: finding extracted entities in a knowledge base. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multi-source, Multilingual Information Extraction and Summarization, pp. 93–115. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-28569-1_5
Chapter Google Scholar
Rosset, S., Grouin, C., Fort, K., Galibert, O., Kahn, J., Zweigenbaum, P.: Structured named entities in two distinct press corpora: contemporary broadcast news and old newspapers. In: Proceedings of the 6th Linguistic Annotation Workshop, pp. 40–48. Association for Computational Linguistics (2012)
Google Scholar
Rosset, S., Grouin, C., Zweigenbaum, P.: Entités nommées structurées : guide d’annotation Quaero. NOTES et DOCUMENTS 2011–04, LIMSI-CNRS (2011)
Google Scholar
Smith, D.A., Cordell, R.: A research agenda for historical and multilingual optical character recognition. Technical report (2018). http://hdl.handle.net/2047/D20297452
Sporleder, C.: Natural language processing for cultural heritage domains. Lang. Linguist. Compass 4(9), 750–768 (2010). https://doi.org/10.1111/j.1749-818X.2010.00230.x
Article Google Scholar
van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: ICAART 2020 - Proceedings of the 12th International Conference on Agents and Artificial Intelligence. SCITEPRESS - Science and Technology Publications, January 2020. https://doi.org/10.17863/CAM.52068
van Strien, D., Beelen, K., Ardanuy, M., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence, pp. 484–496. SCITEPRESS - Science and Technology Publications, Valletta, Malta (2020). https://doi.org/10.5220/0009169004840496
Terras, M.: The rise of digitization. In: Rikowski, R. (ed.) Digitisation Perspectives, pp. 3–20. Sense Publishers, Rotterdam (2011). https://doi.org/10.1007/978-94-6091-299-3_1
Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762
Vilain, M., Su, J., Lubar, S.: Entity extraction is a boring solved problem: or is it? In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, NAACL-Short 2007, Rochester, New York, pp. 181–184. Association for Computational Linguistics (2007). http://dl.acm.org/citation.cfm?id=1614108.1614154

Download references

Acknowledgements

This HIPE evaluation lab would not have been possible without the interest and commitment of many. We express our warmest thanks to: the Swiss newspapers NZZ and Le Temps, and the Swiss and Luxembourg national libraries for sharing part of their data in the frame of the impresso project; Camille Watter and Gerold Schneider for their commitment and hard work with the construction of the data set; the inception project team for its valuable and efficient support with the annotation tool; Richard Eckart de Castillo, Clemens Neudecker, Sophie Rosset and David Smith for their encouragement and guidance as part of the HIPE advisory board; and, finally, the 13 teams who embarked in this first HIPE edition, for their patience and scientific involvement. HIPE is part of the research activities of the project “impresso – Media Monitoring of the Past”, for which we also gratefully acknowledge the financial support of the Swiss National Science Foundation under grant number CR-SII5_173719.

Author information

Authors and Affiliations

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Maud Ehrmann & Matteo Romanello
University of Zurich, Zurich, Switzerland
Alex Flückiger & Simon Clematide

Authors

Maud Ehrmann
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Romanello
View author publications
You can also search for this author in PubMed Google Scholar
Alex Flückiger
View author publications
You can also search for this author in PubMed Google Scholar
Simon Clematide
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maud Ehrmann .

Editor information

Editors and Affiliations

Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, Greece
Avi Arampatzis
University of Amsterdam, Amsterdam, The Netherlands
Evangelos Kanoulas
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Theodora Tsikrika
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Stefanos Vrochidis
Faculty of Library, Information and Media Science, University of Tsukuba, Ibaraki, Japan
Hideo Joho
Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
Christina Lioma
Brown University, Providence, RI, USA
Carsten Eickhoff
LIMSI-CNRS, Orsay, France
Aurélie Névéol
Department of Information Engineering, University of Padova, Padua, Italy
Linda Cappellato
Department of Information Engineering, University of Padova, Padua, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S. (2020). Overview of CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers. In: Arampatzis, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2020. Lecture Notes in Computer Science(), vol 12260. Springer, Cham. https://doi.org/10.1007/978-3-030-58219-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-58219-7_21
Published: 15 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58218-0
Online ISBN: 978-3-030-58219-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics