loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Daniel van Strien 1 ; Kaspar Beelen 2 ; Mariona Coll Ardanuy 2 ; Kasra Hosseini 2 ; Barbara McGillivray 2 ; 3 and Giovanni Colavizza 2 ; 4

Affiliations: 1 The British Library, London, U.K. ; 2 The Alan Turing Institute, London, U.K. ; 3 University of Cambridge, Cambridge, U.K. ; 4 University of Amsterdam, Amsterdam, The Netherlands

Keyword(s): Optical Character Recognition, OCR, Digital Humanities, Natural Language Processing, NLP, Information Retrieval.

Abstract: A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks — sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning — using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.135.205.146

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
van Strien, D.; Beelen, K.; Ardanuy, M.; Hosseini, K.; McGillivray, B. and Colavizza, G. (2020). Assessing the Impact of OCR Quality on Downstream NLP Tasks. In Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH; ISBN 978-989-758-395-7; ISSN 2184-433X, SciTePress, pages 484-496. DOI: 10.5220/0009169004840496

@conference{artidigh20,
author={Daniel {van Strien}. and Kaspar Beelen. and Mariona Coll Ardanuy. and Kasra Hosseini. and Barbara McGillivray. and Giovanni Colavizza.},
title={Assessing the Impact of OCR Quality on Downstream NLP Tasks},
booktitle={Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH},
year={2020},
pages={484-496},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0009169004840496},
isbn={978-989-758-395-7},
issn={2184-433X},
}

TY - CONF

JO - Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH
TI - Assessing the Impact of OCR Quality on Downstream NLP Tasks
SN - 978-989-758-395-7
IS - 2184-433X
AU - van Strien, D.
AU - Beelen, K.
AU - Ardanuy, M.
AU - Hosseini, K.
AU - McGillivray, B.
AU - Colavizza, G.
PY - 2020
SP - 484
EP - 496
DO - 10.5220/0009169004840496
PB - SciTePress