Increasing Recall for Text Re-use in Historical Documents to Support Research in the Humanities

Büchler, Marco; Crane, Gregory; Moritz, Maria; Babeu, Alison

doi:10.1007/978-3-642-33290-6_11

Marco Büchler¹⁹,
Gregory Crane²⁰,
Maria Moritz¹⁹ &
…
Alison Babeu²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7489))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

2325 Accesses
5 Citations

Abstract

High precision text re-use detection allows humanists to discover where and how particular authors are quoted (e.g., the different sections of Plato’s work that come in and out of vogue). This paper reports on on-going work to provide the high recall text re-use detection that humanists often demand. Using an edition of one Greek work that marked quotations and paraphrases from the Homeric epics as our testbed, we were able to achieve a recall of at least 94% while maintaining a precision of 73%. This particular study is part of a larger effort to detect text re-use across 15 million words of Greek and 10 million words of Latin available or under development as openly licensed TEI XML.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Balasubramanian, N., Allan, J.: Syntactic Query Models for Restatement Retrieval. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 143–155. Springer, Heidelberg (2009)
Chapter Google Scholar
Potthast, M., Stein, B.: New Issues in Near-duplicate Detection Data Analysis, Machine Learning and Applications. In: Studies in Classification, Data Analysis, and Knowledge Organization, pp. 601–609. Springer, Heidelberg (2008)
Google Scholar
Wang, J.H., Chang, H.C.: Exploiting Sentence-Level Features for Near-Duplicate Document Detection. In: Lee, G.G., Song, D., Lin, C.-Y., Aizawa, A., Kuriyama, K., Yoshioka, M., Sakai, T. (eds.) AIRS 2009. LNCS, vol. 5839, pp. 205–217. Springer, Heidelberg (2009)
Chapter Google Scholar
Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)
Article Google Scholar
Alzahrani, S., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C 42(2), 133–149 (2012)
Article Google Scholar
Lee, J.: A computational model of text reuse in ancient literary texts. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, Association for Computational Linguistics, pp. 472–479 (June 2007)
Google Scholar
Bourdaillet, J., Ganascia, J.G., Pierre, U., Curie, M.: J.g: Alignment of noisy unstructured text data. In: Proc. of the IJCAI Workshop on Analytics for Noisy Unstructured Text Data (AND 2007) of the 20th International Joint Conference on Artificial Intelligence (IJCAI), pp. 139–146 (2007)
Google Scholar
Trillini, R.H., Quassdorf, S.: A ’key to all quotations’? a corpus-based parameter model of intertextuality. LLC 25(3), 269–286 (2010)
Google Scholar
Coffee, N., Koenig, J.P., Poornim, S., Forstall, C., Ossewaarde, R., Jacobson, S.: The tesserae project: Intertextual analysis of latin poetry (2011), http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-215.xml;query=;brand=default (last accessed February 14, 2012)
Forstall, C.W., Jacobson, S.L., Scheirer, W.J.: Evidence of intertextuality: investigating paul the deacon’s angustae vitae. Literary and Linguistic Computing 26(3), 285–296 (2011)
Article Google Scholar
Kane, A., Tompa, F.W.: Janus: the intertextuality search engine for the electronic manipulus florum project. Literary and Linguistic Computing 26(4), 407–415 (2011)
Article Google Scholar
Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29. IEEE Computer Society (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Computer Science, Leipzig University, Germany
Marco Büchler & Maria Moritz
Department of Classics, Tufts University, Boston, USA
Gregory Crane & Alison Babeu

Authors

Marco Büchler
View author publications
You can also search for this author in PubMed Google Scholar
Gregory Crane
View author publications
You can also search for this author in PubMed Google Scholar
Maria Moritz
View author publications
You can also search for this author in PubMed Google Scholar
Alison Babeu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Multimedia and Graphic Arts, Cyprus University of Technology, 3036, Limassol, Cyprus
Panayiotis Zaphiris & Fernando Loizides &
School of Informatics, City University of London, Northampton Square, EC1V 0HB, London, UK
George Buchanan
School of Library, Archival and Information Studies, Irving K. Barber Learning Centre, The University of British Columbia, V6T 1Z3, Vancouver, BC, Canada
Edie Rasmussen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Büchler, M., Crane, G., Moritz, M., Babeu, A. (2012). Increasing Recall for Text Re-use in Historical Documents to Support Research in the Humanities. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds) Theory and Practice of Digital Libraries. TPDL 2012. Lecture Notes in Computer Science, vol 7489. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33290-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-33290-6_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33289-0
Online ISBN: 978-3-642-33290-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics