A word spotting framework for historical machine-printed documents

Kesidis, A. L.; Galiotou, E.; Gatos, B.; Pratikakis, I.

doi:10.1007/s10032-010-0134-4

A word spotting framework for historical machine-printed documents

Original Paper
Published: 17 November 2010

Volume 14, pages 131–144, (2011)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

A. L. Kesidis^1,2,
E. Galiotou³,
B. Gatos¹ &
…
I. Pratikakis^1,4

248 Accesses
28 Citations
Explore all metrics

Abstract

In this paper, we propose a word spotting framework for accessing the content of historical machine-printed documents without the use of an optical character recognition engine. A preprocessing step is performed in order to improve the quality of the document images, while word segmentation is accomplished with the use of two complementary segmentation methodologies. In the proposed methodology, synthetic word images are created from keywords, and these images are compared to all the words in the digitized documents. A user feedback process is used in order to refine the search procedure. The methodology has been evaluated in early Modern Greek documents printed during the seventeenth and eighteenth century. In order to improve the efficiency of accessing and search, natural language processing techniques have been addressed that comprise a morphological generator that enables searching in documents using only a base word-form for locating all the corresponding inflected word-forms and a synonym dictionary that further facilitates access to the semantic context of documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Antworth, E.: PC-KIMMO: A Two-level Processor for Morphological Analysis, Occasional Publications in Academic Computing no 16, Summer Institute of Linguistics, Dallas TX (1990)
Antonacopoulos, A., Karatzas, D.: Semantics-based content extraction in typewritten historical documents. In: Eighth International Conference on Document Analysis and Recognition, pp. 48–53, 2005
Bai, D., Song, P., Bruza, J., Nie, J., Cao, J.: Query expansion using term relationships in language models for information retrieval. In: Proceedings of the 14th International Conference on Information and Knowledge Management (CIKM05), 2005
Beesley K., Karttunen L.: Finite State Morphology. CSLI Publications, Stanford (2003)
Google Scholar
Bokser M.: Omnidocument technologies. Proc. IEEE 80(7), 1066–1078 (1992)
Article Google Scholar
Cao, J., Nie, J., Bai, J.: Integrating word relationships into language models. In: Proceedings of the 2005 ACM SIGIR Conference on Research and Development in Information Retrieval, 2005
Doerman, D.: The detection of duplicates in document image databases. In: Proc. of the 4th Int. Conf. on Document Analysis and Recognition (ICDAR’97), pp. 314–318, 1997
Ernst-Gerlach, A., Fuhr, N.: Generating Term Variants for Text Collections with Historic Spellings. In: Proceedings of the 28th European Conference on Information RetrievalResearch (ECIR 2006), Springer, 2006
Fang, H.: A re-examination of query expansion using lexical resources. In: Proceedings of ACL’08, pp. 139–147, Columbus, Ohio, 2008
Fang, H., Zai, C.: An exploration of axiomatic approaches to information retrieval. In: Proceedings of the 2005 ACM SIGIR Conference on Research and Development in Information Retrieval, 2005
Fang, C., Zai, C.: Semantic term matching in axiomatic approaches to information retrieval. In: Proceedings of the 2006 ACM SIGIR Conference on Research and Development in Information Retrieval, 2006
Gatos, B., Danatsas, D., Pratikakis I., Perantonis, S.J.: Automatic table detection in document images. In: Proceedings of the Third International Conference on Advances in Pattern Recognition (ICAPR’05). Lecture Notes in Computer Science (3686), pp. 609–618. (2005)
Gatos B., Papamarkos N., Chamzas C.: A binary tree based OCR technique for machine printed characters. Eng. Appl. Artif. Intell. 10(4), 403–412 (1997)
Article Google Scholar
Gatos B., Pratikakis I., Perantonis S.J.: Adaptive degraded document image binarization. Pattern Recognit 39, 317–327 (2006)
Article MATH Google Scholar
Guillevic, D., Suen, C.Y.: HMM word recognition engine. In: Fourth International Conference on Document Analysis and Recognition (ICDAR’97), pp. 544–547, 1997
Karttunen L.: KIMMO: a general morphological processor. Tex. Linguist. Forum 22, 163–186 (1983)
Google Scholar
Karttunen, L., Oflazer, K.: Special issue on finite-state methods in NLP: computational linguistics. 26(1), 1–2 (2000)
Keaton, P., Greenspan, H., Goodman, R.: Keyword spotting for cursive document retrieval. In: Workshop on Document Image Analysis (DIA 1997), pp. 74–82, 1997
Keskustalo H., Järvelin K., Pirkola A.: Evaluating the effectiveness of relevance feedback based on a user simulation model: effects of a user scenario on cumulated gain value. Inf. Retr. 11(3), 209–228 (2008)
Article Google Scholar
Koskenniemi, K.: Two-level Morphology: A General Computational Model for Word-form Recognition and Production. Publication No 11, Dept. of General Linguistics, University of Helsinki (1983)
Konidaris T., Gatos B., Ntzios K., Pratikakis I., Theodoridis S., Perantonis S.J.: Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int. J. Doc. Anal. Recognit. (IJDAR) Spec. Issue Hist. Doc. 9(2–4), 167–177 (2007)
Article Google Scholar
Lampropoulos, A., Galiotou, E., Manolessou, I., Ralli, A.: A finite state approach to the computational morphology of early Modern Greek. In: Proceedings of the 7th WSEAS International Conference on Applied Computer Science, Venice, pp. 242–245, 2007
Liu, S., Liu, F., Yu, C., Meng, W.: An effective approach to document retrieval using WordNet and recognizing phrases. In: Proceedings of the 2004 ACM SIGIR Conference on Research and Development in Information Retrieval, 2004
Lu, Y., Tan, C., Weihua, H., Fan, L.: An approach to word image matching based on weighted Hausdorff distance. In: Sixth International Conference on Document Analysis and Recognition (ICDAR’01), pp. 10–13, 2001
Mandala, R., Tokunaga, T., Tanaka, H.: Combining multiple evidence from different types of thesaurus for query expansion. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, 1999
Manmatha R., Croft W.B.: A Draft of Word Spotting: Indexing Handwritten Manuscripts, Intelligent Multimedia Information Retrieval, pp. 43–64. MIT Press, Cambridge, MA (1997)
Google Scholar
Marcolino, A., Ramos, V., Ármalo, M., Pinto, J.C.: Linea and Word matching in old documents. In: Proceedings of the Fifth Ibero-American Symposium on Pattern Recognition (SIAPR’00), pp. 123–125, 2000
Perantonis S.J., Gatos B., Papamarkos N.: Block decomposition and segmentation for fast Hough transform evaluation. Pattern Recognit. 32(5), 811–824 (1999)
Article Google Scholar
Ralli A., Galiotou E.: Greek Compounds: A Challenging Case for the Parsing Techniques of PC-KIMMO v.2. Int. J. Comput. Intell. 1(2), 152–162 (2004)
Google Scholar
Rath, T.M., Manmatha, R.: Features for word spotting in historical documents. In: Proc. of the 7th Int. Conf. on Document Analysis and Recognition (ICDAR’03), pp. 218–222, 2003
Roark B., Sproat R.: Computational Approaches to Morphology and Syntax. Oxford university Press, Oxford (2007)
Google Scholar
Salton G.: Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Schmid, H.: A Programming Language for Finite State Transducers. In: Proc. FSMNLP 2005, Helsinki, Finland, 2005
Schmid, H., Fitschen, A., Heid, U.: SMOR: A German Computational Morphology Covering Derivation, Composition, and Inflection. In: Proc. LREC 2004, Lisbon, Portugal, pp. 1263–1266, 2004
Sgarbas K., Kokkinakis N.G.: A PC-KIMMO-Based Morphological Description of Modern Greek. Lit. Linguist. Comput. 10(3), 189–201 (1995)
Article Google Scholar
Stamatopoulos, N., Gatos, B., Kesidis, A.: Automatic Borders Detection of Camera Document Images. In: 2nd International Workshop on Camera-Based Document Analysis and Recognition (CBDAR’07), Curitiba, Brazil, pp. 71–78, 2007
Theodoridis S., Koutroumbas K.: Pattern recognition. Academic Press, New York (1997)
Google Scholar
Turcato, D., Popowich, F., Toole, J., Fass, D., Nicholson, D., Tisher, D.: Adapting a synonym database to specific domains. In: Klavans J., Gonzalo J. (eds.) Proceedings of the ACL Workshop on Recent Advances in Natural Language Processing and Information Retrieval, pp. 1–11 (2000)
Veltkamp, R.C., Hagedoorn, M.: Shape similarity measures, properties, and constructions. In: Advances in Visual Information Systems, 4th Int. Conf, VISUAL 2000, pp. 467–476, 2000
Voorhees E.M.: Using WordNet for text retrieval. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, chap. 12, pp. 285–303. MIT Press Books, Cambridge (1998)
Google Scholar
Wahl F.M., Wong K.Y., Casey R.G.: Block segmentation and text extraction in mixed text/image documents. Comput. Graph. Image Process 20, 375–390 (1982)
Article Google Scholar
Wolf C., Jolion J.: Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recognit. 8(4), 280–296 (2006)
Article Google Scholar
Yin P.Y.: Skew detection and block classification of printed documents. Image Vis. Comput. 19, 567–579 (2001)
Article Google Scholar
Zhiguo, G., Chan, W.C., Long, H.U.: Web query expansion by WordNet. In: Proceedings of DEXA’05, Copenhagen, pp. 166–175, Springer, 2005
ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/
http://www.mingw.org/

Download references

Author information

Authors and Affiliations

Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center for Scientific Research “Demokritos”, 15310, Agia Paraskevi, Athens, Greece
A. L. Kesidis, B. Gatos & I. Pratikakis
Department of Surveying Engineering, Technological Educational Institution of Athens, Ag. Spyridona, 12210, Egaleo, Athens, Greece
A. L. Kesidis
Department of Informatics, Technological Educational Institution of Athens, Ag. Spyridona, 12210, Egaleo, Athens, Greece
E. Galiotou
Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100, Xanthi, Greece
I. Pratikakis

Authors

A. L. Kesidis
View author publications
You can also search for this author in PubMed Google Scholar
E. Galiotou
View author publications
You can also search for this author in PubMed Google Scholar
B. Gatos
View author publications
You can also search for this author in PubMed Google Scholar
I. Pratikakis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. L. Kesidis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kesidis, A.L., Galiotou, E., Gatos, B. et al. A word spotting framework for historical machine-printed documents. IJDAR 14, 131–144 (2011). https://doi.org/10.1007/s10032-010-0134-4

Download citation

Received: 10 January 2010
Revised: 28 July 2010
Accepted: 12 October 2010
Published: 17 November 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10032-010-0134-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A word spotting framework for historical machine-printed documents

Abstract

Access this article

Similar content being viewed by others

A segmentation-free word spotting method for historical printed documents

Providing Access to Old Greek Documents Using Keyword Spotting Techniques

Word Spotting in Cursive Handwritten Documents Using Modified Character Shape Codes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A word spotting framework for historical machine-printed documents

Abstract

Access this article

Similar content being viewed by others

A segmentation-free word spotting method for historical printed documents

Providing Access to Old Greek Documents Using Keyword Spotting Techniques

Word Spotting in Cursive Handwritten Documents Using Modified Character Shape Codes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation