Skip to main content
Log in

A word spotting framework for historical machine-printed documents

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

In this paper, we propose a word spotting framework for accessing the content of historical machine-printed documents without the use of an optical character recognition engine. A preprocessing step is performed in order to improve the quality of the document images, while word segmentation is accomplished with the use of two complementary segmentation methodologies. In the proposed methodology, synthetic word images are created from keywords, and these images are compared to all the words in the digitized documents. A user feedback process is used in order to refine the search procedure. The methodology has been evaluated in early Modern Greek documents printed during the seventeenth and eighteenth century. In order to improve the efficiency of accessing and search, natural language processing techniques have been addressed that comprise a morphological generator that enables searching in documents using only a base word-form for locating all the corresponding inflected word-forms and a synonym dictionary that further facilitates access to the semantic context of documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Antworth, E.: PC-KIMMO: A Two-level Processor for Morphological Analysis, Occasional Publications in Academic Computing no 16, Summer Institute of Linguistics, Dallas TX (1990)

  2. Antonacopoulos, A., Karatzas, D.: Semantics-based content extraction in typewritten historical documents. In: Eighth International Conference on Document Analysis and Recognition, pp. 48–53, 2005

  3. Bai, D., Song, P., Bruza, J., Nie, J., Cao, J.: Query expansion using term relationships in language models for information retrieval. In: Proceedings of the 14th International Conference on Information and Knowledge Management (CIKM05), 2005

  4. Beesley K., Karttunen L.: Finite State Morphology. CSLI Publications, Stanford (2003)

    Google Scholar 

  5. Bokser M.: Omnidocument technologies. Proc. IEEE 80(7), 1066–1078 (1992)

    Article  Google Scholar 

  6. Cao, J., Nie, J., Bai, J.: Integrating word relationships into language models. In: Proceedings of the 2005 ACM SIGIR Conference on Research and Development in Information Retrieval, 2005

  7. Doerman, D.: The detection of duplicates in document image databases. In: Proc. of the 4th Int. Conf. on Document Analysis and Recognition (ICDAR’97), pp. 314–318, 1997

  8. Ernst-Gerlach, A., Fuhr, N.: Generating Term Variants for Text Collections with Historic Spellings. In: Proceedings of the 28th European Conference on Information RetrievalResearch (ECIR 2006), Springer, 2006

  9. Fang, H.: A re-examination of query expansion using lexical resources. In: Proceedings of ACL’08, pp. 139–147, Columbus, Ohio, 2008

  10. Fang, H., Zai, C.: An exploration of axiomatic approaches to information retrieval. In: Proceedings of the 2005 ACM SIGIR Conference on Research and Development in Information Retrieval, 2005

  11. Fang, C., Zai, C.: Semantic term matching in axiomatic approaches to information retrieval. In: Proceedings of the 2006 ACM SIGIR Conference on Research and Development in Information Retrieval, 2006

  12. Gatos, B., Danatsas, D., Pratikakis I., Perantonis, S.J.: Automatic table detection in document images. In: Proceedings of the Third International Conference on Advances in Pattern Recognition (ICAPR’05). Lecture Notes in Computer Science (3686), pp. 609–618. (2005)

  13. Gatos B., Papamarkos N., Chamzas C.: A binary tree based OCR technique for machine printed characters. Eng. Appl. Artif. Intell. 10(4), 403–412 (1997)

    Article  Google Scholar 

  14. Gatos B., Pratikakis I., Perantonis S.J.: Adaptive degraded document image binarization. Pattern Recognit 39, 317–327 (2006)

    Article  MATH  Google Scholar 

  15. Guillevic, D., Suen, C.Y.: HMM word recognition engine. In: Fourth International Conference on Document Analysis and Recognition (ICDAR’97), pp. 544–547, 1997

  16. Karttunen L.: KIMMO: a general morphological processor. Tex. Linguist. Forum 22, 163–186 (1983)

    Google Scholar 

  17. Karttunen, L., Oflazer, K.: Special issue on finite-state methods in NLP: computational linguistics. 26(1), 1–2 (2000)

  18. Keaton, P., Greenspan, H., Goodman, R.: Keyword spotting for cursive document retrieval. In: Workshop on Document Image Analysis (DIA 1997), pp. 74–82, 1997

  19. Keskustalo H., Järvelin K., Pirkola A.: Evaluating the effectiveness of relevance feedback based on a user simulation model: effects of a user scenario on cumulated gain value. Inf. Retr. 11(3), 209–228 (2008)

    Article  Google Scholar 

  20. Koskenniemi, K.: Two-level Morphology: A General Computational Model for Word-form Recognition and Production. Publication No 11, Dept. of General Linguistics, University of Helsinki (1983)

  21. Konidaris T., Gatos B., Ntzios K., Pratikakis I., Theodoridis S., Perantonis S.J.: Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int. J. Doc. Anal. Recognit. (IJDAR) Spec. Issue Hist. Doc. 9(2–4), 167–177 (2007)

    Article  Google Scholar 

  22. Lampropoulos, A., Galiotou, E., Manolessou, I., Ralli, A.: A finite state approach to the computational morphology of early Modern Greek. In: Proceedings of the 7th WSEAS International Conference on Applied Computer Science, Venice, pp. 242–245, 2007

  23. Liu, S., Liu, F., Yu, C., Meng, W.: An effective approach to document retrieval using WordNet and recognizing phrases. In: Proceedings of the 2004 ACM SIGIR Conference on Research and Development in Information Retrieval, 2004

  24. Lu, Y., Tan, C., Weihua, H., Fan, L.: An approach to word image matching based on weighted Hausdorff distance. In: Sixth International Conference on Document Analysis and Recognition (ICDAR’01), pp. 10–13, 2001

  25. Mandala, R., Tokunaga, T., Tanaka, H.: Combining multiple evidence from different types of thesaurus for query expansion. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, 1999

  26. Manmatha R., Croft W.B.: A Draft of Word Spotting: Indexing Handwritten Manuscripts, Intelligent Multimedia Information Retrieval, pp. 43–64. MIT Press, Cambridge, MA (1997)

    Google Scholar 

  27. Marcolino, A., Ramos, V., Ármalo, M., Pinto, J.C.: Linea and Word matching in old documents. In: Proceedings of the Fifth Ibero-American Symposium on Pattern Recognition (SIAPR’00), pp. 123–125, 2000

  28. Perantonis S.J., Gatos B., Papamarkos N.: Block decomposition and segmentation for fast Hough transform evaluation. Pattern Recognit. 32(5), 811–824 (1999)

    Article  Google Scholar 

  29. Ralli A., Galiotou E.: Greek Compounds: A Challenging Case for the Parsing Techniques of PC-KIMMO v.2. Int. J. Comput. Intell. 1(2), 152–162 (2004)

    Google Scholar 

  30. Rath, T.M., Manmatha, R.: Features for word spotting in historical documents. In: Proc. of the 7th Int. Conf. on Document Analysis and Recognition (ICDAR’03), pp. 218–222, 2003

  31. Roark B., Sproat R.: Computational Approaches to Morphology and Syntax. Oxford university Press, Oxford (2007)

    Google Scholar 

  32. Salton G.: Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)

    Google Scholar 

  33. Schmid, H.: A Programming Language for Finite State Transducers. In: Proc. FSMNLP 2005, Helsinki, Finland, 2005

  34. Schmid, H., Fitschen, A., Heid, U.: SMOR: A German Computational Morphology Covering Derivation, Composition, and Inflection. In: Proc. LREC 2004, Lisbon, Portugal, pp. 1263–1266, 2004

  35. Sgarbas K., Kokkinakis N.G.: A PC-KIMMO-Based Morphological Description of Modern Greek. Lit. Linguist. Comput. 10(3), 189–201 (1995)

    Article  Google Scholar 

  36. Stamatopoulos, N., Gatos, B., Kesidis, A.: Automatic Borders Detection of Camera Document Images. In: 2nd International Workshop on Camera-Based Document Analysis and Recognition (CBDAR’07), Curitiba, Brazil, pp. 71–78, 2007

  37. Theodoridis S., Koutroumbas K.: Pattern recognition. Academic Press, New York (1997)

    Google Scholar 

  38. Turcato, D., Popowich, F., Toole, J., Fass, D., Nicholson, D., Tisher, D.: Adapting a synonym database to specific domains. In: Klavans J., Gonzalo J. (eds.) Proceedings of the ACL Workshop on Recent Advances in Natural Language Processing and Information Retrieval, pp. 1–11 (2000)

  39. Veltkamp, R.C., Hagedoorn, M.: Shape similarity measures, properties, and constructions. In: Advances in Visual Information Systems, 4th Int. Conf, VISUAL 2000, pp. 467–476, 2000

  40. Voorhees E.M.: Using WordNet for text retrieval. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, chap. 12, pp. 285–303. MIT Press Books, Cambridge (1998)

    Google Scholar 

  41. Wahl F.M., Wong K.Y., Casey R.G.: Block segmentation and text extraction in mixed text/image documents. Comput. Graph. Image Process 20, 375–390 (1982)

    Article  Google Scholar 

  42. Wolf C., Jolion J.: Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recognit. 8(4), 280–296 (2006)

    Article  Google Scholar 

  43. Yin P.Y.: Skew detection and block classification of printed documents. Image Vis. Comput. 19, 567–579 (2001)

    Article  Google Scholar 

  44. Zhiguo, G., Chan, W.C., Long, H.U.: Web query expansion by WordNet. In: Proceedings of DEXA’05, Copenhagen, pp. 166–175, Springer, 2005

  45. ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/

  46. http://www.mingw.org/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. L. Kesidis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kesidis, A.L., Galiotou, E., Gatos, B. et al. A word spotting framework for historical machine-printed documents. IJDAR 14, 131–144 (2011). https://doi.org/10.1007/s10032-010-0134-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-010-0134-4

Keywords

Navigation