Skip to main content
Log in

A survey of keyword spotting techniques for printed document images

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

This paper attempts to provide a survey of the past researches on character based as keyword based approaches used for retrieving information from document images. This survey also provides insights into the strengths and weaknesses of current techniques, relevancy lies between each technique and also the guidance in choosing the area that future work on document image retrieval could address.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Abirami S, Manjula D (2009) Feature string based intelligent information retrieval from tamil document images. Int J Comput Appl Technol Special Issue on ‘Computer Applications in Knowledge Based Systems’, Vol. 35, No. 2/3/4. Inderscience Publishers, pp 150–164

  • Aparna KH, Chakravarthi VS (2003) A complete OCR system development of OCR Tamil magazine documents. Tamil Internet

  • Balasubramanian A, Meshesha M, Jawahar CV (2006) Retrieval from document image collections. In: Proceedings of the international workshop on document analysis systems, LNCS 3872: 1–12

    Google Scholar 

  • Balasubramanian A, Jawahar CV (2006) Textual search in graphics stream of PDF. International conference on digital libraries, pp 1–10

  • Chaudhury S, Sethi G, Vyas A, Harit G (2003) Devising interactive access techniques for indian language document images. In: Proceedings of the seventh international conference on document analysis and recognition, pp 885–889

  • Chen FR, Wilcox LD, Bloomberg DS (1993) Detecting and locating partially specified keywords in scanned images using hidden markov models. In: Proceedings of the international conference on document analysis and recognition, pp 133–138

  • Chen FR, Wilcox LD, Bloomberg DS (1995) A comparison of discrete and continuous hidden markov models for phrase spotting in text images. In: Proceedings of the international conference on document analysis and recognition, pp 398–402

  • Chen FR, Bloomberg DS (1996) Extraction of thematically relevant text from images. Symposium on document analysis and information retrieval, pp 163–178

  • Doermann D (1998) Indexing and retrieval of document images: a survey. J Comput Vis Image Underst 70(3): 287–298

    Article  Google Scholar 

  • Harit G, Chaudhury S, Ghosh H (2004) Managing document images in a digital library: an ontology guided approach. In: Proceedings of the first international workshop on document image analysis for libraries, pp 64–92

  • Harit G, Chaudhury S, Gupta P, Vohra N, Joshi SD (2001) Model guided document image analysis system. In: Proceedings of the sixth international conference on document analysis and recognition, pp 1137–1141

  • Harit G, Chaudhury S, Paranjpe J (2005) Ontology guided access to document images. In: Proceedings of the eighth international conference on document analysis and recognition, pp 292–296

  • Harit G, Garg R, Chaudhury S (2007) An integrated scheme for compression and interactive access to document images. In: Proceedings of the international conference on computing: theory and applications, pp 506–511

  • Harit G, Jain R, Chaudhury S (2005) Improved geometric feature graph: a script independent representation of word images for compression and retrieval. In: Proceedings of the eighth international conference on document analysis and recognition, pp 421–425

  • Jawahar CV, Meshesha M, Balasubramanian A (2004) Searching in document images. In: Proceedings of the international conference on visualization, graphics and image processing, pp 622–627

  • Jawahar CV, Million M, Balasubramanian A (2004) Word level access to document image datasets. In: Proceedings of the workshop on computer vision, graphics and image processing, pp 73–76

  • Kameshiro T, Hirano T, Okada Y, Yoda F (1999) A document image retrieval method tolerating recognition and segmentation errors of OCR using shape feature and multiple candidates. In: Proceedings of the fifth international conference on document analysis and recognition, pp 681–684

  • Kasthuri R, Gormann LO, Govindaraju V (2002) Document image aanlysis: a primer. Sadhana 27(Part. 1): 3–22

    Article  Google Scholar 

  • Katsuyama K (2002) Highly accurate retrieval of Japanese document images through a combination of morphological analysis and OCR. In: Proceedings of the document recognition and retrieval 4670: 57–67

    Google Scholar 

  • Krishnamoorthy V (2002) OCR software for Tamil Printed Text. Tamil Internet, pp 99–102

  • Lu S, Linlin L, Tan CL (2008) Document Image Retrieval through Word Shape Coding. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(11): 1913–1918

    Article  Google Scholar 

  • Lu S, Tan CL (2007) Keyword Spotting and Retrieval of Document Images captured by a Digital Camera. In: Proceedings of the ninth international conference on document analysis and recognition, pp 994–998

  • Lu Y, Tan CL, Huang W, Fan L (2001a) An approach to word image matching based on weighted Hausdorff distance. In: Proceedings of the international conference on document analysis and recognition, pp 921–925

  • Lu Y, Tan CL, Fan L, Huang W (2001b) Similarity measure for CCITT group 4 compressed document images. In: Proceedings of the international conference on image processing, pp 1118–1121

  • Lu Y, Tan CL (2002a) ‘Word Searching in Document Images Using Word Portion Matching’. Document Analysis Systems V, Lecture Notes on Computer science 2423: 319–328

    Google Scholar 

  • Lu Y, Tan CL (2002b) Word spotting in Chinese document images without layout analysis. In: Proceedings of the international conference on pattern recognition, pp 57–60

  • Lu Y, Tan CL (2003) Word searching in CCITT group 4 compressed document images. International conference on document analysis and recognition, pp 467–471

  • Lu Y, Tan CL (2004) Information Retrieval in Document Image Databases. IEEE Transactions on Knowledge and Data Engineering 16(11): 1398–1410

    Article  Google Scholar 

  • Lu Y, Tan CL (2004) Chinese Word searching in Imaged documents. International Journal of Pattern Recognition and Artificial Intelligence 18(2): 229–246

    Article  Google Scholar 

  • Lu Y, Zhang L, Tan CL (2004a) Retrieved Imaged documents in digital libraries based on Word Imaged Coding. In: Proceedings of the first international workshop on document image analysis for libraries, pp 174–187

  • Lu Y, Zhang L, Tan CL (2004b) A Search engine for Imaged documents in PDF files. In; Proceedings of the special interest group on information retrieval, pp 536–537

  • Nagy G, Seth S (1984) Hierarchical representation of optically scanned documents. In: Proceedings of the international conference on pattern recognition, pp 347–34

  • Ohtam M, Takasu A, Adachi J (1997) Retrieval Methods for English Text with Misrecognized OCR characters. In: Proceedings of the fourth international conference on document analysis and recognition, pp 950–956

  • Pramod Shankar K, Jawahar CV (2006) Enabling Search over Large Collections of Telugu Document Images- An automatic Annotation based approach. LNCS 4338: 837–848

    Google Scholar 

  • Rath T, Manmatha R (2003) Features for word spotting in historical manuscripts. International conference on document analysis and recognition, pp 218–222

  • Seethalakshmi R, SreeRanjani TR, Balachandar T, Abnikant Singh, Markandey S, Ritwaj R, Sarvesh K (2005) Optical Character Recognition for printed Tamil text using Unicode. Journal of Zhejiang University Science 6(11): 1297–1305

    Article  Google Scholar 

  • Smeaton AF, Spitz AL (1997) Using Character shape codes for information retrieval. In: Proceedings of the international conference on document analysis and recognition, pp 974–978

  • Spitz AL (1993) Generalized line, word and character finding. In: Proceedings of the progress in image analysis and processing, pp 377–383

  • Spitz AL (1995) Using character shape codes for word spotting in document images. In: Proceedings of the symposium on document analysis and information retrieval, pp 382–389

  • Spitz AL (1997) Determination of script, language content of document images. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3): 235–245

    Article  Google Scholar 

  • Subramanian A, Kuberan B (2000) Optical Character Recognition of Printed Tamil characters. In: Proceedings of the tamil internet conference

  • Tan CL, Sung SY, Yu Z, Xu Y (2000) Text retrieval from document images based on n-gram algorithm. In: Proceedings of the sixth pacific rim international conference on artificial intelligence, pp 1–12

  • Tan CL, Huang W, Yu Z, Xu Y (2002) Imaged Document Text retrieval without OCR. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6): 838–844

    Article  Google Scholar 

  • Tan CL, Huang W, Sung SY, Yu Z, Xu X (2003) Text retrieval from document images based on word shape analysis. Journal of Applied Intelligence, Special issue on Text and Web Mining 18(3): 257–270

    MATH  Google Scholar 

  • Tanaka Y, Torii H (1988) Transmedia machine and its keyword search over image texts. In: Proceedings of the research information assistee par ordinateur, pp 248–258

  • Zhang L, Lu Y, Tan CL (2004) A web based system for retrieving document images from digital library. In: Proceedings of the conference on computer vision and pattern recognition workshop, pp 27–35

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abirami Murugappan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Murugappan, A., Ramachandran, B. & Dhavachelvan, P. A survey of keyword spotting techniques for printed document images. Artif Intell Rev 35, 119–136 (2011). https://doi.org/10.1007/s10462-010-9187-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-010-9187-5

Keywords

Navigation