Skip to main content

GFG-Based Compression and Retrieval of Document Images in Indian Scripts

  • Chapter
  • First Online:
Guide to OCR for Indic Scripts

Part of the book series: Advances in Pattern Recognition ((ACVPR))

Abstract

Indexing and retrieval of Indian language documents is an important problem. We present an interactive access scheme for Indian language document collection using techniques for word-image-based search. The compression and retrieval paradigm we propose is applicable even for those Indian scripts for which reliable OCR technology is not available. Our technique for word spotting is based on exploiting the geometrical features of the word image. The word image features are represented in the form of a graph called geometric feature graph (GFG). The GFG is encoded as a string which serves as a compressed representation of the word image skeleton. We have also augmented the GFG-based word image spotting with latent semantic analysis for more effective retrieval. The query is specified as a set of word images and the documents that best match with the query representation in the latent semantic space are retrieved. The retrieval paradigm is further enhanced to the conceptual level with the use of document image content-domain knowledge specified in the form of an ontology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. R. Manmath, C. Han, and E. Riseman, “Word spotting: A new approach to indexing hand writing,” in Proceedings of IEEE CVPR, pp. 631–637, 1996.

    Google Scholar 

  2. A. K. Jain and A. M. Namboodiri, “Indexing and retrieval of on-line handwritten documents,” in Proceedings of IEEE ICDAR, pp. 655–659, 2003.

    Google Scholar 

  3. T. M. Rath and R. Manmatha, “Word image matching using dynamic time warping ,” in Proceedings of IEEE CVPR, vol. 2, pp. 521–527, 2003.

    Google Scholar 

  4. Deerwester, S. Dumais, Furnas, Lanouauer, and Harshman, “Indexing by latent semantic analysis,” Journal American Society for Information Retrieval, 41 (6), pp. 391–407, 1990.

    Article  Google Scholar 

  5. G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum, “Information retrieval using a singular value decomposition model of latent semantic structure,” in Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, (Grenoble, France), pp. 465–480, 1988.

    Google Scholar 

  6. S. T. Dumais, “Latent semantic indexing (LSI),” in Proceedings of the Text Retrieval Conference (TREC-3), 1995.

    Google Scholar 

  7. S. Chaudhury, A. Roy, and L. Dey, “An MIMD algorithm for constant curvature feature extraction using curvature based data partitioning,” Pattern Recognition Letters, 20 (6), pp. 573–583, 1999.

    Article  Google Scholar 

  8. R. C. Gonzalez and R. E. Woods, Digital Image Processing. Prentice Hall, Upper Saddle River, NJ, 3rd ed., 2008.

    Google Scholar 

  9. E. Ukkonen, “Finding approximate patterns in string,” Journal of Algorithms, 6 (1), pp. 132–137, 1985.

    Article  MATH  MathSciNet  Google Scholar 

  10. S. Banerjee, G. Harit, and S. Chaudhury, “Word image based latent semantic indexing for conceptual querying in document image databases,” in Proceedings of IEEE ICDAR, vol. 2, pp. 1208–1212, 2007.

    Google Scholar 

  11. P. R. Christopher, D. Manning, and H. Schtze, Introduction to Information Retrieval. Cambridge University Press, Cambridge, 1st ed., 2008.

    Google Scholar 

  12. T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of SIGIR, 1999.

    Google Scholar 

  13. S. Kumar, N. Khanna, S. Chaudhury, and S. D. Joshi, “Locating text in images using matched wavelets,” in Proceedings of IEEE ICDAR, vol. 2, pp. 595–599, 2005.

    Google Scholar 

  14. L. Saul and F. Pereira, “Aggregate and mixed order Markov models for statistical language processing,” in Proceedings of the 2nd International Conference on Empirical Methods Natural Language Processing, pp. 81–89, 1997.

    Google Scholar 

  15. H. Ghosh, S. Chaudhury, K. Kashyap, and B. Maiti, Ontologies A Handbook of Principles, Concepts and Applications in Information Systems, ch. Ontology Specification and Integration for Multimedia Applications. Springer-Verlag New York, Inc., Secaucus, NJ, USA 2006.

    Google Scholar 

  16. G. Harit, S. Chaudhury, and J. Paranjpe, “Ontology guided access to document images,” in Proceedings of IEEE ICDAR, vol. 1, pp. 292–296, 2005.

    Google Scholar 

  17. H. Ghosh and S. Chaudhury, “Distributed and reactive query planning in R-MAGIC: An agent based multimedia retrieval system,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, pp. 1082–1095, September 2004.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gaurav Harit .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag London Limited

About this chapter

Cite this chapter

Harit, G., Chaudhury, S., Garg, R. (2009). GFG-Based Compression and Retrieval of Document Images in Indian Scripts. In: Govindaraju, V., Setlur, S. (eds) Guide to OCR for Indic Scripts. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-330-9_14

Download citation

  • DOI: https://doi.org/10.1007/978-1-84800-330-9_14

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84800-329-3

  • Online ISBN: 978-1-84800-330-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics