skip to main content
10.1145/1008992.1009056acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

A search engine for historical manuscript images

Published:25 July 2004Publication History

ABSTRACT

Many museum and library archives are digitizing their large collections of handwritten historical manuscripts to enable public access to them. These collections are only available in image formats and require expensive manual annotation work for access to them. Current handwriting recognizers have word error rates in excess of 50% and therefore cannot be used for such material. We describe two statistical models for retrieval in large collections of handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a joint probability distribution between features computed from word images and their transcriptions. The models can then be used to retrieve unlabeled images of handwritten documents given a text query. We show experiments with a training set of 100 transcribed pages and a test set of 987 handwritten page images from the George Washington collection. Experiments show that the precision at 20 documents is about 0.4 to 0.5 depending on the model. To the best of our knowledge, this is the first automatic retrieval system for historical manuscripts using text queries, without manual transcription of the original corpus.

References

  1. K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In Proc. of the Int'l Conf. on Computer Vision, volume 2, pages 408--415, Vancouver, Canada, July 9-12 2001.]]Google ScholarGoogle ScholarCross RefCross Ref
  2. D. M. Blei and M. I. Jordan. Modeling annotated data. In Proc. of the 26th Annual Int'l ACM SIGIR Conf., pages 127--134, Toronto, Canada, July 28-August 1 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proc. of the 7th European Conf. on Computer Vision, volume 4, pages 97--112, Copenhagen, Denmark, May 27-June 2 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. Govindaraju. Presentation. In IEEE Workshop on Document Image Analysis for Libraries, Palo Alto, CA, January 23-24 2004.]]Google ScholarGoogle Scholar
  5. J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proc. of the 26th Annual Int'l ACM SIGIR Conf., pages 119--126, Toronto, Canada, July 28-August 1 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. Lavrenko, M. Choquette, and W. B. Croft. Cross-lingual relevance models. In Proc. of the 25th Annual Int'l SIGIR Conf., pages 175--182, Tampere, Finland, August 11-15 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. V. Lavrenko, T. M. Rath, and R. Manmatha. Holistic word recognition for handwritten historical documents. In Proc. of the Int'l Workshop on Document Image Analysis for Libraries, pages 278--287, Palo Alto, CA, January 23-24 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Manmatha and W. B. Croft. Word spotting: Indexing handwritten manuscripts. In M. Maybury, editor, Intelligent Multi-media Information Retrieval, pages 43--64. AAAI/MIT Press, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Manmatha and N. Srimal. Scale space technique for word segmentation in handwritten manuscripts. In Proc. of the 2nd Int'l Conf. on Scale-Space Theories in Computer Vision, pages 22--33, Corfu, Greece, September 26-27 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. U.-V. Marti and H. Bunke. Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. Int'l Journal of Pattern Recognition and Artifical Intelligence, 15(1):65--90, 2001.]]Google ScholarGoogle ScholarCross RefCross Ref
  11. J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proc. of the 21st Annual Int'l ACM SIGIR Conf., pages 275--281, Melbourne, Australia, August 24-28 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. M. Rath, V. Lavrenko, and R. Manmatha. Retrieving historical manuscripts using shape. Technical report, Center for Intelligent Information Retrieval, Univ. of Massachusetts Amherst, 2003.]]Google ScholarGoogle Scholar
  13. T. M. Rath and R. Manmatha. Word image matching using dynamic time warping. In Proc. of the Conf. on Computer Vision and Pattern Recognition, volume 2, pages 521--527, Madison, WI, June 18-20 2003.]]Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A search engine for historical manuscript images

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
        July 2004
        624 pages
        ISBN:1581138814
        DOI:10.1145/1008992

        Copyright © 2004 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 July 2004

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader