Article

A search engine for historical manuscript images

Authors:
Toni M. Rath

University of Massachusetts, Amherst, MA

University of Massachusetts, Amherst, MA
View Profile

,
R. Manmatha

University of Massachusetts, Amherst, MA

University of Massachusetts, Amherst, MA
View Profile

,
Victor Lavrenko

University of Massachusetts, Amherst, MA

University of Massachusetts, Amherst, MA
View Profile

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2004Pages 369–376https://doi.org/10.1145/1008992.1009056

Published:25 July 2004Publication History

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 369–376

ABSTRACT

Many museum and library archives are digitizing their large collections of handwritten historical manuscripts to enable public access to them. These collections are only available in image formats and require expensive manual annotation work for access to them. Current handwriting recognizers have word error rates in excess of 50% and therefore cannot be used for such material. We describe two statistical models for retrieval in large collections of handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a joint probability distribution between features computed from word images and their transcriptions. The models can then be used to retrieve unlabeled images of handwritten documents given a text query. We show experiments with a training set of 100 transcribed pages and a test set of 987 handwritten page images from the George Washington collection. Experiments show that the precision at 20 documents is about 0.4 to 0.5 depending on the model. To the best of our knowledge, this is the first automatic retrieval system for historical manuscripts using text queries, without manual transcription of the original corpus.

References

K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In Proc. of the Int'l Conf. on Computer Vision, volume 2, pages 408--415, Vancouver, Canada, July 9-12 2001.]]Google ScholarCross Ref
D. M. Blei and M. I. Jordan. Modeling annotated data. In Proc. of the 26th Annual Int'l ACM SIGIR Conf., pages 127--134, Toronto, Canada, July 28-August 1 2003.]] Google ScholarDigital Library
P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proc. of the 7th European Conf. on Computer Vision, volume 4, pages 97--112, Copenhagen, Denmark, May 27-June 2 2002.]] Google ScholarDigital Library
V. Govindaraju. Presentation. In IEEE Workshop on Document Image Analysis for Libraries, Palo Alto, CA, January 23-24 2004.]]Google Scholar
J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proc. of the 26th Annual Int'l ACM SIGIR Conf., pages 119--126, Toronto, Canada, July 28-August 1 2003.]] Google ScholarDigital Library
V. Lavrenko, M. Choquette, and W. B. Croft. Cross-lingual relevance models. In Proc. of the 25th Annual Int'l SIGIR Conf., pages 175--182, Tampere, Finland, August 11-15 2002.]] Google ScholarDigital Library
V. Lavrenko, T. M. Rath, and R. Manmatha. Holistic word recognition for handwritten historical documents. In Proc. of the Int'l Workshop on Document Image Analysis for Libraries, pages 278--287, Palo Alto, CA, January 23-24 2004.]] Google ScholarDigital Library
R. Manmatha and W. B. Croft. Word spotting: Indexing handwritten manuscripts. In M. Maybury, editor, Intelligent Multi-media Information Retrieval, pages 43--64. AAAI/MIT Press, 1997.]] Google ScholarDigital Library
R. Manmatha and N. Srimal. Scale space technique for word segmentation in handwritten manuscripts. In Proc. of the 2nd Int'l Conf. on Scale-Space Theories in Computer Vision, pages 22--33, Corfu, Greece, September 26-27 1999.]] Google ScholarDigital Library
U.-V. Marti and H. Bunke. Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. Int'l Journal of Pattern Recognition and Artifical Intelligence, 15(1):65--90, 2001.]]Google ScholarCross Ref
J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proc. of the 21st Annual Int'l ACM SIGIR Conf., pages 275--281, Melbourne, Australia, August 24-28 1998.]] Google ScholarDigital Library
T. M. Rath, V. Lavrenko, and R. Manmatha. Retrieving historical manuscripts using shape. Technical report, Center for Intelligent Information Retrieval, Univ. of Massachusetts Amherst, 2003.]]Google Scholar
T. M. Rath and R. Manmatha. Word image matching using dynamic time warping. In Proc. of the Conf. on Computer Vision and Pattern Recognition, volume 2, pages 521--527, Madison, WI, June 18-20 2003.]]Google ScholarCross Ref

Index Terms

A search engine for historical manuscript images
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Study on Automated Approach to Recognize Characters for Handwritten and Historical Document
Script recognition is the mechanism of automatic script analysis and recognition whereby intensive study has been carried out and a significant amount of papers on this problem have been released over the past. But there are still a few issues to be ...
Read More
Boosted decision trees for word recognition in handwritten document retrieval
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Recognition and retrieval of historical handwritten material is an unsolved problem. We propose a novel approach to recognizing and retrieving handwritten manuscripts, based upon word image classification as a key step. Decision trees with normalized ...
Read More
A Web-Based Search Engine for Chinese Calligraphic Manuscript Images
ICWL '009: Proceedings of the 8th International Conference on Advances in Web Based Learning

In this paper, we propose a novel framework for the web-based retrieval of Chinese calligraphic manuscript images which includes two main components: 1). A <Emphasis Type="ItalicUnderline">S</Emphasis> hape-<Emphasis Type="Underline"> S</Emphasis> ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
General Chair:
Mark Sanderson
University of Sheffield (UK)
,
Program Chairs:
Kalervo Järvelin
University of Tampere (Finland)
,
James Allan
University of Massachusetts (USA)
,
Peter Bruza
Distributed Systems Technology Centre (Australia)
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
handwriting retrieval
historical manuscripts
relevance models
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 104
  Total Citations
  View Citations
- 1,076
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A search engine for historical manuscript images

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Study on Automated Approach to Recognize Characters for Handwritten and Historical Document

Boosted decision trees for word recognition in handwritten document retrieval

A Web-Based Search Engine for Chinese Calligraphic Manuscript Images