Article

Extraction of text areas in printed document images

Authors:
Jean Duong

Ecole de Technologie Superieure (ETS), Montréal, Quebec, Canada

Ecole de Technologie Superieure (ETS), Montréal, Quebec, Canada
View Profile

,
Myriam Côte

Ecole de Technologie Superieure (ETS), Montréal, Quebec, Canada

Ecole de Technologie Superieure (ETS), Montréal, Quebec, Canada
View Profile

,
Hubert Emptoz

Institut National des Sciences Appliquees (INSA) de Lyon, Villeurbanne Cedex, France

Institut National des Sciences Appliquees (INSA) de Lyon, Villeurbanne Cedex, France
View Profile

,
Ching Y. Suen

Concordia University, Montréal, Quebec, Canada

Concordia University, Montréal, Quebec, Canada
View Profile

DocEng '01: Proceedings of the 2001 ACM Symposium on Document engineeringNovember 2001Pages 157–165https://doi.org/10.1145/502187.502211

Published:09 November 2001Publication History

DocEng '01: Proceedings of the 2001 ACM Symposium on Document engineering

Pages 157–165

ABSTRACT

In this paper, we present a document analysis system which is expected to extract regions of interest in greyscale document images. Collected areas are then clustered in text zones and non-text areas using geometric and texture features. The system works in two steps. Regions of interest are retrieved via cumulative gradient considerations. In classification module, we introduced some entropic heuristic. Experiments are done on the MediaTeam Document Database to show the relevance of this criteria.

References

1.N. Ahmed and K. R. Rao. Orthogonal Transforms for Digital Signal Processing. Springer Verlag, Berlin, Heidelberg, New York, 1975.]] Google ScholarDigital Library
2.N. Amamoto, S Torigoe, and Y. Hirogaki. Block segmentation and text area extraction of vertically/horizontally written documents. In Proceedings of thee second International Conference on Document Analysis and Recognition (ICDAR), pages 739-742, Tsukuba, Science City (Japan), 1993.]]Google ScholarCross Ref
3.M. Bahi. Segmentation de surfaces representees par des nuages de points non organises. PhD thesis, Universite Claude Bernard de Lyon, Juillet 1997.]]Google Scholar
4.Gerald Baillargeon. Introduction a l'inference statistique. Editions S.M.G., Trois Riviere, Quebec (Canada), 1992.]]Google Scholar
5.Abdel Belayd. Analyse et reconnaissance de documents. In Le traitement electronique du document, chapter 2, pages 11-47. ADBS Editions, Paris (France), 1994.]]Google Scholar
6.Abdel Belayd and Yolande Belayd. Reconnaissance des formes. Methodes et applications. Informatique, intelligence artificielle (iia). InterEdition, Paris (France), 1992.]]Google Scholar
7.Ph. Bolon, J.-M. Chassery, D. Domigny J.-P. Cocquerez, C. Graffigne, S. Philipp A. Montanvert, R. Zeboudj, and J. Zerubia. Analyse d'images: filtrage et segmentation. Masson, Paris, Milan, Barcelone, enseignement de la physique edition, Octobre 1995.]]Google Scholar
8.L. Boukined, B. Taconet, A. Zahour, and A Faure. Recherche de la structure physique d'un document imprime par rectangulation. In w Congres Reconnaissace de Formes et Intelligence Artificielle (RFIA), volume 3, pages 1027-1031, Lyon-Villeurbanne (France), Novembre 1991.]]Google Scholar
9.Jean-Marie Bouroche and Gilbert Saporta. L'analyse des donnees. Presses Universitaires de France, Paris (France), 1989.]]Google Scholar
10.Philippe Chauvet. Systemes d'analyse, reconnaissance et description de documents complexes. In w Congres Reconnaissance de Formes et Intelligence Artificielle (RFIA), volume 3, pages 1033-1044, Lyon-Villeurbanne (France), Novembre 1991.]]Google Scholar
11.Chi Hau Chen. Statistical Pattern Analysis. Spartan Books. Hayden Book Company, Inc., Rochelle Park, New Jersey (USA), 1973.]]Google Scholar
12.M. Cote, E. Lecolinet, M. Cheriet, and C. Y. Suen. Automatic reading of cursive scripts using reading model and perceptual concepts. the percepto system. International Journal on Document Analysis and Recognition (IJDAR), 1(1):3-17, 1998.]]Google Scholar
13.Myriam Cote. Utilisation d'un modele d'acces lexical et de concepts perceptifs pour la reconnaissance d'images de mots cursifs. PhD thesis, Ecole Nationale Superieure des Telecommonications (ENST) de Paris, 1997.]]Google Scholar
14.E. R. Davies. Machine Vision: Theory, Algorithms, Practicalities. Harcourt Brace Jovanovich, London, San Diego, New York, Boston, Sydney, Tokyo, academic press edition, 1990.]] Google ScholarDigital Library
15.Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis. Wiley-Interscience. John wiley and sons, 1973.]]Google Scholar
16.Anil K. Jain. Fundamentals of Digital Image Processing. Thomas Kailath, Prentice Hall, Englewoods Cliffs, New Jersey, USA, prentice hall information ans system sciences series edition, 1989.]] Google ScholarDigital Library
17.Ramesh Jain, Rangachar Kasturi, and Brian G. Schunck. Machine Vision. McGraw-Hill Inc., mcgraw-hill series in computer science edition, 1995.]] Google ScholarDigital Library
18.M. Krishnamoorthy, G. Nagy, S. Seth, and M. Viswanathan. Syntactic segmentation and labeling of digitalized pages from technical journals. IEEE Computer Vision, Graphics and Image Processing, 47:327-352, 1993.]]Google Scholar
19.Ludovic Lebart, Alain Morineau, and Marie Piron. Statistique exploratoire multidimensionnelle. Dunod, Paris (France), 2000.]]Google Scholar
20.Kyong-Ho Lee, Yoon-Chul Choy, and Sung-Bae Cho. Geometric structure analysis of document images: A knowledge-based approach. IEEE Transaction on Pattern Analysis and Machine Intelligence, 22(11):1224-1240, November 2000.]] Google ScholarDigital Library
21.G. Nagy and S. Seth. Hierarchical representation of optically scanned documents. In 7th International Conference on Pattern Recognition (ICPR), pages 347-349, Montreal (Canada), 1984. IEEE Computer Society Press.]]Google Scholar
22.Lawrence O'Gorman and Rangachar Kasturi. Document Image Analysis. IEEE Computer Society Executive Briefing. IEEE Computer Society, Los Alamitos (California, USA), 1997.]] Google ScholarDigital Library
23.Oleg Okun, David Doermann, and Matti Pietik~inen. Page segmentation and zone classification: The state of the art, November 1999.]]Google Scholar
24.J. R. Parker. Algorithms for Image Processing and Computer Vision. John Wiley and Sons, Chichester, New York, Brisbane, Toronto, Singapore, Weinheim, design and measurement in electronic engineering edition, 1997.]] Google ScholarDigital Library
25.T. Pavlidis. Structural Pattern Recognition. Springer-Verlag, Berlin, Heidelberg, New York, springer series in electrophysics edition, 1977.]]Google Scholar
26.T. Pavlidis and J. Zhou. Segmentation by white streams. In International Conference on Document Analysis and Recognition (ICDAR), pages 945-953, St-Malo (France), 1991.]]Google Scholar
27.William K. Pratt. Digital Image Processing. John Wiley and Sons, New York, Chichester, Brisbane, Toronto, Singapore, wiley-interscience edition, 1991.]] Google ScholarDigital Library
28.Vincent Quint. Edition de documents structures. In Le traitement electronique du document, chapter 1, pages 11-47. ADBS Editions, Paris (France), 1994.]]Google Scholar
29.Henri Rouanet and Brigitte Le Roux. Analyse des Donnees Multidimensionnelles. Dunod, Paris (France), 1993.]]Google Scholar
30.E. Roubine. Introduction a la theorie de la communication, volume 3. Masson, Paris (France), 1970.]]Google Scholar
31.William C. Schefler. Statistics. Concepts and Applications. The Benjamin/Cummings Publishing Company, Inc., Menlo Park, California (USA), 1988.]] Google ScholarDigital Library
32.J. Serra. Image Analysis and Mathematical Morphology (vol.1). Academic Press, New York, 1982.]] Google ScholarDigital Library
33.J. Serra. Image Analysis and Mathematical Morphology (vol.2). Academic Press, New York, 1988.]] Google ScholarDigital Library
34.Souad Souafi-Bensafi, Frank Lebourgeois, and Hubert Emptoz. Modelisation et reconnaissance des structures de documents: application aux sommaires de revues. In Actes du deuxieme Colloque International Francophone sur l'Ecrit et le Document (CIFED), Lyon (France), July 3-5 2000.]]Google Scholar
35.Souad Souafi-Bensafi, Frank Lebourgeois, Marc Parizeau, and Hubert Emptoz. Contribution a la reconnaissance des structures logiques hierarchiques dans les documents papier. Technical report, Universite Laval (Quebec), 2000.]]Google Scholar
36.Y. Y. Tang, C.D. Yan, M. Cheriet, and C.Y. Suen. Automatic analysis and understanding of documents. In .H. Chen Patrick S.P. Wang and L.F. Pau, editors, Handbook of Pattern Recognition and Computer Vision. The World Scientific Publishing Co. Pte, Ltd, Singapore, 1993.]] Google ScholarDigital Library
37.Souad Tayeb-Bey. Analyse et conversion de documents: du pixel au langage HTML. PhD thesis, Institut National des Sciences Appliquees (INSA) de Lyon, 1998.]]Google Scholar
38.Ferdinand van der Heijden. Image Based Measurement Sytems. John Wiley and Sons, Chichester, New York, Brisbane, Toronto, Singapore, design and measurement in electronic engineering edition, 1994.]]Google Scholar
39.Kwan Y. Wong, Richard G. Casey, and Friedrich M. Wahl. Document analysis system. IBM Journal of Research and Developpment, 26(6):647-656, November 1982.]]Google ScholarDigital Library
40.Victor Wu and R. Manmatha. Document image clean-up and binarization. Technical report, Computer Science Department, University of Massachusetts, Amherst (Massachussetts, USA), December 1997.]]Google Scholar
41.Victor Wu, R. Manmatha, and Edward M. Riseman. Textfinder: An automatic system to detect and recognize text in images. Technical report, Computer Science Department, University of Massachusetts, Amherst (Massachussetts, USA), November 1997.]] Google ScholarDigital Library
42.Victor Wu, R. Manmatha, and Edward M. Riserman. Finding text in images. In Second ACM International Conference on Digital Libraries (DL'97), July 1997.]] Google ScholarDigital Library
43.Steven W. Zucker. Survey: Region growing: Childhood and adolescence. In Computer Vision, Graphics and Image Processing, volume 5, pages 382-399. Academic Press, 1976.]]Google Scholar

Index Terms

Extraction of text areas in printed document images

Recommendations

Text region extraction from quality degraded document images
PReMI'07: Proceedings of the 2nd international conference on Pattern recognition and machine intelligence

In this paper we present a well designed method that makes use of edge information to extract textual blocks from gray scale document images. It aims at detecting textual regions on heavy noise infected newspaper images and separate them from graphical ...
Read More
A multi-plane approach for text segmentation of complex document images

This study presents a new method, namely the multi-plane segmentation approach, for segmenting and extracting textual objects from various real-life complex document images. The proposed multi-plane segmentation approach first decomposes the document ...
Read More
Text Region Extraction from Quality Degraded Document Images
Pattern Recognition and Machine Intelligence
Abstract
In this paper we present a well designed method that makes use of edge information to extract textual blocks from gray scale document images. It aims at detecting textual regions on heavy noise infected newspaper images and separate them from ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '01: Proceedings of the 2001 ACM Symposium on Document engineering
November 2001
174 pages
ISBN:1581134320
DOI:10.1145/502187
General Chair:
Ethan V. Munson
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 November 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
entropy
features
text extraction
Qualifiers
- Article
Conference

Acceptance Rates
DocEng '01 Paper Acceptance Rate18of55submissions,33%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 1,407
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Extraction of text areas in printed document images

DocEng '01: Proceedings of the 2001 ACM Symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Text region extraction from quality degraded document images

A multi-plane approach for text segmentation of complex document images

Text Region Extraction from Quality Degraded Document Images