Abstract
Optical character recognition (OCR) still garbles a considerable amount of information reduction and noise on texts so that many documents are unsuitable for information extraction systems. This paper introduces a statistical method for bootstrapping a lexicon from a very small number of “noisy ,” domain-specific texts. This method determines regularity in grammatical forms and also reoccuring ungrammatical forms from the input text. Through a combination of frequency lists and Levenshtein matrices, a language independent, robust core lexicon is constructed that supports the analysis of “noisy texts,” too.
Chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
T. Bayer, U. Bohnacker, and I. Renz.Information extraction from paper documents. In H. Bunke and P.S.P. Wang, editors, Handbook on Optical Character Recognition and Document Image Analysis, pages 653–677. World Scientific Publishing Company, 1997.
W.N. Francis and H. Kučera. Frequency Analysis of English Usage. Houghton Mifflin, Boston, 1982.
J. Nerbonne, W. Heeringa, E. van den Hout, P. van der Kooi, S. Otten and W. van de Vis. Phonetic distance between dutch dialects. In Durieux, G., Daelemans, W., and Gillis, S., editors, Proceedings of Computational Linguistics in the Netherlands, pages 185–202, Antwerp, Centre for Dutch Language and Speech (UIA), 1996.
C.E. Shannon. A mathematical theory of communication. The Bell Systems Technical Journal, 27:623–656, 1948.
E. von Weizsäcker. Erstmaligkeit und Bestätigung als Komponenten der pragmatischen Information. In E. von Weizsäcker, editor, Offene Systeme I, pages 83–113. Klett, Stuttgart, 1974.
G.K. Zipf. The Psycho-Biology of Language. Houghton Mifflin, Boston, 1935.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Schneider, R. (1998). Automatic acquisition of lexical knowledge from sparse and noisy data. In: Nédellec, C., Rouveirol, C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science, vol 1398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026670
Download citation
DOI: https://doi.org/10.1007/BFb0026670
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64417-0
Online ISBN: 978-3-540-69781-7
eBook Packages: Springer Book Archive