Article

Free Access

Detection of language (model) errors

Authors:
K. Y. Hung

Hong Kong Polytechnic University, Hong Kong

Hong Kong Polytechnic University, Hong Kong
View Profile

,
R. W. P. Luk

Hong Kong Polytechnic University, Hong Kong

Hong Kong Polytechnic University, Hong Kong
View Profile

,
D. Yeung

Hong Kong Polytechnic University, Hong Kong

Hong Kong Polytechnic University, Hong Kong
View Profile

,
K. F. L. Chung

Hong Kong Polytechnic University, Hong Kong

Hong Kong Polytechnic University, Hong Kong
View Profile

,
W. Shu

Hong Kong Polytechnic University, Hong Kong

Hong Kong Polytechnic University, Hong Kong
View Profile

EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13October 2000Pages 87–94https://doi.org/10.3115/1117794.1117805

Published:07 October 2000Publication History

EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

Pages 87–94

ABSTRACT

The bigram language models are popular, in much language processing applications, in both Indo-European and Asian languages. However, when the language model for Chinese is applied in a novel domain, the accuracy is reduced significantly, from 96% to 78% in our evaluation. We apply pattern recognition techniques (i.e. Bayesian, decision tree and neural network classifiers) to discover language model errors. We have examined 2 general types of features: model-based and language-specific features. In our evaluation, Bayesian classifiers produce the best recall performance of 80% but the precision is low (60%). Neural network produced good recall (75%) and precision (80%) but both Bayesian and Neural network have low skip ratio (65%). The decision tree classifier produced the best precision (81%) and skip ratio (76%) but its recall is the lowest (73%).

References

Bahl, L. R., F. Jelinek and R. L. Mercer (1983) "A maximum likelihood approach to continuous speech recognition", IEEE Trans. PAMI, 5:2, pp. 179--190.Google ScholarDigital Library
Brown, P. F., V. J. Della Pietra, P. V. deSouza and R. L. Mercer (1992) "Class-based n-gram models of natural language", Computational Linguistics, 4, pp. 467--479. Google ScholarDigital Library
Chen, K.-C. and C. R. Huang (eds.) (1993) "Chinese word class analysis", Technical Report 93-05, Chinese Knowledge Information Processing Group, Institute of Information Science, Academia Sinica, Taiwan.Google Scholar
Chien, L. F., Chen, K. J., Lee, L. S. (1993) "A best-first language processing model integrating the unification grammar and Markov language model for speech recognition applications", IEEE Trans. Speech and Audio Processing, 1:2, Page(s): 221--240.Google ScholarCross Ref
Elliman, D. G. and I. T. Lancaster (1990) "A review of segmentation and contextual analysis techniques for text recognition", Pattern Recognition, 23:3/4, pp. 337--346. Google ScholarDigital Library
Fujita, O. (1998) "Statistical estimation of the number of hidden units for feedforward neural networks", Neural Networks, 11, 851--859. Google ScholarDigital Library
Guo, J. and H. C. Liu, "PH - a Chinese corpus for pinyin-hanzi transcription", ISS Technical Report, TR93-112-0, Institute of Systems Science, National University of Singapore, 1992.Google Scholar
Huang, X., F. Alleva, H. Hon, M. Hwang, K. Lee and R. Rosenfeld (1993) "The SPHINX-II speech recognition system: an overview", Computer Speech and Lanaguage, 2, 137--148.Google ScholarCross Ref
Iyer, R., M. Ostendorf and M. Meteer (1997) "Analyzing and predicting language model performance", Proc. IEEE Workshop Automatic Speech Recognition and Understanding, pp. 254--261.Google ScholarCross Ref
Jelinek, F. (1989) "Self-organized language modeling for speech recognition", in Readings in Speech Recognition, Morgan Kayfmann. Google ScholarDigital Library
Jelinek, F. (1991) "Up from trigrams", Proc. Eurospeech 91, pp. 181--184.Google Scholar
Jin, Y., Y. Xia and X. Chang (1995) "Using contextual information to guide Chinese text recognition", Proc. ICCPOL '95, pp. 134--139.Google Scholar
Kenne, P. E. and M. O'Kane (1996) "Hybrid language models and spontaneous legal discourse", Proc. JCSLP, Vol. 2, pp. 717--720.Google ScholarCross Ref
Kit, C., Y. Liu and N. Liang (1989) "On methods of Chinese automatic word segmentation", Journal of Chinese Information Processing, 3:1, 13--20.Google Scholar
Law, H. H-C. and C. Chan (1996) "N-th order ergodic multigram HMM for modeling of languages without marked word boundaries", Proc. COLING 96, pp. 2043--209. Google ScholarDigital Library
Lee, H-J. and C-H Tang (1995) "A language model based on semantically clustered words in a Chinese character recognition system", Proc. 3rd Int Conf. on Document Analysis and Recognition, Vol. 1., pp. 450--453. Google ScholarDigital Library
Lin, M-Y., T-H. Chiang and K-Y. Su (1993) "A preliminary study on unknown word problem in Chinese word segmentation", Proc. ROCLING VI, pp. 119--141.Google Scholar
Lochovsky, A. F. and K-H. Chung (1997) "Homonym resolution for Chinese phonetic input", Communications of COLIPS, 7:1, 5--15.Google Scholar
Mahajan, M., D. Beeferman and X. D. Huang (1999) "Improving topic-dependent modeling using information retrieval techniques", Proc. IEEE ICASSP 99, Vol. 1, pp. 541--544. Google ScholarDigital Library
Nagy, G. (1988), "Chinese character recognition: twenty-five-year retrospective", in Proc. 9th Int. Conf. on Pattern Recognition, Vol. I, pp. 163--167.Google ScholarCross Ref
Nathan, K. S., H. S. M. Beigi, J. Subrahmonia, G. J. Clary and H. Maruyama (1995) "Real-time online unconstrained handwriting recognition using statistical methods",Google Scholar
Oommen, B. J. and K. Zhang (1996) "The normalized string editing problem revisited", IEEE Trans. on PAMI, 18:6, pp. 669--672. Google ScholarDigital Library
Quinlan, J. R. (1993) "C4.5 programs for machine learning", Morgan Kaufmann, CA. Google ScholarDigital Library
Ron, D., Y. Singer and N. Tishby (1994) "The power of Amnesia: learning probabilistic automata with variable memory length", to appear in Machine Learning Google ScholarDigital Library
Rosenfeld, R. (1994) "Adaptive statistical language modeling" a maximum entropy approach", Ph.D. Thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh.Google Scholar
Sun, S. W. (1991), "A Contextual Postprocessing for Optical Chinese Character Recognition", in Proc.Int. Sym. on Circuits and Systems, pp. 2641--2644.Google Scholar
Vapnik, V. (1995) The Nature of Statistical Learning Theory, Springer-Verlag, New York. Google ScholarDigital Library
Van Rijsbergen (1979) Information Retrieval, Butterworths, London. Google ScholarDigital Library
Wagner, R. A. and M. J. Fisher (1974) "The string to string correction problem", J. ACM, 21:1, pp. 168--173. Google ScholarDigital Library
Ward, W. and S. Issar (1996) "A class based language model for speech recognition", Proc. IEEE ICASSP 96, Vol. 1, pp. 416--418. Google ScholarDigital Library
Wong, P-K. and C. Chan (1999) "Postprocessing statistical language models for handwritten Chinese character recognizer", IEEE Trans. SMC, Part B, 29:2, 286--291. Google ScholarDigital Library
Xia, Y., S. Ma, M. Sun, X. Zhu, Y. Jin and X. Chang (1996) "Automatic post-processing of offline handwritten Chinese text recognition", Proc. ICCC, pp. 413--416.Google Scholar
Yang, K-C., T-H. Ho, L-F. Chien and L-S. Lee (1998) "Statistics-based segment pattern lexicon - a new direction for Chinese language modeling", Proc. IEEE ICASSP 98, Vol. 1., pp. 169--172.Google Scholar

Detection of language (model) errors
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Language model based arabic word segmentation
ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1

We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus ...
Read More
Sentence Boundary Detection For Marathi Language

Detecting the sentence boundary forms the basic step for many natural language applications. A lot of work has been done in this direction for English and other foreign languages. But not much work has been done for Indian languages. This paper proposes ...
Read More
A hybrid model for spelling error detection and correction for Urdu language
Abstract
Detecting and correcting misspelled words in a written text are of great importance in many natural language processing applications. Errors can be broadly classified into two groups, namely spelling error and contextual errors. Spelling errors ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
October 2000
233 pages
Conference Chairs:
Hinrich Schiitze
GroupFire Inc
,
Keh-Yih Su
Behavior Design Corporation
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 7 October 2000
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate73of234submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 230
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detection of language (model) errors

EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

ABSTRACT

References

Cited By

Recommendations

Language model based arabic word segmentation

Sentence Boundary Detection For Marathi Language

A hybrid model for spelling error detection and correction for Urdu language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Detection of language (model) errors

EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

ABSTRACT

References

Cited By

Recommendations

Language model based arabic word segmentation

Sentence Boundary Detection For Marathi Language

A hybrid model for spelling error detection and correction for Urdu language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media