skip to main content
10.3115/1117794.1117805dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
Article
Free Access

Detection of language (model) errors

Authors Info & Claims
Published:07 October 2000Publication History

ABSTRACT

The bigram language models are popular, in much language processing applications, in both Indo-European and Asian languages. However, when the language model for Chinese is applied in a novel domain, the accuracy is reduced significantly, from 96% to 78% in our evaluation. We apply pattern recognition techniques (i.e. Bayesian, decision tree and neural network classifiers) to discover language model errors. We have examined 2 general types of features: model-based and language-specific features. In our evaluation, Bayesian classifiers produce the best recall performance of 80% but the precision is low (60%). Neural network produced good recall (75%) and precision (80%) but both Bayesian and Neural network have low skip ratio (65%). The decision tree classifier produced the best precision (81%) and skip ratio (76%) but its recall is the lowest (73%).

References

  1. Bahl, L. R., F. Jelinek and R. L. Mercer (1983) "A maximum likelihood approach to continuous speech recognition", IEEE Trans. PAMI, 5:2, pp. 179--190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brown, P. F., V. J. Della Pietra, P. V. deSouza and R. L. Mercer (1992) "Class-based n-gram models of natural language", Computational Linguistics, 4, pp. 467--479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chen, K.-C. and C. R. Huang (eds.) (1993) "Chinese word class analysis", Technical Report 93-05, Chinese Knowledge Information Processing Group, Institute of Information Science, Academia Sinica, Taiwan.Google ScholarGoogle Scholar
  4. Chien, L. F., Chen, K. J., Lee, L. S. (1993) "A best-first language processing model integrating the unification grammar and Markov language model for speech recognition applications", IEEE Trans. Speech and Audio Processing, 1:2, Page(s): 221--240.Google ScholarGoogle ScholarCross RefCross Ref
  5. Elliman, D. G. and I. T. Lancaster (1990) "A review of segmentation and contextual analysis techniques for text recognition", Pattern Recognition, 23:3/4, pp. 337--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fujita, O. (1998) "Statistical estimation of the number of hidden units for feedforward neural networks", Neural Networks, 11, 851--859. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Guo, J. and H. C. Liu, "PH - a Chinese corpus for pinyin-hanzi transcription", ISS Technical Report, TR93-112-0, Institute of Systems Science, National University of Singapore, 1992.Google ScholarGoogle Scholar
  8. Huang, X., F. Alleva, H. Hon, M. Hwang, K. Lee and R. Rosenfeld (1993) "The SPHINX-II speech recognition system: an overview", Computer Speech and Lanaguage, 2, 137--148.Google ScholarGoogle ScholarCross RefCross Ref
  9. Iyer, R., M. Ostendorf and M. Meteer (1997) "Analyzing and predicting language model performance", Proc. IEEE Workshop Automatic Speech Recognition and Understanding, pp. 254--261.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jelinek, F. (1989) "Self-organized language modeling for speech recognition", in Readings in Speech Recognition, Morgan Kayfmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jelinek, F. (1991) "Up from trigrams", Proc. Eurospeech 91, pp. 181--184.Google ScholarGoogle Scholar
  12. Jin, Y., Y. Xia and X. Chang (1995) "Using contextual information to guide Chinese text recognition", Proc. ICCPOL '95, pp. 134--139.Google ScholarGoogle Scholar
  13. Kenne, P. E. and M. O'Kane (1996) "Hybrid language models and spontaneous legal discourse", Proc. JCSLP, Vol. 2, pp. 717--720.Google ScholarGoogle ScholarCross RefCross Ref
  14. Kit, C., Y. Liu and N. Liang (1989) "On methods of Chinese automatic word segmentation", Journal of Chinese Information Processing, 3:1, 13--20.Google ScholarGoogle Scholar
  15. Law, H. H-C. and C. Chan (1996) "N-th order ergodic multigram HMM for modeling of languages without marked word boundaries", Proc. COLING 96, pp. 2043--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lee, H-J. and C-H Tang (1995) "A language model based on semantically clustered words in a Chinese character recognition system", Proc. 3rd Int Conf. on Document Analysis and Recognition, Vol. 1., pp. 450--453. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lin, M-Y., T-H. Chiang and K-Y. Su (1993) "A preliminary study on unknown word problem in Chinese word segmentation", Proc. ROCLING VI, pp. 119--141.Google ScholarGoogle Scholar
  18. Lochovsky, A. F. and K-H. Chung (1997) "Homonym resolution for Chinese phonetic input", Communications of COLIPS, 7:1, 5--15.Google ScholarGoogle Scholar
  19. Mahajan, M., D. Beeferman and X. D. Huang (1999) "Improving topic-dependent modeling using information retrieval techniques", Proc. IEEE ICASSP 99, Vol. 1, pp. 541--544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Nagy, G. (1988), "Chinese character recognition: twenty-five-year retrospective", in Proc. 9th Int. Conf. on Pattern Recognition, Vol. I, pp. 163--167.Google ScholarGoogle ScholarCross RefCross Ref
  21. Nathan, K. S., H. S. M. Beigi, J. Subrahmonia, G. J. Clary and H. Maruyama (1995) "Real-time online unconstrained handwriting recognition using statistical methods",Google ScholarGoogle Scholar
  22. Oommen, B. J. and K. Zhang (1996) "The normalized string editing problem revisited", IEEE Trans. on PAMI, 18:6, pp. 669--672. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Quinlan, J. R. (1993) "C4.5 programs for machine learning", Morgan Kaufmann, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ron, D., Y. Singer and N. Tishby (1994) "The power of Amnesia: learning probabilistic automata with variable memory length", to appear in Machine Learning Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Rosenfeld, R. (1994) "Adaptive statistical language modeling" a maximum entropy approach", Ph.D. Thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh.Google ScholarGoogle Scholar
  26. Sun, S. W. (1991), "A Contextual Postprocessing for Optical Chinese Character Recognition", in Proc.Int. Sym. on Circuits and Systems, pp. 2641--2644.Google ScholarGoogle Scholar
  27. Vapnik, V. (1995) The Nature of Statistical Learning Theory, Springer-Verlag, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Van Rijsbergen (1979) Information Retrieval, Butterworths, London. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Wagner, R. A. and M. J. Fisher (1974) "The string to string correction problem", J. ACM, 21:1, pp. 168--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ward, W. and S. Issar (1996) "A class based language model for speech recognition", Proc. IEEE ICASSP 96, Vol. 1, pp. 416--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Wong, P-K. and C. Chan (1999) "Postprocessing statistical language models for handwritten Chinese character recognizer", IEEE Trans. SMC, Part B, 29:2, 286--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Xia, Y., S. Ma, M. Sun, X. Zhu, Y. Jin and X. Chang (1996) "Automatic post-processing of offline handwritten Chinese text recognition", Proc. ICCC, pp. 413--416.Google ScholarGoogle Scholar
  33. Yang, K-C., T-H. Ho, L-F. Chien and L-S. Lee (1998) "Statistics-based segment pattern lexicon - a new direction for Chinese language modeling", Proc. IEEE ICASSP 98, Vol. 1., pp. 169--172.Google ScholarGoogle Scholar
  1. Detection of language (model) errors

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image DL Hosted proceedings
            EMNLP '00: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
            October 2000
            233 pages

            Publisher

            Association for Computational Linguistics

            United States

            Publication History

            • Published: 7 October 2000

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate73of234submissions,31%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader