ABSTRACT
The bigram language models are popular, in much language processing applications, in both Indo-European and Asian languages. However, when the language model for Chinese is applied in a novel domain, the accuracy is reduced significantly, from 96% to 78% in our evaluation. We apply pattern recognition techniques (i.e. Bayesian, decision tree and neural network classifiers) to discover language model errors. We have examined 2 general types of features: model-based and language-specific features. In our evaluation, Bayesian classifiers produce the best recall performance of 80% but the precision is low (60%). Neural network produced good recall (75%) and precision (80%) but both Bayesian and Neural network have low skip ratio (65%). The decision tree classifier produced the best precision (81%) and skip ratio (76%) but its recall is the lowest (73%).
- Bahl, L. R., F. Jelinek and R. L. Mercer (1983) "A maximum likelihood approach to continuous speech recognition", IEEE Trans. PAMI, 5:2, pp. 179--190.Google ScholarDigital Library
- Brown, P. F., V. J. Della Pietra, P. V. deSouza and R. L. Mercer (1992) "Class-based n-gram models of natural language", Computational Linguistics, 4, pp. 467--479. Google ScholarDigital Library
- Chen, K.-C. and C. R. Huang (eds.) (1993) "Chinese word class analysis", Technical Report 93-05, Chinese Knowledge Information Processing Group, Institute of Information Science, Academia Sinica, Taiwan.Google Scholar
- Chien, L. F., Chen, K. J., Lee, L. S. (1993) "A best-first language processing model integrating the unification grammar and Markov language model for speech recognition applications", IEEE Trans. Speech and Audio Processing, 1:2, Page(s): 221--240.Google ScholarCross Ref
- Elliman, D. G. and I. T. Lancaster (1990) "A review of segmentation and contextual analysis techniques for text recognition", Pattern Recognition, 23:3/4, pp. 337--346. Google ScholarDigital Library
- Fujita, O. (1998) "Statistical estimation of the number of hidden units for feedforward neural networks", Neural Networks, 11, 851--859. Google ScholarDigital Library
- Guo, J. and H. C. Liu, "PH - a Chinese corpus for pinyin-hanzi transcription", ISS Technical Report, TR93-112-0, Institute of Systems Science, National University of Singapore, 1992.Google Scholar
- Huang, X., F. Alleva, H. Hon, M. Hwang, K. Lee and R. Rosenfeld (1993) "The SPHINX-II speech recognition system: an overview", Computer Speech and Lanaguage, 2, 137--148.Google ScholarCross Ref
- Iyer, R., M. Ostendorf and M. Meteer (1997) "Analyzing and predicting language model performance", Proc. IEEE Workshop Automatic Speech Recognition and Understanding, pp. 254--261.Google ScholarCross Ref
- Jelinek, F. (1989) "Self-organized language modeling for speech recognition", in Readings in Speech Recognition, Morgan Kayfmann. Google ScholarDigital Library
- Jelinek, F. (1991) "Up from trigrams", Proc. Eurospeech 91, pp. 181--184.Google Scholar
- Jin, Y., Y. Xia and X. Chang (1995) "Using contextual information to guide Chinese text recognition", Proc. ICCPOL '95, pp. 134--139.Google Scholar
- Kenne, P. E. and M. O'Kane (1996) "Hybrid language models and spontaneous legal discourse", Proc. JCSLP, Vol. 2, pp. 717--720.Google ScholarCross Ref
- Kit, C., Y. Liu and N. Liang (1989) "On methods of Chinese automatic word segmentation", Journal of Chinese Information Processing, 3:1, 13--20.Google Scholar
- Law, H. H-C. and C. Chan (1996) "N-th order ergodic multigram HMM for modeling of languages without marked word boundaries", Proc. COLING 96, pp. 2043--209. Google ScholarDigital Library
- Lee, H-J. and C-H Tang (1995) "A language model based on semantically clustered words in a Chinese character recognition system", Proc. 3rd Int Conf. on Document Analysis and Recognition, Vol. 1., pp. 450--453. Google ScholarDigital Library
- Lin, M-Y., T-H. Chiang and K-Y. Su (1993) "A preliminary study on unknown word problem in Chinese word segmentation", Proc. ROCLING VI, pp. 119--141.Google Scholar
- Lochovsky, A. F. and K-H. Chung (1997) "Homonym resolution for Chinese phonetic input", Communications of COLIPS, 7:1, 5--15.Google Scholar
- Mahajan, M., D. Beeferman and X. D. Huang (1999) "Improving topic-dependent modeling using information retrieval techniques", Proc. IEEE ICASSP 99, Vol. 1, pp. 541--544. Google ScholarDigital Library
- Nagy, G. (1988), "Chinese character recognition: twenty-five-year retrospective", in Proc. 9th Int. Conf. on Pattern Recognition, Vol. I, pp. 163--167.Google ScholarCross Ref
- Nathan, K. S., H. S. M. Beigi, J. Subrahmonia, G. J. Clary and H. Maruyama (1995) "Real-time online unconstrained handwriting recognition using statistical methods",Google Scholar
- Oommen, B. J. and K. Zhang (1996) "The normalized string editing problem revisited", IEEE Trans. on PAMI, 18:6, pp. 669--672. Google ScholarDigital Library
- Quinlan, J. R. (1993) "C4.5 programs for machine learning", Morgan Kaufmann, CA. Google ScholarDigital Library
- Ron, D., Y. Singer and N. Tishby (1994) "The power of Amnesia: learning probabilistic automata with variable memory length", to appear in Machine Learning Google ScholarDigital Library
- Rosenfeld, R. (1994) "Adaptive statistical language modeling" a maximum entropy approach", Ph.D. Thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh.Google Scholar
- Sun, S. W. (1991), "A Contextual Postprocessing for Optical Chinese Character Recognition", in Proc.Int. Sym. on Circuits and Systems, pp. 2641--2644.Google Scholar
- Vapnik, V. (1995) The Nature of Statistical Learning Theory, Springer-Verlag, New York. Google ScholarDigital Library
- Van Rijsbergen (1979) Information Retrieval, Butterworths, London. Google ScholarDigital Library
- Wagner, R. A. and M. J. Fisher (1974) "The string to string correction problem", J. ACM, 21:1, pp. 168--173. Google ScholarDigital Library
- Ward, W. and S. Issar (1996) "A class based language model for speech recognition", Proc. IEEE ICASSP 96, Vol. 1, pp. 416--418. Google ScholarDigital Library
- Wong, P-K. and C. Chan (1999) "Postprocessing statistical language models for handwritten Chinese character recognizer", IEEE Trans. SMC, Part B, 29:2, 286--291. Google ScholarDigital Library
- Xia, Y., S. Ma, M. Sun, X. Zhu, Y. Jin and X. Chang (1996) "Automatic post-processing of offline handwritten Chinese text recognition", Proc. ICCC, pp. 413--416.Google Scholar
- Yang, K-C., T-H. Ho, L-F. Chien and L-S. Lee (1998) "Statistics-based segment pattern lexicon - a new direction for Chinese language modeling", Proc. IEEE ICASSP 98, Vol. 1., pp. 169--172.Google Scholar
- Detection of language (model) errors
Recommendations
Language model based arabic word segmentation
ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus ...
Sentence Boundary Detection For Marathi Language
Detecting the sentence boundary forms the basic step for many natural language applications. A lot of work has been done in this direction for English and other foreign languages. But not much work has been done for Indian languages. This paper proposes ...
A hybrid model for spelling error detection and correction for Urdu language
AbstractDetecting and correcting misspelled words in a written text are of great importance in many natural language processing applications. Errors can be broadly classified into two groups, namely spelling error and contextual errors. Spelling errors ...
Comments