Abstract
This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.
- BERTON, A., FETTER P., AND REGEL-BRIETZMANN, P. 1996. Compound words in large-vocabulary German speech recognition systems. ICSLP96.Google Scholar
- BROWN, P. F., DELLA PIETRA V. J., DE SOUZA, P. V., LAI, J. C., AND MERCER, R. L. 1990. Class-based n-gram models of natural language. Comput. Linguist. 18, 467-479. Google Scholar
- CHEN, S. F., AND GOODMAN, J. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13 (Oct.), 359-394.Google Scholar
- CHEN, S. F., BEEFERMAN, D., AND ROSENFELD, R. 1998. Evaluation metrics for language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.Google Scholar
- CHIEN, L. F. 1997. PAT-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the ACM SIGIR'97 Conference (Philadelphia, PA), 50-58. Google Scholar
- CLARKSON, P. AND ROBINSON, A. 1997. Language model adaptation using mixtures and an exponentially decaying cache. In Proceedings of the ICASSP-97 Conference. Google Scholar
- FUNG, P. 1998. Extracting key terms from Chinese and Japanese texts. Int. J. Comput. Process. Oriental Lang. Special Issue on Information Retrieval on Oriental Languages, 99-121.Google Scholar
- GAO, J., LI, M., AND LEE, K. F. 2000a. N-gram distribution based language model adaptation. In Proceedings of the ICSLP-2000 Conference (Beijing, Oct. 16-20).Google Scholar
- GAO, J., WANG, H. F., LI, M., AND LEE, K. F. 2000b. A unified approach to statistical language modeling for Chinese. In Proceedings of the ICASSP-2000 Conference (Istanbul, June).Google Scholar
- GAO, J., GOODMAN, J., AND MIAO, J. 2001. The use of clustering techniques for language model application to Asian language. Int. J. Comput. Linguist. Chinese Lang. Process., 6, 1.Google Scholar
- GIACHIN, E. P. 1995. Phrase bigrams for continuous speech recognition. In Proceedings of the ICASSP-95 Conference.Google Scholar
- GOODMAN, J. AND GAO, J. 2000. Language model compression by predictive clustering. In Proceedings of the ICSLP-2000 Conference (Beijing, Oct.).Google Scholar
- HEARST, M. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23, 33-64. Google Scholar
- HUANG, X. D., ACERO, A., AND HON, H. 2000. Spoken Language Processing. Prentice Hall, Englewood Cliffs, NJ.Google Scholar
- IYER, R., OSTENDORF, M., AND GISH, H. 1997. Using out-of-domain data to improve in-domain language models. IEEE Signal Process. Lett. 4, 8 (Aug.).Google Scholar
- JELINEK, F. 1990. Self-organized language modeling for speech recognition. In Readings in Speech Recognition. A. Waibel and K. F. Lee, Eds., Morgan-Kaufmann, San Mateo, CA, 450-506. Google Scholar
- KATZ, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoustics. Speech Signal Process. ASSP-35, 3 (March), 400-401.Google Scholar
- LIN, S. C., TSAI, C. L., CHIEN, L. F., CHEN, K. J., AND LEE, L. S. 1997. Chinese language model adaptation based on document classification and multiple domain-specific language models. In Proceedings of the 5th European Conference on Speech Communication and Technology (Rhodes, Greece).Google Scholar
- MANNING, C. D. AND SCHUTZE, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google Scholar
- MILLER, D., LEEK, T., AND SCHWARTZ, R. M. 1999. A hidden Markov model information retrieval system. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (Berkeley, CA), 214-221. Google Scholar
- ROCCHIO, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ, 313-323.Google Scholar
- SEYMORE, K., AND ROSENFELD, R. 1996. Scalable backoff language models. In Proceedings of the International Conference on Speech and Language Processing, Vol. 1 (Philadelphia, PA), 232-235.Google Scholar
- SEYMORE, K., AND ROSENFELD, R. 1997. Using story topics for language model adaptation. In Proceedings of the ICASSP-97 Conference.Google Scholar
- STOLCKE, A. 1998. Entropy-based pruning of backoff language models. In Proceedings of the DARPA News Transcription and Understanding Workshop (Lansdowne, VA.), 270-274.Google Scholar
- TUNG, C. H., AND LEE, H. J. 1994. Identification of unknown words from a corpus. Comput. Process. Chinese Oriental Lang. 131-145.Google Scholar
- WONG, P. K., AND CHAN, C. K. 1996. Chinese word segmentation based on maximum matching and word binding force. In Proceedings of the 16th International Conference on Computational Linguistics (Copenhagen), 200-203 Google Scholar
- WU, M. W. AND SU, K. Y. 1993. Corpus-based automatic compound extraction with mutual information and relative frequency count. In Proceedings of the R.O.C. Computational Linguistics Conference VI (Nantou, Taiwan), 207-216.Google Scholar
- YAMAMOTO, H. AND SAGISAKA, Y. 1999. Multi-class composite n-gram based on connection direction. In Proceedings of the ICASSP Conference (Phoenix, AZ, May). Google Scholar
- YANG, K. C., HO, T. H., CHIEN, L. F., AND LEE, L. S. 1998. Statistics-based segment pattern lexicon: A new direction for Chinese language modeling. In Proceedings of the IEEE 1998 International Conference on Acoustic, Speech, Signal Processing (Seattle, WA), 169-172.Google Scholar
- ZHANG, J., GAO, J., AND ZHOU, M. 2000. Extraction of Chinese compound words: An experimental study on a very large corpus. In Proceedings of the Second Chinese Language Processing Workshop (Hong Kong, Oct. 8). Google Scholar
- ZHAO, J., GAO, J., CHANG, E., AND LI, M. 2000. Lexicon optimization for Chinese language modeling. In Proceedings of the ISCSLP-2000. International Symposium on Spoken Language Processing (Beijing, Oct. 14-15).Google Scholar
- ZUE, V. W. 1995. Navigating the information superhighway using spoken language interfaces. IEEE Expert 10, 5 (Oct.), 39-43. Google Scholar
Index Terms
- Toward a unified approach to statistical language modeling for Chinese
Recommendations
A unified language model for large vocabulary continuous speech recognition of Turkish
Fractional calculus applications in signals and systemsWe have designed a Turkish dictation system for newspaper content transcription application. Turkish is an agglutinative language with free word order. These characteristics of the language result in vocabulary explosion, large number of out-of-...
A comparative study of dictionaries and corpora as methods for language resource addition
In this paper, we investigate the relative effect of two strategies for language resource addition for Japanese morphological analysis, a joint task of word segmentation and part-of-speech tagging. The first strategy is adding entries to the dictionary ...
Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system
This paper describes a new technique of language modeling for a highly inflectional Dravidian language, Tamil. It aims to alleviate the main problems encountered in processing of Tamil language, like enormous vocabulary growth caused by the large number ...
Comments