skip to main content
article

Toward a unified approach to statistical language modeling for Chinese

Published:01 March 2002Publication History
Skip Abstract Section

Abstract

This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

References

  1. BERTON, A., FETTER P., AND REGEL-BRIETZMANN, P. 1996. Compound words in large-vocabulary German speech recognition systems. ICSLP96.Google ScholarGoogle Scholar
  2. BROWN, P. F., DELLA PIETRA V. J., DE SOUZA, P. V., LAI, J. C., AND MERCER, R. L. 1990. Class-based n-gram models of natural language. Comput. Linguist. 18, 467-479. Google ScholarGoogle Scholar
  3. CHEN, S. F., AND GOODMAN, J. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13 (Oct.), 359-394.Google ScholarGoogle Scholar
  4. CHEN, S. F., BEEFERMAN, D., AND ROSENFELD, R. 1998. Evaluation metrics for language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.Google ScholarGoogle Scholar
  5. CHIEN, L. F. 1997. PAT-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the ACM SIGIR'97 Conference (Philadelphia, PA), 50-58. Google ScholarGoogle Scholar
  6. CLARKSON, P. AND ROBINSON, A. 1997. Language model adaptation using mixtures and an exponentially decaying cache. In Proceedings of the ICASSP-97 Conference. Google ScholarGoogle Scholar
  7. FUNG, P. 1998. Extracting key terms from Chinese and Japanese texts. Int. J. Comput. Process. Oriental Lang. Special Issue on Information Retrieval on Oriental Languages, 99-121.Google ScholarGoogle Scholar
  8. GAO, J., LI, M., AND LEE, K. F. 2000a. N-gram distribution based language model adaptation. In Proceedings of the ICSLP-2000 Conference (Beijing, Oct. 16-20).Google ScholarGoogle Scholar
  9. GAO, J., WANG, H. F., LI, M., AND LEE, K. F. 2000b. A unified approach to statistical language modeling for Chinese. In Proceedings of the ICASSP-2000 Conference (Istanbul, June).Google ScholarGoogle Scholar
  10. GAO, J., GOODMAN, J., AND MIAO, J. 2001. The use of clustering techniques for language model application to Asian language. Int. J. Comput. Linguist. Chinese Lang. Process., 6, 1.Google ScholarGoogle Scholar
  11. GIACHIN, E. P. 1995. Phrase bigrams for continuous speech recognition. In Proceedings of the ICASSP-95 Conference.Google ScholarGoogle Scholar
  12. GOODMAN, J. AND GAO, J. 2000. Language model compression by predictive clustering. In Proceedings of the ICSLP-2000 Conference (Beijing, Oct.).Google ScholarGoogle Scholar
  13. HEARST, M. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23, 33-64. Google ScholarGoogle Scholar
  14. HUANG, X. D., ACERO, A., AND HON, H. 2000. Spoken Language Processing. Prentice Hall, Englewood Cliffs, NJ.Google ScholarGoogle Scholar
  15. IYER, R., OSTENDORF, M., AND GISH, H. 1997. Using out-of-domain data to improve in-domain language models. IEEE Signal Process. Lett. 4, 8 (Aug.).Google ScholarGoogle Scholar
  16. JELINEK, F. 1990. Self-organized language modeling for speech recognition. In Readings in Speech Recognition. A. Waibel and K. F. Lee, Eds., Morgan-Kaufmann, San Mateo, CA, 450-506. Google ScholarGoogle Scholar
  17. KATZ, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoustics. Speech Signal Process. ASSP-35, 3 (March), 400-401.Google ScholarGoogle Scholar
  18. LIN, S. C., TSAI, C. L., CHIEN, L. F., CHEN, K. J., AND LEE, L. S. 1997. Chinese language model adaptation based on document classification and multiple domain-specific language models. In Proceedings of the 5th European Conference on Speech Communication and Technology (Rhodes, Greece).Google ScholarGoogle Scholar
  19. MANNING, C. D. AND SCHUTZE, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google ScholarGoogle Scholar
  20. MILLER, D., LEEK, T., AND SCHWARTZ, R. M. 1999. A hidden Markov model information retrieval system. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (Berkeley, CA), 214-221. Google ScholarGoogle Scholar
  21. ROCCHIO, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ, 313-323.Google ScholarGoogle Scholar
  22. SEYMORE, K., AND ROSENFELD, R. 1996. Scalable backoff language models. In Proceedings of the International Conference on Speech and Language Processing, Vol. 1 (Philadelphia, PA), 232-235.Google ScholarGoogle Scholar
  23. SEYMORE, K., AND ROSENFELD, R. 1997. Using story topics for language model adaptation. In Proceedings of the ICASSP-97 Conference.Google ScholarGoogle Scholar
  24. STOLCKE, A. 1998. Entropy-based pruning of backoff language models. In Proceedings of the DARPA News Transcription and Understanding Workshop (Lansdowne, VA.), 270-274.Google ScholarGoogle Scholar
  25. TUNG, C. H., AND LEE, H. J. 1994. Identification of unknown words from a corpus. Comput. Process. Chinese Oriental Lang. 131-145.Google ScholarGoogle Scholar
  26. WONG, P. K., AND CHAN, C. K. 1996. Chinese word segmentation based on maximum matching and word binding force. In Proceedings of the 16th International Conference on Computational Linguistics (Copenhagen), 200-203 Google ScholarGoogle Scholar
  27. WU, M. W. AND SU, K. Y. 1993. Corpus-based automatic compound extraction with mutual information and relative frequency count. In Proceedings of the R.O.C. Computational Linguistics Conference VI (Nantou, Taiwan), 207-216.Google ScholarGoogle Scholar
  28. YAMAMOTO, H. AND SAGISAKA, Y. 1999. Multi-class composite n-gram based on connection direction. In Proceedings of the ICASSP Conference (Phoenix, AZ, May). Google ScholarGoogle Scholar
  29. YANG, K. C., HO, T. H., CHIEN, L. F., AND LEE, L. S. 1998. Statistics-based segment pattern lexicon: A new direction for Chinese language modeling. In Proceedings of the IEEE 1998 International Conference on Acoustic, Speech, Signal Processing (Seattle, WA), 169-172.Google ScholarGoogle Scholar
  30. ZHANG, J., GAO, J., AND ZHOU, M. 2000. Extraction of Chinese compound words: An experimental study on a very large corpus. In Proceedings of the Second Chinese Language Processing Workshop (Hong Kong, Oct. 8). Google ScholarGoogle Scholar
  31. ZHAO, J., GAO, J., CHANG, E., AND LI, M. 2000. Lexicon optimization for Chinese language modeling. In Proceedings of the ISCSLP-2000. International Symposium on Spoken Language Processing (Beijing, Oct. 14-15).Google ScholarGoogle Scholar
  32. ZUE, V. W. 1995. Navigating the information superhighway using spoken language interfaces. IEEE Expert 10, 5 (Oct.), 39-43. Google ScholarGoogle Scholar

Index Terms

  1. Toward a unified approach to statistical language modeling for Chinese

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader