article

Toward a unified approach to statistical language modeling for Chinese

Authors:
Jianfeng Gao

Microsoft Research (Asia), Beijing, China

Microsoft Research (Asia), Beijing, China
View Profile

,
Joshua Goodman

Microsoft Research (Redmond), Washington

Microsoft Research (Redmond), Washington
View Profile

,
Mingjing Li

Microsoft Research (Asia), Beijing, China

Microsoft Research (Asia), Beijing, China
View Profile

,
Kai-Fu Lee

Microsoft Research (Asia), Beijing, China

Microsoft Research (Asia), Beijing, China
View Profile

Authors Info & Claims

ACM Transactions on Asian Language Information Processing Volume 1 Issue 1pp 3–33https://doi.org/10.1145/595576.595578

Published:01 March 2002Publication History

ACM Transactions on Asian Language Information Processing

Abstract

This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

References

BERTON, A., FETTER P., AND REGEL-BRIETZMANN, P. 1996. Compound words in large-vocabulary German speech recognition systems. ICSLP96.Google Scholar
BROWN, P. F., DELLA PIETRA V. J., DE SOUZA, P. V., LAI, J. C., AND MERCER, R. L. 1990. Class-based n-gram models of natural language. Comput. Linguist. 18, 467-479. Google Scholar
CHEN, S. F., AND GOODMAN, J. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13 (Oct.), 359-394.Google Scholar
CHEN, S. F., BEEFERMAN, D., AND ROSENFELD, R. 1998. Evaluation metrics for language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.Google Scholar
CHIEN, L. F. 1997. PAT-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the ACM SIGIR'97 Conference (Philadelphia, PA), 50-58. Google Scholar
CLARKSON, P. AND ROBINSON, A. 1997. Language model adaptation using mixtures and an exponentially decaying cache. In Proceedings of the ICASSP-97 Conference. Google Scholar
FUNG, P. 1998. Extracting key terms from Chinese and Japanese texts. Int. J. Comput. Process. Oriental Lang. Special Issue on Information Retrieval on Oriental Languages, 99-121.Google Scholar
GAO, J., LI, M., AND LEE, K. F. 2000a. N-gram distribution based language model adaptation. In Proceedings of the ICSLP-2000 Conference (Beijing, Oct. 16-20).Google Scholar
GAO, J., WANG, H. F., LI, M., AND LEE, K. F. 2000b. A unified approach to statistical language modeling for Chinese. In Proceedings of the ICASSP-2000 Conference (Istanbul, June).Google Scholar
GAO, J., GOODMAN, J., AND MIAO, J. 2001. The use of clustering techniques for language model application to Asian language. Int. J. Comput. Linguist. Chinese Lang. Process., 6, 1.Google Scholar
GIACHIN, E. P. 1995. Phrase bigrams for continuous speech recognition. In Proceedings of the ICASSP-95 Conference.Google Scholar
GOODMAN, J. AND GAO, J. 2000. Language model compression by predictive clustering. In Proceedings of the ICSLP-2000 Conference (Beijing, Oct.).Google Scholar
HEARST, M. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23, 33-64. Google Scholar
HUANG, X. D., ACERO, A., AND HON, H. 2000. Spoken Language Processing. Prentice Hall, Englewood Cliffs, NJ.Google Scholar
IYER, R., OSTENDORF, M., AND GISH, H. 1997. Using out-of-domain data to improve in-domain language models. IEEE Signal Process. Lett. 4, 8 (Aug.).Google Scholar
JELINEK, F. 1990. Self-organized language modeling for speech recognition. In Readings in Speech Recognition. A. Waibel and K. F. Lee, Eds., Morgan-Kaufmann, San Mateo, CA, 450-506. Google Scholar
KATZ, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoustics. Speech Signal Process. ASSP-35, 3 (March), 400-401.Google Scholar
LIN, S. C., TSAI, C. L., CHIEN, L. F., CHEN, K. J., AND LEE, L. S. 1997. Chinese language model adaptation based on document classification and multiple domain-specific language models. In Proceedings of the 5th European Conference on Speech Communication and Technology (Rhodes, Greece).Google Scholar
MANNING, C. D. AND SCHUTZE, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google Scholar
MILLER, D., LEEK, T., AND SCHWARTZ, R. M. 1999. A hidden Markov model information retrieval system. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (Berkeley, CA), 214-221. Google Scholar
ROCCHIO, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ, 313-323.Google Scholar
SEYMORE, K., AND ROSENFELD, R. 1996. Scalable backoff language models. In Proceedings of the International Conference on Speech and Language Processing, Vol. 1 (Philadelphia, PA), 232-235.Google Scholar
SEYMORE, K., AND ROSENFELD, R. 1997. Using story topics for language model adaptation. In Proceedings of the ICASSP-97 Conference.Google Scholar
STOLCKE, A. 1998. Entropy-based pruning of backoff language models. In Proceedings of the DARPA News Transcription and Understanding Workshop (Lansdowne, VA.), 270-274.Google Scholar
TUNG, C. H., AND LEE, H. J. 1994. Identification of unknown words from a corpus. Comput. Process. Chinese Oriental Lang. 131-145.Google Scholar
WONG, P. K., AND CHAN, C. K. 1996. Chinese word segmentation based on maximum matching and word binding force. In Proceedings of the 16th International Conference on Computational Linguistics (Copenhagen), 200-203 Google Scholar
WU, M. W. AND SU, K. Y. 1993. Corpus-based automatic compound extraction with mutual information and relative frequency count. In Proceedings of the R.O.C. Computational Linguistics Conference VI (Nantou, Taiwan), 207-216.Google Scholar
YAMAMOTO, H. AND SAGISAKA, Y. 1999. Multi-class composite n-gram based on connection direction. In Proceedings of the ICASSP Conference (Phoenix, AZ, May). Google Scholar
YANG, K. C., HO, T. H., CHIEN, L. F., AND LEE, L. S. 1998. Statistics-based segment pattern lexicon: A new direction for Chinese language modeling. In Proceedings of the IEEE 1998 International Conference on Acoustic, Speech, Signal Processing (Seattle, WA), 169-172.Google Scholar
ZHANG, J., GAO, J., AND ZHOU, M. 2000. Extraction of Chinese compound words: An experimental study on a very large corpus. In Proceedings of the Second Chinese Language Processing Workshop (Hong Kong, Oct. 8). Google Scholar
ZHAO, J., GAO, J., CHANG, E., AND LI, M. 2000. Lexicon optimization for Chinese language modeling. In Proceedings of the ISCSLP-2000. International Symposium on Spoken Language Processing (Beijing, Oct. 14-15).Google Scholar
ZUE, V. W. 1995. Navigating the information superhighway using spoken language interfaces. IEEE Expert 10, 5 (Oct.), 39-43. Google Scholar

Index Terms

Toward a unified approach to statistical language modeling for Chinese
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
    2. Philosophical/theoretical foundations of artificial intelligence
      1. Cognitive science
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Natural language interfaces

Recommendations

A unified language model for large vocabulary continuous speech recognition of Turkish
Fractional calculus applications in signals and systems

We have designed a Turkish dictation system for newspaper content transcription application. Turkish is an agglutinative language with free word order. These characteristics of the language result in vocabulary explosion, large number of out-of-...
Read More
A comparative study of dictionaries and corpora as methods for language resource addition

In this paper, we investigate the relative effect of two strategies for language resource addition for Japanese morphological analysis, a joint task of word segmentation and part-of-speech tagging. The first strategy is adding entries to the dictionary ...
Read More
Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system

This paper describes a new technique of language modeling for a highly inflectional Dravidian language, Tamil. It aims to alleviate the main problems encountered in processing of Tamil language, like enormous vocabulary growth caused by the large number ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 1, Issue 1
March 2002
102 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/595576
Issue’s Table of Contents

Copyright © 2002 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2002
Published in talip Volume 1, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Chinese language
Chinese pinyin-to-character conversion
backoff
character error rate
domain adaptation
lexicon
n-gram model
perplexity
pruning
smoothing
statistical language modeling
word segmentation
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 75
  Total Citations
  View Citations
- 936
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Toward a unified approach to statistical language modeling for Chinese

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

A unified language model for large vocabulary continuous speech recognition of Turkish

A comparative study of dictionaries and corpora as methods for language resource addition

Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Toward a unified approach to statistical language modeling for Chinese

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

A unified language model for large vocabulary continuous speech recognition of Turkish

A comparative study of dictionaries and corpora as methods for language resource addition

Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media