skip to main content
research-article

A Unified Character-Based Tagging Framework for Chinese Word Segmentation

Published:01 June 2010Publication History
Skip Abstract Section

Abstract

Chinese word segmentation is an active area in Chinese language processing though it is suffering from the argument about what precisely is a word in Chinese. Based on corpus-based segmentation standard, we launched this study. In detail, we regard Chinese word segmentation as a character-based tagging problem. We show that there has been a potent trend of using a character-based tagging approach in this field. In particular, learning from segmented corpus with or without additional linguistic resources is treated in a unified way in which the only difference depends on how the feature template set is selected. It differs from existing work in that both feature template selection and tag set selection are considered in our approach, instead of the previous feature template focus only technique. We show that there is a significant performance difference as different tag sets are selected. This is especially applied to a six-tag set, which is good enough for most current segmented corpora. The linguistic meaning of a tag set is also discussed. Our results show that a simple learning system with six n-gram feature templates and a six-tag set can obtain competitive performance in the cases of learning only from a training corpus. In cases when additional linguistic resources are available, an ensemble learning technique, assistant segmenter, is proposed and its effectiveness is verified. Assistant segmenter is also proven to be an effective method as segmentation standard adaptation that outperforms existing ones. Based on the proposed approach, our system provides state-of-the-art performance in all 12 corpora of three international Chinese word segmentation bakeoffs.

References

  1. Asahara, M., Goh, C. L., Wang, X., and Matsumoto, Y. 2003. Combining segmenter and chunker for Chinese word segmentation. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 144--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brill, E. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Comput. Linguist. 21, 4, 543--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chen, A. 2003. Chinese word segmentation using minimal linguistic knowledge. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 148--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cohn, T., Smith, A., and Osborne, M. 2005. Scaling conditional random fields using error-correcting codes. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). 10--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Emerson, T. 2005. The second international Chinese word segmentation bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’05). 123--133.Google ScholarGoogle Scholar
  6. Fan, C.-K. and Tsai, W.-H. 1988. Automatic word identification in Chinese sentences by the relaxation technique. Comput. Proc. Chinese Oriental Lang. 4, 1, 33--56.Google ScholarGoogle Scholar
  7. Fu, G.-H. and Wang, X.-L. 1999. Unsupervised Chinese word segmentation and unknown word identification. In Proceedings of the 5th Natural Language Processing Pacific Rim Symposium (NLPRS’99). 32--37.Google ScholarGoogle Scholar
  8. Gao, J., Li, M., Wu, A., and Huang, C.-N. 2005. Chinese word segmentation and named entity recognition: A pragmatic approach. Comput. Linguist. 31, 4, 531--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gao, J., Wu, A., Li, M., Huang, C.-N., Li, H., Xia, X., and Qin, H. 2004. Adaptive Chinese word segmentation. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL’04). 462--469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Goh, C.-L., Asahara, M., and Matsumoto, Y. 2005. Chinese word segmentatin by classification of characters. Comput. Linguist. Chinese Lang. Proc. 10, 3, 381--396.Google ScholarGoogle Scholar
  11. Grinstead, C. and Snell, J. L. 1997. Introduction to Probability. American Mathematical Society, Providence, RI.Google ScholarGoogle Scholar
  12. Hockenmaier, J. and Brew, C. 1998. Error driven segmentation of Chinese. Comm. COLIPS 8, 1, 69--84.Google ScholarGoogle Scholar
  13. Kuncheva, L. I. and Whitaker, C. J. 2003. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn. 51, 2, 181--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Lafferty, J. D., McCallum, A., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lau, T. P. and King, I. 2005. Two-phase lmr-rc tagging for Chinese word segmentation. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’05). 183--186.Google ScholarGoogle Scholar
  16. Levow, G.-A. 2006. The third international Chinese language processing bakeoff: Word segmentation and named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 108--117.Google ScholarGoogle Scholar
  17. Li, M., Gao, J., Huang, C.-N., and Li, J. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Li, S. 2005. Chinese word segmentation in ICT-NLP. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’05). 187--188.Google ScholarGoogle Scholar
  19. Liang, P. 2005. Semi-supervised learning for natural language. M.S. thesis, Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  20. Low, J. K., Ng, H. T., and Guo, W. 2005. A maximum entropy approach to Chinese word segmentation. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’05). 161--164.Google ScholarGoogle Scholar
  21. Luo, X., Sun, M., and Tsou, B. K. 2002. Covering ambiguity resolution in Chinese word segmentation based on contextual information. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Malouf, R. 2002. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of the Conference on Natural Language Learning (CoNLL’02). 49--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ng, H. T. and Low, J. K. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? Word-based or character-based? In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 277--284.Google ScholarGoogle Scholar
  24. Nie, J.-Y., Jin, W., and Hannan, M.-L. 1994. A hybrid approach to unknown word detection and segmentation of Chinese. In Proceedings of the International Conference on Chinese Computing (ICCC’94). 326--335.Google ScholarGoogle Scholar
  25. Packard, J. 2000. The Morphology of Chinese: A Linguistics and Cognitive Approach. Cambridge University Press, Cambridge, UK.Google ScholarGoogle Scholar
  26. Palmer, D. D. 1997. A trainable rule-based algorithm for word segmentation. In Proceedings of the Association for Computational Linguistics (ACL’97). 321--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Peng, F., Feng, F., and McCallum, A. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’04). 562--568. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ratnaparkhi, A. 1996. A maximum entropy part-of-speech tagger. In Proceedings of the Empirical Method in Natural Language Processing Conference (EMNLP’96). 133--142.Google ScholarGoogle Scholar
  29. Rosenfeld, B., Feldman, R., and Fresko, M. 2006. A systematic cross-comparison of sequence classifiers. In Proceedings of the SIAM International Conference on Data Mining (SDM’06). 563--567.Google ScholarGoogle Scholar
  30. Sha, F. and Pereira, F. 2003. Shallow parsing with conditional random fields. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (ACL-HLT’03). 134--141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sproat, R. and Emerson, T. 2003. The first international Chinese word segmentation bakeoff. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 133--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sproat, R. and Shih, C. 1990. A statistical method for finding word boundaries in Chinese text. Comput. Proc. Chinese Oriental Lang. 4, 4, 336--351.Google ScholarGoogle Scholar
  33. Sproat, R. and Shih, C. 2002. Corpus-based methods in Chinese morphology and phonology. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02).Google ScholarGoogle Scholar
  34. Sproat, R., Shih, C., Gale, W., and Chang, N. 1996. A stochastic finite-state word-segmentation algorithm for Chinese. Comput. Linguist. 22, 3, 377--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sun, C., Huang, C.-N., and Guan, Y. 2006. Combinative ambiguity string detection and resolution based on annotated corpus. In Proceedings of the 3rd Student Workshop on Computational Linguistics (SWCL’06).Google ScholarGoogle Scholar
  36. Sun, M., Shen, D., and Tsou, B. K. 1998. Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (COLING-ACL’98). 1265--1271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sun, M. and Tsou, B. K. 2001. A review and evaluation on automatic segmentation of Chinese (in Chinese). Contemporary Linguist. 3, 1, 22--32.Google ScholarGoogle Scholar
  38. Tsai, J.-L. 2006. BMM-based Chinese word segmentor with word support model for the SIGHAN bakeoff 2006. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 130--133.Google ScholarGoogle Scholar
  39. Tsai, R. T.-H., Hung, H.-C., Sung, C.-L., Dai, H.-J., and Hsu, W.-L. 2006. On closed task of Chinese word segmentation: An improved CRF model coupled with character clustering and automatically generated template matching. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 108--117.Google ScholarGoogle Scholar
  40. Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. 2005. A conditional random field word segmenter for SIGHAN bakeoff 2005. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 168--171.Google ScholarGoogle Scholar
  41. Wang, X., Lin, X., Yu, D., Tian, H., and Wu, X. 2006. Chinese word segmentation with maximum entropy and N-gram language model. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 138--141.Google ScholarGoogle Scholar
  42. Wu, A. and Jiang, Z. 2000. Statistically-enhanced new word identification in a rule-based Chinese system. In Proceedings of the 2nd Chinese Processing Workshop (ACL’00). 46--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xue, N. 2003. Chinese word segmentation as character tagging. Comput. Linguist. Chinese Lang. Proc. 8, 1, 29--48.Google ScholarGoogle Scholar
  44. Xue, N. and Shen, L. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 176--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yuan, Y. 1997. Statistics based approaches towards Chinese language processing. Ph.D. thesis, National University of Singapore.Google ScholarGoogle Scholar
  46. Zhang, H.-P. and Liu, Q. 2003. Chinese lexical analysis using hierarchical hidden markov model. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN’03). 63--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zhang, M., Zhou, G.-D., Yang, L.-P., and Ji, D.-H. 2006. Chinese word segmentation and named entity recognition based on a context-dependent mutual information independence model. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 154--157.Google ScholarGoogle Scholar
  48. Zhang, R., Kikui, G., and Sumita, E. 2006. Subword-based tagging by conditional random fields for Chinese word segmentation. In Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics (HLT/NAACL’06). 193--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. ZHOU, G. D. 2005. A chunking strategy towards unknown word detection in Chinese word segmentation. In Proceeding of the 2nd International Joint Conference on Natural Language Processing (IJCNLP’05). R. Dale, K.-F. Wong, J. Su, and O. Y. Kwong, Eds. Lecture Notes in Computer Science, vol. 3651, 530--541. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Zhu, M.-H., Wang, Y.-L., Wang, Z.-X., Wang, H.-Z., and Zhu, J.-B. 2006. Designing special post-processing rules for SVM-based Chinese word segmentation. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN’06). 217--220.Google ScholarGoogle Scholar

Index Terms

  1. A Unified Character-Based Tagging Framework for Chinese Word Segmentation

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Asian Language Information Processing
                ACM Transactions on Asian Language Information Processing  Volume 9, Issue 2
                June 2010
                90 pages
                ISSN:1530-0226
                EISSN:1558-3430
                DOI:10.1145/1781134
                Issue’s Table of Contents

                Copyright © 2010 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 1 June 2010
                • Accepted: 1 February 2010
                • Revised: 1 January 2010
                • Received: 1 May 2009
                Published in talip Volume 9, Issue 2

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader