article

Aligning word senses using bilingual corpora

Authors:
Marine Carpuat

Hong Kong University of Science and Technology, Kowloon, Hong Kong

Hong Kong University of Science and Technology, Kowloon, Hong Kong
View Profile

,
Pascale Fung

Hong Kong University of Science and Technology, Kowloon, Hong Kong

Hong Kong University of Science and Technology, Kowloon, Hong Kong
View Profile

,
Grace Ngai

Hong Kong Polytechnic University, Kowloon, Hong Kong

Hong Kong Polytechnic University, Kowloon, Hong Kong
View Profile

ACM Transactions on Asian Language Information Processing Volume 5 Issue 2pp 89–120https://doi.org/10.1145/1165255.1165256

Published:01 June 2006Publication History

ACM Transactions on Asian Language Information Processing

Abstract

The growing importance of multilingual information retrieval and machine translation has made multilingual ontologies extremely valuable resources. Since the construction of an ontology from scratch is a very expensive and time-consuming undertaking, it is attractive to consider ways of automatically aligning monolingual ontologies, which already exist for many of the world's major languages. Previous research exploited similarity in the structure of the ontologies to align, or manually created bilingual resources. These approaches cannot be used to align ontologies with vastly different structures and can only be applied to much studied language pairs for which expensive resources are already available. In this paper, we propose a novel approach to align the ontologies at the node level: Given a concept represented by a particular word sense in one ontology, our task is to find the best corresponding word sense in the second language ontology. To this end, we present a language-independent, corpus-based method that borrows from techniques used in information retrieval and machine translation. We show its efficiency by applying it to two very different ontologies in very different languages: the Mandarin Chinese HowNet and the American English WordNet. Moreover, we propose a methodology to measure bilingual corpora comparability and show that our method is robust enough to use noisy nonparallel bilingual corpora efficiently, when clean parallel corpora are not available.

References

Agirre, E., Ansa, O., Hovy, E., and Martinez, D. 2000. Enriching very large ontologies using the www. In Proceedings of the Ontology Learning Workshop, ECAI 2000. Berlin, Germany.Google Scholar
Asanoma, H. 2001. Alignment of ontologies: Wordnet and goi-taikei. In Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations. Pittsburgh, PA.Google Scholar
Baker, C. F., Fillmore, C. J., and Lowe, J. B. 1998. The Berkeley FrameNet project. In Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, C. Boitet and P. Whitelock, Eds. Morgan Kaufmann, San Francisco, CA. 86--90. Google Scholar
Brill, E. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21, 4, 543--565. Google Scholar
Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Lafferty, J., Mercer, R., and Rossin, P. 1990. A statistical approach to machine translation. Computational Linguistics 16, 79--85. Google Scholar
Church, K. 1993. Char_align: A program for aligning parallel texts at the character level. In Proceedings of the 31st Annual ACL Conference. Columbus, OH. 1--8. Google Scholar
Dagan, I. and Itai, A. 1994. Word sense disambiguation using a second language monolingual corpus. Computational Linguistics 20, 4, 563--596. Google Scholar
Dong, Z. 1988. Knowledge description: What, how and who? In Proceedings of International Symposium on Electronic Dictionary. Tokyo, Japan.Google Scholar
Dorr, B., Levow, G., and Lin, D. 2000. Large-scale construction of a Chinese-English semantic hierarchy. Tech. rep., University of Maryland, College Park, MD.Google Scholar
Fung, P. and Lo, Y. Y. 1998. An ir approach for translating new words from nonparallel, comparable texts. In Proceedings of the 36th Annual Conference of the Association for Computational Linguistics. Montreal, Canada. 414--420. Google Scholar
Fung, P. and Lo, Y. Y. 1999. Understanding chinese spontaneous speech: Are mandarin and cantonese very different? In Proceedings of ISSPIS 99. Guangzhou, China.Google Scholar
Kilgarriff, A. 2001. Comparing corpora. International Journal of Corpus Linguistics 6, 1, 1--37.Google Scholar
Kilgarriff, A. and Rose, T. 1998. Measures for corpus similarity and homogeneity. In Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing. Granada, Spain, 46--52.Google Scholar
Knight, K. and Luk, S. 1994. Building a large-scale knowledge base for machine translation. In Proceedings of the National Conference on Artificial Intelligence. AAAI. Google Scholar
Lee, L. 1999. Measures of distributional similarity. In Proceedings of the 37th Conference of the Association for Computational Linguistics. College Park, MD. 25--32. Google Scholar
Lenat, D. B. 1995. CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM 38, 11, 33--38. Google Scholar
Levin, B. 1993. English verb classes and alternations: A preliminary investigation. University of Chicago Press, Chicago, IL.Google Scholar
Lin, D. 1998. Automatic retrieval and clustering of similar words. In Proceedings of COLING-ACL 98. Montreal, Canada. Google Scholar
Melamed, I. 1997. Automatic discovery of non-compositional compounds in parallel data. In Proceedings of EMNLP-1997. Providence, RI.Google Scholar
Miller, G. 1990. Wordnet: An on-line lexical database. International Journal of Lexicography 3, 4.Google Scholar
Ngai, G. and Florian, R. 2001. Transformation-based learning in the fast lane. In Proceedings of the 39th Annual ACL Conference. Pittsburgh, PA. Google Scholar
Palmer, M. and Wu, Z. 1995. Verb semantics for English-Chinese translation. Machine Translation 10, 1--2, 59--92.Google Scholar
Rapp, R. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual ACL Conference. 320--322. Google Scholar
Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence. Google Scholar
Rigau i Claramunt, G. and Agirre, E. 2002. Semi-automatic methods for WordNet construction. In Proceedings of the first Global WordNet Conference. Mysore, India.Google Scholar
Sekine, S., Sudo, K., and Ogino, T. 1999. Statistical matching of two ontologies. In Proceedings of SIGLEX99: Standardizing Lexical Resources. College Park, MD.Google Scholar
Smadja, F., McKeown, K., and Hatzivassiloglou, V. 1996. Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics 22, 1, 3. Google Scholar
Vossen, P., Ed. 1998. EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Press, New York. Google Scholar
Wong, P. and Fung, P. 2002. Nouns in wordnet and hownet: An analysis and comparison of semantic relations. In Proceedings of the 1st International Conference on Global Wordnet. Mysore, India.Google Scholar
Wu, D. and Xia, X. 1995. Large-scale automatic extraction of an english-chinese lexicon. machine translation 9, 3--4, 295--313.Google Scholar

Index Terms

Aligning word senses using bilingual corpora
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...
Read More
Unsupervised word sense disambiguation using bilingual comparable corpora
COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1

An unsupervised method for word sense disambiguation using a bilingual comparable corpus was developed. First, it extracts statistically significant pairs of related words from the corpus of each language. Then, aligning pairs of related words ...
Read More
Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation

The availability of machine-readable bilingual linguistic resources is crucial not only for rule-based machine translation but also for other applications such as cross-lingual information retrieval. However, the building of such resources (bilingual ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 5, Issue 2
June 2006
93 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1165255
Issue’s Table of Contents

Copyright © 2006 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2006
Published in talip Volume 5, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Machine translation
information retrieval
multilingual ontologies
nonparallel corpora
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 680
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Aligning word senses using bilingual corpora

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

Unsupervised word sense disambiguation using bilingual comparable corpora

Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Aligning word senses using bilingual corpora

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

Unsupervised word sense disambiguation using bilingual comparable corpora

Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media