Abstract
The growing importance of multilingual information retrieval and machine translation has made multilingual ontologies extremely valuable resources. Since the construction of an ontology from scratch is a very expensive and time-consuming undertaking, it is attractive to consider ways of automatically aligning monolingual ontologies, which already exist for many of the world's major languages. Previous research exploited similarity in the structure of the ontologies to align, or manually created bilingual resources. These approaches cannot be used to align ontologies with vastly different structures and can only be applied to much studied language pairs for which expensive resources are already available. In this paper, we propose a novel approach to align the ontologies at the node level: Given a concept represented by a particular word sense in one ontology, our task is to find the best corresponding word sense in the second language ontology. To this end, we present a language-independent, corpus-based method that borrows from techniques used in information retrieval and machine translation. We show its efficiency by applying it to two very different ontologies in very different languages: the Mandarin Chinese HowNet and the American English WordNet. Moreover, we propose a methodology to measure bilingual corpora comparability and show that our method is robust enough to use noisy nonparallel bilingual corpora efficiently, when clean parallel corpora are not available.
- Agirre, E., Ansa, O., Hovy, E., and Martinez, D. 2000. Enriching very large ontologies using the www. In Proceedings of the Ontology Learning Workshop, ECAI 2000. Berlin, Germany.Google Scholar
- Asanoma, H. 2001. Alignment of ontologies: Wordnet and goi-taikei. In Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations. Pittsburgh, PA.Google Scholar
- Baker, C. F., Fillmore, C. J., and Lowe, J. B. 1998. The Berkeley FrameNet project. In Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, C. Boitet and P. Whitelock, Eds. Morgan Kaufmann, San Francisco, CA. 86--90. Google Scholar
- Brill, E. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21, 4, 543--565. Google Scholar
- Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Lafferty, J., Mercer, R., and Rossin, P. 1990. A statistical approach to machine translation. Computational Linguistics 16, 79--85. Google Scholar
- Church, K. 1993. Char_align: A program for aligning parallel texts at the character level. In Proceedings of the 31st Annual ACL Conference. Columbus, OH. 1--8. Google Scholar
- Dagan, I. and Itai, A. 1994. Word sense disambiguation using a second language monolingual corpus. Computational Linguistics 20, 4, 563--596. Google Scholar
- Dong, Z. 1988. Knowledge description: What, how and who? In Proceedings of International Symposium on Electronic Dictionary. Tokyo, Japan.Google Scholar
- Dorr, B., Levow, G., and Lin, D. 2000. Large-scale construction of a Chinese-English semantic hierarchy. Tech. rep., University of Maryland, College Park, MD.Google Scholar
- Fung, P. and Lo, Y. Y. 1998. An ir approach for translating new words from nonparallel, comparable texts. In Proceedings of the 36th Annual Conference of the Association for Computational Linguistics. Montreal, Canada. 414--420. Google Scholar
- Fung, P. and Lo, Y. Y. 1999. Understanding chinese spontaneous speech: Are mandarin and cantonese very different? In Proceedings of ISSPIS 99. Guangzhou, China.Google Scholar
- Kilgarriff, A. 2001. Comparing corpora. International Journal of Corpus Linguistics 6, 1, 1--37.Google Scholar
- Kilgarriff, A. and Rose, T. 1998. Measures for corpus similarity and homogeneity. In Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing. Granada, Spain, 46--52.Google Scholar
- Knight, K. and Luk, S. 1994. Building a large-scale knowledge base for machine translation. In Proceedings of the National Conference on Artificial Intelligence. AAAI. Google Scholar
- Lee, L. 1999. Measures of distributional similarity. In Proceedings of the 37th Conference of the Association for Computational Linguistics. College Park, MD. 25--32. Google Scholar
- Lenat, D. B. 1995. CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM 38, 11, 33--38. Google Scholar
- Levin, B. 1993. English verb classes and alternations: A preliminary investigation. University of Chicago Press, Chicago, IL.Google Scholar
- Lin, D. 1998. Automatic retrieval and clustering of similar words. In Proceedings of COLING-ACL 98. Montreal, Canada. Google Scholar
- Melamed, I. 1997. Automatic discovery of non-compositional compounds in parallel data. In Proceedings of EMNLP-1997. Providence, RI.Google Scholar
- Miller, G. 1990. Wordnet: An on-line lexical database. International Journal of Lexicography 3, 4.Google Scholar
- Ngai, G. and Florian, R. 2001. Transformation-based learning in the fast lane. In Proceedings of the 39th Annual ACL Conference. Pittsburgh, PA. Google Scholar
- Palmer, M. and Wu, Z. 1995. Verb semantics for English-Chinese translation. Machine Translation 10, 1--2, 59--92.Google Scholar
- Rapp, R. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual ACL Conference. 320--322. Google Scholar
- Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence. Google Scholar
- Rigau i Claramunt, G. and Agirre, E. 2002. Semi-automatic methods for WordNet construction. In Proceedings of the first Global WordNet Conference. Mysore, India.Google Scholar
- Sekine, S., Sudo, K., and Ogino, T. 1999. Statistical matching of two ontologies. In Proceedings of SIGLEX99: Standardizing Lexical Resources. College Park, MD.Google Scholar
- Smadja, F., McKeown, K., and Hatzivassiloglou, V. 1996. Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics 22, 1, 3. Google Scholar
- Vossen, P., Ed. 1998. EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Press, New York. Google Scholar
- Wong, P. and Fung, P. 2002. Nouns in wordnet and hownet: An analysis and comparison of semantic relations. In Proceedings of the 1st International Conference on Global Wordnet. Mysore, India.Google Scholar
- Wu, D. and Xia, X. 1995. Large-scale automatic extraction of an english-chinese lexicon. machine translation 9, 3--4, 295--313.Google Scholar
Index Terms
- Aligning word senses using bilingual corpora
Recommendations
Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora
An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...
Unsupervised word sense disambiguation using bilingual comparable corpora
COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1An unsupervised method for word sense disambiguation using a bilingual comparable corpus was developed. First, it extracts statistically significant pairs of related words from the corpus of each language. Then, aligning pairs of related words ...
Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation
The availability of machine-readable bilingual linguistic resources is crucial not only for rule-based machine translation but also for other applications such as cross-lingual information retrieval. However, the building of such resources (bilingual ...
Comments