ABSTRACT
In a multilingual scenario, the classical monolingual text categorization problem can be reformulated as a cross language TC task, in which we have to cope with two or more languages (e.g. English and Italian). In this setting, the system is trained using labeled examples in a source language (e.g. English), and it classifies documents in a different target language (e.g. Italian).
In this paper we propose a novel approach to solve the cross language text categorization problem based on acquiring Multilingual Domain Models from comparable corpora in a totally unsupervised way and without using any external knowledge source (e.g. bilingual dictionaries). These Multilingual Domain Models are exploited to define a generalized similarity function (i.e. a kernel function) among documents in different languages, which is used inside a Support Vector Machines classification framework. The results show that our approach is a feasible and cheap solution that largely outperforms a baseline.
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407.Google ScholarDigital Library
- E. Gaussier, J. M. Renders, I. Matveeva, C. Goutte, and H. Dejean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of ACL-04, Barcelona, Spain, July. Google ScholarDigital Library
- A. Gliozzo, C. Strapparava, and I. Dagan. 2004. Unsupervised and supervised exploitation of semantic domains in lexical disambiguation. Computer Speech and Language, 18:275--299.Google ScholarCross Ref
- T. Joachims. 1999. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in kernel methods: support vector learning, chapter 11, pages 169--184. The MIT Press. Google ScholarDigital Library
- T. Joachims. 2002. Learning to Classify Text using Support Vector Machines. Kluwer Academic Publishers. Google ScholarDigital Library
- P. Koehn and K. Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of ACL Workshop on Unsupervised Lexical Acquisition, Philadelphia, July. Google ScholarDigital Library
- M. Littman, S. Dumais, and T. Landauer. 1998. Automatic cross-language information retrieval using latent semantic indexing. In G. Grefenstette, editor, Cross Language Information Retrieval, pages 51--62. Kluwer Academic Publishers.Google Scholar
- B. Magnini and G. Cavaglià. 2000. Integrating subject field codes into WordNet. In Proceedings of LREC-2000, Athens, Greece, June.Google Scholar
- D. Melamed. 2001. Empirical Methods for Exploiting Parallel Texts. The MIT Press.Google Scholar
- B. Schölkopf and A. J. Smola. 2001. Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press. Google ScholarDigital Library
- C. Strapparava, A. Gliozzo, and C. Giuliano. 2004. Pattern abstraction and term similarity for word sense disambiguation. In Proceedings of SENSEVAL-3, Barcelona, Spain, July.Google Scholar
- S. K. M. Wong, W. Ziarko, and P. C. N. Wong. 1985. Generalized vector space model in information retrieval. In Proceedings of the 8th ACM SIGIR Conference. Google ScholarDigital Library
- Cross language text categorization by acquiring multilingual domain models from comparable corpora
Recommendations
Mining comparable bilingual text corpora for cross-language information integration
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data miningIntegrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-...
Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization
ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational LinguisticsCross-language Text Categorization is the task of assigning semantic classes to documents written in a target language (e.g. English) while the system is trained using labeled documents in a source language (e.g. Italian).In this work we present many ...
Acquiring paraphrases from text corpora
K-CAP '09: Proceedings of the fifth international conference on Knowledge captureParaphrases are textual expressions that convey the same meaning using different surface forms. Capturing the variability of language, they play an important role in many natural language applications includ ing question answering, machine translation, ...
Comments