skip to main content
10.5555/1654449.1654452dlproceedingsArticle/Chapter ViewAbstractPublication PagesparatextConference Proceedingsconference-collections
research-article
Free Access

Cross language text categorization by acquiring multilingual domain models from comparable corpora

Published:29 June 2005Publication History

ABSTRACT

In a multilingual scenario, the classical monolingual text categorization problem can be reformulated as a cross language TC task, in which we have to cope with two or more languages (e.g. English and Italian). In this setting, the system is trained using labeled examples in a source language (e.g. English), and it classifies documents in a different target language (e.g. Italian).

In this paper we propose a novel approach to solve the cross language text categorization problem based on acquiring Multilingual Domain Models from comparable corpora in a totally unsupervised way and without using any external knowledge source (e.g. bilingual dictionaries). These Multilingual Domain Models are exploited to define a generalized similarity function (i.e. a kernel function) among documents in different languages, which is used inside a Support Vector Machines classification framework. The results show that our approach is a feasible and cheap solution that largely outperforms a baseline.

References

  1. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Gaussier, J. M. Renders, I. Matveeva, C. Goutte, and H. Dejean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of ACL-04, Barcelona, Spain, July. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Gliozzo, C. Strapparava, and I. Dagan. 2004. Unsupervised and supervised exploitation of semantic domains in lexical disambiguation. Computer Speech and Language, 18:275--299.Google ScholarGoogle ScholarCross RefCross Ref
  4. T. Joachims. 1999. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in kernel methods: support vector learning, chapter 11, pages 169--184. The MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Joachims. 2002. Learning to Classify Text using Support Vector Machines. Kluwer Academic Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Koehn and K. Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of ACL Workshop on Unsupervised Lexical Acquisition, Philadelphia, July. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Littman, S. Dumais, and T. Landauer. 1998. Automatic cross-language information retrieval using latent semantic indexing. In G. Grefenstette, editor, Cross Language Information Retrieval, pages 51--62. Kluwer Academic Publishers.Google ScholarGoogle Scholar
  8. B. Magnini and G. Cavaglià. 2000. Integrating subject field codes into WordNet. In Proceedings of LREC-2000, Athens, Greece, June.Google ScholarGoogle Scholar
  9. D. Melamed. 2001. Empirical Methods for Exploiting Parallel Texts. The MIT Press.Google ScholarGoogle Scholar
  10. B. Schölkopf and A. J. Smola. 2001. Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Strapparava, A. Gliozzo, and C. Giuliano. 2004. Pattern abstraction and term similarity for word sense disambiguation. In Proceedings of SENSEVAL-3, Barcelona, Spain, July.Google ScholarGoogle Scholar
  12. S. K. M. Wong, W. Ziarko, and P. C. N. Wong. 1985. Generalized vector space model in information retrieval. In Proceedings of the 8th ACM SIGIR Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Cross language text categorization by acquiring multilingual domain models from comparable corpora

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image DL Hosted proceedings
            ParaText '05: Proceedings of the ACL Workshop on Building and Using Parallel Texts
            June 2005
            233 pages

            Publisher

            Association for Computational Linguistics

            United States

            Publication History

            • Published: 29 June 2005

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader