research-article

Free Access

Cross language text categorization by acquiring multilingual domain models from comparable corpora

Authors:
Alfio Gliozzo

ITC-Irst, Trento, Italy

ITC-Irst, Trento, Italy
View Profile

,
Carlo Strapparava

ITC-Irst, Trento, Italy

ITC-Irst, Trento, Italy
View Profile

Authors Info & Claims

ParaText '05: Proceedings of the ACL Workshop on Building and Using Parallel TextsJune 2005Pages 9–16

Published:29 June 2005Publication History

ParaText '05: Proceedings of the ACL Workshop on Building and Using Parallel Texts

Pages 9–16

ABSTRACT

In a multilingual scenario, the classical monolingual text categorization problem can be reformulated as a cross language TC task, in which we have to cope with two or more languages (e.g. English and Italian). In this setting, the system is trained using labeled examples in a source language (e.g. English), and it classifies documents in a different target language (e.g. Italian).

In this paper we propose a novel approach to solve the cross language text categorization problem based on acquiring Multilingual Domain Models from comparable corpora in a totally unsupervised way and without using any external knowledge source (e.g. bilingual dictionaries). These Multilingual Domain Models are exploited to define a generalized similarity function (i.e. a kernel function) among documents in different languages, which is used inside a Support Vector Machines classification framework. The results show that our approach is a feasible and cheap solution that largely outperforms a baseline.

References

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407.Google ScholarDigital Library
E. Gaussier, J. M. Renders, I. Matveeva, C. Goutte, and H. Dejean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of ACL-04, Barcelona, Spain, July. Google ScholarDigital Library
A. Gliozzo, C. Strapparava, and I. Dagan. 2004. Unsupervised and supervised exploitation of semantic domains in lexical disambiguation. Computer Speech and Language, 18:275--299.Google ScholarCross Ref
T. Joachims. 1999. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in kernel methods: support vector learning, chapter 11, pages 169--184. The MIT Press. Google ScholarDigital Library
T. Joachims. 2002. Learning to Classify Text using Support Vector Machines. Kluwer Academic Publishers. Google ScholarDigital Library
P. Koehn and K. Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of ACL Workshop on Unsupervised Lexical Acquisition, Philadelphia, July. Google ScholarDigital Library
M. Littman, S. Dumais, and T. Landauer. 1998. Automatic cross-language information retrieval using latent semantic indexing. In G. Grefenstette, editor, Cross Language Information Retrieval, pages 51--62. Kluwer Academic Publishers.Google Scholar
B. Magnini and G. Cavaglià. 2000. Integrating subject field codes into WordNet. In Proceedings of LREC-2000, Athens, Greece, June.Google Scholar
D. Melamed. 2001. Empirical Methods for Exploiting Parallel Texts. The MIT Press.Google Scholar
B. Schölkopf and A. J. Smola. 2001. Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press. Google ScholarDigital Library
C. Strapparava, A. Gliozzo, and C. Giuliano. 2004. Pattern abstraction and term similarity for word sense disambiguation. In Proceedings of SENSEVAL-3, Barcelona, Spain, July.Google Scholar
S. K. M. Wong, W. Ziarko, and P. C. N. Wong. 1985. Generalized vector space model in information retrieval. In Proceedings of the 8^th ACM SIGIR Conference. Google ScholarDigital Library

Cross language text categorization by acquiring multilingual domain models from comparable corpora

Recommendations

Mining comparable bilingual text corpora for cross-language information integration
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-...
Read More
Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization
ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics

Cross-language Text Categorization is the task of assigning semantic classes to documents written in a target language (e.g. English) while the system is trained using labeled documents in a source language (e.g. Italian).In this work we present many ...
Read More
Acquiring paraphrases from text corpora
K-CAP '09: Proceedings of the fifth international conference on Knowledge capture

Paraphrases are textual expressions that convey the same meaning using different surface forms. Capturing the variability of language, they play an important role in many natural language applications includ ing question answering, machine translation, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ParaText '05: Proceedings of the ACL Workshop on Building and Using Parallel Texts
June 2005
233 pages
Program Chairs:
Philipp Koehn
University of Edinburgh
,
Joel Martin
National Research Council of Canada
,
Rada Mihalcea
University of North Texas
,
Christof Monz
University of Maryland
,
Ted Pedersen
University of Minnesota, Duluth
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 29 June 2005
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 101
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cross language text categorization by acquiring multilingual domain models from comparable corpora

ParaText '05: Proceedings of the ACL Workshop on Building and Using Parallel Texts

ABSTRACT

References

Cited By

Recommendations

Mining comparable bilingual text corpora for cross-language information integration

Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization

Acquiring paraphrases from text corpora

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Cross language text categorization by acquiring multilingual domain models from comparable corpora

ParaText '05: Proceedings of the ACL Workshop on Building and Using Parallel Texts

ABSTRACT

References

Cited By

Recommendations

Mining comparable bilingual text corpora for cross-language information integration

Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization

Acquiring paraphrases from text corpora

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media