ABSTRACT
We present a Hebrew to English transliteration method in the context of a machine translation system. Our method uses machine learning to determine which terms are to be transliterated rather than translated. The training corpus for this purpose includes only positive examples, acquired semi-automatically. Our classifier reduces more than 38% of the errors made by a baseline method. The identified terms are then transliterated. We present an SMT-based transliteration model trained with a parallel corpus extracted from Wikipedia using a fairly simple method which requires minimal knowledge. The correct result is produced in more than 76% of the cases, and in 92% of the instances it is one of the top-5 results. We also demonstrate a small improvement in the performance of a Hebrew-to-English MT system that uses our transliteration module.
- Yaser Al-Onaizan and Kevin Knight. 2002. Translating named entities using monolingual and bilingual resources. In ACL '02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 400--408, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarDigital Library
- Mansur Arbabi, Scott M. Fischthal, Vincent C. Cheng, and Elizabeth Bart. 1994. Algorithms for arabic name transliteration. IBM Journal of Research and Development, 38(2):183--194. Google ScholarDigital Library
- Roy Bar-Haim, Khalil Sima'an, and Yoad Winter. 2008. Part-of-speech tagging of Modern Hebrew text. Natural Language Engineering, 14(2):223--251. Google ScholarDigital Library
- Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram version 1.1. Technical report, Google Reseach.Google Scholar
- Peter F. Brown, Stephen Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematic of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263--311. Google ScholarDigital Library
- Chih-Chung Chang and Chih-Jen Lin, 2001. LIB-SVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.Google Scholar
- Yoav Goldberg and Michael Elhadad. 2008. Identification of transliterated foreign words in hebrew script. In CICLing, pages 466--477. Google ScholarDigital Library
- Dan Goldwasser and Dan Roth. 2008. Active sample selection for named entity transliteration. In Proceedings of ACL-08: HLT, Short Papers, pages 53--56, Columbus, Ohio, June. Association for Computational Linguistics. Google ScholarDigital Library
- Ulf Hermjakob, Kevin Knight, and Hal Daumé III. 2008. Name translation in statistical machine translation - learning when to transliterate. In Proceedings of ACL-08: HLT, pages 389--397, Columbus, Ohio, June. Association for Computational Linguistics.Google Scholar
- Alon Itai and Shuly Wintner. 2008. Language resources for Hebrew. Language Resources and Evaluation, 42(1):75--98, March.Google ScholarCross Ref
- Alon Itai, Shuly Wintner, and Shlomo Yona. 2006. A computational lexicon of contemporary hebrew. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC-2006), pages 19--22, Genoa, Italy.Google Scholar
- Kevin Knight and Jonathan Graehl. 1997. Machine transliteration. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pages 128--135, Madrid, Spain. Association for Computational Linguistics. Google ScholarDigital Library
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177--180, Prague, Czech Republic, June. Association for Computational Linguistics. Google ScholarDigital Library
- Alon Lavie, Erik Peterson, Katharina Probst, Shuly Wintner, and Yaniv Eytani. 2004a. Rapid prototyping of a transfer-based Hebrew-to-English machine translation system. In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 1--10, Baltimore, MD, October.Google Scholar
- Alon Lavie, Kenji Sagae, and Shyamsundar Jayaraman. 2004b. The significance of recall in automatic metrics for mt evaluation. In Robert E. Frederking and Kathryn Taylor, editors, AMTA, volume 3265 of Lecture Notes in Computer Science, pages 134--143. Springer.Google Scholar
- David Matthews. 2007. Machine transliteration of proper names. Master's thesis, School of Informatics, University of Edinburgh.Google Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. BLEU: a method for automatic evaluation of machine translation. In ACL'02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311--318, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarDigital Library
- Bernhard Schölkopf, Alex J. Smola, Robert Williamson, and Peter Bartlett. 2000. New support vector algorithms. Neural Computation, 12:1207--1245. Google ScholarDigital Library
- Bonnie Glover Stalls and Kevin Knight. 1998. Translating names and technical terms in Arabic text. In Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages, pages 34--41. Google ScholarDigital Library
- Andreas Stolcke. 2002. SRILM -- an extensible language modeling toolkit. In Proceedings International Conference on Spoken Language Processing (ICSLP 2002), pages 901--904.Google Scholar
- Vladimir N. Vapnik. 1995. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA. Google ScholarDigital Library
- Su-Youn Yoon, Kyoung-Young Kim, and Richard Sproat. 2007. Multilingual transliteration using feature based phonetic method. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 112--119, Prague, Czech Republic, June. Association for Computational Linguistics.Google Scholar
Index Terms
- Lightly supervised transliteration for machine translation
Recommendations
Hindi-to-Urdu machine translation through transliteration
ACL '10: Proceedings of the 48th Annual Meeting of the Association for Computational LinguisticsWe present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our ...
Transliteration normalization for Information Extraction and Machine Translation
Foreign name transliterations typically include multiple spelling variants. These variants cause data sparseness and inconsistency problems, increase the Out-of-Vocabulary (OOV) rate, and present challenges for Machine Translation, Information ...
Word Sense Based Hindi-Tamil Statistical Machine Translation
Corpus based natural language processing has emerged with great success in recent years. It is not only used for languages like English, French, Spanish, and Hindi but also is widely used for languages like Tamil, Telugu etc. This paper focuses to ...
Comments