Abstract
We have developed a new program called word_align for aligning parallel text, text such as the Canadian Hansards that are available in two or more languages. The program takes the output of char_align (Church, 1993), a robust alternative to sentence-based alignment programs, and applies word-level constraints using a version of Brown et al.’s Model 2 (Brown et al., 1993), modified and extended to deal with robustness issues. Word_align was tested on a subset of Canadian Hansards supplied by Simard (Simard et al., 1992). The combination of word_align plus char_align reduces the variance (average square error) by a factor of 5 over char_align alone. More importantly, because word_align and char_align were designed to work robustly on texts that are smaller and more noisy than the Hansards, it has been possible to successfully deploy the programs at AT&T Language Line Services, a commercial translation service, to help them with difficult terminology.
(Part of) This work was accomplished at AT&T.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baum, L. E. 1972. An inequality and an associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3: 1–8.
Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer, R. L. and Roossin, P.S. 1990. A statistical approach to language translation. Computational Linguistics, 16 (2): 79–85.
Brown, P., Lai, J. and Mercer, R. 1991a. Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting of the ACL, pp. 169–176.
Brown, P., Della Pietra, S., Della Pietra, V. and Mercer, R. 1991b. Word sense disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting of the ACL, pp. 264–270.
Brown, P., Della Pietra, S., Della Pietra, V. and Mercer, R. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19 (2): 263–311.
Church, K. W. 1993. Char_align: A program for aligning parallel texts at the character level. In Proceedings of the 31st Annual Meeting of the ACL, pp. 1–8.
Dempster, A. P., Laird, N. M. and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 (B): 1–38.
Gale, W. and Church, K. 1991a. Identifying word correspondence in parallel text. In Proceedings of the DARPA Workshop on Speech and Natural Language.
Gale, W. and Church, K. 1991b. A program for aligning sentences in bilingual corpora. In Proceedings of the 29th Annual Meeting of the ACL, pp. 177–184.
Gale, W., Church, K. and Yarowsky, D. 1992. Using bilingual materials to develop word sense disambiguation methods. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 101–112.
Isabelle, P. 1992. Bi-textual aids for translators. In Proceedings of the Annual Conference of the UW Center for the New OED and Text Research.
Kay, M. and Roscheisen, M. 1993. Text-translation alignment. Computational Linguistics, 19 (1): 121–142.
Klavans, J. and Tzoukermann, E. 1990. The BICORD system. In Proceedings of COLING 1990, Helsinki, Finland, pp. 174–178.
Kupiec, J. 1993. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Meeting of the ACL, pp. 17–22.
Landauer, T. K. and Littman, M. L. 1990. Fully automatic cross-language document retrieval using latent semantic indexing. In Proceedings of the Annual Conference of the UW Center for the New OED and Text Research, pp. 31–38.
Matsumoto, Y., Ishimoto, H., Utsuro, T. and Nagao, M. 1993. Structural matching of parallel texts. In Proceedings of the 31st Annual Meeting of the ACL, pp. 23–30.
Ogden, W. and Gonzales, M. 1993. Norm — a system for translators. Demonstration at ARPA Workshop on Human Language Technology.
Sadler, V. 1989. Working with analogical semantics: Disambiguation techniques in DLT. Foris Publications.
Simard, M. Foster, G. and Isabelle, P. 1992. Using cognates to align sentences in bilingual corpora. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 67–82.
Smadja, F. 1992. How to compile a bilingual collocational lexicon automatically. In AAAI Workshop on Statistically-based Natural Language Processing Techniques,July.
Warwick, S., Hajic, J. and Russell, G. 1990. Searching on tagged corpora: linguistically motivated concordance analysis. In Proceedings of the Annual Conference of the UW Center for the New OED and Text Research, pp. 10–18.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Dagan, I., Church, K., Gale, W. (1999). Robust Bilingual Word Alignment for Machine Aided Translation. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol 11. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2390-9_13
Download citation
DOI: https://doi.org/10.1007/978-94-017-2390-9_13
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5349-7
Online ISBN: 978-94-017-2390-9
eBook Packages: Springer Book Archive