Skip to main content

Robust Bilingual Word Alignment for Machine Aided Translation

  • Chapter
Natural Language Processing Using Very Large Corpora

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 11))

Abstract

We have developed a new program called word_align for aligning parallel text, text such as the Canadian Hansards that are available in two or more languages. The program takes the output of char_align (Church, 1993), a robust alternative to sentence-based alignment programs, and applies word-level constraints using a version of Brown et al.’s Model 2 (Brown et al., 1993), modified and extended to deal with robustness issues. Word_align was tested on a subset of Canadian Hansards supplied by Simard (Simard et al., 1992). The combination of word_align plus char_align reduces the variance (average square error) by a factor of 5 over char_align alone. More importantly, because word_align and char_align were designed to work robustly on texts that are smaller and more noisy than the Hansards, it has been possible to successfully deploy the programs at AT&T Language Line Services, a commercial translation service, to help them with difficult terminology.

(Part of) This work was accomplished at AT&T.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Baum, L. E. 1972. An inequality and an associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3: 1–8.

    Google Scholar 

  • Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer, R. L. and Roossin, P.S. 1990. A statistical approach to language translation. Computational Linguistics, 16 (2): 79–85.

    Google Scholar 

  • Brown, P., Lai, J. and Mercer, R. 1991a. Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting of the ACL, pp. 169–176.

    Google Scholar 

  • Brown, P., Della Pietra, S., Della Pietra, V. and Mercer, R. 1991b. Word sense disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting of the ACL, pp. 264–270.

    Google Scholar 

  • Brown, P., Della Pietra, S., Della Pietra, V. and Mercer, R. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19 (2): 263–311.

    Google Scholar 

  • Church, K. W. 1993. Char_align: A program for aligning parallel texts at the character level. In Proceedings of the 31st Annual Meeting of the ACL, pp. 1–8.

    Google Scholar 

  • Dempster, A. P., Laird, N. M. and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 (B): 1–38.

    Google Scholar 

  • Gale, W. and Church, K. 1991a. Identifying word correspondence in parallel text. In Proceedings of the DARPA Workshop on Speech and Natural Language.

    Google Scholar 

  • Gale, W. and Church, K. 1991b. A program for aligning sentences in bilingual corpora. In Proceedings of the 29th Annual Meeting of the ACL, pp. 177–184.

    Google Scholar 

  • Gale, W., Church, K. and Yarowsky, D. 1992. Using bilingual materials to develop word sense disambiguation methods. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 101–112.

    Google Scholar 

  • Isabelle, P. 1992. Bi-textual aids for translators. In Proceedings of the Annual Conference of the UW Center for the New OED and Text Research.

    Google Scholar 

  • Kay, M. and Roscheisen, M. 1993. Text-translation alignment. Computational Linguistics, 19 (1): 121–142.

    Google Scholar 

  • Klavans, J. and Tzoukermann, E. 1990. The BICORD system. In Proceedings of COLING 1990, Helsinki, Finland, pp. 174–178.

    Google Scholar 

  • Kupiec, J. 1993. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Meeting of the ACL, pp. 17–22.

    Google Scholar 

  • Landauer, T. K. and Littman, M. L. 1990. Fully automatic cross-language document retrieval using latent semantic indexing. In Proceedings of the Annual Conference of the UW Center for the New OED and Text Research, pp. 31–38.

    Google Scholar 

  • Matsumoto, Y., Ishimoto, H., Utsuro, T. and Nagao, M. 1993. Structural matching of parallel texts. In Proceedings of the 31st Annual Meeting of the ACL, pp. 23–30.

    Google Scholar 

  • Ogden, W. and Gonzales, M. 1993. Norm — a system for translators. Demonstration at ARPA Workshop on Human Language Technology.

    Google Scholar 

  • Sadler, V. 1989. Working with analogical semantics: Disambiguation techniques in DLT. Foris Publications.

    Google Scholar 

  • Simard, M. Foster, G. and Isabelle, P. 1992. Using cognates to align sentences in bilingual corpora. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 67–82.

    Google Scholar 

  • Smadja, F. 1992. How to compile a bilingual collocational lexicon automatically. In AAAI Workshop on Statistically-based Natural Language Processing Techniques,July.

    Google Scholar 

  • Warwick, S., Hajic, J. and Russell, G. 1990. Searching on tagged corpora: linguistically motivated concordance analysis. In Proceedings of the Annual Conference of the UW Center for the New OED and Text Research, pp. 10–18.

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Dagan, I., Church, K., Gale, W. (1999). Robust Bilingual Word Alignment for Machine Aided Translation. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol 11. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2390-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-94-017-2390-9_13

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-5349-7

  • Online ISBN: 978-94-017-2390-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics