IRSTLM: an open source toolkit for handling large scale language models

Federico, Marcello; Bertoldi, Nicola; Cettolo, Mauro

doi:10.21437/Interspeech.2008-271

IRSTLM: an open source toolkit for handling large scale language models

Marcello Federico, Nicola Bertoldi, Mauro Cettolo

Research in speech recognition and machine translation is boosting the use of large scale n-gram language models. We present an open source toolkit that permits to efficiently handle language models with billions of n-grams on conventional machines. The IRSTLM toolkit supports distribution of n-gram collection and smoothing over a computer cluster, language model compression through probability quantization, lazy-loading of huge language models from disk. IRSTLM has been so far successfully deployed with the Moses toolkit for statistical machine translation and with the FBK-irst speech recognition system. Efficiency of the tool is reported on a speech transcription task of Italian political speeches using a language model of 1.1 billion four-grams.

doi: 10.21437/Interspeech.2008-271

Cite as: Federico, M., Bertoldi, N., Cettolo, M. (2008) IRSTLM: an open source toolkit for handling large scale language models. Proc. Interspeech 2008, 1618-1621, doi: 10.21437/Interspeech.2008-271

@inproceedings{federico08_interspeech,
  author={Marcello Federico and Nicola Bertoldi and Mauro Cettolo},
  title={{IRSTLM: an open source toolkit for handling large scale language models}},
  year=2008,
  booktitle={Proc. Interspeech 2008},
  pages={1618--1621},
  doi={10.21437/Interspeech.2008-271}
}