Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions

Lifshits, Yury; Mozes, Shay; Weimann, Oren; Ziv-Ukelson, Michal

doi:10.1007/s00453-007-9128-0

Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions

Published: 28 November 2007

Volume 54, pages 379–399, (2009)
Cite this article

Algorithmica Aims and scope Submit manuscript

Yury Lifshits¹,
Shay Mozes²,
Oren Weimann³ &
…
Michal Ziv-Ukelson⁴

201 Accesses
20 Citations
1 Altmetric
Explore all metrics

Abstract

We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms (IEEE Trans. Inform. Theory IT-13:260–269, 1967), as well as to the forward-backward and Baum-Welch (Inequalities 3:1–8, 1972) algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), Lempel-Ziv (LZ78) parsing, grammar-based compression (SLP), and byte pair encoding (BPE). Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method, \(\Omega(\frac{r}{\log r})\) using RLE, \(\Omega(\frac{\log n}{k})\) using LZ78, \(\Omega(\frac{r}{k})\) using SLP, and Ω(r) using BPE, where k is the number of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. We also discuss a parallel implementation of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal Decoding of Hidden Markov Models with Consistency Constraints

An Experimental Study of Pruning Techniques in Handwritten Text Recognition Systems

Forward Looking Huffman Coding

References

Agazzi, O., Kuo, S.: HMM based optical character recognition in the presence of deterministic transformations. Pattern Recognit. 26, 1813–1826 (1993)
Article Google Scholar
Apostolico, A., Landau, G.M., Skiena, S.: Matching for run length encoded strings. J. Complex. 15(1), 4–16 (1999)
Article MATH MathSciNet Google Scholar
Arbell, O., Landau, G.M., Mitchell, J.: Edit distance of run-length encoded strings. Inf. Process. Lett. 83(6), 307–314 (2002)
Article MATH MathSciNet Google Scholar
Arlazarov, V.L., Dinic, E.A., Kronrod, M.A., Faradzev, I.A.: On economic construction of the transitive closure of a directed graph. Sov. Math. Dokl. 11, 1209–1210 (1975)
Google Scholar
Baum, L.E.: An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. Inequalities 3, 1–8 (1972)
Google Scholar
Benson, A., Amir, G., Farach, M.: Let sleeping files lie: pattern matching in Z-compressed files. J. Comput. Syst. Sci. 52(2), 299–307 (1996)
Article MATH MathSciNet Google Scholar
Bird, A.P.: CpG-rich islands as gene markers in the vertebrate nucleus. Trends Genet. 3, 342–347 (1987)
Article Google Scholar
Brejova, B., Brown, D.G., Vinar, T.: Advances in hidden Markov models for sequence annotation. In: Mandoiu, I., Zelikowski, A. (eds.) Bioinformatics Algorithms: Techniques and Applications, pp. 3–42. Wiley, New York (2008)
Buchsbaum, A.L., Giancarlo, R.: Algorithmic aspects in speech recognition: an introduction. ACM J. Exp. Algorithms 2, 1 (1997)
Article MathSciNet Google Scholar
Bunke, H., Csirik, J.: An improved algorithm for computing the edit distance of run length coded strings. Inf. Process. Lett. 54, 93–96 (1995)
Article MATH Google Scholar
Cégielski, P., Guessarian, I., Lifshits, Y., Matiyasevich, Y.: Window subsequence problems for compressed texts. In: Proceedings of the 1st International Symposium Computer Science in Russia (CSR), pp. 127–136, 2006
Chan, T.M.: All-pairs shortest paths with real weights in O(n ³/log n) time. In: Proceedings of the 9th Workshop on Algorithms and Data Structures (WADS), pp. 318–324, 2005
Chan, T.M.: More algorithms for all-pairs shortest paths in weighted graphs. In: Proceedings of the 39th ACM Symposium on Theory of Computing (STOC), pp. 590–598, 2007
Churchill, G.A.: Hidden Markov chains and the analysis of genome structure. Comput. Chem. 16, 107–115 (1992)
Article MATH Google Scholar
Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetical progressions. J. Symb. Comput. 9, 251–280 (1990)
Article MATH MathSciNet Google Scholar
Crochemore, M., Landau, G., Ziv-Ukelson, M.: A sub-quadratic sequence alignment algorithm for unrestricted cost matrices. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 679–688, 2002
Durbin, R., Eddy, S., Krigh, A., Mitcheson, G.: Biological Sequence Analysis. Cambridge University Press, Cambridge (1998)
MATH Google Scholar
Fettweis, G., Meyr, H.: High-rate viterbi processor: a systolic array solution. IEEE J. Sel. Areas Commun. 8(8), 1520–1534 (1990)
Article Google Scholar
Gage, P.: A new algorithm for data compression. C Users J. 12(2), 1994
Henderson, J., Salzberg, S., Fasman, K.H.: Finding genes in DNA with a hidden Markov model. J. Comput. Biol. 4(2), 127–142 (1997)
Article Google Scholar
Karkkainen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the 3rd South American Workshop on String Processing (WSP), pp. 141–155, 1996
Karkkainen, J., Navarro, G., Ukkonen, E.: Approximate string matching over Ziv-Lempel compressed text. In: Proceedings of the 11th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 195–209, 2000
Kleinberg, J.: Bursty and hierarchical structure in streams. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 91–101, 2002
Krogh, A., Mian, I.S., Haussler, D.: A hidden Markov model that finds genes in E. Coli DNA. Technical report, University of California Santa Cruz (1994)
Lifshits, Y.: Processing compressed texts: a tractability border. In: Proceedings of the 18th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 228–240, 2007
Lin, H.D., Messerschmitt, D.G.: Algorithms and architectures for concurrent viterbi decoding. IEEE Int. Conf. Commun. 2, 836–840 (1989)
Google Scholar
Makinen, V., Navarro, G., Ukkonen, E.: Approximate matching of run-length compressed strings. Proceedings of the 12th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 1–13, 1999
Manber, U.: A text compression scheme that allows fast searching directly in the compressed file. Proceedings of the 5th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 31–49, 2001
Manning, C., Schutze, H.: Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Mitchell, J.: A geometric shortest path problem, with application to computing a longest common subsequence in run-length encoded strings. Technical Report, Dept. of Applied Mathematics, SUNY Stony Brook (1997)
Mozes, S., Weimann, O., Ziv-Ukelson, M.: Speeding up HMM decoding and training by exploiting sequence repetitions. In: Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 4–15, 2007
Navarro, G., Kida, T., Takeda, M., Shinohara, A., Arikawa, S.: Faster approximate string matching over compressed text. In: Proceedings of the Data Compression Conference (DCC), pp. 459–468, 2001
Pachter, L., Alexandersson, M., Cawley, S.: Applications of generalized pair hidden Markov models to alignment and gene finding problems. In: Proceedings of the 5th annual international conference on Computational biology (RECOMB), pp. 241–248, 2001
Pedersen, J.S., Hein, J.: Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics 19(2), 219–227 (2003)
Article Google Scholar
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003)
Article MATH MathSciNet Google Scholar
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Byte Pair encoding: A text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161, Department of Informatics, Kyushu University (1999)
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding up pattern matching by text compression. In: Lecture Notes in Computer Science, vol. 1767, pp. 306–315. Springer, Berlin (2000)
Google Scholar
Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: Proceedings of the 8th annual international conference on research in computational molecular biology (RECOMB), pp. 177–186, 2004
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13, 354–356 (1969)
Article MATH MathSciNet Google Scholar
Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Inform. Theory IT-13, 260–269 (1967)
Article Google Scholar
Ziv, J., Lempel, A.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)
Article MATH MathSciNet Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

California Institute of Technology, 1200 E. California Blvd., Pasadena, CA, 91125, USA
Yury Lifshits
Department of Computer Science, Brown University, Providence, RI, 02912-1910, USA
Shay Mozes
MIT Computer Science and Artificial Intelligence Laboratory, 32 Vassar Street, Cambridge, MA, 02139, USA
Oren Weimann
Computer Science Department, Ben Gurion University of the Negev, Beer-Sheva, 84105, Israel
Michal Ziv-Ukelson

Authors

Yury Lifshits
View author publications
You can also search for this author in PubMed Google Scholar
Shay Mozes
View author publications
You can also search for this author in PubMed Google Scholar
Oren Weimann
View author publications
You can also search for this author in PubMed Google Scholar
Michal Ziv-Ukelson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michal Ziv-Ukelson.

Additional information

A preliminary version of this paper appeared in Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 4–15, 2007.

Y. Lifshits’ research was supported by the Center for the Mathematics of Information and the Lee Center for Advanced Networking.

S. Mozes’ work conducted while visiting MIT.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lifshits, Y., Mozes, S., Weimann, O. et al. Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions. Algorithmica 54, 379–399 (2009). https://doi.org/10.1007/s00453-007-9128-0

Download citation

Received: 10 June 2007
Accepted: 05 November 2007
Published: 28 November 2007
Issue Date: July 2009
DOI: https://doi.org/10.1007/s00453-007-9128-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions

Abstract

Access this article

Similar content being viewed by others

Optimal Decoding of Hidden Markov Models with Consistency Constraints

An Experimental Study of Pruning Techniques in Handwritten Text Recognition Systems

Forward Looking Huffman Coding

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions

Abstract

Access this article

Similar content being viewed by others

Optimal Decoding of Hidden Markov Models with Consistency Constraints

An Experimental Study of Pruning Techniques in Handwritten Text Recognition Systems

Forward Looking Huffman Coding

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation