Skip to main content
Log in

Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms (IEEE Trans. Inform. Theory IT-13:260–269, 1967), as well as to the forward-backward and Baum-Welch (Inequalities 3:1–8, 1972) algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), Lempel-Ziv (LZ78) parsing, grammar-based compression (SLP), and byte pair encoding (BPE). Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method, \(\Omega(\frac{r}{\log r})\) using RLE, \(\Omega(\frac{\log n}{k})\) using LZ78, \(\Omega(\frac{r}{k})\) using SLP, and Ω(r) using BPE, where k is the number of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. We also discuss a parallel implementation of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agazzi, O., Kuo, S.: HMM based optical character recognition in the presence of deterministic transformations. Pattern Recognit. 26, 1813–1826 (1993)

    Article  Google Scholar 

  2. Apostolico, A., Landau, G.M., Skiena, S.: Matching for run length encoded strings. J. Complex. 15(1), 4–16 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  3. Arbell, O., Landau, G.M., Mitchell, J.: Edit distance of run-length encoded strings. Inf. Process. Lett. 83(6), 307–314 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  4. Arlazarov, V.L., Dinic, E.A., Kronrod, M.A., Faradzev, I.A.: On economic construction of the transitive closure of a directed graph. Sov. Math. Dokl. 11, 1209–1210 (1975)

    Google Scholar 

  5. Baum, L.E.: An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. Inequalities 3, 1–8 (1972)

    Google Scholar 

  6. Benson, A., Amir, G., Farach, M.: Let sleeping files lie: pattern matching in Z-compressed files. J. Comput. Syst. Sci. 52(2), 299–307 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  7. Bird, A.P.: CpG-rich islands as gene markers in the vertebrate nucleus. Trends Genet. 3, 342–347 (1987)

    Article  Google Scholar 

  8. Brejova, B., Brown, D.G., Vinar, T.: Advances in hidden Markov models for sequence annotation. In: Mandoiu, I., Zelikowski, A. (eds.) Bioinformatics Algorithms: Techniques and Applications, pp. 3–42. Wiley, New York (2008)

  9. Buchsbaum, A.L., Giancarlo, R.: Algorithmic aspects in speech recognition: an introduction. ACM J. Exp. Algorithms 2, 1 (1997)

    Article  MathSciNet  Google Scholar 

  10. Bunke, H., Csirik, J.: An improved algorithm for computing the edit distance of run length coded strings. Inf. Process. Lett. 54, 93–96 (1995)

    Article  MATH  Google Scholar 

  11. Cégielski, P., Guessarian, I., Lifshits, Y., Matiyasevich, Y.: Window subsequence problems for compressed texts. In: Proceedings of the 1st International Symposium Computer Science in Russia (CSR), pp. 127–136, 2006

  12. Chan, T.M.: All-pairs shortest paths with real weights in O(n 3/log n) time. In: Proceedings of the 9th Workshop on Algorithms and Data Structures (WADS), pp. 318–324, 2005

  13. Chan, T.M.: More algorithms for all-pairs shortest paths in weighted graphs. In: Proceedings of the 39th ACM Symposium on Theory of Computing (STOC), pp. 590–598, 2007

  14. Churchill, G.A.: Hidden Markov chains and the analysis of genome structure. Comput. Chem. 16, 107–115 (1992)

    Article  MATH  Google Scholar 

  15. Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetical progressions. J. Symb. Comput. 9, 251–280 (1990)

    Article  MATH  MathSciNet  Google Scholar 

  16. Crochemore, M., Landau, G., Ziv-Ukelson, M.: A sub-quadratic sequence alignment algorithm for unrestricted cost matrices. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 679–688, 2002

  17. Durbin, R., Eddy, S., Krigh, A., Mitcheson, G.: Biological Sequence Analysis. Cambridge University Press, Cambridge (1998)

    MATH  Google Scholar 

  18. Fettweis, G., Meyr, H.: High-rate viterbi processor: a systolic array solution. IEEE J. Sel. Areas Commun. 8(8), 1520–1534 (1990)

    Article  Google Scholar 

  19. Gage, P.: A new algorithm for data compression. C Users J. 12(2), 1994

  20. Henderson, J., Salzberg, S., Fasman, K.H.: Finding genes in DNA with a hidden Markov model. J. Comput. Biol. 4(2), 127–142 (1997)

    Article  Google Scholar 

  21. Karkkainen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the 3rd South American Workshop on String Processing (WSP), pp. 141–155, 1996

  22. Karkkainen, J., Navarro, G., Ukkonen, E.: Approximate string matching over Ziv-Lempel compressed text. In: Proceedings of the 11th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 195–209, 2000

  23. Kleinberg, J.: Bursty and hierarchical structure in streams. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 91–101, 2002

  24. Krogh, A., Mian, I.S., Haussler, D.: A hidden Markov model that finds genes in E. Coli DNA. Technical report, University of California Santa Cruz (1994)

  25. Lifshits, Y.: Processing compressed texts: a tractability border. In: Proceedings of the 18th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 228–240, 2007

  26. Lin, H.D., Messerschmitt, D.G.: Algorithms and architectures for concurrent viterbi decoding. IEEE Int. Conf. Commun. 2, 836–840 (1989)

    Google Scholar 

  27. Makinen, V., Navarro, G., Ukkonen, E.: Approximate matching of run-length compressed strings. Proceedings of the 12th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 1–13, 1999

  28. Manber, U.: A text compression scheme that allows fast searching directly in the compressed file. Proceedings of the 5th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 31–49, 2001

  29. Manning, C., Schutze, H.: Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  30. Mitchell, J.: A geometric shortest path problem, with application to computing a longest common subsequence in run-length encoded strings. Technical Report, Dept. of Applied Mathematics, SUNY Stony Brook (1997)

  31. Mozes, S., Weimann, O., Ziv-Ukelson, M.: Speeding up HMM decoding and training by exploiting sequence repetitions. In: Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 4–15, 2007

  32. Navarro, G., Kida, T., Takeda, M., Shinohara, A., Arikawa, S.: Faster approximate string matching over compressed text. In: Proceedings of the Data Compression Conference (DCC), pp. 459–468, 2001

  33. Pachter, L., Alexandersson, M., Cawley, S.: Applications of generalized pair hidden Markov models to alignment and gene finding problems. In: Proceedings of the 5th annual international conference on Computational biology (RECOMB), pp. 241–248, 2001

  34. Pedersen, J.S., Hein, J.: Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics 19(2), 219–227 (2003)

    Article  Google Scholar 

  35. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  36. Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Byte Pair encoding: A text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161, Department of Informatics, Kyushu University (1999)

  37. Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding up pattern matching by text compression. In: Lecture Notes in Computer Science, vol. 1767, pp. 306–315. Springer, Berlin (2000)

    Google Scholar 

  38. Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: Proceedings of the 8th annual international conference on research in computational molecular biology (RECOMB), pp. 177–186, 2004

  39. Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13, 354–356 (1969)

    Article  MATH  MathSciNet  Google Scholar 

  40. Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Inform. Theory IT-13, 260–269 (1967)

    Article  Google Scholar 

  41. Ziv, J., Lempel, A.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  42. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michal Ziv-Ukelson.

Additional information

A preliminary version of this paper appeared in Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 4–15, 2007.

Y. Lifshits’ research was supported by the Center for the Mathematics of Information and the Lee Center for Advanced Networking.

S. Mozes’ work conducted while visiting MIT.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lifshits, Y., Mozes, S., Weimann, O. et al. Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions. Algorithmica 54, 379–399 (2009). https://doi.org/10.1007/s00453-007-9128-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-007-9128-0

Keywords

Navigation