Abstract
We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms (IEEE Trans. Inform. Theory IT-13:260–269, 1967), as well as to the forward-backward and Baum-Welch (Inequalities 3:1–8, 1972) algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), Lempel-Ziv (LZ78) parsing, grammar-based compression (SLP), and byte pair encoding (BPE). Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method, \(\Omega(\frac{r}{\log r})\) using RLE, \(\Omega(\frac{\log n}{k})\) using LZ78, \(\Omega(\frac{r}{k})\) using SLP, and Ω(r) using BPE, where k is the number of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. We also discuss a parallel implementation of our algorithms.
Similar content being viewed by others
References
Agazzi, O., Kuo, S.: HMM based optical character recognition in the presence of deterministic transformations. Pattern Recognit. 26, 1813–1826 (1993)
Apostolico, A., Landau, G.M., Skiena, S.: Matching for run length encoded strings. J. Complex. 15(1), 4–16 (1999)
Arbell, O., Landau, G.M., Mitchell, J.: Edit distance of run-length encoded strings. Inf. Process. Lett. 83(6), 307–314 (2002)
Arlazarov, V.L., Dinic, E.A., Kronrod, M.A., Faradzev, I.A.: On economic construction of the transitive closure of a directed graph. Sov. Math. Dokl. 11, 1209–1210 (1975)
Baum, L.E.: An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. Inequalities 3, 1–8 (1972)
Benson, A., Amir, G., Farach, M.: Let sleeping files lie: pattern matching in Z-compressed files. J. Comput. Syst. Sci. 52(2), 299–307 (1996)
Bird, A.P.: CpG-rich islands as gene markers in the vertebrate nucleus. Trends Genet. 3, 342–347 (1987)
Brejova, B., Brown, D.G., Vinar, T.: Advances in hidden Markov models for sequence annotation. In: Mandoiu, I., Zelikowski, A. (eds.) Bioinformatics Algorithms: Techniques and Applications, pp. 3–42. Wiley, New York (2008)
Buchsbaum, A.L., Giancarlo, R.: Algorithmic aspects in speech recognition: an introduction. ACM J. Exp. Algorithms 2, 1 (1997)
Bunke, H., Csirik, J.: An improved algorithm for computing the edit distance of run length coded strings. Inf. Process. Lett. 54, 93–96 (1995)
Cégielski, P., Guessarian, I., Lifshits, Y., Matiyasevich, Y.: Window subsequence problems for compressed texts. In: Proceedings of the 1st International Symposium Computer Science in Russia (CSR), pp. 127–136, 2006
Chan, T.M.: All-pairs shortest paths with real weights in O(n 3/log n) time. In: Proceedings of the 9th Workshop on Algorithms and Data Structures (WADS), pp. 318–324, 2005
Chan, T.M.: More algorithms for all-pairs shortest paths in weighted graphs. In: Proceedings of the 39th ACM Symposium on Theory of Computing (STOC), pp. 590–598, 2007
Churchill, G.A.: Hidden Markov chains and the analysis of genome structure. Comput. Chem. 16, 107–115 (1992)
Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetical progressions. J. Symb. Comput. 9, 251–280 (1990)
Crochemore, M., Landau, G., Ziv-Ukelson, M.: A sub-quadratic sequence alignment algorithm for unrestricted cost matrices. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 679–688, 2002
Durbin, R., Eddy, S., Krigh, A., Mitcheson, G.: Biological Sequence Analysis. Cambridge University Press, Cambridge (1998)
Fettweis, G., Meyr, H.: High-rate viterbi processor: a systolic array solution. IEEE J. Sel. Areas Commun. 8(8), 1520–1534 (1990)
Gage, P.: A new algorithm for data compression. C Users J. 12(2), 1994
Henderson, J., Salzberg, S., Fasman, K.H.: Finding genes in DNA with a hidden Markov model. J. Comput. Biol. 4(2), 127–142 (1997)
Karkkainen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the 3rd South American Workshop on String Processing (WSP), pp. 141–155, 1996
Karkkainen, J., Navarro, G., Ukkonen, E.: Approximate string matching over Ziv-Lempel compressed text. In: Proceedings of the 11th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 195–209, 2000
Kleinberg, J.: Bursty and hierarchical structure in streams. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 91–101, 2002
Krogh, A., Mian, I.S., Haussler, D.: A hidden Markov model that finds genes in E. Coli DNA. Technical report, University of California Santa Cruz (1994)
Lifshits, Y.: Processing compressed texts: a tractability border. In: Proceedings of the 18th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 228–240, 2007
Lin, H.D., Messerschmitt, D.G.: Algorithms and architectures for concurrent viterbi decoding. IEEE Int. Conf. Commun. 2, 836–840 (1989)
Makinen, V., Navarro, G., Ukkonen, E.: Approximate matching of run-length compressed strings. Proceedings of the 12th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 1–13, 1999
Manber, U.: A text compression scheme that allows fast searching directly in the compressed file. Proceedings of the 5th Annual Symposium On Combinatorial Pattern Matching (CPM), pp. 31–49, 2001
Manning, C., Schutze, H.: Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Mitchell, J.: A geometric shortest path problem, with application to computing a longest common subsequence in run-length encoded strings. Technical Report, Dept. of Applied Mathematics, SUNY Stony Brook (1997)
Mozes, S., Weimann, O., Ziv-Ukelson, M.: Speeding up HMM decoding and training by exploiting sequence repetitions. In: Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 4–15, 2007
Navarro, G., Kida, T., Takeda, M., Shinohara, A., Arikawa, S.: Faster approximate string matching over compressed text. In: Proceedings of the Data Compression Conference (DCC), pp. 459–468, 2001
Pachter, L., Alexandersson, M., Cawley, S.: Applications of generalized pair hidden Markov models to alignment and gene finding problems. In: Proceedings of the 5th annual international conference on Computational biology (RECOMB), pp. 241–248, 2001
Pedersen, J.S., Hein, J.: Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics 19(2), 219–227 (2003)
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003)
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Byte Pair encoding: A text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161, Department of Informatics, Kyushu University (1999)
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding up pattern matching by text compression. In: Lecture Notes in Computer Science, vol. 1767, pp. 306–315. Springer, Berlin (2000)
Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: Proceedings of the 8th annual international conference on research in computational molecular biology (RECOMB), pp. 177–186, 2004
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13, 354–356 (1969)
Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Inform. Theory IT-13, 260–269 (1967)
Ziv, J., Lempel, A.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this paper appeared in Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 4–15, 2007.
Y. Lifshits’ research was supported by the Center for the Mathematics of Information and the Lee Center for Advanced Networking.
S. Mozes’ work conducted while visiting MIT.
Rights and permissions
About this article
Cite this article
Lifshits, Y., Mozes, S., Weimann, O. et al. Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions. Algorithmica 54, 379–399 (2009). https://doi.org/10.1007/s00453-007-9128-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-007-9128-0