Skip to main content
Log in

A binary decision diagram based approach for mining frequent subsequences

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Sequential pattern mining is an important problem in data mining. State of the art techniques for mining sequential patterns, such as frequent subsequences, are often based on the pattern-growth approach, which recursively projects conditional databases. Explicitly creating database projections is thought to be a major computational bottleneck, but we will show in this paper that it can be beneficial when the appropriate data structure is used. Our technique uses a canonical directed acyclic graph as the sequence database representation, which can be represented as a binary decision diagram (BDD). In this paper, we introduce a new type of BDD, namely a sequence BDD (SeqBDD), and show how it can be used for efficiently mining frequent subsequences. A novel feature of the SeqBDD is its ability to share results between similar intermediate computations and avoid redundant computation. We perform an experimental study to compare the SeqBDD technique with existing pattern growth techniques, that are based on other data structures such as prefix trees. Our results show that a SeqBDD can be half as large as a prefix tree, especially when many similar sequences exist. In terms of mining time, it can be substantially more efficient when the support is low, the number of patterns is large, or the input sequences are long and highly similar.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal R (1996) Mining sequential patterns: Generalizations and performance improvements. In: Proceedings of the 5th international conference on extending database technology (EDBT’96), pp 3–17

  2. Aloul FA, Mneimneh MN, Sakallah K (2002) ZBDD-based backtrack search SAT solver. In: International workshop on logic synthesis. University of Michigan

  3. Baeza-Yates RA (1991) Searching subsequences. Theor Comput Sci 78(2): 363–376

    Article  MATH  MathSciNet  Google Scholar 

  4. Bryant RE (1986) Graph-based algorithms for boolean function manipulation. IEEE Trans Comput 35(8): 677–691

    Article  MATH  Google Scholar 

  5. Bryant RE, Chen Y-A (1995) Verification of arithmetic circuits with binary moment diagrams. In: DAC’95: proceedings of the 32nd ACM/IEEE conference on design automation, pp 535–541

  6. Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI (2008) An optimized sequential pattern matching methodology for sequence classification. Knowl Inform Syst (KAIS) 19: 249–264

    Article  Google Scholar 

  7. Ezeife CI, Lu Y (2005) Mining web log sequential patterns with position coded pre-order linked WAP-tree. Int J Data Min Knowl Discov (DMKD) 10(1): 5–38

    Article  MathSciNet  Google Scholar 

  8. Ezeife CI, Lu Y, Liu Y (2005) PLWAP sequential mining: open source code. In: OSDM’05: proceedings of the 1st international workshop on open source data mining, pp 26–35

  9. Ferreira P, Azevedo AP (2005) Protein sequence classification through relevant sequences and bayes classifiers. In: Proceedings of progress in artificial intelligence, vol 3808, pp 236–247

  10. Gergov J, Meinel C (1994) Efficient analysis and manipulation of OBDDs can be extended to FBDDs’. IEEE Trans Comput 43(10): 1197–1209

    Article  MATH  Google Scholar 

  11. Ghoting A, Buehrer G, Parthasarathy S, Kim D, Nguyen A, Chen Y-K, Dubey P (2005) Cache-conscious frequent pattern mining on a modern processor. In: Proceedings of the 31st international conference on very large data bases, pp 577–588

  12. Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1): 53–87

    Article  MathSciNet  Google Scholar 

  13. Hirao M, Hoshino H, Shinohara A, Takeda M, Arikawa S (2000) A practical algorithm to find the best subsequence patterns. In: Proceedings of discovery science, pp 141–154

  14. IBM (2006) Synthetic data generation code for association rules and sequential patterns. Intelligent information systems, IBM almaden research center. http://www.almaden.ibm.com/software/quest/resources

  15. Ji X, Bailey J, Dong G (2007) Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inform Syst (KAIS) 11(3): 259–286

    Article  Google Scholar 

  16. Kurai R, Minato S, Zeugmann T (2007) N-gram analysis based on Zero-suppressed BDDs. In: New frontiers in artificial intelligence. Lecture notes in computer science, vol 4384

  17. Lin M-Y, Lee S-Y (2005) Efficient mining of sequential patterns with time constraints by delimited pattern growth. Knowl Inform Syst (KAIS) 7(4): 499–514

    Article  MathSciNet  Google Scholar 

  18. Loekito E, Bailey J (2006) Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams. In: Proceedings of the 12th international conference on knowledge discovery and data mining (KDD), pp 307–316

  19. Loekito E, Bailey J (2007) Are zero-suppressed binary decision diagrams good for mining frequent patterns in high dimensional datasets? In: Proceedings of the 6th Australasian data mining conference (AusDM), pp 139–150

  20. Luo C, Chung SM (2008) A scalable algorithm for mining maximal frequent sequences using a sample. Knowl Inform Syst (KAIS) 15(2): 149–179

    Article  Google Scholar 

  21. Ma Q, Wang J, Sasha D, Wu C (2001) DNA sequence classification via an expectation maximization algorithm and neural networks: a case study. IEEE Trans Syst Man Cybern Part C 31(4): 468–475

    Article  Google Scholar 

  22. Masseglia F, Cathala F, Poncelet P (1998) The PSP approach for mining sequential patterns. In: Proceedings of the 2nd European symposium on principles of data mining and knowledge discovery, vol 1510, pp 176–184

  23. Minato S (1993) Zero-suppressed BDDs for set manipulation in combinatorial problems. In: Proceedings of the 30th international conference on design automation, pp 272–277

  24. Minato S (2001) Zero-suppressed BDDs and their applications. Int J Softw Tools Technol Transf (STTT) 3(2): 156–170

    MATH  Google Scholar 

  25. Minato S (2005) Finding simple disjoint decompositions in frequent itemset data using Zero-suppressed BDD. In: Proceedings of ICDM workshop on computational intelligence in data mining, pp 3–11

  26. Minato S, Arimura H (2005) Combinatorial item set analysis based on Zero-suppressed BDDs. In: IEEE workshop on web information retrieval WIRI, pp 3–10

  27. Minato S, Arimura H (2006) Frequent pattern mining and knowledge indexing based on Zero-suppressed BDDs. In: The 5th international workshop on knowledge discovery in inductive databases (KDID’06), pp 83–94

  28. Mitasiunaite I, Boulicaut J-F (2006) Looking for monotonicity properties of a similarity constraint on sequences. In: Proceedings of the 2006 ACM symposium on applied computing, pp 546–552

  29. NCBI (n.d.), Entrez, the life sciences search engine. http://www.ncbi.nlm.nih.gov/sites/entrez

  30. Ossowski J, Baier C (2006) Symbolic reasoning with weighted and normalized decision diagrams. In: Proceedings of the 12th symposium on the integration of symbolic computation and mechanized reasoning, pp 35–96

  31. Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11): 1424–1440

    Article  Google Scholar 

  32. Pei J, Han J, Mortazavi-asl B, Zhu H (2000) Mining access patterns efficiently from web logs, In: PAKDD’00: proceedings of the 2000 Pacific-Asia conference on knowledge discovery and data mining, pp 396–407

  33. Pei J, Han J, Want W (2002) Mining sequential patterns with constraints in large databases. In: Proceedings of the 11th international conference on information and knowledge management (CIKM), pp 18–25

  34. She R, Chen F, Wang K, Ester M, Gardy JL, Brinkman FSL (2003) Frequent-subsequence-based prediction of outer membrane proteins. In: Proceedings of the 9th international conference on knowledge discovery and data mining (KDD), Washington DC, pp 436–445

  35. Sinnamon RM, Andrews J (1996) Quantitative fault tree analysis using binary decision diagrams. Eur J Autom 30(8): 1051–1073

    Google Scholar 

  36. Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th International conference on extending database technology: advances in database technology, pp 3–17

  37. Tzvetkov P, Yan X, Han J (2005) Tsp: mining top-k closed sequential patterns. Knowl Inform Syst (KAIS) 7(4): 438–457

    Article  Google Scholar 

  38. Wang J, Han J (2004) BIDE: efficient mining of frequent closed sequences. In: ICDE’04 proceedings of the 20th international conference on data engineering, p 79

  39. Yang X, Han J, Afshar R (2003) Clospan: mining closed sequential patterns in large databases. In: Proceedings of the international conference on data mining (SDM), pp 166–177

  40. Zaiane OR, Wang Y, Goebel R, Taylor G (2006) Frequent subsequence-based protein localization. In: Proceedings of the data mining for biomedical applications, pp 35–47

  41. Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2): 31–60

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Bailey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Loekito, E., Bailey, J. & Pei, J. A binary decision diagram based approach for mining frequent subsequences. Knowl Inf Syst 24, 235–268 (2010). https://doi.org/10.1007/s10115-009-0252-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0252-9

Keywords

Navigation