Skip to main content
Log in

Iterative sampling based frequent itemset mining for big data

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Frequent pattern mining attracts extensive research interests over the past two decades: including mining frequent item sets from transactions, extracting frequent sequences from bio-arrays and detecting common subgraph from molecular structures. In the era of big data, the explosive data volume brings new challenges to frequent pattern mining: (1) Space complexity: both input data, intermediate results and the outputted patterns could be too large to fit into memory which prevents many algorithms from executing; (2) Time complexity: many existing approaches rely on exhaustive search or complicated data structures to mine frequent patterns which prove to be inapplicable for big data. To deal with these two challenges. we propose ISbFIM, an Iterative Sampling based Frequent Itemset Mining method. Rather than process the entire data set at once, ISbFIM samples computationally-manageable subsets and extracts frequent itemsets from these subsets. By repeating this process for a sufficient number of times, we can guarantee both theoretically and empirically that the frequent itemsets can be enumerated without running into a combinatorial explosion. ISbFIM can be easily parallelized and applied to mine item sets, sequences or structures. We implement a Map-Reduce version of ISbFIM to demonstrate its scalability on big data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://www.comscore.com/Insights/Press_Releases/2014/2/comScore_Releases_January_2014_US_Search_Engine_Rankings.

  2. http://fimi.ua.ac.be/data/.

  3. http://hadoop.apache.org/.

References

  1. Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8:962–969

    Article  Google Scholar 

  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB ’94, pp 487–499

  3. Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of ICDE ’95, pp 3–14

  4. Al Hasan M, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB Endow 2:730–741

    Article  Google Scholar 

  5. Anastasiu DC, Iverson J, Smith S, Karypis G (2014) Big data frequent pattern mining. In: Aggarwal CC, Han J (ed) Pattern Frequent. Publishing, Mining, Springer International, pp 225–259

  6. Aridhi S, d’Orazio L, Maddouri M, Nguifo EM (2015) Density-based data partitioning strategy to approximate large-scale subgraph mining. Inf Syst 48:213–223

    Article  Google Scholar 

  7. Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: Proceedings of ACL ’07. Prague, Czech Republic, Association for Computational Linguistics, pp 440–447

  8. Cheng H, Yan X, Han J, wei Hsu C (2007) Discriminative frequent pattern analysis for effective classification. In: International Conference on Data Engineering, pp 716–725

  9. Cheng H, Yan X, Han J, Yu PS (2008) Direct discriminative pattern mining for effective classification. In: Proceedings of ICDM ’08. IEEE Computer Society, Washington, DC, USA, pp 169–178

  10. Cheung DW, Han J, Ng VT, Fu AW, Fu Y (1996) A fast distributed algorithm for mining association rules. In: Proceedings of the fourth international conference on on Parallel and distributed information systems. IEEE Computer Society, Washington, DC, USA, pp 31–43

  11. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51:107–113

    Article  Google Scholar 

  12. Fan W, Zhang K, Cheng H, Gao J, Yan X, Han J, Yu P, Verscheure O (2008) Direct mining of discriminative and essential frequent patterns via model-based search tree. In: Proceeding of KDD ’08. ACM, New York, NY, USA, pp 230–238

  13. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’00, pp 1–12, doi:10.1145/342009.335372

  14. Hill S, Srichandan B, Sunderraman R (2012) An iterative mapreduce approach to frequent subgraph mining in biological datasets. In: Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, ACM, New York, NY, USA, BCB ’12, pp 661–666, doi:10.1145/2382936.2383055

  15. Jin R, Abu-Ata M, Xiang Y, Ruan N (2008) Effective and efficient itemset pattern summarization: regression-based approaches. In: Proceeding of KDD ’08. ACM, New York, NY, USA, pp 399–407

  16. Jindal N, Liu B (2008) Opinion spam and analysis. In: Proceedings of WSDM ’08. ACM, New York, NY, USA, pp 219–230

  17. Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: Proceedings of the 2001 IEEE International Conference on Data Mining, IEEE Computer Society, Washington, DC, USA, ICDM ’01, pp 313–320

  18. Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) Pfp: Parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, ACM, New York, NY, USA, RecSys ’08, pp 107–114, DOI 10.1145/1454008.1454027

  19. Lin MY, Lee PY, Hsueh SC (2012) Apriori-based frequent itemset mining algorithms on mapreduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ACM, New York, NY, USA, ICUIMC ’12, pp 76:1–76:8

  20. Luo Y, Guan J, Zhou S (2011) Towards efficient subgraph search in cloud computing environments. In: Proceedings of the 16th International Conference on Database Systems for Advanced Applications, Springer-Verlag, Berlin, Heidelberg, DASFAA’11, pp 2–13, http://dl.acm.org/citation.cfm?id=1996686.1996690

  21. Minato S, Uno T, Tsuda K, Terada A, Sese J (2014) A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. Machine Learning and Knowledge Discovery in Databases—European Conference, ECML PKDD 2014, Nancy, France, September 15–19, 2014. Proceedings, Part II, pp 422–436

  22. Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of ACL ’04, Association for Computational Linguistics, Stroudsburg, PA, USA

  23. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing -, vol 10. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 79–86

  24. Park JS, Chen MS, Yu PS (1995) Efficient parallel data mining for association rules. In: Proceedings of CIKM ’95. ACM, New York, NY, USA, pp 31–36

  25. Thoma M, Cheng H, Gretton A, Han J, peter Kriegel H, Smola A, Song L, Yu PS, Yan X, Borgwardt K (2009) Near-optimal supervised feature selection among frequent subgraphs. In. In SIAM Int’l Conf. on Data Mining

  26. Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of KDD ’06. ACM, New York, NY, USA, pp 730–735

  27. Yan X, Cheng H, Han J, Xin D (2005) Summarizing itemset patterns: a profile-based approach. In: Proceedings of KDD ’05. ACM, New York, NY, USA, pp 314–323

  28. Yan X, Cheng H, Han J, Yu PS (2008) Mining significant graph patterns by leap search. In: Proceedings of SIGMOD ’08. ACM, New York, NY, USA, pp 433–444

  29. Yang G (2004) The complexity of mining maximal frequent itemsets and maximal frequent patterns. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’04, pp 344–353

  30. Yang G (2006) Computational aspects of mining maximal frequent patterns. Theor Comput Sci 362(1–3):63–85

    Article  MATH  Google Scholar 

  31. Zaïane OR, El-Hajj M, Lu P (2001) Fast parallel association rule mining without candidacy generation. In: Proceedings of the 2001 IEEE International Conference on Data Mining, IEEE Computer Society, Washington, DC, USA, ICDM ’01, pp 665–668

  32. Zaki M, Parthasarathy S, Ogihara M, Li W (1997) Parallel Algorithms for Discovery of Association Rules. Data Mining and Knowledge Discovery pp 343–373, doi:10.1023/A:1009773317876

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xian Wu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, X., Fan, W., Peng, J. et al. Iterative sampling based frequent itemset mining for big data. Int. J. Mach. Learn. & Cyber. 6, 875–882 (2015). https://doi.org/10.1007/s13042-015-0345-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-015-0345-6

Keywords

Navigation