A fast and low idle time method for mining frequent patterns in distributed and many-task computing environments

Lin, Chun-Cheng; Chung, Sheng-Hao; Chen, Ju-Chin; Yu, Yuan-Tse; Lin, Kawuu W.

doi:10.1007/s10619-018-7221-9

A fast and low idle time method for mining frequent patterns in distributed and many-task computing environments

Published: 26 March 2018

Volume 36, pages 613–641, (2018)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Chun-Cheng Lin¹,
Sheng-Hao Chung¹,
Ju-Chin Chen²,
Yuan-Tse Yu³ &
…
Kawuu W. Lin ORCID: orcid.org/0000-0002-1669-1008²

390 Accesses
Explore all metrics

Abstract

Association rules mining has attracted much attention among data mining topics because it has been successfully applied in various fields to find the association between purchased items by identifying frequent patterns (FPs). Currently, databases are huge, ranging in size from terabytes to petabytes. Although past studies can effectively discover FPs to deduce association rules, the execution efficiency is still a critical problem, particularly for big data. Progressive size working set (PSWS) and parallel FP-growth (PFP) are state-of-the-art methods that have been applied successfully to parallel and distributed computing technology to improve mining processing time in many-task computing, thereby bridging the gap between high-throughput and high-performance computing. However, such methods cannot mine before obtaining a complete FP-tree or the corresponding subdatabase, causing a high idle time for computing nodes. We propose a method that can begin mining when a small part of an FP-tree is received. The idle time of computing nodes can be reduced, and thus, the time required for mining can be reduced effectively. Through an empirical evaluation, the proposed method is shown to be faster than PSWS and PFP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

Multi-level dataset decomposition for parallel frequent itemset mining on a cluster of personal computers

Article 03 January 2018

MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets

References

Adnan, M., Alhajj, R.: DRFP-tree: disk-resident frequent pattern tree. Appl. Intell. 30, 84–97 (2009)
Article Google Scholar
Agrawal, R., Shafer, J.C.: Parallel mining of association rules. IEEE Trans. Knowl. Data Eng. 8, 962–969 (1996)
Article Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th international conference very large data bases, VLDB, pp. 487–499 (1994)
Agrawal, R., Srikant, R.: Quest Synthetic Data Generator. IBM Almaden Research Center, San Jose (2009)
Google Scholar
Baralis, E., Cerquitelli, T., Chiusano, S., Grand, A.: P-mine: parallel itemset mining on large datasets. In: 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW), IEEE, pp. 266–271 (2013)
Brijs, T., Swinnen, G., Vanhoof, K., Wets, G.: Using association rules for product assortment decisions: a case study. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 254–260 (1999)
Buehrer, G., de Oliveira, R.L., Fuhry, D., Parthasarathy, S.: Towards a parameter-free and parallel itemset mining algorithm in linearithmic time. In: IEEE 31st International Conference on Data Engineering (ICDE), IEEE, pp. 1071–1082 (2015)
Buehrer, G., Parthasarathy, S., Tatikonda, S., Kurc, T., Saltz, J.: Toward terabyte pattern mining: an architecture-conscious solution. In: Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, pp. 2–12 (2007)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
Eggen, M., Eggen, R.: Java versus MPI in a distributed environment. In: PDPTA, pp. 390–395 (1999)
Ezeife, C., Zhang, D.: TidFP: mining frequent patterns in different databases with transaction ID. In: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery, Springer, pp. 125–137 (2009)
Geurts, K., Wets, G., Brijs, T., Vanhoof, K.: Profiling of high-frequency accident locations by use of association rules. Transp. Res. Rec. 2003, 123–130 (1840)
Google Scholar
Goethals, B., Zaki, M.J.: Frequent itemset mining dataset repository. In: Frequent Itemset Mining Implementations (FIMI 2003) (2003)
Grahne, G., Zhu, J.: Efficiently using prefix-trees in mining frequent itemsets. In: FIMI, pp. 123–132 (2003)
Grahne, G., Zhu, J.: Mining frequent itemsets from secondary memory. In: Fourth IEEE International Conference on Data Mining, 2004. ICDM’04, IEEE, pp. 91–98 (2004)
Hadoop, A.: Hadoop (2009). http://hadoop.apache.org/
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD Record, ACM, pp. 1–12 (2000)
Article Google Scholar
Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Disc. 8, 53–87 (2004)
Article MathSciNet Google Scholar
Huang, D., Song, Y., Routray, R., Qin, F.: Smart cache: an optimized mapreduce implementation of frequent itemset mining. In: 2015 IEEE International Conference on Cloud Engineering (IC2E), IEEE, pp. 16–25 (2015)
Javed, A., Khokhar, A.: Frequent pattern mining on message passing multiprocessor systems. Distrib. Parallel Databases 16, 321–334 (2004)
Article Google Scholar
Lai, Y., ZhongZhi, S.: An efficient data mining framework on Hadoop using Java persistence API. In: 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), IEEE, pp. 203–209 (2010)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, ACM, pp. 107–114 (2008)
Liang, Y.-H., Wu, S.-Y.: Sequence-growth: a scalable and effective frequent itemset mining algorithm for big data based on mapreduce framework. In: 2015 IEEE International Congress on Big Data (BigData Congress), IEEE, pp. 393–400 (2015)
Lin, K.W., Chung, S.-H.: A fast and resource efficient mining algorithm for discovering frequent patterns in distributed computing environments. Fut. Gener. Comput. Syst. 52, 49–58 (2015)
Article Google Scholar
Lin, K.W., Chung, S.-H., Lin, C.-C.: A fast and distributed algorithm for mining frequent patterns in congested networks. Computing 98, 235–256 (2016)
Article MathSciNet Google Scholar
Lin, K.W., Deng, D.-J.: A novel parallel algorithm for frequent pattern mining with privacy preserved in cloud computing environments. Int. J. Ad Hoc Ubiquitous Comput. 6, 205–215 (2010)
Article Google Scholar
Lin, K.W., Lo, Y.-C.: Efficient algorithms for frequent pattern mining in many-task computing environments. Knowl. Based Syst. 49, 10–21 (2013)
Article Google Scholar
Lin, W.-T., Chu, C.-P.: Determining the appropriate number of nodes for fast mining of frequent patterns in distributed computing environments. Int. J. Parallel Emerg. Distrib. Syst. 30, 380–392 (2014)
Article Google Scholar
Liu, J., Wu, Y., Zhou, Q., Fung, B.C., Chen, F., Yu, B.: Parallel eclat for opportunistic mining of frequent itemsets. In: Database and Expert Systems Applications, Springer, pp. 401–415 (2015)
Lucchese, C., Orlando, S., Perego, R.: Parallel mining of frequent closed patterns: harnessing modern computer architectures. In: Seventh IEEE International Conference on Data Mining, 2007. ICDM 2007, IEEE, pp. 242–251 (2007)
Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: WebDocs: a real-life huge transactional dataset. In: FIMI (2004)
Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: 2013 IEEE International Conference on Big Data, IEEE, pp. 111–118 (2013)
Qiu, H., Gu, R., Yuan, C., Huang, Y.: Yafim: a parallel frequent itemset mining algorithm with spark. In: Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, IEEE, pp. 1664–1671 (2014)
Qiu, Y., Lan, Y.-J., Xie, Q.-S.: An improved algorithm of mining from FP-tree. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics, IEEE, pp. 1665–1670, 2004
Schlegel, B., Gemulla, R., Lehner, W.: Memory-efficient frequent-itemset mining. In: Proceedings of the 14th International Conference on Extending Database Technology, ACM, pp. 461–472 (2011)
Spark, A.: Spark. https://spark.apache.org/
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, p. 5 (2013)
Vu, L., Alaghband, G.: Novel parallel method for mining frequent patterns on multi-core shared memory systems. In: Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems, ACM, pp. 49–54 (2013)
Wang, Y., Parthasarathy, S., Sadayappan, P.: Stratification driven placement of complex data: a framework for distributed data analytics. In: IEEE 29th International Conference on Data Engineering (ICDE), pp. 709–720 (2013)
Wu, X., Fan, W., Peng, J., Zhang, K., Yu, Y.: Iterative sampling based frequent itemset mining for big data. Int. J. Mach. Learn. Cybern 6, 875–882 (2015)
Article Google Scholar
Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 97–107 (2014)
Article Google Scholar
Yahya, O., Hegazy, O., Ezat, E.: An efficient implementation of A-Priori algorithm based on Hadoop-MapReduce model. Int. J. Rev. Comput. 12 (2012)
Yang, L., Shi, Z., Xu, L.D., Liang, F., Kirsh, I.: DH-TRIE frequent pattern mining on Hadoop using JPA. In: 2011 IEEE International Conference on Granular Computing (GrC), pp. 875–878 (2011)
Yang, X.Y., Liu, Z., Fu, Y.: MapReduce as a programming model for association rules algorithm on Hadoop. In: 2010 3rd International Conference on Information Sciences and Interaction Sciences (ICIS), IEEE, pp. 99–102 (2010)
Yen, S.-J., Lee, Y.-S., Wang, Y.-S., Wu, J.-W., Ouyang, L.-Y.: The studies of mining frequent patterns based on frequent pattern tree. In: Advances in Knowledge Discovery and Data Mining, Springer, pp. 232–241 (2009)
Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12, 372–390 (2000)
Article Google Scholar
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., Stolorz, P., Musick, R.: Parallel algorithms for discovery of association rules. In: Scalable High Performance Computing for Knowledge Discovery and Data Mining, Springer, pp. 5–35 (1997)
Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Clust. Comput. 18, 1493–1501 (2015)
Article Google Scholar
Zhou, J., Yu, K.-M.: Tidset-based parallel FP-tree algorithm for the frequent pattern mining problem on PC clusters. In: Advances in Grid and Pervasive Computing, Springer, pp. 18–28 (2008)
Zhou, J., Yu, K.-M.: Balanced Tidset-based parallel FP-tree algorithm for the frequent pattern mining on grid system. In: Proceedings of the 2008 Fourth International Conference on Semantics, Knowledge and Grid, IEEE Computer Society, pp. 103–108 (2008)

Download references

Acknowledgement

This work was supported by the Ministry of Science and Technology of Taiwan, R.O.C., under Grant Nos. MOST 104-2221-E-151 -055 and 105-2221-E-151 -056.

Author information

Authors and Affiliations

Department of Industrial Engineering and Management, National Chiao Tung University, Hsinchu, Taiwan
Chun-Cheng Lin & Sheng-Hao Chung
Department of Computer Science and Information Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
Ju-Chin Chen & Kawuu W. Lin
Department of Software Engineering and Management, National Kaohsiung Normal University, Kaohsiung, Taiwan
Yuan-Tse Yu

Authors

Chun-Cheng Lin
View author publications
You can also search for this author in PubMed Google Scholar
Sheng-Hao Chung
View author publications
You can also search for this author in PubMed Google Scholar
Ju-Chin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yuan-Tse Yu
View author publications
You can also search for this author in PubMed Google Scholar
Kawuu W. Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kawuu W. Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, CC., Chung, SH., Chen, JC. et al. A fast and low idle time method for mining frequent patterns in distributed and many-task computing environments. Distrib Parallel Databases 36, 613–641 (2018). https://doi.org/10.1007/s10619-018-7221-9

Download citation

Published: 26 March 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s10619-018-7221-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A fast and low idle time method for mining frequent patterns in distributed and many-task computing environments

Abstract

Access this article

Similar content being viewed by others

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

Multi-level dataset decomposition for parallel frequent itemset mining on a cluster of personal computers

MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A fast and low idle time method for mining frequent patterns in distributed and many-task computing environments

Abstract

Access this article

Similar content being viewed by others

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

Multi-level dataset decomposition for parallel frequent itemset mining on a cluster of personal computers

MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation