Abstract
Association rules mining has attracted much attention among data mining topics because it has been successfully applied in various fields to find the association between purchased items by identifying frequent patterns (FPs). Currently, databases are huge, ranging in size from terabytes to petabytes. Although past studies can effectively discover FPs to deduce association rules, the execution efficiency is still a critical problem, particularly for big data. Progressive size working set (PSWS) and parallel FP-growth (PFP) are state-of-the-art methods that have been applied successfully to parallel and distributed computing technology to improve mining processing time in many-task computing, thereby bridging the gap between high-throughput and high-performance computing. However, such methods cannot mine before obtaining a complete FP-tree or the corresponding subdatabase, causing a high idle time for computing nodes. We propose a method that can begin mining when a small part of an FP-tree is received. The idle time of computing nodes can be reduced, and thus, the time required for mining can be reduced effectively. Through an empirical evaluation, the proposed method is shown to be faster than PSWS and PFP.
Similar content being viewed by others
References
Adnan, M., Alhajj, R.: DRFP-tree: disk-resident frequent pattern tree. Appl. Intell. 30, 84–97 (2009)
Agrawal, R., Shafer, J.C.: Parallel mining of association rules. IEEE Trans. Knowl. Data Eng. 8, 962–969 (1996)
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th international conference very large data bases, VLDB, pp. 487–499 (1994)
Agrawal, R., Srikant, R.: Quest Synthetic Data Generator. IBM Almaden Research Center, San Jose (2009)
Baralis, E., Cerquitelli, T., Chiusano, S., Grand, A.: P-mine: parallel itemset mining on large datasets. In: 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW), IEEE, pp. 266–271 (2013)
Brijs, T., Swinnen, G., Vanhoof, K., Wets, G.: Using association rules for product assortment decisions: a case study. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 254–260 (1999)
Buehrer, G., de Oliveira, R.L., Fuhry, D., Parthasarathy, S.: Towards a parameter-free and parallel itemset mining algorithm in linearithmic time. In: IEEE 31st International Conference on Data Engineering (ICDE), IEEE, pp. 1071–1082 (2015)
Buehrer, G., Parthasarathy, S., Tatikonda, S., Kurc, T., Saltz, J.: Toward terabyte pattern mining: an architecture-conscious solution. In: Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, pp. 2–12 (2007)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Eggen, M., Eggen, R.: Java versus MPI in a distributed environment. In: PDPTA, pp. 390–395 (1999)
Ezeife, C., Zhang, D.: TidFP: mining frequent patterns in different databases with transaction ID. In: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery, Springer, pp. 125–137 (2009)
Geurts, K., Wets, G., Brijs, T., Vanhoof, K.: Profiling of high-frequency accident locations by use of association rules. Transp. Res. Rec. 2003, 123–130 (1840)
Goethals, B., Zaki, M.J.: Frequent itemset mining dataset repository. In: Frequent Itemset Mining Implementations (FIMI 2003) (2003)
Grahne, G., Zhu, J.: Efficiently using prefix-trees in mining frequent itemsets. In: FIMI, pp. 123–132 (2003)
Grahne, G., Zhu, J.: Mining frequent itemsets from secondary memory. In: Fourth IEEE International Conference on Data Mining, 2004. ICDM’04, IEEE, pp. 91–98 (2004)
Hadoop, A.: Hadoop (2009). http://hadoop.apache.org/
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD Record, ACM, pp. 1–12 (2000)
Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Disc. 8, 53–87 (2004)
Huang, D., Song, Y., Routray, R., Qin, F.: Smart cache: an optimized mapreduce implementation of frequent itemset mining. In: 2015 IEEE International Conference on Cloud Engineering (IC2E), IEEE, pp. 16–25 (2015)
Javed, A., Khokhar, A.: Frequent pattern mining on message passing multiprocessor systems. Distrib. Parallel Databases 16, 321–334 (2004)
Lai, Y., ZhongZhi, S.: An efficient data mining framework on Hadoop using Java persistence API. In: 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), IEEE, pp. 203–209 (2010)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, ACM, pp. 107–114 (2008)
Liang, Y.-H., Wu, S.-Y.: Sequence-growth: a scalable and effective frequent itemset mining algorithm for big data based on mapreduce framework. In: 2015 IEEE International Congress on Big Data (BigData Congress), IEEE, pp. 393–400 (2015)
Lin, K.W., Chung, S.-H.: A fast and resource efficient mining algorithm for discovering frequent patterns in distributed computing environments. Fut. Gener. Comput. Syst. 52, 49–58 (2015)
Lin, K.W., Chung, S.-H., Lin, C.-C.: A fast and distributed algorithm for mining frequent patterns in congested networks. Computing 98, 235–256 (2016)
Lin, K.W., Deng, D.-J.: A novel parallel algorithm for frequent pattern mining with privacy preserved in cloud computing environments. Int. J. Ad Hoc Ubiquitous Comput. 6, 205–215 (2010)
Lin, K.W., Lo, Y.-C.: Efficient algorithms for frequent pattern mining in many-task computing environments. Knowl. Based Syst. 49, 10–21 (2013)
Lin, W.-T., Chu, C.-P.: Determining the appropriate number of nodes for fast mining of frequent patterns in distributed computing environments. Int. J. Parallel Emerg. Distrib. Syst. 30, 380–392 (2014)
Liu, J., Wu, Y., Zhou, Q., Fung, B.C., Chen, F., Yu, B.: Parallel eclat for opportunistic mining of frequent itemsets. In: Database and Expert Systems Applications, Springer, pp. 401–415 (2015)
Lucchese, C., Orlando, S., Perego, R.: Parallel mining of frequent closed patterns: harnessing modern computer architectures. In: Seventh IEEE International Conference on Data Mining, 2007. ICDM 2007, IEEE, pp. 242–251 (2007)
Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: WebDocs: a real-life huge transactional dataset. In: FIMI (2004)
Moens, S., Aksehirli, E., Goethals, B.: Frequent itemset mining for big data. In: 2013 IEEE International Conference on Big Data, IEEE, pp. 111–118 (2013)
Qiu, H., Gu, R., Yuan, C., Huang, Y.: Yafim: a parallel frequent itemset mining algorithm with spark. In: Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, IEEE, pp. 1664–1671 (2014)
Qiu, Y., Lan, Y.-J., Xie, Q.-S.: An improved algorithm of mining from FP-tree. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics, IEEE, pp. 1665–1670, 2004
Schlegel, B., Gemulla, R., Lehner, W.: Memory-efficient frequent-itemset mining. In: Proceedings of the 14th International Conference on Extending Database Technology, ACM, pp. 461–472 (2011)
Spark, A.: Spark. https://spark.apache.org/
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, p. 5 (2013)
Vu, L., Alaghband, G.: Novel parallel method for mining frequent patterns on multi-core shared memory systems. In: Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems, ACM, pp. 49–54 (2013)
Wang, Y., Parthasarathy, S., Sadayappan, P.: Stratification driven placement of complex data: a framework for distributed data analytics. In: IEEE 29th International Conference on Data Engineering (ICDE), pp. 709–720 (2013)
Wu, X., Fan, W., Peng, J., Zhang, K., Yu, Y.: Iterative sampling based frequent itemset mining for big data. Int. J. Mach. Learn. Cybern 6, 875–882 (2015)
Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 97–107 (2014)
Yahya, O., Hegazy, O., Ezat, E.: An efficient implementation of A-Priori algorithm based on Hadoop-MapReduce model. Int. J. Rev. Comput. 12 (2012)
Yang, L., Shi, Z., Xu, L.D., Liang, F., Kirsh, I.: DH-TRIE frequent pattern mining on Hadoop using JPA. In: 2011 IEEE International Conference on Granular Computing (GrC), pp. 875–878 (2011)
Yang, X.Y., Liu, Z., Fu, Y.: MapReduce as a programming model for association rules algorithm on Hadoop. In: 2010 3rd International Conference on Information Sciences and Interaction Sciences (ICIS), IEEE, pp. 99–102 (2010)
Yen, S.-J., Lee, Y.-S., Wang, Y.-S., Wu, J.-W., Ouyang, L.-Y.: The studies of mining frequent patterns based on frequent pattern tree. In: Advances in Knowledge Discovery and Data Mining, Springer, pp. 232–241 (2009)
Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12, 372–390 (2000)
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., Stolorz, P., Musick, R.: Parallel algorithms for discovery of association rules. In: Scalable High Performance Computing for Knowledge Discovery and Data Mining, Springer, pp. 5–35 (1997)
Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Clust. Comput. 18, 1493–1501 (2015)
Zhou, J., Yu, K.-M.: Tidset-based parallel FP-tree algorithm for the frequent pattern mining problem on PC clusters. In: Advances in Grid and Pervasive Computing, Springer, pp. 18–28 (2008)
Zhou, J., Yu, K.-M.: Balanced Tidset-based parallel FP-tree algorithm for the frequent pattern mining on grid system. In: Proceedings of the 2008 Fourth International Conference on Semantics, Knowledge and Grid, IEEE Computer Society, pp. 103–108 (2008)
Acknowledgement
This work was supported by the Ministry of Science and Technology of Taiwan, R.O.C., under Grant Nos. MOST 104-2221-E-151 -055 and 105-2221-E-151 -056.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lin, CC., Chung, SH., Chen, JC. et al. A fast and low idle time method for mining frequent patterns in distributed and many-task computing environments. Distrib Parallel Databases 36, 613–641 (2018). https://doi.org/10.1007/s10619-018-7221-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-018-7221-9