ABSTRACT
In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level and ill-suited for implementing ML-DM algorithms. To address this deficiency, we present NIMBLE, a portable infrastructure that has been specifically designed to enable the rapid implementation of parallel ML-DM algorithms. The infrastructure allows one to compose parallel ML-DM algorithms using reusable (serial and parallel) building blocks that can be efficiently executed using MR and other parallel programming models; it currently runs on top of Hadoop, which is an open-source MR implementation. We show how NIMBLE can be used to realize scalable implementations of ML-DM algorithms and present a performance evaluation.
- Hadoop. http://hadoop.apache.org.Google Scholar
- HBase. http://hadoop.apache.org/hbase.Google Scholar
- Hive. http://hadoop.apache.org/hive.Google Scholar
- IBM Parallel Machine Learning Toolbox. http://www.alphaworks.ibm.com/tech/pml.Google Scholar
- Intel Threading Building Blocks. http://www.threadingbuildingblocks.org.Google Scholar
- JAQL. http://www.jaql.org.Google Scholar
- Mahout. http://lucene.apache.org/mahout/.Google Scholar
- MPI. http://www.mpi-forum.org.Google Scholar
- OpenMP. http://www.openmp.org.Google Scholar
- PThreads. https://computing.llnl.gov/tutorials/pthreads.Google Scholar
- R. Agrawal et al. Mining association rules between sets of items in large databases. ACM SIGMOD, 22(2), 1993. Google ScholarDigital Library
- L. Breiman. Bagging predictors. Machine Learning, 24(2), 1996. Google ScholarDigital Library
- C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Google ScholarDigital Library
- C. Chu et al. Map-reduce for machine learning on multicore. In NIPS, 2007.Google Scholar
- W. Fan et al. A general framework for accurate and fast regression by data summarization in random decision trees. In ACM SIGKDD, 2006. Google ScholarDigital Library
- A. Ghoting et al. Fast mining of distance-based outliers in high-dimensional datasets. DMKD, 16(3), 2008. Google ScholarDigital Library
- M. Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. In SIGOPS Operating System Review, 2007. Google ScholarDigital Library
- R. Jin and G. Agrawal. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. In SDM, 2002. Google ScholarDigital Library
- P. Kambadur et al. PFunc: Modern Task Parallelism For Modern High Performance Computing. In SC, 2009. Google ScholarDigital Library
- Y. LeCun et al. Gradient-based learning applied to document recognition. In Intelligent Signal Processing, 2001.Google Scholar
- H. Li et al. Pfp: parallel fp-growth for query recommendation. In ACM RecSys, 2008. Google ScholarDigital Library
- C. Olston et al. Pig latin: a not-so-foreign language for data processing. In ACM SIGMOD, 2008. Google ScholarDigital Library
- B. Panda et al. PLANET: massively parallel learning of tree ensembles with MapReduce. Proceedings of the VLDB Endowment, 2(2), 2009. Google ScholarDigital Library
- Y. Yu et al. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI, 2008. Google ScholarDigital Library
Index Terms
- NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce
Recommendations
SGTNE: semi-global time of the next event algorithm
PADS '95: Proceedings of the ninth workshop on Parallel and distributed simulationThis paper describes an extension of the TNE algorithm, the objective of which is to increase its parallelism and to break the inter-processor deadlocks inherent with the use of TNE. The algorithm, which we call the SGTNE algorithm (Semi Global TNE), is ...
SGTNE: semi-global time of the next event algorithm
This paper describes an extension of the TNE algorithm, the objective of which is to increase its parallelism and to break the inter-processor deadlocks inherent with the use of TNE. The algorithm, which we call the SGTNE algorithm (Semi Global TNE), is ...
The Cilkview scalability analyzer
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architecturesThe Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ applications. Cilkview monitors logical parallelism during an instrumented execution of the Cilk++ application on a single ...
Comments