research-article

NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Authors:
Amol Ghoting

IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA

IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Prabhanjan Kambadur

IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA

IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Edwin Pednault

IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA

IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Ramakrishnan Kannan

IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA

IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2011Pages 334–342https://doi.org/10.1145/2020408.2020464

Published:21 August 2011Publication History

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 334–342

ABSTRACT

In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level and ill-suited for implementing ML-DM algorithms. To address this deficiency, we present NIMBLE, a portable infrastructure that has been specifically designed to enable the rapid implementation of parallel ML-DM algorithms. The infrastructure allows one to compose parallel ML-DM algorithms using reusable (serial and parallel) building blocks that can be efficiently executed using MR and other parallel programming models; it currently runs on top of Hadoop, which is an open-source MR implementation. We show how NIMBLE can be used to realize scalable implementations of ML-DM algorithms and present a performance evaluation.

References

Hadoop. http://hadoop.apache.org.Google Scholar
HBase. http://hadoop.apache.org/hbase.Google Scholar
Hive. http://hadoop.apache.org/hive.Google Scholar
IBM Parallel Machine Learning Toolbox. http://www.alphaworks.ibm.com/tech/pml.Google Scholar
Intel Threading Building Blocks. http://www.threadingbuildingblocks.org.Google Scholar
JAQL. http://www.jaql.org.Google Scholar
Mahout. http://lucene.apache.org/mahout/.Google Scholar
MPI. http://www.mpi-forum.org.Google Scholar
OpenMP. http://www.openmp.org.Google Scholar
PThreads. https://computing.llnl.gov/tutorials/pthreads.Google Scholar
R. Agrawal et al. Mining association rules between sets of items in large databases. ACM SIGMOD, 22(2), 1993. Google ScholarDigital Library
L. Breiman. Bagging predictors. Machine Learning, 24(2), 1996. Google ScholarDigital Library
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Google ScholarDigital Library
C. Chu et al. Map-reduce for machine learning on multicore. In NIPS, 2007.Google Scholar
W. Fan et al. A general framework for accurate and fast regression by data summarization in random decision trees. In ACM SIGKDD, 2006. Google ScholarDigital Library
A. Ghoting et al. Fast mining of distance-based outliers in high-dimensional datasets. DMKD, 16(3), 2008. Google ScholarDigital Library
M. Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. In SIGOPS Operating System Review, 2007. Google ScholarDigital Library
R. Jin and G. Agrawal. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. In SDM, 2002. Google ScholarDigital Library
P. Kambadur et al. PFunc: Modern Task Parallelism For Modern High Performance Computing. In SC, 2009. Google ScholarDigital Library
Y. LeCun et al. Gradient-based learning applied to document recognition. In Intelligent Signal Processing, 2001.Google Scholar
H. Li et al. Pfp: parallel fp-growth for query recommendation. In ACM RecSys, 2008. Google ScholarDigital Library
C. Olston et al. Pig latin: a not-so-foreign language for data processing. In ACM SIGMOD, 2008. Google ScholarDigital Library
B. Panda et al. PLANET: massively parallel learning of tree ensembles with MapReduce. Proceedings of the VLDB Endowment, 2(2), 2009. Google ScholarDigital Library
Y. Yu et al. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI, 2008. Google ScholarDigital Library

Index Terms

NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce
1. Software and its engineering
  1. Software notations and tools
    1. Software libraries and repositories

Recommendations

SGTNE: semi-global time of the next event algorithm
PADS '95: Proceedings of the ninth workshop on Parallel and distributed simulation

This paper describes an extension of the TNE algorithm, the objective of which is to increase its parallelism and to break the inter-processor deadlocks inherent with the use of TNE. The algorithm, which we call the SGTNE algorithm (Semi Global TNE), is ...
Read More
SGTNE: semi-global time of the next event algorithm

This paper describes an extension of the TNE algorithm, the objective of which is to increase its parallelism and to break the inter-processor deadlocks inherent with the use of TNE. The algorithm, which we call the SGTNE algorithm (Semi Global TNE), is ...
Read More
The Cilkview scalability analyzer
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

The Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ applications. Cilkview monitors logical parallelism during an instrumented execution of the Cilk++ application on a single ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2011
1446 pages
ISBN:9781450308137
DOI:10.1145/2020408
General Chair:
Chid Apte
IBM Research
,
Program Chairs:
Joydeep Ghosh
UT Austin
,
Padhraic Smyth
UC Irvine
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data mining
machine learning
map/reduce
parallelism
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 42
  Total Citations
  View Citations
- 1,151
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.