research-article

The MADlib analytics library: or MAD skills, the SQL

Authors:
Joseph M. Hellerstein

U.C. Berkeley

U.C. Berkeley
View Profile

,
Christoper Ré

U. Wisconsin

U. Wisconsin
View Profile

,
Florian Schoppmann

Greenplum

Greenplum
View Profile

,
Daisy Zhe Wang

U. Florida

U. Florida
View Profile

,
Eugene Fratkin

Greenplum

Greenplum
View Profile

,
Aleksander Gorajek

Greenplum

Greenplum
View Profile

,
Kee Siong Ng

Greenplum

Greenplum
View Profile

,
Caleb Welton

Greenplum

Greenplum
View Profile

,
Xixuan Feng

U. Wisconsin

U. Wisconsin
View Profile

,
Kun Li

U. Florida

U. Florida
View Profile

,
Arun Kumar

U. Wisconsin

U. Wisconsin
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 12pp 1700–1711https://doi.org/10.14778/2367502.2367510

Published:01 August 2012Publication History

Proceedings of the VLDB Endowment

Abstract

MADlib is a free, open-source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind.

In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open-source nature. We provide an overview of the library's architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the project's goals.

MADlib is freely available at http://madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.

References

D. Aloise, A. Deshpande, P. Hansen, et al. NP-hardness of euclidean sum-of-squares clustering. Machine Learning, 75(2):245--248, 2009. Google Scholar
E. Anderson, Z. Bai, C. Bischof, et al. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, third edition, 1999. Google Scholar
Apache Mahout. http://mahout.apache.org/.Google Scholar
D. Arthur, B. Manthey, and H. Roglin. k-means has polynomial smoothed complexity. In FOCS, pages 405--414, 2009. Google Scholar
D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google Scholar
D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.Google Scholar
V. Borkar, M. Carey, R. Grover, et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011. Google Scholar
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google Scholar
J. Choi, J. Demmel, I. Dhillon, et al. ScaLAPACK: A portable linear algebra library for distributed memory computers -- design issues and performance. Computer Physics Communications, 97(1):1--15, 1996.Google Scholar
C.-T. Chu, S. K. Kim, Y.-A. Lin, et al. Map-reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.Google Scholar
J. Cohen, B. Dolan, M. Dunlap, et al. MAD Skills: New analysis practices for big data. PVLDB, 2(2):1481--1492, 2009. Google Scholar
R. Feldman and J. Sanger. The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, 2007. Google Scholar
X. Feng, A. Kumar, B. Recht, et al. Towards a unified architecture for in-RDBMS analytics. In SIGMOD, pages 325--336, 2012. Google Scholar
G. Forney Jr. The Viterbi algorithm. Proceedings of the IEEE, 61(3):268--278, 1973.Google Scholar
A. Ghoting, R. Krishnamurthy, E. Pednault, et al. SystemML: Declarative machine learning on MapReduce. In ICDE, pages 231--242, 2011. Google Scholar
L. Gravano, P. Ipeirotis, H. Jagadish, et al. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.Google Scholar
G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.Google Scholar
D. Jurafsky and M. J. H. Speech and Language Processing. Pearson Prentice Hall, 2008. Google Scholar
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google Scholar
J. Langford. http://hunch.net/~vw/.Google Scholar
S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--137, 1982. Technical Report appeared much earlier in: Bell Telephone Laboratories Paper (1957). Google Scholar
Y. Low, J. Gonzalez, A. Kyrola, et al. GraphLab: A new framework for parallel machine learning. In UAI, pages 340--349, 2010.Google Scholar
M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is NP-hard. WALCOM: Algorithms and Computation, pages 274--285, 2009. Google Scholar
G. Malewicz, M. H. Austern, A. J. Bik, et al. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google Scholar
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, Mar. 2001. Google Scholar
A. Nedic and D. P. Bertsekas. Convergence rate of incremental subgradient algorithms. In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications, pages 263--304. Kluwer Academic Publishers, 2000.Google Scholar
N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In PLDI, pages 89--100, 2007. Google Scholar
Oracle R Enterprise. http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html.Google Scholar
C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18(2):188--201, 2006. Google Scholar
C. Ordonez. Statistical model computation with UDFs. TKDE, 22(12):1752--1765, 2010. Google Scholar
C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In SIGMOD, pages 559--570, 2000. Google Scholar
A. Pavlo, E. Paulson, A. Rasin, et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178. ACM, 2009. Google Scholar
Revloution Analytics. http://www.revolutionanalytics.com/.Google Scholar
B. Ripley. The R project in statistical computing. MSOR Connections, 1(1):23--25, 2001.Google Scholar
H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400--407, 1951.Google Scholar
R. T. Rockafellar. Convex Analysis (Princeton Landmarks in Mathematics and Physics). Princeton University Press, 1996.Google Scholar
C. Sanderson. Armadillo: An open source C++ linear algebra library for fast prototyping and computationally intensive experiments. Technical report, NICTA, 2010.Google Scholar
M. Stonebraker, P. Brown, A. Poliakov, et al. The architecture of SciDB. In SSDBM, pages 1--16, 2011. Google Scholar
The PostgreSQL Global Development Group. PostgreSQL 9.1.4 Documentation, 2011.Google Scholar
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1994.Google Scholar
L. Tierney, A. J. Rossini, and N. Li. Snow: a parallel computing framework for the r system. IJPP, 37(1):78--90, Feb. 2009. Google Scholar
H. M. Wallach. Conditional random fields: An introduction. Technical report, Dept. of CIS, Univ. of Pennsylvania, 2004.Google Scholar
D. Wang, M. Franklin, M. Garofalakis, et al. Hybrid in-database inference for declarative information extraction. In SIGMOD, pages 517--528, 2011. Google Scholar
D. Z. Wang, M. J. Franklin, M. N. Garofalakis, et al. Querying probabilistic information extraction. PVLDB, 3(1):1057--1067, 2010. Google Scholar
M. Weimer, T. Condie, R. Ramakrishnan, et al. Machine learning in ScalOps, a higher order cloud computing language. In NIPS Workshop on Parallel and Large-Scale Machine Learning (BigLearn), pages 389--396, 2011.Google Scholar
M. Zaharia, M. Chowdhury, T. Das, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, Jul 2011.Google Scholar
M. Zinkevich, M. Weimer, A. Smola, et al. Parallelized stochastic gradient descent. NIPS, 23(23):1--9, 2010.Google Scholar

Index Terms

The MADlib analytics library: or MAD skills, the SQL
1. Information systems
  1. Data management systems
    1. Database management system engines

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 12
August 2012
340 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2012
Published in pvldb Volume 5, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 165
  Total Citations
  View Citations
- 880
  Total Downloads
- Downloads (Last 12 months)39
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics

Big Data Analytics with R and Hadoop

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics

Big Data Analytics with R and Hadoop

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media