Abstract
MADlib is a free, open-source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind.
In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open-source nature. We provide an overview of the library's architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the project's goals.
MADlib is freely available at http://madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.
- D. Aloise, A. Deshpande, P. Hansen, et al. NP-hardness of euclidean sum-of-squares clustering. Machine Learning, 75(2):245--248, 2009. Google Scholar
- E. Anderson, Z. Bai, C. Bischof, et al. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, third edition, 1999. Google Scholar
- Apache Mahout. http://mahout.apache.org/.Google Scholar
- D. Arthur, B. Manthey, and H. Roglin. k-means has polynomial smoothed complexity. In FOCS, pages 405--414, 2009. Google Scholar
- D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google Scholar
- D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.Google Scholar
- V. Borkar, M. Carey, R. Grover, et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011. Google Scholar
- S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google Scholar
- J. Choi, J. Demmel, I. Dhillon, et al. ScaLAPACK: A portable linear algebra library for distributed memory computers -- design issues and performance. Computer Physics Communications, 97(1):1--15, 1996.Google Scholar
- C.-T. Chu, S. K. Kim, Y.-A. Lin, et al. Map-reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.Google Scholar
- J. Cohen, B. Dolan, M. Dunlap, et al. MAD Skills: New analysis practices for big data. PVLDB, 2(2):1481--1492, 2009. Google Scholar
- R. Feldman and J. Sanger. The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, 2007. Google Scholar
- X. Feng, A. Kumar, B. Recht, et al. Towards a unified architecture for in-RDBMS analytics. In SIGMOD, pages 325--336, 2012. Google Scholar
- G. Forney Jr. The Viterbi algorithm. Proceedings of the IEEE, 61(3):268--278, 1973.Google Scholar
- A. Ghoting, R. Krishnamurthy, E. Pednault, et al. SystemML: Declarative machine learning on MapReduce. In ICDE, pages 231--242, 2011. Google Scholar
- L. Gravano, P. Ipeirotis, H. Jagadish, et al. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.Google Scholar
- G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.Google Scholar
- D. Jurafsky and M. J. H. Speech and Language Processing. Pearson Prentice Hall, 2008. Google Scholar
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google Scholar
- J. Langford. http://hunch.net/~vw/.Google Scholar
- S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--137, 1982. Technical Report appeared much earlier in: Bell Telephone Laboratories Paper (1957). Google Scholar
- Y. Low, J. Gonzalez, A. Kyrola, et al. GraphLab: A new framework for parallel machine learning. In UAI, pages 340--349, 2010.Google Scholar
- M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is NP-hard. WALCOM: Algorithms and Computation, pages 274--285, 2009. Google Scholar
- G. Malewicz, M. H. Austern, A. J. Bik, et al. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google Scholar
- G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, Mar. 2001. Google Scholar
- A. Nedic and D. P. Bertsekas. Convergence rate of incremental subgradient algorithms. In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications, pages 263--304. Kluwer Academic Publishers, 2000.Google Scholar
- N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In PLDI, pages 89--100, 2007. Google Scholar
- Oracle R Enterprise. http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html.Google Scholar
- C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18(2):188--201, 2006. Google Scholar
- C. Ordonez. Statistical model computation with UDFs. TKDE, 22(12):1752--1765, 2010. Google Scholar
- C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In SIGMOD, pages 559--570, 2000. Google Scholar
- A. Pavlo, E. Paulson, A. Rasin, et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178. ACM, 2009. Google Scholar
- Revloution Analytics. http://www.revolutionanalytics.com/.Google Scholar
- B. Ripley. The R project in statistical computing. MSOR Connections, 1(1):23--25, 2001.Google Scholar
- H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400--407, 1951.Google Scholar
- R. T. Rockafellar. Convex Analysis (Princeton Landmarks in Mathematics and Physics). Princeton University Press, 1996.Google Scholar
- C. Sanderson. Armadillo: An open source C++ linear algebra library for fast prototyping and computationally intensive experiments. Technical report, NICTA, 2010.Google Scholar
- M. Stonebraker, P. Brown, A. Poliakov, et al. The architecture of SciDB. In SSDBM, pages 1--16, 2011. Google Scholar
- The PostgreSQL Global Development Group. PostgreSQL 9.1.4 Documentation, 2011.Google Scholar
- R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1994.Google Scholar
- L. Tierney, A. J. Rossini, and N. Li. Snow: a parallel computing framework for the r system. IJPP, 37(1):78--90, Feb. 2009. Google Scholar
- H. M. Wallach. Conditional random fields: An introduction. Technical report, Dept. of CIS, Univ. of Pennsylvania, 2004.Google Scholar
- D. Wang, M. Franklin, M. Garofalakis, et al. Hybrid in-database inference for declarative information extraction. In SIGMOD, pages 517--528, 2011. Google Scholar
- D. Z. Wang, M. J. Franklin, M. N. Garofalakis, et al. Querying probabilistic information extraction. PVLDB, 3(1):1057--1067, 2010. Google Scholar
- M. Weimer, T. Condie, R. Ramakrishnan, et al. Machine learning in ScalOps, a higher order cloud computing language. In NIPS Workshop on Parallel and Large-Scale Machine Learning (BigLearn), pages 389--396, 2011.Google Scholar
- M. Zaharia, M. Chowdhury, T. Das, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, Jul 2011.Google Scholar
- M. Zinkevich, M. Weimer, A. Smola, et al. Parallelized stochastic gradient descent. NIPS, 23(23):1--9, 2010.Google Scholar
Index Terms
- The MADlib analytics library: or MAD skills, the SQL
Comments