skip to main content
research-article

The MADlib analytics library: or MAD skills, the SQL

Published:01 August 2012Publication History
Skip Abstract Section

Abstract

MADlib is a free, open-source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind.

In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open-source nature. We provide an overview of the library's architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the project's goals.

MADlib is freely available at http://madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.

References

  1. D. Aloise, A. Deshpande, P. Hansen, et al. NP-hardness of euclidean sum-of-squares clustering. Machine Learning, 75(2):245--248, 2009. Google ScholarGoogle Scholar
  2. E. Anderson, Z. Bai, C. Bischof, et al. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, third edition, 1999. Google ScholarGoogle Scholar
  3. Apache Mahout. http://mahout.apache.org/.Google ScholarGoogle Scholar
  4. D. Arthur, B. Manthey, and H. Roglin. k-means has polynomial smoothed complexity. In FOCS, pages 405--414, 2009. Google ScholarGoogle Scholar
  5. D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarGoogle Scholar
  6. D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.Google ScholarGoogle Scholar
  7. V. Borkar, M. Carey, R. Grover, et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011. Google ScholarGoogle Scholar
  8. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google ScholarGoogle Scholar
  9. J. Choi, J. Demmel, I. Dhillon, et al. ScaLAPACK: A portable linear algebra library for distributed memory computers -- design issues and performance. Computer Physics Communications, 97(1):1--15, 1996.Google ScholarGoogle Scholar
  10. C.-T. Chu, S. K. Kim, Y.-A. Lin, et al. Map-reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.Google ScholarGoogle Scholar
  11. J. Cohen, B. Dolan, M. Dunlap, et al. MAD Skills: New analysis practices for big data. PVLDB, 2(2):1481--1492, 2009. Google ScholarGoogle Scholar
  12. R. Feldman and J. Sanger. The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, 2007. Google ScholarGoogle Scholar
  13. X. Feng, A. Kumar, B. Recht, et al. Towards a unified architecture for in-RDBMS analytics. In SIGMOD, pages 325--336, 2012. Google ScholarGoogle Scholar
  14. G. Forney Jr. The Viterbi algorithm. Proceedings of the IEEE, 61(3):268--278, 1973.Google ScholarGoogle Scholar
  15. A. Ghoting, R. Krishnamurthy, E. Pednault, et al. SystemML: Declarative machine learning on MapReduce. In ICDE, pages 231--242, 2011. Google ScholarGoogle Scholar
  16. L. Gravano, P. Ipeirotis, H. Jagadish, et al. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.Google ScholarGoogle Scholar
  17. G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.Google ScholarGoogle Scholar
  18. D. Jurafsky and M. J. H. Speech and Language Processing. Pearson Prentice Hall, 2008. Google ScholarGoogle Scholar
  19. J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google ScholarGoogle Scholar
  20. J. Langford. http://hunch.net/~vw/.Google ScholarGoogle Scholar
  21. S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--137, 1982. Technical Report appeared much earlier in: Bell Telephone Laboratories Paper (1957). Google ScholarGoogle Scholar
  22. Y. Low, J. Gonzalez, A. Kyrola, et al. GraphLab: A new framework for parallel machine learning. In UAI, pages 340--349, 2010.Google ScholarGoogle Scholar
  23. M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is NP-hard. WALCOM: Algorithms and Computation, pages 274--285, 2009. Google ScholarGoogle Scholar
  24. G. Malewicz, M. H. Austern, A. J. Bik, et al. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google ScholarGoogle Scholar
  25. G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, Mar. 2001. Google ScholarGoogle Scholar
  26. A. Nedic and D. P. Bertsekas. Convergence rate of incremental subgradient algorithms. In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications, pages 263--304. Kluwer Academic Publishers, 2000.Google ScholarGoogle Scholar
  27. N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In PLDI, pages 89--100, 2007. Google ScholarGoogle Scholar
  28. Oracle R Enterprise. http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html.Google ScholarGoogle Scholar
  29. C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18(2):188--201, 2006. Google ScholarGoogle Scholar
  30. C. Ordonez. Statistical model computation with UDFs. TKDE, 22(12):1752--1765, 2010. Google ScholarGoogle Scholar
  31. C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In SIGMOD, pages 559--570, 2000. Google ScholarGoogle Scholar
  32. A. Pavlo, E. Paulson, A. Rasin, et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178. ACM, 2009. Google ScholarGoogle Scholar
  33. Revloution Analytics. http://www.revolutionanalytics.com/.Google ScholarGoogle Scholar
  34. B. Ripley. The R project in statistical computing. MSOR Connections, 1(1):23--25, 2001.Google ScholarGoogle Scholar
  35. H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400--407, 1951.Google ScholarGoogle Scholar
  36. R. T. Rockafellar. Convex Analysis (Princeton Landmarks in Mathematics and Physics). Princeton University Press, 1996.Google ScholarGoogle Scholar
  37. C. Sanderson. Armadillo: An open source C++ linear algebra library for fast prototyping and computationally intensive experiments. Technical report, NICTA, 2010.Google ScholarGoogle Scholar
  38. M. Stonebraker, P. Brown, A. Poliakov, et al. The architecture of SciDB. In SSDBM, pages 1--16, 2011. Google ScholarGoogle Scholar
  39. The PostgreSQL Global Development Group. PostgreSQL 9.1.4 Documentation, 2011.Google ScholarGoogle Scholar
  40. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1994.Google ScholarGoogle Scholar
  41. L. Tierney, A. J. Rossini, and N. Li. Snow: a parallel computing framework for the r system. IJPP, 37(1):78--90, Feb. 2009. Google ScholarGoogle Scholar
  42. H. M. Wallach. Conditional random fields: An introduction. Technical report, Dept. of CIS, Univ. of Pennsylvania, 2004.Google ScholarGoogle Scholar
  43. D. Wang, M. Franklin, M. Garofalakis, et al. Hybrid in-database inference for declarative information extraction. In SIGMOD, pages 517--528, 2011. Google ScholarGoogle Scholar
  44. D. Z. Wang, M. J. Franklin, M. N. Garofalakis, et al. Querying probabilistic information extraction. PVLDB, 3(1):1057--1067, 2010. Google ScholarGoogle Scholar
  45. M. Weimer, T. Condie, R. Ramakrishnan, et al. Machine learning in ScalOps, a higher order cloud computing language. In NIPS Workshop on Parallel and Large-Scale Machine Learning (BigLearn), pages 389--396, 2011.Google ScholarGoogle Scholar
  46. M. Zaharia, M. Chowdhury, T. Das, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, Jul 2011.Google ScholarGoogle Scholar
  47. M. Zinkevich, M. Weimer, A. Smola, et al. Parallelized stochastic gradient descent. NIPS, 23(23):1--9, 2010.Google ScholarGoogle Scholar

Index Terms

  1. The MADlib analytics library: or MAD skills, the SQL
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 5, Issue 12
      August 2012
      340 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 August 2012
      Published in pvldb Volume 5, Issue 12

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader