skip to main content
10.1145/2465351.2465371acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Presto: distributed machine learning and graph processing with sparse matrices

Published:15 April 2013Publication History

ABSTRACT

It is cumbersome to write machine learning and graph algorithms in data-parallel models such as MapReduce and Dryad. We observe that these algorithms are based on matrix computations and, hence, are inefficient to implement with the restrictive programming and communication interface of such frameworks.

In this paper we show that array-based languages such as R [3] are suitable for implementing complex algorithms and can outperform current data parallel solutions. Since R is single-threaded and does not scale to large datasets, we have built Presto, a distributed system that extends R and addresses many of its limitations. Presto efficiently shares sparse structured data, can leverage multi-cores, and dynamically partitions data to mitigate load imbalance. Our results show the promise of this approach: many important machine learning and graph algorithms can be expressed in a single framework and are substantially faster than those in Hadoop and Spark.

References

  1. Apache mahout. http://mahout.apache.org.Google ScholarGoogle Scholar
  2. Netflix prize. http://www.netflixprize.com/.Google ScholarGoogle Scholar
  3. The R project for statistical computing. http://www.r-project.org.Google ScholarGoogle Scholar
  4. Stanford network analysis package. http://snap.stanford.edu/snap.Google ScholarGoogle Scholar
  5. G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using Mantri. In In OSDI'10, Vancouver, BC, Canada, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. In SFCS '94, pages 356--368, Washington, DC, USA, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25:163--177, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  8. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In WWW7, pages 107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. Proc. VLDB Endow., 3:285--296, September 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In OOPSLA'05, pages 519--538, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: Integrating R and Hadoop. In SIGMOD Conference '10, pages 987--998, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM, 51(1), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: A runtime for iterative MapReduce. In HPDC '10, pages 810--818, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In OSDI'12, Hollywood, CA, October 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. Hernandez, J. E. Roman, and V. Vidal. Slepc: A scalable and flexible toolkit for the solution of eigenvalue problems. ACM Trans. Math. Softw., 31(3):351--362, Sept. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Hintjens. ZeroMQ: The Guide, 2010.Google ScholarGoogle Scholar
  17. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys '07, pages 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. U. Kang, B. Meeder, and C. Faloutsos. Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation. In PAKDD (2), pages 13--25, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Kepner and J. Gilbert. Graph Algorithms in the Language of Linear Algebra. Fundamentals of Algorithms. SIAM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In PLDI '07, pages 211--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. B. Lehoucq, D. C. Sorensen, and C. Yang. ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. Software, environments, tools. SIAM, 1998.Google ScholarGoogle Scholar
  22. D. Loveman. High performance Fortran. IEEE Parallel & Distributed Technology: Systems & Applications, 1(1):25--42, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A New Framework for Parallel Machine Learning. CoRR, pages 1--1, 2010.Google ScholarGoogle Scholar
  24. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD '10, pages 135--146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Q. E. McCallum and S. Weston. Parallel R. O'Reilly Media, Oct. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. G. Murray and S. Hand. Ciel: A universal execution engine for distributed data-flow computing. In NSDI '11, Boston, MA, USA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD'08, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In OSDI '10, Vancouver, BC, Canada, 2010. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, and Z. Zhang. MadLINQ: large-scale distributed matrix computation for the cloud. In EuroSys '12, pages 197--210, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Seo, E. J. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng. Hama: An efficient matrix computation with the mapreduce framework. In In CLOUDCOM'10, pages 721--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. L. Steele, Jr. Parallel programming and code selection in fortress. In PPoPP '06, pages 1--1, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. G. Strang. Introduction to Linear Algebra, Third Edition. Wellesley Cambridge Pr, Mar. 2003.Google ScholarGoogle Scholar
  33. C. E. Tsourakakis. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM'08, pages 608--617. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33:103--111, August 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI '08, pages 1--14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI'12, San Jose, CA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. In AAIM '08, pages 337--348, Shanghai, China, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Presto: distributed machine learning and graph processing with sparse matrices

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            EuroSys '13: Proceedings of the 8th ACM European Conference on Computer Systems
            April 2013
            401 pages
            ISBN:9781450319942
            DOI:10.1145/2465351

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 April 2013

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            EuroSys '13 Paper Acceptance Rate28of143submissions,20%Overall Acceptance Rate241of1,308submissions,18%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader