research-article

Presto: distributed machine learning and graph processing with sparse matrices

Authors:
Shivaram Venkataraman

University of Chicago

University of Chicago
View Profile

,
Erik Bodzsar

University of Chicago

University of Chicago
View Profile

,
Indrajit Roy

University of Chicago

University of Chicago
View Profile

,
Alvin AuYoung

University of Chicago

University of Chicago
View Profile

,
Robert S. Schreiber

University of Chicago

University of Chicago
View Profile

EuroSys '13: Proceedings of the 8th ACM European Conference on Computer SystemsApril 2013Pages 197–210https://doi.org/10.1145/2465351.2465371

Published:15 April 2013Publication History

EuroSys '13: Proceedings of the 8th ACM European Conference on Computer Systems

Pages 197–210

ABSTRACT

It is cumbersome to write machine learning and graph algorithms in data-parallel models such as MapReduce and Dryad. We observe that these algorithms are based on matrix computations and, hence, are inefficient to implement with the restrictive programming and communication interface of such frameworks.

In this paper we show that array-based languages such as R [3] are suitable for implementing complex algorithms and can outperform current data parallel solutions. Since R is single-threaded and does not scale to large datasets, we have built Presto, a distributed system that extends R and addresses many of its limitations. Presto efficiently shares sparse structured data, can leverage multi-cores, and dynamically partitions data to mitigate load imbalance. Our results show the promise of this approach: many important machine learning and graph algorithms can be expressed in a single framework and are substantially faster than those in Hadoop and Spark.

References

Apache mahout. http://mahout.apache.org.Google Scholar
Netflix prize. http://www.netflixprize.com/.Google Scholar
The R project for statistical computing. http://www.r-project.org.Google Scholar
Stanford network analysis package. http://snap.stanford.edu/snap.Google Scholar
G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using Mantri. In In OSDI'10, Vancouver, BC, Canada, 2010. Google ScholarDigital Library
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. In SFCS '94, pages 356--368, Washington, DC, USA, 1994. Google ScholarDigital Library
U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25:163--177, 2001.Google ScholarCross Ref
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In WWW7, pages 107--117, 1998. Google ScholarDigital Library
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. Proc. VLDB Endow., 3:285--296, September 2010. Google ScholarDigital Library
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In OOPSLA'05, pages 519--538, 2005. Google ScholarDigital Library
S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: Integrating R and Hadoop. In SIGMOD Conference '10, pages 987--998, 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM, 51(1), 2008. Google ScholarDigital Library
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: A runtime for iterative MapReduce. In HPDC '10, pages 810--818, 2010. Google ScholarDigital Library
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In OSDI'12, Hollywood, CA, October 2012. Google ScholarDigital Library
V. Hernandez, J. E. Roman, and V. Vidal. Slepc: A scalable and flexible toolkit for the solution of eigenvalue problems. ACM Trans. Math. Softw., 31(3):351--362, Sept. 2005. Google ScholarDigital Library
P. Hintjens. ZeroMQ: The Guide, 2010.Google Scholar
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys '07, pages 59--72, 2007. Google ScholarDigital Library
U. Kang, B. Meeder, and C. Faloutsos. Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation. In PAKDD (2), pages 13--25, 2011. Google ScholarDigital Library
J. Kepner and J. Gilbert. Graph Algorithms in the Language of Linear Algebra. Fundamentals of Algorithms. SIAM, 2011. Google ScholarDigital Library
M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In PLDI '07, pages 211--222. Google ScholarDigital Library
R. B. Lehoucq, D. C. Sorensen, and C. Yang. ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. Software, environments, tools. SIAM, 1998.Google Scholar
D. Loveman. High performance Fortran. IEEE Parallel & Distributed Technology: Systems & Applications, 1(1):25--42, 1993. Google ScholarDigital Library
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A New Framework for Parallel Machine Learning. CoRR, pages 1--1, 2010.Google Scholar
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD '10, pages 135--146, 2010. Google ScholarDigital Library
Q. E. McCallum and S. Weston. Parallel R. O'Reilly Media, Oct. 2011. Google ScholarDigital Library
D. G. Murray and S. Hand. Ciel: A universal execution engine for distributed data-flow computing. In NSDI '11, Boston, MA, USA, 2011. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD'08, pages 1099--1110, 2008. Google ScholarDigital Library
R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In OSDI '10, Vancouver, BC, Canada, 2010. USENIX Association. Google ScholarDigital Library
Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, and Z. Zhang. MadLINQ: large-scale distributed matrix computation for the cloud. In EuroSys '12, pages 197--210, 2012. Google ScholarDigital Library
S. Seo, E. J. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng. Hama: An efficient matrix computation with the mapreduce framework. In In CLOUDCOM'10, pages 721--726. Google ScholarDigital Library
G. L. Steele, Jr. Parallel programming and code selection in fortress. In PPoPP '06, pages 1--1, 2006. Google ScholarDigital Library
G. Strang. Introduction to Linear Algebra, Third Edition. Wellesley Cambridge Pr, Mar. 2003.Google Scholar
C. E. Tsourakakis. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM'08, pages 608--617. IEEE, 2008. Google ScholarDigital Library
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33:103--111, August 1990. Google ScholarDigital Library
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI '08, pages 1--14, 2008. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI'12, San Jose, CA, 2012. Google ScholarDigital Library
Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. In AAIM '08, pages 337--348, Shanghai, China, 2008. Google ScholarDigital Library

Index Terms

Presto: distributed machine learning and graph processing with sparse matrices
1. Computing methodologies
  1. Artificial intelligence
    1. Search methodologies
      1. Discrete space search
      2. Game tree search
  2. Machine learning
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

Evaluating Presto and SparkSQL with TPC-DS
Database Systems for Advanced Applications. DASFAA 2022 International Workshops
Abstract
From the perspective of the development trend of database technology and the application of big data, the unified management and analysis of relational data and non-relational data is a new trend. New relational computing engines, such as SparkSQL ...
Read More
Querying Data Lakes using Spark and Presto
WWW '19: The World Wide Web Conference

Squerall is a tool that allows the querying of heterogeneous, large-scale data sources by leveraging state-of-the-art Big Data processing engines: Spark and Presto. Queries are posed on-demand against a Data Lake, i.e., directly on the original data ...
Read More
Presto: A Decade of SQL Analytics at Meta
PACMMOD

Presto is an open-source distributed SQL query engine that supports analytics workloads involving multiple exabyte-scale data sources. Presto is used for low-latency interactive use cases as well as long-running ETL jobs at Meta. It was originally ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '13: Proceedings of the 8th ACM European Conference on Computer Systems
April 2013
401 pages
ISBN:9781450319942
DOI:10.1145/2465351
General Chairs:
Zdenek Hanzálek
Czech Technical University Prague
,
Hermann Härtig
Technische Universität Dresden
,
Program Chairs:
Miguel Castro
Microsoft Research Cambridge
,
M. Frans Kaashoek
MIT
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 April 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
EuroSys '13 Paper Acceptance Rate28of143submissions,20%Overall Acceptance Rate241of1,308submissions,18%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 1,019
  Total Downloads
- Downloads (Last 12 months)33
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Presto: distributed machine learning and graph processing with sparse matrices

EuroSys '13: Proceedings of the 8th ACM European Conference on Computer Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluating Presto and SparkSQL with TPC-DS

Querying Data Lakes using Spark and Presto

Presto: A Decade of SQL Analytics at Meta

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Presto: distributed machine learning and graph processing with sparse matrices

EuroSys '13: Proceedings of the 8th ACM European Conference on Computer Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluating Presto and SparkSQL with TPC-DS

Querying Data Lakes using Spark and Presto

Presto: A Decade of SQL Analytics at Meta

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media