ABSTRACT
It is cumbersome to write machine learning and graph algorithms in data-parallel models such as MapReduce and Dryad. We observe that these algorithms are based on matrix computations and, hence, are inefficient to implement with the restrictive programming and communication interface of such frameworks.
In this paper we show that array-based languages such as R [3] are suitable for implementing complex algorithms and can outperform current data parallel solutions. Since R is single-threaded and does not scale to large datasets, we have built Presto, a distributed system that extends R and addresses many of its limitations. Presto efficiently shares sparse structured data, can leverage multi-cores, and dynamically partitions data to mitigate load imbalance. Our results show the promise of this approach: many important machine learning and graph algorithms can be expressed in a single framework and are substantially faster than those in Hadoop and Spark.
- Apache mahout. http://mahout.apache.org.Google Scholar
- Netflix prize. http://www.netflixprize.com/.Google Scholar
- The R project for statistical computing. http://www.r-project.org.Google Scholar
- Stanford network analysis package. http://snap.stanford.edu/snap.Google Scholar
- G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using Mantri. In In OSDI'10, Vancouver, BC, Canada, 2010. Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. In SFCS '94, pages 356--368, Washington, DC, USA, 1994. Google ScholarDigital Library
- U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25:163--177, 2001.Google ScholarCross Ref
- S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In WWW7, pages 107--117, 1998. Google ScholarDigital Library
- Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. Proc. VLDB Endow., 3:285--296, September 2010. Google ScholarDigital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In OOPSLA'05, pages 519--538, 2005. Google ScholarDigital Library
- S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: Integrating R and Hadoop. In SIGMOD Conference '10, pages 987--998, 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM, 51(1), 2008. Google ScholarDigital Library
- J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: A runtime for iterative MapReduce. In HPDC '10, pages 810--818, 2010. Google ScholarDigital Library
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In OSDI'12, Hollywood, CA, October 2012. Google ScholarDigital Library
- V. Hernandez, J. E. Roman, and V. Vidal. Slepc: A scalable and flexible toolkit for the solution of eigenvalue problems. ACM Trans. Math. Softw., 31(3):351--362, Sept. 2005. Google ScholarDigital Library
- P. Hintjens. ZeroMQ: The Guide, 2010.Google Scholar
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys '07, pages 59--72, 2007. Google ScholarDigital Library
- U. Kang, B. Meeder, and C. Faloutsos. Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation. In PAKDD (2), pages 13--25, 2011. Google ScholarDigital Library
- J. Kepner and J. Gilbert. Graph Algorithms in the Language of Linear Algebra. Fundamentals of Algorithms. SIAM, 2011. Google ScholarDigital Library
- M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In PLDI '07, pages 211--222. Google ScholarDigital Library
- R. B. Lehoucq, D. C. Sorensen, and C. Yang. ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. Software, environments, tools. SIAM, 1998.Google Scholar
- D. Loveman. High performance Fortran. IEEE Parallel & Distributed Technology: Systems & Applications, 1(1):25--42, 1993. Google ScholarDigital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A New Framework for Parallel Machine Learning. CoRR, pages 1--1, 2010.Google Scholar
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD '10, pages 135--146, 2010. Google ScholarDigital Library
- Q. E. McCallum and S. Weston. Parallel R. O'Reilly Media, Oct. 2011. Google ScholarDigital Library
- D. G. Murray and S. Hand. Ciel: A universal execution engine for distributed data-flow computing. In NSDI '11, Boston, MA, USA, 2011. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD'08, pages 1099--1110, 2008. Google ScholarDigital Library
- R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In OSDI '10, Vancouver, BC, Canada, 2010. USENIX Association. Google ScholarDigital Library
- Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, and Z. Zhang. MadLINQ: large-scale distributed matrix computation for the cloud. In EuroSys '12, pages 197--210, 2012. Google ScholarDigital Library
- S. Seo, E. J. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng. Hama: An efficient matrix computation with the mapreduce framework. In In CLOUDCOM'10, pages 721--726. Google ScholarDigital Library
- G. L. Steele, Jr. Parallel programming and code selection in fortress. In PPoPP '06, pages 1--1, 2006. Google ScholarDigital Library
- G. Strang. Introduction to Linear Algebra, Third Edition. Wellesley Cambridge Pr, Mar. 2003.Google Scholar
- C. E. Tsourakakis. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM'08, pages 608--617. IEEE, 2008. Google ScholarDigital Library
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33:103--111, August 1990. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI '08, pages 1--14, 2008. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI'12, San Jose, CA, 2012. Google ScholarDigital Library
- Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. In AAIM '08, pages 337--348, Shanghai, China, 2008. Google ScholarDigital Library
Index Terms
- Presto: distributed machine learning and graph processing with sparse matrices
Recommendations
Evaluating Presto and SparkSQL with TPC-DS
Database Systems for Advanced Applications. DASFAA 2022 International WorkshopsAbstractFrom the perspective of the development trend of database technology and the application of big data, the unified management and analysis of relational data and non-relational data is a new trend. New relational computing engines, such as SparkSQL ...
Querying Data Lakes using Spark and Presto
WWW '19: The World Wide Web ConferenceSquerall is a tool that allows the querying of heterogeneous, large-scale data sources by leveraging state-of-the-art Big Data processing engines: Spark and Presto. Queries are posed on-demand against a Data Lake, i.e., directly on the original data ...
Presto: A Decade of SQL Analytics at Meta
PACMMODPresto is an open-source distributed SQL query engine that supports analytics workloads involving multiple exabyte-scale data sources. Presto is used for low-latency interactive use cases as well as long-running ETL jobs at Meta. It was originally ...
Comments