research-article

Evaluating end-to-end optimization for data analytics applications in weld

Authors:
Shoumik Palkar

Stanford University

Stanford University
View Profile

,
James Thomas

Stanford University

Stanford University
View Profile

,
Deepak Narayanan

Stanford University

Stanford University
View Profile

,
Pratiksha Thaker

Stanford University

Stanford University
View Profile

,
Rahul Palamuttam

Stanford University

Stanford University
View Profile

,
Parimajan Negi

Stanford University

Stanford University
View Profile

,
Anil Shanbhag

MIT CSAIL

MIT CSAIL
View Profile

,
Malte Schwarzkopf

MIT CSAIL

MIT CSAIL
View Profile

,
Holger Pirk

Imperial College London

Imperial College London
View Profile

,
Saman Amarasinghe

MIT CSAIL

MIT CSAIL
View Profile

,
Samuel Madden

MIT CSAIL

MIT CSAIL
View Profile

,
Matei Zaharia

Stanford University

Stanford University
View Profile

Proceedings of the VLDB Endowment Volume 11 Issue 9pp 1002–1015https://doi.org/10.14778/3213880.3213890

Published:01 May 2018Publication History

Proceedings of the VLDB Endowment

Abstract

Modern analytics applications use a diverse mix of libraries and functions. Unfortunately, there is no optimization across these libraries, resulting in performance penalties as high as an order of magnitude in many applications. To address this problem, we proposed Weld, a common runtime for existing data analytics libraries that performs key physical optimizations such as pipelining under existing, imperative library APIs. In this work, we further develop the Weld vision by designing an automatic adaptive optimizer for Weld applications, and evaluating its impact on realistic data science workloads. Our optimizer eliminates multiple forms of overhead that arise when composing imperative libraries like Pandas and NumPy, and uses lightweight measurements to make data-dependent decisions at run-time in ad-hoc workloads where no statistics are available, with sub-second overhead. We also evaluate which optimizations have the largest impact in practice and whether Weld can be integrated into libraries incrementally. Our results are promising: using our optimizer, Weld accelerates data science workloads by up to 23X on one thread and 80X on eight threads, and its adaptive optimizations provide up to a 3.75X speedup over rule-based optimization. Moreover, Weld provides benefits if even just 4--5 operators in a library are ported to use it. Our results show that common runtime designs like Weld may be a viable approach to accelerate analytics.

References

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. TensorFlow: A System for Large-Scale Machine Learning. In Proc. USENIX OSDI, pages 265--283, 2016. Google ScholarDigital Library
S. Agarwal, D. Liu, and R. Xin. Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop. https://databricks.com/blog/2016/05/23/, 2016.Google Scholar
A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb, and V. Markl. Implicit Parallelism Through Deep Language Embedding. In SIGMOD '15, 2015. Google ScholarDigital Library
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In Proc. ACM SIGMOD, pages 1383--1394, 2015. Google ScholarDigital Library
Apache Arrow. https://arrow.apache.org/, 2018.Google Scholar
G. E. Blelloch, J. C. Hardwick, S. Chatterjee, J. Sipelstein, and M. Zagha. Implementation of a Portable Nested Data-parallel Language. SIGPLAN Not., 28(7):102--111, 1993. Google ScholarDigital Library
R. D. Blumenofe, C. F. Joerg, B. C. Kurzmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 37(1):55--69, 1996. Google ScholarDigital Library
Bohrium. http://bohrium.readthedocs.io, 2018.Google Scholar
K. J. Brown, H. Lee, T. Rompf, A. K. Sujeeth, C. De Sa, C. Aberger, and K. Olukotun. Have Abstraction and Eat Performance, Too: Optimized Heterogeneous Computing with Parallel Patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pages 194--205. ACM, 2016. Google ScholarDigital Library
P. Buneman, L. Libkin, D. Suciu, V. Tannen, and L. Wong. Comprehension syntax. SIGMOD Rec., 23(1):87--96, March 1994. Google ScholarDigital Library
Cran. https://cran.r-project.org, 2018.Google Scholar
A. Crotty, A. Galakatos, K. Dursun, T. Kraska, C. Binnig, U. Cetintemel, and S. Zdonik. An Architecture for Compiling UDF-centric Workflows. PVLDB, 8(12):1466--1477, 2015. Google ScholarDigital Library
CUDA. http://www.nvidia.com/object/cuda_home_new.html, 2018.Google Scholar
P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel, M. Stonebraker, and S. Zdonik. A Demonstration of SciDB: A Science-oriented DBMS. PVLDB, 2(2):1534--1537, 2009. Google ScholarDigital Library
Cython. http://cython.org, 2018.Google Scholar
Pandas Cookbook Chapter 7: Cleaning Up Messy Data. https://github.com/jvns/pandas- cookbook/.Google Scholar
Demand Paging. https://en.wikipedia.org/wiki/Demand_paging, 2018.Google Scholar
A. Deshpande, Z. Ives, V. Raman, et al. Adaptive query processing. Foundations and Trends® in Databases, 1(1):1--140, 2007. Google ScholarDigital Library
Flight Delays and Cancellations Dataset. https://www.kaggle.com/usdot/flight-delays/data.Google Scholar
Gluon. https://gluon.mxnet.io.Google Scholar
J. Goseme. Black Scholes Formula, 2013.Google Scholar
G. Graefe. Encapsulation of Parallelism in the Volcano Query Processing System, volume 19. ACM, 1990. Google ScholarDigital Library
T. Grust. Monad Comprehensions: A Versatile Representation for Queries, pages 288--311. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.Google Scholar
J. Hamrick. The Demise of for Loops. https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html.Google Scholar
F. M. Harper and J. A. Konstan. The Movielens Datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016. Google ScholarDigital Library
S. Heisler. A Beginner's Guide to Optimizing Pandas Code for Speed. goo.gl/dqwmrG, 2017.Google Scholar
F. Hueske, M. Peters, A. Krettek, M. Ringwald, K. Tzoumas, V. Markl, and J.-C. Freytag. Peeking into the Optimization of Data Flow Programs with MapReduce-style UDFs. In 2013 IEEE 29th International Conference on Data Engineering (ICDE), pages 1292--1295. IEEE, 2013. Google ScholarDigital Library
HyPer Web Interface. http://hyper-db.de/interface.html, 2013.Google Scholar
A. Kemper, F. Funke, H. Pirk, S. Manegold, U. Leser, M. Grund, T. Neumann, and M. Kersten. Cpu and cache efficient management of memory-resident databases. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013), ICDE '13, pages 14--25, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarDigital Library
K. Kennedy and K. S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In International Workshop on Languages and Compilers for Parallel Computing, pages 301--320. Springer, 1993. Google ScholarDigital Library
J. Kessenich. An introduction to SPIR-V. https://www.khronos.org/registry/spir-v/papers/WhitePaper.pdf, 2015.Google Scholar
Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building Efficient Query Engines in a High-level Language. PVLDB, 7(10):853--864, 2014. Google ScholarDigital Library
C. Lattner and V. Adve. LLVM: a compilation framework for lifelong program analysis transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on, pages 75--86, 2004. Google ScholarDigital Library
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw., 5(3):308--323, 1979. Google ScholarDigital Library
W. Liu. Python and Pandas Part 4: More Baby Names. http://beyondvalence.blogspot.com/2014/09/python-and-pandas-part-4-more-baby-names.html, 2014.Google Scholar
Loop Unrolling. https://www.cs.umd.edu/class/fall2001/cmsc411/proj01/proja/loop.html, 2001.Google Scholar
S. Maleki, Y. Gao, M. J. Garzar, T. Wong, D. A. Padua, et al. An evaluation of vectorizing compilers. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 372--382. IEEE, 2011. Google ScholarDigital Library
S. Manegold, P. Boncz, and M. L. Kersten. Generic Database Cost Models for Hierarchical Memory Systems. In Proceedings of the 28th International Conference on Very Large Data Bases, VLDB '02, pages 191--202. VLDB Endowment, 2002. Google ScholarDigital Library
W. McKinney. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, pages 51 -- 56, 2010.Google ScholarCross Ref
Intel Math Kernel Library. https://software.intel.com/en-us/mkl, 2018.Google Scholar
MNIST. http://yann.lecun.com/exdb/mnist/.Google Scholar
T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
NumPy. http://www.numpy.org/.Google Scholar
NumPy Array Indexing. https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html, 2009.Google Scholar
NYC Taxi Dataset. https://cloud.google.com/bigquery/public-data/nyc-tlc-trips.Google Scholar
OpenMP. http://openmp.org/wp/.Google Scholar
K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun. Making Sense of Performance in Data Analytics Frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 293--307, 2015. Google ScholarDigital Library
S. Palkar, J. J. Thomas, A. Shanbhag, D. Narayanan, H. Pirk, M. Schwarzkopf, S. Amarasinghe, and M. Zaharia. Weld: A Common Runtime for High Performance Analytics. In CIDR, 2017.Google Scholar
H. Pirk, O. Moll, M. Zaharia, and S. Madden. Voodoo-A Vector Algebra for Portable Database Performance on Modern Hardware. PVLDB, 9(14):1707--1718, 2016. Google ScholarDigital Library
Pypi. https://pypi.python.org, 2018.Google Scholar
Pytorch. http://pytorch.org, 2018.Google Scholar
L. Qiao, V. Raman, F. Reiss, P. J. Haas, and G. M. Lohman. Main-memory Scan Sharing for multi-core CPUs. PVLDB, 1(1):610--621, 2008. Google ScholarDigital Library
311 Service Requests Dataset. https://github.com/jvns/pandas-cookbook/blob/master/data/311-service-requests.csv.Google Scholar
T. Rompf, A. K. Sujeeth, N. Amin, K. J. Brown, V. Jovanovic, H. Lee, M. Jonnalagedda, K. Olukotun, and M. Odersky. Optimizing Data Structures in High-level Programs: New Directions for Extensible Compilers Based on Staging. In POPL '13, 2013. Google ScholarDigital Library
T. K. Sellis. Multiple-query optimization. ACM Transactions on Database Systems (TODS), 13(1):23--52, 1988. Google ScholarDigital Library
A. Shaikhha, Y. Klonatos, L. Parreaux, L. Brown, M. Dashti, and C. Koch. How to architect a query compiler. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 1907--1922, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
J. E. Stone, D. Gohara, and G. Shi. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science Engineering, 12(3):66--73, 2010. Google ScholarDigital Library
N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey. GraphMat: High Performance Graph Analytics Made Productive. PVLDB, 8(11):1214--1225, 2015. Google ScholarDigital Library
TensorFlow XLA. https://www.tensorflow.org/performance/xla/, 2018.Google Scholar
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A System for General-purpose Distributed Data-parallel Computing Using a High-level Language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI'08, pages 1--14, Berkeley, CA, USA, 2008. USENIX Association. Google ScholarDigital Library
M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM, 59(11):56--65, October 2016. Google ScholarDigital Library

Recommendations

SIMD parallel MCMC sampling with applications for big-data Bayesian analytics

Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential ...
Read More
Big Data Analytics
Read More
Characterizing Data Analytics Workloads on Intel Xeon Phi
IISWC '15: Proceedings of the 2015 IEEE International Symposium on Workload Characterization

With the growing computation demands of data analytics, heterogeneous architectures become popular for their support of high parallelism. Intel Xeon Phi, a many-core coprocessor originally designed for high performance computing applications, is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 11, Issue 9
Proceedings of the 44th International Conference on Very Large Data Bases, Rio de Janeiro, Brazil
May 2018
135 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 May 2018
Published in pvldb Volume 11, Issue 9
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 36
  Total Citations
  View Citations
- 403
  Total Downloads
- Downloads (Last 12 months)53
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating end-to-end optimization for data analytics applications in weld

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

SIMD parallel MCMC sampling with applications for big-data Bayesian analytics

Big Data Analytics

Characterizing Data Analytics Workloads on Intel Xeon Phi

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Evaluating end-to-end optimization for data analytics applications in weld

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

SIMD parallel MCMC sampling with applications for big-data Bayesian analytics

Big Data Analytics

Characterizing Data Analytics Workloads on Intel Xeon Phi

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media