skip to main content
research-article

Evaluating end-to-end optimization for data analytics applications in weld

Published:01 May 2018Publication History
Skip Abstract Section

Abstract

Modern analytics applications use a diverse mix of libraries and functions. Unfortunately, there is no optimization across these libraries, resulting in performance penalties as high as an order of magnitude in many applications. To address this problem, we proposed Weld, a common runtime for existing data analytics libraries that performs key physical optimizations such as pipelining under existing, imperative library APIs. In this work, we further develop the Weld vision by designing an automatic adaptive optimizer for Weld applications, and evaluating its impact on realistic data science workloads. Our optimizer eliminates multiple forms of overhead that arise when composing imperative libraries like Pandas and NumPy, and uses lightweight measurements to make data-dependent decisions at run-time in ad-hoc workloads where no statistics are available, with sub-second overhead. We also evaluate which optimizations have the largest impact in practice and whether Weld can be integrated into libraries incrementally. Our results are promising: using our optimizer, Weld accelerates data science workloads by up to 23X on one thread and 80X on eight threads, and its adaptive optimizations provide up to a 3.75X speedup over rule-based optimization. Moreover, Weld provides benefits if even just 4--5 operators in a library are ported to use it. Our results show that common runtime designs like Weld may be a viable approach to accelerate analytics.

References

  1. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. TensorFlow: A System for Large-Scale Machine Learning. In Proc. USENIX OSDI, pages 265--283, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Agarwal, D. Liu, and R. Xin. Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop. https://databricks.com/blog/2016/05/23/, 2016.Google ScholarGoogle Scholar
  3. A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb, and V. Markl. Implicit Parallelism Through Deep Language Embedding. In SIGMOD '15, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In Proc. ACM SIGMOD, pages 1383--1394, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Apache Arrow. https://arrow.apache.org/, 2018.Google ScholarGoogle Scholar
  6. G. E. Blelloch, J. C. Hardwick, S. Chatterjee, J. Sipelstein, and M. Zagha. Implementation of a Portable Nested Data-parallel Language. SIGPLAN Not., 28(7):102--111, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. D. Blumenofe, C. F. Joerg, B. C. Kurzmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 37(1):55--69, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bohrium. http://bohrium.readthedocs.io, 2018.Google ScholarGoogle Scholar
  9. K. J. Brown, H. Lee, T. Rompf, A. K. Sujeeth, C. De Sa, C. Aberger, and K. Olukotun. Have Abstraction and Eat Performance, Too: Optimized Heterogeneous Computing with Parallel Patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pages 194--205. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Buneman, L. Libkin, D. Suciu, V. Tannen, and L. Wong. Comprehension syntax. SIGMOD Rec., 23(1):87--96, March 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cran. https://cran.r-project.org, 2018.Google ScholarGoogle Scholar
  12. A. Crotty, A. Galakatos, K. Dursun, T. Kraska, C. Binnig, U. Cetintemel, and S. Zdonik. An Architecture for Compiling UDF-centric Workflows. PVLDB, 8(12):1466--1477, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. CUDA. http://www.nvidia.com/object/cuda_home_new.html, 2018.Google ScholarGoogle Scholar
  14. P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel, M. Stonebraker, and S. Zdonik. A Demonstration of SciDB: A Science-oriented DBMS. PVLDB, 2(2):1534--1537, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cython. http://cython.org, 2018.Google ScholarGoogle Scholar
  16. Pandas Cookbook Chapter 7: Cleaning Up Messy Data. https://github.com/jvns/pandas- cookbook/.Google ScholarGoogle Scholar
  17. Demand Paging. https://en.wikipedia.org/wiki/Demand_paging, 2018.Google ScholarGoogle Scholar
  18. A. Deshpande, Z. Ives, V. Raman, et al. Adaptive query processing. Foundations and Trends® in Databases, 1(1):1--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Flight Delays and Cancellations Dataset. https://www.kaggle.com/usdot/flight-delays/data.Google ScholarGoogle Scholar
  20. Gluon. https://gluon.mxnet.io.Google ScholarGoogle Scholar
  21. J. Goseme. Black Scholes Formula, 2013.Google ScholarGoogle Scholar
  22. G. Graefe. Encapsulation of Parallelism in the Volcano Query Processing System, volume 19. ACM, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Grust. Monad Comprehensions: A Versatile Representation for Queries, pages 288--311. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.Google ScholarGoogle Scholar
  24. J. Hamrick. The Demise of for Loops. https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html.Google ScholarGoogle Scholar
  25. F. M. Harper and J. A. Konstan. The Movielens Datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Heisler. A Beginner's Guide to Optimizing Pandas Code for Speed. goo.gl/dqwmrG, 2017.Google ScholarGoogle Scholar
  27. F. Hueske, M. Peters, A. Krettek, M. Ringwald, K. Tzoumas, V. Markl, and J.-C. Freytag. Peeking into the Optimization of Data Flow Programs with MapReduce-style UDFs. In 2013 IEEE 29th International Conference on Data Engineering (ICDE), pages 1292--1295. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. HyPer Web Interface. http://hyper-db.de/interface.html, 2013.Google ScholarGoogle Scholar
  29. A. Kemper, F. Funke, H. Pirk, S. Manegold, U. Leser, M. Grund, T. Neumann, and M. Kersten. Cpu and cache efficient management of memory-resident databases. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013), ICDE '13, pages 14--25, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Kennedy and K. S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In International Workshop on Languages and Compilers for Parallel Computing, pages 301--320. Springer, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Kessenich. An introduction to SPIR-V. https://www.khronos.org/registry/spir-v/papers/WhitePaper.pdf, 2015.Google ScholarGoogle Scholar
  32. Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building Efficient Query Engines in a High-level Language. PVLDB, 7(10):853--864, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. Lattner and V. Adve. LLVM: a compilation framework for lifelong program analysis transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on, pages 75--86, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw., 5(3):308--323, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. W. Liu. Python and Pandas Part 4: More Baby Names. http://beyondvalence.blogspot.com/2014/09/python-and-pandas-part-4-more-baby-names.html, 2014.Google ScholarGoogle Scholar
  36. Loop Unrolling. https://www.cs.umd.edu/class/fall2001/cmsc411/proj01/proja/loop.html, 2001.Google ScholarGoogle Scholar
  37. S. Maleki, Y. Gao, M. J. Garzar, T. Wong, D. A. Padua, et al. An evaluation of vectorizing compilers. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 372--382. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Manegold, P. Boncz, and M. L. Kersten. Generic Database Cost Models for Hierarchical Memory Systems. In Proceedings of the 28th International Conference on Very Large Data Bases, VLDB '02, pages 191--202. VLDB Endowment, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. W. McKinney. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, pages 51 -- 56, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  40. Intel Math Kernel Library. https://software.intel.com/en-us/mkl, 2018.Google ScholarGoogle Scholar
  41. MNIST. http://yann.lecun.com/exdb/mnist/.Google ScholarGoogle Scholar
  42. T. Neumann. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB, 4(9):539--550, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. NumPy. http://www.numpy.org/.Google ScholarGoogle Scholar
  44. NumPy Array Indexing. https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html, 2009.Google ScholarGoogle Scholar
  45. NYC Taxi Dataset. https://cloud.google.com/bigquery/public-data/nyc-tlc-trips.Google ScholarGoogle Scholar
  46. OpenMP. http://openmp.org/wp/.Google ScholarGoogle Scholar
  47. K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun. Making Sense of Performance in Data Analytics Frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 293--307, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. S. Palkar, J. J. Thomas, A. Shanbhag, D. Narayanan, H. Pirk, M. Schwarzkopf, S. Amarasinghe, and M. Zaharia. Weld: A Common Runtime for High Performance Analytics. In CIDR, 2017.Google ScholarGoogle Scholar
  49. H. Pirk, O. Moll, M. Zaharia, and S. Madden. Voodoo-A Vector Algebra for Portable Database Performance on Modern Hardware. PVLDB, 9(14):1707--1718, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Pypi. https://pypi.python.org, 2018.Google ScholarGoogle Scholar
  51. Pytorch. http://pytorch.org, 2018.Google ScholarGoogle Scholar
  52. L. Qiao, V. Raman, F. Reiss, P. J. Haas, and G. M. Lohman. Main-memory Scan Sharing for multi-core CPUs. PVLDB, 1(1):610--621, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. 311 Service Requests Dataset. https://github.com/jvns/pandas-cookbook/blob/master/data/311-service-requests.csv.Google ScholarGoogle Scholar
  54. T. Rompf, A. K. Sujeeth, N. Amin, K. J. Brown, V. Jovanovic, H. Lee, M. Jonnalagedda, K. Olukotun, and M. Odersky. Optimizing Data Structures in High-level Programs: New Directions for Extensible Compilers Based on Staging. In POPL '13, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. T. K. Sellis. Multiple-query optimization. ACM Transactions on Database Systems (TODS), 13(1):23--52, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. A. Shaikhha, Y. Klonatos, L. Parreaux, L. Brown, M. Dashti, and C. Koch. How to architect a query compiler. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 1907--1922, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. J. E. Stone, D. Gohara, and G. Shi. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science Engineering, 12(3):66--73, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey. GraphMat: High Performance Graph Analytics Made Productive. PVLDB, 8(11):1214--1225, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. TensorFlow XLA. https://www.tensorflow.org/performance/xla/, 2018.Google ScholarGoogle Scholar
  60. Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A System for General-purpose Distributed Data-parallel Computing Using a High-level Language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI'08, pages 1--14, Berkeley, CA, USA, 2008. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM, 59(11):56--65, October 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 11, Issue 9
    Proceedings of the 44th International Conference on Very Large Data Bases, Rio de Janeiro, Brazil
    May 2018
    135 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 May 2018
    Published in pvldb Volume 11, Issue 9

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader