skip to main content
10.1145/3318464.3389758acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure

Published:31 May 2020Publication History

ABSTRACT

Serverless computing has recently attracted a lot of attention from research and industry due to its promise of ultimate elasticity and operational simplicity. However, there is no consensus yet on whether or not the approach is suitable for data processing. In this paper, we present Lambada, a serverless distributed data processing framework designed to explore how to perform data analytics on serverless computing. In our analysis, supported with extensive experiments, we show in which scenarios serverless makes sense from an economic and performance perspective. We address several important technical questions that need to be solved to support data analytics and present examples from several domains where serverless offers a cost and performance advantage over existing solutions.

Skip Supplemental Material Section

Supplemental Material

3318464.3389758.mp4

mp4

116.6 MB

References

  1. Karolina Alexiou, Donald Kossmann, and Per-ÅkeLarson. "Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia." In: PVLDB 6.14 (2013). doi:10.14778/2556549.2556556.Google ScholarGoogle Scholar
  2. Lixiang Ao, Liz Izhikevich, Geoffrey M. Voelker, and George Porter. "Sprocket: A Serverless Video Processing Framework. "In: SoCC. 2018. doi:10.1145/3267809.3267815.Google ScholarGoogle Scholar
  3. Renata Borovica-Gajic, Raja Appuswamy, and Anastasia Ailamaki. "Cheap Data Analytics using Cold Storage Devices." In: PVLDB 9.12 (2016). doi:10.14778/2994509.2994521.Google ScholarGoogle Scholar
  4. CERN.CERN Open Data Portal. uRl: http://opendata.cern.ch/(visited on 01/20/2020).Google ScholarGoogle Scholar
  5. CERN. MuOnia primary dataset in AOD format from RunB of 2010. doi: 10.7483/OPENDATA.CMS.TME9.7FP2. (Visited on01/20/2020).Google ScholarGoogle Scholar
  6. Microsoft Corp. Azure Functions. uRl: https://azure.microsoft.com/en-us/services/functions/(visited on 10/19/2019).Google ScholarGoogle Scholar
  7. Matteo Cremonesi et al. "Using Big Data Technologies for HEP Analysis." In:CHEP. 2019.Google ScholarGoogle ScholarCross RefCross Ref
  8. Justin DeBrabant, Andrew Pavlo, Stephen Tu, Michael Stone-braker, and Stan Zdonik. "Anti-Caching: A New Approach to Database Management System Architecture." In: PVLDB 6.14(2013). doi:10.14778/2556549.2556575.Google ScholarGoogle Scholar
  9. David DeWitt and Jim Gray. "Parallel Database Systems: The Future of High Performance Database Systems." In: CACM 35.6(1992). doi:10.1145/129888.129894.Google ScholarGoogle Scholar
  10. Ahmed Eldawy, Justin Levandoski, and Per-Åke Larson. "Trekking Through Siberia: Managing Cold Data in a Memory-Optimized Database." In: PVLDB 7.11 (2014). doi: 10.14778/2732967.2732968.Google ScholarGoogle Scholar
  11. Sadjad Fouladi, Francisco Romero, Dan Iter, Qian Li, Shuvo Chatterjee, Christos Kozyrakis, Matei Zaharia, and Keith Win-stein. "From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers." In: USENIXATC. 2019.Google ScholarGoogle Scholar
  12. Sadjad Fouladi et al. "Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads." In: NSDI. 2017.Google ScholarGoogle Scholar
  13. G. Graefe and D.L. Davison. "Encapsulation of Parallelism and Architecture-Independence in Extensible Database Query Execution." In: IEEE Trans. Softw. Eng.19.8 (1993). doi:10.1109/32.238579.Google ScholarGoogle Scholar
  14. Goetz Graefe. "Encapsulation of Parallelism in the Volcano Query Processing System." In: SIGMOD. 1990. doi: 10.1145/93597.98720.Google ScholarGoogle Scholar
  15. Goetz Graefe. "Query Evaluation Techniques for Large Data-bases." In:CSUR25.2 (1993). doi:10.1145/152610.152611.Google ScholarGoogle Scholar
  16. Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta. Introduction to Parallel Computing. 2nd Edition. Addison-Wesley, 2003. isbn: 9780201648652.Google ScholarGoogle Scholar
  17. Joseph M. Hellerstein, Jose M. Faleiro, Joseph Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Cheng-gang Wu. "Serverless Computing: One Step Forward, Two Steps Back." In: CIDR. 2019.Google ScholarGoogle Scholar
  18. G.J. Huffman, E.F. Stocker, D.T. Bolvin, E.J. Nelkin, and Jackson Tan. GPM IMERG Early Precipitation L3 Half Hourly 0.1 degreex 0.1 degree V06. Goddard Earth Sciences Data and Information Services Center (GES DISC). doi: 10.5067/GPM/IMERG/3B-HH-E/06. (Visited on 01/20/2020).Google ScholarGoogle Scholar
  19. Amazon Inc.Amazon Athena. uRl: http://docs.aws.amazon.com/athena/(visited on 10/19/2019).Google ScholarGoogle Scholar
  20. Google Inc.Google BigQuery. uRl: https://cloud.google.com/bigquery/(visited on 10/19/2019).Google ScholarGoogle Scholar
  21. Google Inc.Google Cloud Functions. uRl: https://cloud.google.com/functions/(visited on 10/19/2019).Google ScholarGoogle Scholar
  22. IBM Inc.IBM Multi-temperature management. uRl: https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0059106.html(visited on 10/19/2019).Google ScholarGoogle Scholar
  23. SAP Inc.SAP using Spark to process cold data next to a mainmemory database. uRl: https://blogs.saphana.com/2018/12/03/what-is-sap-hana-cold-data-tiering/(visited on 10/19/2019).Google ScholarGoogle Scholar
  24. Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. "Occupy the Cloud: Distributed Computingfor the 99%." In: SoCC. 2017.doi:10.1145/3127479.3128601.Google ScholarGoogle Scholar
  25. Youngbin Kim and Jimmy Lin. "Serverless Data Analytics with Flint." In:CLOUD. 2018.doi:10.1109/CLOUD.2018.00063.Google ScholarGoogle Scholar
  26. Ana Klimovic, Yawen Wang, Christos Kozyrakis, Patrick Stuedi, Jonas Pfefferle, and Animesh Trivedi. "Understanding Ephemeral Storage for Serverless Analytics." In: NSDI. 2018.Google ScholarGoogle Scholar
  27. Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, and Christos Kozyrakis. "Pocket: Elastic Ephemeral Storage for Serverless Analytics." In: OSDI. 2018.Google ScholarGoogle Scholar
  28. Justin J. Levandoski, Per-Åke Larson, and Radu Stoica. "Identifying Hot and Cold Data in Main-Memory Databases." In: ICDE. 2013. doi:10.1109/ICDE.2013.6544811.Google ScholarGoogle Scholar
  29. Yinan Li, Ippokratis Pandis, Rene Mueller, Vijayshankar Raman, and Guy Lohman. "NUMA-aware algorithms: the case of data shuffling." In: CIDR. 2013.Google ScholarGoogle Scholar
  30. Haicheng Liu, Peter Oosterom, Chengfang Hu, and Wen Wang. "Managing Large Multidimensional Array Hydrologic Datasets: A Case Study Comparing NetCDF and SciDB." In:Procedia Engineering 154 (2016). doi:10.1016/j.proeng.2016.07.449.Google ScholarGoogle Scholar
  31. Haicheng Lui. "Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: A case study of SciDB." MA thesis. TU Delf. (Visitedon 10/19/2019).Google ScholarGoogle Scholar
  32. Renato Marroquín, Ingo Müller, Darko Makreshanski, and Gus-tavo Alonso. "Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution." In: SoCC '18. doi:10.1145/3267809.3267822.Google ScholarGoogle Scholar
  33. Ingo Müller, Rodrigo Bruno, Ana Klimovic, John Wilkes, EricSedlar, and Gustavo Alonso. "Serverless Clusters: The MissingPiece for Interactive Batch Applications?" In: SPMA. 2020. doi:10.3929/ethz-b-000405616.Google ScholarGoogle Scholar
  34. Ingo Müller, Renato Marroquín, Dimitrios Koutsoukos, Mike Wawrzoniak, Sabir Akhadov, and Gustavo Alonso. The Collection Virtual Machine: An Abstraction for Multi-Frontend Multi-Backend Data Analysis. 2020. arXiv:2004.01908[cs.DB].Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. NASA.DATA.NASA.GOV: A catalog of publicly available NASA datasets. uRl: http://data.nasa.gov/(visited on 01/20/2020).Google ScholarGoogle Scholar
  36. M. Tamer Özsu and P Valduriez. Principles of Distributed Data-base Systems. 3rd ed. Springer, 2011.isbn: 9781441988331.Google ScholarGoogle Scholar
  37. Matthew Perron, Raul Castro Fernandez, David DeWitt, and Samuel Madden. "Starling: A Scalable Query Engine on Cloud Function Services." In: SIGMOD. 2020.Google ScholarGoogle Scholar
  38. Qifan Pu, U C Berkeley, Shivaram Venkataraman, Ion Stoica, UC Berkeley, and Implementation Nsdi. "Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure." In:NSDI.2019.Google ScholarGoogle Scholar
  39. Wolf Rödiger, Tobias Mühlbauer, Alfons Kemper, and Thomas Neumann. "High-Speed Query Processing over High-Speed Net-works." In: PVLDB 9.4 (2015). doi:10.14778/2856318.2856319.Google ScholarGoogle Scholar
  40. Josep Sampé, Gil Vernik, Marc Sánchez-Artigas, and Pedro García-López. "Serverless data analytics in the IBM cloud." In: Middleware Industry. 2018. doi:10.1145/3284028.3284029.Google ScholarGoogle Scholar
  41. Venkat Sowrirajan, Bharath Bhushan, and Mayank Ahuja. Qubole offers Apache Spark on AWS Lambda. 2017. URL: https://www.qubole.com/blog/spark-on-aws-lambda/(visited on 12/20/2019).Google ScholarGoogle Scholar
  42. Transaction Processing Performance Council. TPC Benchmark H (Revision 2.18). 2018.Google ScholarGoogle Scholar
  43. Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart,and Michael Swift. "Peeking Behind the Curtains of Serverless Platforms." In:USENIX ATC. 2018.Google ScholarGoogle Scholar
  44. Matei Zaharia et al. "Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing." In:NSDI. 2012.Google ScholarGoogle Scholar

Index Terms

  1. Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
          June 2020
          2925 pages
          ISBN:9781450367356
          DOI:10.1145/3318464

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 May 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader