research-article

Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure

Authors:
Ingo Müller

ETH Zürich, Zürich, Switzerland

ETH Zürich, Zürich, Switzerland
View Profile

,
Renato Marroquín

ETH Zürich, Zürich, Switzerland

ETH Zürich, Zürich, Switzerland
View Profile

,
Gustavo Alonso

ETH Zürich, Zürich, Switzerland

ETH Zürich, Zürich, Switzerland
View Profile

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataJune 2020Pages 115–130https://doi.org/10.1145/3318464.3389758

Published:31 May 2020Publication History

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 115–130

ABSTRACT

Serverless computing has recently attracted a lot of attention from research and industry due to its promise of ultimate elasticity and operational simplicity. However, there is no consensus yet on whether or not the approach is suitable for data processing. In this paper, we present Lambada, a serverless distributed data processing framework designed to explore how to perform data analytics on serverless computing. In our analysis, supported with extensive experiments, we show in which scenarios serverless makes sense from an economic and performance perspective. We address several important technical questions that need to be solved to support data analytics and present examples from several domains where serverless offers a cost and performance advantage over existing solutions.

Supplemental Material

3318464.3389758.mp4

mp4

116.6 MB

Download

References

Karolina Alexiou, Donald Kossmann, and Per-ÅkeLarson. "Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia." In: PVLDB 6.14 (2013). doi:10.14778/2556549.2556556.Google Scholar
Lixiang Ao, Liz Izhikevich, Geoffrey M. Voelker, and George Porter. "Sprocket: A Serverless Video Processing Framework. "In: SoCC. 2018. doi:10.1145/3267809.3267815.Google Scholar
Renata Borovica-Gajic, Raja Appuswamy, and Anastasia Ailamaki. "Cheap Data Analytics using Cold Storage Devices." In: PVLDB 9.12 (2016). doi:10.14778/2994509.2994521.Google Scholar
CERN.CERN Open Data Portal. uRl: http://opendata.cern.ch/(visited on 01/20/2020).Google Scholar
CERN. MuOnia primary dataset in AOD format from RunB of 2010. doi: 10.7483/OPENDATA.CMS.TME9.7FP2. (Visited on01/20/2020).Google Scholar
Microsoft Corp. Azure Functions. uRl: https://azure.microsoft.com/en-us/services/functions/(visited on 10/19/2019).Google Scholar
Matteo Cremonesi et al. "Using Big Data Technologies for HEP Analysis." In:CHEP. 2019.Google ScholarCross Ref
Justin DeBrabant, Andrew Pavlo, Stephen Tu, Michael Stone-braker, and Stan Zdonik. "Anti-Caching: A New Approach to Database Management System Architecture." In: PVLDB 6.14(2013). doi:10.14778/2556549.2556575.Google Scholar
David DeWitt and Jim Gray. "Parallel Database Systems: The Future of High Performance Database Systems." In: CACM 35.6(1992). doi:10.1145/129888.129894.Google Scholar
Ahmed Eldawy, Justin Levandoski, and Per-Åke Larson. "Trekking Through Siberia: Managing Cold Data in a Memory-Optimized Database." In: PVLDB 7.11 (2014). doi: 10.14778/2732967.2732968.Google Scholar
Sadjad Fouladi, Francisco Romero, Dan Iter, Qian Li, Shuvo Chatterjee, Christos Kozyrakis, Matei Zaharia, and Keith Win-stein. "From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers." In: USENIXATC. 2019.Google Scholar
Sadjad Fouladi et al. "Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads." In: NSDI. 2017.Google Scholar
G. Graefe and D.L. Davison. "Encapsulation of Parallelism and Architecture-Independence in Extensible Database Query Execution." In: IEEE Trans. Softw. Eng.19.8 (1993). doi:10.1109/32.238579.Google Scholar
Goetz Graefe. "Encapsulation of Parallelism in the Volcano Query Processing System." In: SIGMOD. 1990. doi: 10.1145/93597.98720.Google Scholar
Goetz Graefe. "Query Evaluation Techniques for Large Data-bases." In:CSUR25.2 (1993). doi:10.1145/152610.152611.Google Scholar
Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta. Introduction to Parallel Computing. 2nd Edition. Addison-Wesley, 2003. isbn: 9780201648652.Google Scholar
Joseph M. Hellerstein, Jose M. Faleiro, Joseph Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Cheng-gang Wu. "Serverless Computing: One Step Forward, Two Steps Back." In: CIDR. 2019.Google Scholar
G.J. Huffman, E.F. Stocker, D.T. Bolvin, E.J. Nelkin, and Jackson Tan. GPM IMERG Early Precipitation L3 Half Hourly 0.1 degreex 0.1 degree V06. Goddard Earth Sciences Data and Information Services Center (GES DISC). doi: 10.5067/GPM/IMERG/3B-HH-E/06. (Visited on 01/20/2020).Google Scholar
Amazon Inc.Amazon Athena. uRl: http://docs.aws.amazon.com/athena/(visited on 10/19/2019).Google Scholar
Google Inc.Google BigQuery. uRl: https://cloud.google.com/bigquery/(visited on 10/19/2019).Google Scholar
Google Inc.Google Cloud Functions. uRl: https://cloud.google.com/functions/(visited on 10/19/2019).Google Scholar
IBM Inc.IBM Multi-temperature management. uRl: https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0059106.html(visited on 10/19/2019).Google Scholar
SAP Inc.SAP using Spark to process cold data next to a mainmemory database. uRl: https://blogs.saphana.com/2018/12/03/what-is-sap-hana-cold-data-tiering/(visited on 10/19/2019).Google Scholar
Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. "Occupy the Cloud: Distributed Computingfor the 99%." In: SoCC. 2017.doi:10.1145/3127479.3128601.Google Scholar
Youngbin Kim and Jimmy Lin. "Serverless Data Analytics with Flint." In:CLOUD. 2018.doi:10.1109/CLOUD.2018.00063.Google Scholar
Ana Klimovic, Yawen Wang, Christos Kozyrakis, Patrick Stuedi, Jonas Pfefferle, and Animesh Trivedi. "Understanding Ephemeral Storage for Serverless Analytics." In: NSDI. 2018.Google Scholar
Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, and Christos Kozyrakis. "Pocket: Elastic Ephemeral Storage for Serverless Analytics." In: OSDI. 2018.Google Scholar
Justin J. Levandoski, Per-Åke Larson, and Radu Stoica. "Identifying Hot and Cold Data in Main-Memory Databases." In: ICDE. 2013. doi:10.1109/ICDE.2013.6544811.Google Scholar
Yinan Li, Ippokratis Pandis, Rene Mueller, Vijayshankar Raman, and Guy Lohman. "NUMA-aware algorithms: the case of data shuffling." In: CIDR. 2013.Google Scholar
Haicheng Liu, Peter Oosterom, Chengfang Hu, and Wen Wang. "Managing Large Multidimensional Array Hydrologic Datasets: A Case Study Comparing NetCDF and SciDB." In:Procedia Engineering 154 (2016). doi:10.1016/j.proeng.2016.07.449.Google Scholar
Haicheng Lui. "Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: A case study of SciDB." MA thesis. TU Delf. (Visitedon 10/19/2019).Google Scholar
Renato Marroquín, Ingo Müller, Darko Makreshanski, and Gus-tavo Alonso. "Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution." In: SoCC '18. doi:10.1145/3267809.3267822.Google Scholar
Ingo Müller, Rodrigo Bruno, Ana Klimovic, John Wilkes, EricSedlar, and Gustavo Alonso. "Serverless Clusters: The MissingPiece for Interactive Batch Applications?" In: SPMA. 2020. doi:10.3929/ethz-b-000405616.Google Scholar
Ingo Müller, Renato Marroquín, Dimitrios Koutsoukos, Mike Wawrzoniak, Sabir Akhadov, and Gustavo Alonso. The Collection Virtual Machine: An Abstraction for Multi-Frontend Multi-Backend Data Analysis. 2020. arXiv:2004.01908[cs.DB].Google ScholarDigital Library
NASA.DATA.NASA.GOV: A catalog of publicly available NASA datasets. uRl: http://data.nasa.gov/(visited on 01/20/2020).Google Scholar
M. Tamer Özsu and P Valduriez. Principles of Distributed Data-base Systems. 3rd ed. Springer, 2011.isbn: 9781441988331.Google Scholar
Matthew Perron, Raul Castro Fernandez, David DeWitt, and Samuel Madden. "Starling: A Scalable Query Engine on Cloud Function Services." In: SIGMOD. 2020.Google Scholar
Qifan Pu, U C Berkeley, Shivaram Venkataraman, Ion Stoica, UC Berkeley, and Implementation Nsdi. "Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure." In:NSDI.2019.Google Scholar
Wolf Rödiger, Tobias Mühlbauer, Alfons Kemper, and Thomas Neumann. "High-Speed Query Processing over High-Speed Net-works." In: PVLDB 9.4 (2015). doi:10.14778/2856318.2856319.Google Scholar
Josep Sampé, Gil Vernik, Marc Sánchez-Artigas, and Pedro García-López. "Serverless data analytics in the IBM cloud." In: Middleware Industry. 2018. doi:10.1145/3284028.3284029.Google Scholar
Venkat Sowrirajan, Bharath Bhushan, and Mayank Ahuja. Qubole offers Apache Spark on AWS Lambda. 2017. URL: https://www.qubole.com/blog/spark-on-aws-lambda/(visited on 12/20/2019).Google Scholar
Transaction Processing Performance Council. TPC Benchmark H (Revision 2.18). 2018.Google Scholar
Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart,and Michael Swift. "Peeking Behind the Curtains of Serverless Platforms." In:USENIX ATC. 2018.Google Scholar
Matei Zaharia et al. "Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing." In:NSDI. 2012.Google Scholar

Index Terms

Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Next generation cloud computing

The landscape of cloud computing has significantly changed over the last decade. Not only have more providers and service offerings crowded the space, but also cloud infrastructure that was traditionally limited to single provider data centers is now ...
Read More
Serverless Workflows for Containerised Applications in the Cloud Continuum
Abstract
This paper introduces an open-source platform to support serverless computing for scientific data-processing workflow-based applications across the Cloud continuum (i.e. simultaneously involving both on-premises and public Cloud platforms to ...
Read More
Supporting Multi-Provider Serverless Computing on the Edge
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

Serverless computing has recently emerged as a new execution model for cloud computing, in which service providers offer compute runtimes, also known as Function-as-a-Service (FaaS) platforms, allowing users to develop, execute and manage application ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 May 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud computing
data lake
elasticity
interactive analytics
serverless computing
serverless functions
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 2,078
  Total Downloads
- Downloads (Last 12 months)437
- Downloads (Last 6 weeks)43
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Next generation cloud computing

Serverless Workflows for Containerised Applications in the Cloud Continuum

Supporting Multi-Provider Serverless Computing on the Edge

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Next generation cloud computing

Serverless Workflows for Containerised Applications in the Cloud Continuum

Supporting Multi-Provider Serverless Computing on the Edge

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media