ABSTRACT
Serverless computing has recently attracted a lot of attention from research and industry due to its promise of ultimate elasticity and operational simplicity. However, there is no consensus yet on whether or not the approach is suitable for data processing. In this paper, we present Lambada, a serverless distributed data processing framework designed to explore how to perform data analytics on serverless computing. In our analysis, supported with extensive experiments, we show in which scenarios serverless makes sense from an economic and performance perspective. We address several important technical questions that need to be solved to support data analytics and present examples from several domains where serverless offers a cost and performance advantage over existing solutions.
Supplemental Material
- Karolina Alexiou, Donald Kossmann, and Per-ÅkeLarson. "Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia." In: PVLDB 6.14 (2013). doi:10.14778/2556549.2556556.Google Scholar
- Lixiang Ao, Liz Izhikevich, Geoffrey M. Voelker, and George Porter. "Sprocket: A Serverless Video Processing Framework. "In: SoCC. 2018. doi:10.1145/3267809.3267815.Google Scholar
- Renata Borovica-Gajic, Raja Appuswamy, and Anastasia Ailamaki. "Cheap Data Analytics using Cold Storage Devices." In: PVLDB 9.12 (2016). doi:10.14778/2994509.2994521.Google Scholar
- CERN.CERN Open Data Portal. uRl: http://opendata.cern.ch/(visited on 01/20/2020).Google Scholar
- CERN. MuOnia primary dataset in AOD format from RunB of 2010. doi: 10.7483/OPENDATA.CMS.TME9.7FP2. (Visited on01/20/2020).Google Scholar
- Microsoft Corp. Azure Functions. uRl: https://azure.microsoft.com/en-us/services/functions/(visited on 10/19/2019).Google Scholar
- Matteo Cremonesi et al. "Using Big Data Technologies for HEP Analysis." In:CHEP. 2019.Google ScholarCross Ref
- Justin DeBrabant, Andrew Pavlo, Stephen Tu, Michael Stone-braker, and Stan Zdonik. "Anti-Caching: A New Approach to Database Management System Architecture." In: PVLDB 6.14(2013). doi:10.14778/2556549.2556575.Google Scholar
- David DeWitt and Jim Gray. "Parallel Database Systems: The Future of High Performance Database Systems." In: CACM 35.6(1992). doi:10.1145/129888.129894.Google Scholar
- Ahmed Eldawy, Justin Levandoski, and Per-Åke Larson. "Trekking Through Siberia: Managing Cold Data in a Memory-Optimized Database." In: PVLDB 7.11 (2014). doi: 10.14778/2732967.2732968.Google Scholar
- Sadjad Fouladi, Francisco Romero, Dan Iter, Qian Li, Shuvo Chatterjee, Christos Kozyrakis, Matei Zaharia, and Keith Win-stein. "From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers." In: USENIXATC. 2019.Google Scholar
- Sadjad Fouladi et al. "Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads." In: NSDI. 2017.Google Scholar
- G. Graefe and D.L. Davison. "Encapsulation of Parallelism and Architecture-Independence in Extensible Database Query Execution." In: IEEE Trans. Softw. Eng.19.8 (1993). doi:10.1109/32.238579.Google Scholar
- Goetz Graefe. "Encapsulation of Parallelism in the Volcano Query Processing System." In: SIGMOD. 1990. doi: 10.1145/93597.98720.Google Scholar
- Goetz Graefe. "Query Evaluation Techniques for Large Data-bases." In:CSUR25.2 (1993). doi:10.1145/152610.152611.Google Scholar
- Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta. Introduction to Parallel Computing. 2nd Edition. Addison-Wesley, 2003. isbn: 9780201648652.Google Scholar
- Joseph M. Hellerstein, Jose M. Faleiro, Joseph Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Cheng-gang Wu. "Serverless Computing: One Step Forward, Two Steps Back." In: CIDR. 2019.Google Scholar
- G.J. Huffman, E.F. Stocker, D.T. Bolvin, E.J. Nelkin, and Jackson Tan. GPM IMERG Early Precipitation L3 Half Hourly 0.1 degreex 0.1 degree V06. Goddard Earth Sciences Data and Information Services Center (GES DISC). doi: 10.5067/GPM/IMERG/3B-HH-E/06. (Visited on 01/20/2020).Google Scholar
- Amazon Inc.Amazon Athena. uRl: http://docs.aws.amazon.com/athena/(visited on 10/19/2019).Google Scholar
- Google Inc.Google BigQuery. uRl: https://cloud.google.com/bigquery/(visited on 10/19/2019).Google Scholar
- Google Inc.Google Cloud Functions. uRl: https://cloud.google.com/functions/(visited on 10/19/2019).Google Scholar
- IBM Inc.IBM Multi-temperature management. uRl: https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0059106.html(visited on 10/19/2019).Google Scholar
- SAP Inc.SAP using Spark to process cold data next to a mainmemory database. uRl: https://blogs.saphana.com/2018/12/03/what-is-sap-hana-cold-data-tiering/(visited on 10/19/2019).Google Scholar
- Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. "Occupy the Cloud: Distributed Computingfor the 99%." In: SoCC. 2017.doi:10.1145/3127479.3128601.Google Scholar
- Youngbin Kim and Jimmy Lin. "Serverless Data Analytics with Flint." In:CLOUD. 2018.doi:10.1109/CLOUD.2018.00063.Google Scholar
- Ana Klimovic, Yawen Wang, Christos Kozyrakis, Patrick Stuedi, Jonas Pfefferle, and Animesh Trivedi. "Understanding Ephemeral Storage for Serverless Analytics." In: NSDI. 2018.Google Scholar
- Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, and Christos Kozyrakis. "Pocket: Elastic Ephemeral Storage for Serverless Analytics." In: OSDI. 2018.Google Scholar
- Justin J. Levandoski, Per-Åke Larson, and Radu Stoica. "Identifying Hot and Cold Data in Main-Memory Databases." In: ICDE. 2013. doi:10.1109/ICDE.2013.6544811.Google Scholar
- Yinan Li, Ippokratis Pandis, Rene Mueller, Vijayshankar Raman, and Guy Lohman. "NUMA-aware algorithms: the case of data shuffling." In: CIDR. 2013.Google Scholar
- Haicheng Liu, Peter Oosterom, Chengfang Hu, and Wen Wang. "Managing Large Multidimensional Array Hydrologic Datasets: A Case Study Comparing NetCDF and SciDB." In:Procedia Engineering 154 (2016). doi:10.1016/j.proeng.2016.07.449.Google Scholar
- Haicheng Lui. "Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: A case study of SciDB." MA thesis. TU Delf. (Visitedon 10/19/2019).Google Scholar
- Renato Marroquín, Ingo Müller, Darko Makreshanski, and Gus-tavo Alonso. "Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution." In: SoCC '18. doi:10.1145/3267809.3267822.Google Scholar
- Ingo Müller, Rodrigo Bruno, Ana Klimovic, John Wilkes, EricSedlar, and Gustavo Alonso. "Serverless Clusters: The MissingPiece for Interactive Batch Applications?" In: SPMA. 2020. doi:10.3929/ethz-b-000405616.Google Scholar
- Ingo Müller, Renato Marroquín, Dimitrios Koutsoukos, Mike Wawrzoniak, Sabir Akhadov, and Gustavo Alonso. The Collection Virtual Machine: An Abstraction for Multi-Frontend Multi-Backend Data Analysis. 2020. arXiv:2004.01908[cs.DB].Google ScholarDigital Library
- NASA.DATA.NASA.GOV: A catalog of publicly available NASA datasets. uRl: http://data.nasa.gov/(visited on 01/20/2020).Google Scholar
- M. Tamer Özsu and P Valduriez. Principles of Distributed Data-base Systems. 3rd ed. Springer, 2011.isbn: 9781441988331.Google Scholar
- Matthew Perron, Raul Castro Fernandez, David DeWitt, and Samuel Madden. "Starling: A Scalable Query Engine on Cloud Function Services." In: SIGMOD. 2020.Google Scholar
- Qifan Pu, U C Berkeley, Shivaram Venkataraman, Ion Stoica, UC Berkeley, and Implementation Nsdi. "Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure." In:NSDI.2019.Google Scholar
- Wolf Rödiger, Tobias Mühlbauer, Alfons Kemper, and Thomas Neumann. "High-Speed Query Processing over High-Speed Net-works." In: PVLDB 9.4 (2015). doi:10.14778/2856318.2856319.Google Scholar
- Josep Sampé, Gil Vernik, Marc Sánchez-Artigas, and Pedro García-López. "Serverless data analytics in the IBM cloud." In: Middleware Industry. 2018. doi:10.1145/3284028.3284029.Google Scholar
- Venkat Sowrirajan, Bharath Bhushan, and Mayank Ahuja. Qubole offers Apache Spark on AWS Lambda. 2017. URL: https://www.qubole.com/blog/spark-on-aws-lambda/(visited on 12/20/2019).Google Scholar
- Transaction Processing Performance Council. TPC Benchmark H (Revision 2.18). 2018.Google Scholar
- Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart,and Michael Swift. "Peeking Behind the Curtains of Serverless Platforms." In:USENIX ATC. 2018.Google Scholar
- Matei Zaharia et al. "Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing." In:NSDI. 2012.Google Scholar
Index Terms
- Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure
Recommendations
Next generation cloud computing
The landscape of cloud computing has significantly changed over the last decade. Not only have more providers and service offerings crowded the space, but also cloud infrastructure that was traditionally limited to single provider data centers is now ...
Serverless Workflows for Containerised Applications in the Cloud Continuum
AbstractThis paper introduces an open-source platform to support serverless computing for scientific data-processing workflow-based applications across the Cloud continuum (i.e. simultaneously involving both on-premises and public Cloud platforms to ...
Supporting Multi-Provider Serverless Computing on the Edge
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel ProcessingServerless computing has recently emerged as a new execution model for cloud computing, in which service providers offer compute runtimes, also known as Function-as-a-Service (FaaS) platforms, allowing users to develop, execute and manage application ...
Comments