ABSTRACT
The High-Energy Physics experiments at CERN produce a high volume of data. It is not possible to analyze big chunks of it within a reasonable time by any single machine. The ROOT framework was recently extended with the distributed computing capabilities for massively parallelized RDataFrame applications. This approach, using the MapReduce pattern underneath, made the heavy computations much more approachable even for the newcomers.
This paper explores the possibility of running such analyses on serverless services in public cloud using a purely stateless environment. So far, the distributed approaches used by RDataFrame relied on stateful, fully managed computing frameworks like Apache Spark. Here we show that our newly developed tool is able to use perfectly stateless cloud functions, demonstrating the excellent speedup in parallel stage of processing in our benchmarks.
- Valentina Avati, Milosz Blaszkiewicz, Enrico Bocchi, Luca Canali, Diogo Castro, Javier Cervantes, Leszek Grzanka, Enrico Guiraud, Jan Kaspar, Prasanth Kothuri, Massimo Lamanna, Maciej Malawski, Aleksandra Mnich, Jakub Moscicki, Shravan Murali, Danilo Piparo, and Enric Tejedor. 2019. Declarative Big Data Analysis for High-Energy Physics: TOTEM Use Case. In Euro-Par 2019: Parallel Processing, Ramin Yahyapour (Ed.). Springer International Publishing, Cham, 241--255.Google Scholar
- Rene Brun and Fons Rademakers. 1996. ROOT - An Object Oriented Data Analysis Framework. http://root.cern.ch/, In AIHENP'96 Workshop, Lausane. Nucl. Inst. & Meth. in Phys. Res. A, Vol. 389, 81--86.Google Scholar
- Alvise Dorigo, Peter Elmer, Fabrizio Furano, and Andrew Hanushevsky. 2005. XROOTD/TXNetFile: a highly scalable architecture for data access in the ROOT environment. Proceedings of the 7th WSEAS International Conference on Telecommunications and Informatics (01 2005), 46. Google ScholarDigital Library
- Zaharia et al. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM, Vol. 59, 11 (Oct. 2016) 56--65. https://doi.org/10.1145/2934664 Google ScholarDigital Library
- HashiCorp. [n.d.]. Terraform: Infrastructure as code for provisioning, compliance, and management of any cloud, infrastructure, and service. https://www.hashicorp.com/products/terraform.Google Scholar
- htcondor [n.d.]. HTCondor is a specialized workload management system for compute-intensive jobs. https://research.cs.wisc.edu/htcondorGoogle Scholar
- Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. 2017. Occupy the Cloud: Distributed Computing for the 99%. In Proceedings of the 2017 Symposium on Cloud Computing (Santa Clara, California) (SoCC '17). Association for Computing Machinery, New York, NY, USA, 445--451. https://doi.org/10.1145/3127479.3128601 Google ScholarDigital Library
- Y. Kim and J. Lin. 2018. Serverless Data Analytics with Flint. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). 451-455. https: //doi.org/10.1109/CLOUD.2018.00063Google ScholarCross Ref
- Kubernetes [n.d.]. Production-Grade Container Orchestration Automated container deployment, scaling, and management. https://kubernetes.io/.Google Scholar
- Padulano, Vincenzo Eduardo, Cervantes Villanueva, Javier, Guiraud, Enrico, and Tejedor Saavedra, Enric. 2020. Distributed data analysis with ROOT RDataFrame. EPJ Web Conf., Vol. 245 (2020), 03009. https://doi.org/10.1051/epjconf/202024503009Google Scholar
- Danilo Piparo, Enric Tejedor, Pere Mato, Luca Mascetti, Jakub Moscicki, and Massimo Lamanna. 2018. SWAN: A service for interactive analysis in the cloud. Future Generation Computer Systems, Vol. 78 (2018), 1071--1078. https://doi.org/10.1016/j.future.2016.11.035Google ScholarCross Ref
- Piparo, Danilo, Canal, Philippe, Guiraud, Enrico, Pla, Xavier Valls, Ganis, Gerardo, Amadio, Guilherme, Naumann, Axel, and Tejedor, Enric. 2019. RDataFrame: Easy Parallel ROOT Analysis at 100 Threads. EPJ Web Conf., Vol. 214 (2019), 06029. https://doi.org/10.1051/epjconf/201921406029Google Scholar
- Russel Sandberg. 2000. The Sun Network File System: Design, Implementation and Experience. (09 2000).Google Scholar
- Stefan Wunsch. 2019. DoubleMuParked dataset from 2012 in NanoAOD format reduced on muons. CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.LVG5.QT81.Google Scholar
Index Terms
- Distributed Parallel Analysis Engine for High Energy Physics Using AWS Lambda
Recommendations
A framework and a performance assessment for serverless MapReduce on AWS Lambda
AbstractMapReduce is one of the most widely used programming models for analysing large-scale datasets, i.e. Big Data. In recent years, serverless computing and, in particular, Functions as a Service (FaaS) has surged as an execution model in ...
Highlights- A Python-based framework to support serverless MapReduce on AWS Lambda is introduced.
MATE-EC2: a middleware for processing data with AWS
MTAGS '11: Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputersRecently, there has been growing interest in using Cloud resources for a variety of high performance and data-intensive applications. While there is currently a number of commercial Cloud service providers, Amazon Web Services (AWS) appears to be the ...
Dynamic latency control of serverless applications operated on AWS lambda and greengrass
SIGCOMM '20: Proceedings of the SIGCOMM '20 Poster and Demo SessionsCloud native programming and the serverless paradigm can revolutionize software development and the operation of distributed applications. However, latency sensitive applications pose additional challenges to the underlying networks and cloud platforms. ...
Comments