Performance Assurance Model for Applications on SPARK Platform

Singhal, Rekha; Singh, Praveen

doi:10.1007/978-3-319-72401-0_10

Rekha Singhal¹⁵ &
Praveen Singh¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10661))

Included in the following conference series:

Technology Conference on Performance Evaluation and Benchmarking

1212 Accesses
10 Citations

Abstract

The wide availability of open source big data processing frameworks, such as Spark, has increased migration of existing applications and deployment of new applications to these cost-effective platforms. One of the challenges is assuring performance of an application with increase in data size in production system. We have addressed this problem in our work for Spark platform using a performance prediction model in development environment. We have proposed a grey box approach to estimate an application execution time on Spark cluster for higher data size using measurements on low volume data in a small size cluster. The proposed model may also be used iteratively to estimate the competent cluster size for desired application performance in production environment. We have discussed both machine learning and analytic based techniques to build the model. The model is also flexible to different configurations of Spark cluster. This flexibility enables the use of the prediction model with optimization techniques to get tuned value of Spark parameters for optimal performance of deployed application on Spark cluster. Our key innovations in building Spark performance prediction model are support for different configurations of Spark platform, and simulator to estimate Spark stage execution time which includes task execution variability due to HDFS, data skew and cluster nodes heterogeneity. We have shown that our proposed approaches are able to predict within 20% error bound for Wordcount, Terasort, K-means and few TPC-H SQL workloads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 60.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

SparkBench: Spark performance tests. https://github.com/databricks/spark-perf
TPC-H benchmarks. https://www.tpc.org/tpch
Awan, A.J., Brorsson, M., Vlassov, V., Ayguade, E.: How data volume affects spark based data analytics on a scale-up server. arXiv:1507.08340 (2015)
Awan, A.J., Brorsson, M., Vlassov, V., Ayguade, E.: Architectural impact on performance of in-memory data analytics: apache spark case study. arXiv:1604.08484 (2016)
Herodotou, H., Babu, S.: Profiling, what-if, analysis, and cost-based optimization of mapreduce programs. In: The 37th International Conference on Very Large Data Bases (2011)
Google Scholar
Jia, Z., Xue, C., Chen, G., Zhan, J., Zhang, L., Lin, Y., Hofstee, P.: Auto-tuning spark big data workloads on POWER8: prediction-based dynamic SMT threading. In: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (2016)
Google Scholar
Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.: Making sense of performance in data analytics frameworks. In: Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2015) (2015)
Google Scholar
Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. arXiv:1607.07348 (2016)
Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: MRTuner: a toolkit to enable holistic optimization for mapreduce jobs. PVLDB 7(13), 1319–1330 (2014)
Google Scholar
Singhal, R., Nambiar, M.: Predicting SQL query execution time for large data volume. In: ACM Proceedings of IDEAS (2016)
Google Scholar
Singhal, R., Sangroya, A.: Performance assurance model for HiveQL on large data volume. In: International Workshop on Foundations of Big Data Computing in conjunction with 22nd IEEE International Conference on High Performance Computing (2015)
Google Scholar
Singhal, R., Verma, A.: Predicting job completion time in heterogeneous mapreduce environments. In: Proceedings of IPDPS: Heterogeneous Computing Workshop, IPDPS (2016)
Google Scholar
Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: IEEE 17th International Conference on High Performance Computing and Communications (HPCC) (2015)
Google Scholar
Yigitbasi, N., Willke, T., Liao, G., Epema, D.: Towards machine learning-based auto-tuning of mapreduce. In: IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Tata Consultancy Services Research, Mumbai, India
Rekha Singhal & Praveen Singh

Authors

Rekha Singhal
View author publications
You can also search for this author in PubMed Google Scholar
Praveen Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rekha Singhal .

Editor information

Editors and Affiliations

Cisco Systems, Inc., San Jose, California, USA
Raghunath Nambiar
Server Technologies, Oracle Corporation, Redwood Shores, California, USA
Meikel Poess

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singhal, R., Singh, P. (2018). Performance Assurance Model for Applications on SPARK Platform. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking for the Analytics Era. TPCTC 2017. Lecture Notes in Computer Science(), vol 10661. Springer, Cham. https://doi.org/10.1007/978-3-319-72401-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-72401-0_10
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72400-3
Online ISBN: 978-3-319-72401-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics