Spark Performance Optimization Analysis With Multi-Layer Parameter Using Shuffling and Scheduling With Data Serialization in Different Data Caching Options

Mesay Deleli, Deleli Mesay Adinew, Ayall Tewodros Alemu

Source Title: Journal of Technological Advancements (JTA)1(1)

EISSN: 2767-3804|EISBN13: 9781799883708|DOI: 10.4018/JTA.290326

MLA

Deleli, Mesay, et al. "Spark Performance Optimization Analysis With Multi-Layer Parameter Using Shuffling and Scheduling With Data Serialization in Different Data Caching Options." JTA vol.1, no.1 2021: pp.1-17. http://doi.org/10.4018/JTA.290326

APA

Deleli, M., Adinew, D. M., & Alemu, A. T. (2021). Spark Performance Optimization Analysis With Multi-Layer Parameter Using Shuffling and Scheduling With Data Serialization in Different Data Caching Options. Journal of Technological Advancements (JTA), 1(1), 1-17. http://doi.org/10.4018/JTA.290326

Chicago

Deleli, Mesay, Deleli Mesay Adinew, and Ayall Tewodros Alemu. "Spark Performance Optimization Analysis With Multi-Layer Parameter Using Shuffling and Scheduling With Data Serialization in Different Data Caching Options," Journal of Technological Advancements (JTA) 1, no.1: 1-17. http://doi.org/10.4018/JTA.290326

Export Reference

Favorite Full-Issue Download

View Full Text HTML

View Full Text PDF

Abstract

As social networking services and e-commerce are growing rapidly, the number of online users also dynamically growing that facilitate contribution of huge contents to digital world. In such dynamic environment, meeting the demand of computing is very challenging special with existing computing model. Although Spark is recently introduced to alleviate the problems with concept of in-memory computing for big data analytic with many parameters configuration that allow to configure and improve its performance, still it has performance bottleneck which require to investigate performance improvement mechanism by focus on the combinations of Scheduling and Shuffle Manager with data serialization with intermediate data caching options. Standalone cluster computing model was selected as experimental methodology with submit command line for data submission. Three Spark application such as WorkCount, TeraSort and PageRank were selected and developed for experiment. As a result, 2.45% and 8.01% performance improvement are achieved in OFFHEAP and Memory Only Ser data caching option, respectively.