A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks

Ihde, Nina; Marten, Paula; Eleliemy, Ahmed; Poerwawinata, Gabrielle; Silva, Pedro; Tolovski, Ilin; Ciorba, Florina M.; Rabl, Tilmann

doi:10.1007/978-3-030-94437-7_7

Nina Ihde¹⁰,
Paula Marten¹⁰,
Ahmed Eleliemy¹¹,
Gabrielle Poerwawinata¹¹,
Pedro Silva¹⁰,
Ilin Tolovski¹⁰,
Florina M. Ciorba¹¹ &
…
Tilmann Rabl¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 13169))

Included in the following conference series:

Technology Conference on Performance Evaluation and Benchmarking

926 Accesses
5 Citations

Abstract

In recent years, there has been a convergence of Big Data (BD), High Performance Computing (HPC), and Machine Learning (ML) systems. This convergence is due to the increasing complexity of long data analysis pipelines on separated software stacks. With the increasing complexity of data analytics pipelines comes a need to evaluate their systems, in order to make informed decisions about technology selection, sizing and scoping of hardware. While there are many benchmarks for each of these domains, there is no convergence of these efforts. As a first step, it is also necessary to understand how the individual benchmark domains relate.

In this work, we analyze some of the most expressive and recent benchmarks of BD, HPC, and ML systems. We propose a taxonomy of those systems based on individual dimensions such as accuracy metrics and common dimensions such as workload type. Moreover, we aim at enabling the usage of our taxonomy in identifying adapted benchmarks for their BD, HPC, and ML systems. Finally, we identify challenges and research directions related to the future of converged BD, HPC, and ML system benchmarking.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Computer architecture is back - the Berkeley view on the parallel computing landscape. https://web.stanford.edu/class/ee380/Abstracts/070131-BerkeleyView1.7.pdf. Accessed 18 Aug 2021
Coral procurement benchmarks. https://asc.llnl.gov/sites/asc/files/2020-06/CORALBenchmarksProcedure-v26.pdf. Accessed 30 June 2021
High performance conjugate gradient benchmark (HPCG). https://github.com/hpcg-benchmark/hpcg/. Accessed 04 July 2021
High performance conjugate gradient benchmark (HPCG). http://www.netlib.org/benchmark/hpl/. Accessed 04 July 2021
HPCG benchmark. https://icl.bitbucket.io/hpl-ai/. Accessed 06 July 2021
Parallel graph analytix (PGX). https://www.oracle.com/middleware/technologies/parallel-graph-analytix.html. Accessed 01 July 2021
SPEC ACCEL: Read me first. https://www.spec.org/accel/docs/readme1st.html#Q13. Accessed 29 June 2021
SPEC OMP 2012. https://www.spec.org/omp2012/. Accessed 07 July 2021
SPECMPI. https://www.spec.org/mpi2007/. Accessed 07 July 2021
Standard performance evaluation corporation, SPEC CPU (2017). https://www.spec.org/cpu2017/Docs/overview.html#suites. Accessed 29 June 2021
Unified European applications benchmark suite. https://repository.prace-ri.eu/git/UEABS/ueabs. Accessed 29 June 2021
Adolf, R., Rama, S., Reagen, B., Wei, G.Y., Brooks, D.: Fathom: reference workloads for modern deep learning methods. In: 2016 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10. IEEE (2016)
Google Scholar
Asanovic, K., et al.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)
Article Google Scholar
Bailey, D., et al.: The NAS parallel benchmarks. Technical report, RNR-94-007, NASA Ames Research Center, Moffett Field, CA, March 1994 (1994)
Google Scholar
Bailey, D., Harris, T., Saphir, W., van der Wijngaart, R., Woo, A., Yarrow, M.: The NAS parallel benchmarks 2.0. Technical report, RNR-95-020, NASA Ames Research Center, Moffett Field, CA, March 1995 (1995)
Google Scholar
Bajaber, F., Sakr, S., Batarfi, O., Altalhi, A., Barnawi, A.: Benchmarking big data systems: a survey. Comput. Commun. 149, 241–251 (2020). https://doi.org/10.1016/j.comcom.2019.10.002. https://www.sciencedirect.com/science/article/pii/S0140366419312344
Barata, M., Bernardino, J., Furtado, P.: YCSB and TPC-H: big data and decision support benchmarks. In: 2014 IEEE International Congress on Big Data, pp. 800–801. IEEE (2014)
Google Scholar
Baru, C., et al.: Discussion of BigBench: a proposed industry standard performance benchmark for big data. In: Nambiar, R., Poess, M. (eds.) TPCTC 2014. LNCS, vol. 8904, pp. 44–63. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15350-6_4
Chapter Google Scholar
Bienia, C., Kumar, S., Singh, J.P., Li, K.: The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81 (2008)
Google Scholar
Bonawitz, K., et al.: Towards federated learning at scale: system design. arXiv preprint arXiv:1902.01046 (2019)
Bonifati, A., Fletcher, G., Hidders, J., Iosup, A.: A survey of benchmarks for graph-processing systems. In: Fletcher, G., Hidders, J., Larriba-Pey, J. (eds.) Graph Data Management. DSA, pp. 163–186. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96193-4_6
Chapter Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998). https://doi.org/10.1016/S0169-7552(98)00110-X. https://www.sciencedirect.com/science/article/pii/S016975529800110X. Proceedings of the Seventh International World Wide Web Conference
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)
Article Google Scholar
Caldas, S., et al.: Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097 (2018)
Capotă, M., Hegeman, T., Iosup, A., Prat-Pérez, A., Erling, O., Boncz, P.: Graphalytics: a big data benchmark for graph-processing platforms. In: Proceedings of the GRADES 2015, pp. 1–6 (2015)
Google Scholar
Cheng, P., Lu, Y., Du, Y., Chen, Z.: Experiences of converging big data analytics frameworks with high performance computing systems. In: Yokota, R., Wu, W. (eds.) SCFA 2018. LNCS, vol. 10776, pp. 90–106. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-69953-0_6
Chapter Google Scholar
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 143–154 (2010)
Google Scholar
Czarnul, P., Proficz, J., Krzywaniak, A., Weglarz, J.: Energy-aware high-performance computing: survey of state-of-the-art tools, techniques, and environments. Sci. Program. 2019 (2019). https://doi.org/10.1155/2019/8348791
Dongarra, J., Luszczek, P., Heroux, M.: HPCG technical specification. Sandia National Laboratories, Sandia Report SAND2013-8752 (2013)
Google Scholar
Fox, G.C., Jha, S., Qiu, J., Ekanazake, S., Luckow, A.: Towards a comprehensive set of big data benchmarks. Big Data High Perform. Comput. 26, 47 (2015)
Google Scholar
Fox, G.C., Jha, S., Qiu, J., Luckow, A.: Ogres: a systematic approach to big data benchmarks. Big Data Extreme-scale Comput. (BDEC) 29–30 (2015). Barcelona, Spain
Google Scholar
Frumkin, M.A., Shabanov, L.: Arithmetic data cube as a data intensive benchmark. Technical report, NAS-03-005, NASA Ames Research Center, Moffett Field, CA, March 2003 (2003)
Google Scholar
Fuller, A., Fan, Z., Day, C., Barlow, C.: Digital twin: enabling technologies, challenges and open research. IEEE Access 8, 108952–108971 (2020)
Article Google Scholar
Gao, W., et al.: BigDataBench: a scalable and unified big data and AI benchmark suite. arXiv preprint arXiv:1802.08254 (2018)
Gao, W., et al.: BigDataBench: a big data benchmark suite from web search engines. arXiv preprint arXiv:1307.0320 (2013)
Ghazal, A., et al.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2463676.2463712
Guo, Y., Varbanescu, A.L., Iosup, A., Martella, C., Willke, T.L.: Benchmarking graph-processing platforms: a vision. In: Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering, pp. 289–292 (2014)
Google Scholar
Han, R., et al.: BigDataBench-MT: a benchmark tool for generating realistic mixed data center workloads. In: Zhan, J., Han, R., Zicari, R.V. (eds.) BPOE 2015. LNCS, vol. 9495, pp. 10–21. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-29006-5_2
Chapter Google Scholar
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51. IEEE (2010)
Google Scholar
Huang, S., Huang, J., Liu, Y., Yi, L., Dai, J.: HiBench: a representative and comprehensive Hadoop benchmark suite. In: Proceedings of the ICDE Workshops, pp. 41–51 (2010)
Google Scholar
Intel: Hibench (2021). https://github.com/Intel-bigdata/HiBench
Iosup, A., et al.: LDBC graphalytics: a benchmark for large-scale graph analysis on parallel and distributed platforms. Proc. VLDB Endow. 9(13), 1317–1328 (2016)
Article Google Scholar
Jack Dongarra, P.L.: HPC Challenge: Design, History, and Implementation Highlights, chap. 2. Chapman and Hall/CRC (2013)
Google Scholar
Dongarra, J., Heroux, M., Luszczek, P.: BOF HPCG benchmark update and a look at the HPL-AI benchmark (2021)
Google Scholar
Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. J. Parallel Distrib. Comput. 74(7), 2561–2573 (2014)
Article Google Scholar
Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., Zhang, C.: CleanML: a benchmark for joint data cleaning and machine learning [experiments and analysis]. arXiv preprint arXiv:1904.09483, p. 75 (2019)
Luszczek, P., et al.: Introduction to the HPC challenge benchmark suite, December 2004
Google Scholar
Dixit, K.M.: Overview of the SPEC benchmark. In: Gray, J. (ed.) The Benchmark Handbook, chap. 10, pp. 266–290. Morgan Kaufmann Publishers Inc. (1993)
Google Scholar
Mattson, P., et al.: MLPerf training benchmark. arXiv preprint arXiv:1910.01500 (2019)
Mattson, P., et al.: MLPerf: an industry standard benchmark suite for machine learning performance. IEEE Micro 40(2), 8–16 (2020)
Article Google Scholar
Ming, Z., et al.: BDGS: a scalable big data generator suite in big data benchmarking. In: Rabl, T., Jacobsen, H.-A., Raghunath, N., Poess, M., Bhandarkar, M., Baru, C. (eds.) WBDB 2013. LNCS, vol. 8585, pp. 138–154. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10596-3_11
Chapter Google Scholar
Müller, M., Whitney, B., Henschel, R., Kumaran, K.: SPEC Benchmarks, pp. 1886–1893. Springer, Boston (2011)
Google Scholar
Narang, S.: Deepbench. https://svail.github.io/DeepBench/. Accessed 03 July 2021
Narang, S., Diamos, G.: An update to deepbench with a focus on deep learning inference. https://svail.github.io/DeepBench-update/. Accessed 03 July 2021
Ngai, W.L., Hegeman, T., Heldens, S., Iosup, A.: Granula: toward fine-grained performance analysis of large-scale graph processing platforms. In: Proceedings of the Fifth International Workshop on Graph Data-Management Experiences & Systems, pp. 1–6 (2017)
Google Scholar
Poess, M., Nambiar, R.O., Walrath, D.: Why you should run TPC-DS: a workload analysis. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 1138–1149. VLDB Endowment (2007)
Google Scholar
Rabl, T., Frank, M., Sergieh, H.M., Kosch, H.: A data generator for cloud-scale benchmarking. In: Nambiar, R., Poess, M. (eds.) TPCTC 2010. LNCS, vol. 6417, pp. 41–56. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-18206-8_4
Chapter Google Scholar
Radulovic, M., Asifuzzaman, K., Carpenter, P., Radojković, P., Ayguadé, E.: HPC benchmarking: scaling right and looking beyond the average. In: Aldinucci, M., Padovani, L., Torquati, M. (eds.) Euro-Par 2018. LNCS, vol. 11014, pp. 135–146. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96983-1_10
Chapter Google Scholar
Reed, D.A., Dongarra, J.: Exascale computing and big data. Commun. ACM 58(7), 56–68 (2015)
Article Google Scholar
von Rueden, L., Mayer, S., Sifa, R., Bauckhage, C., Garcke, J.: Combining machine learning and simulation to a hybrid modelling approach: current and future directions. In: Berthold, M.R., Feelders, A., Krempl, G. (eds.) IDA 2020. LNCS, vol. 12080, pp. 548–560. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44584-3_43
Chapter Google Scholar
Tian, X., et al.: BigDataBench-S: an open-source scientific big data benchmark suite. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1068–1077. IEEE (2017)
Google Scholar
Vazhkudai, S.S., et al.: The design, deployment, and evaluation of the coral pre-exascale systems. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 661–672 (2018)
Google Scholar
Lioen, W., et al.: Evaluation of accelerated and non-accelerated benchmarks (2019)
Google Scholar
Wang, L., et al.: BigDataBench: a big data benchmark suite from internet services. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp. 488–499. IEEE (2014)
Google Scholar
van der Wijngaart, R., Jin, H.: NAS parallel benchmarks, multi-zone versions. Technical report, NAS-03-010, NASA Ames Research Center, Moffett Field, CA, March 2003 (2003)
Google Scholar
Wong, P., van der Wijngaart, R.: NAS parallel benchmarks i/o version 2.4. Technical report, NAS-03-020, NASA Ames Research Center, Moffett Field, CA, March 2003 (2003)
Google Scholar
Zhang, Q., et al.: A survey on deep learning benchmarks: do we still need new ones? In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 36–49. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32813-9_5
Chapter Google Scholar

Download references

Acknowledgement

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 957407 as DAPHNE. This work has also been supported through the German Research Foundation as FONDA.

Author information

Authors and Affiliations

Hasso Platner Institut, Potsdam, Germany
Nina Ihde, Paula Marten, Pedro Silva, Ilin Tolovski & Tilmann Rabl
University of Basel, Basel, Switzerland
Ahmed Eleliemy, Gabrielle Poerwawinata & Florina M. Ciorba

Authors

Nina Ihde
View author publications
You can also search for this author in PubMed Google Scholar
Paula Marten
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Eleliemy
View author publications
You can also search for this author in PubMed Google Scholar
Gabrielle Poerwawinata
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Silva
View author publications
You can also search for this author in PubMed Google Scholar
Ilin Tolovski
View author publications
You can also search for this author in PubMed Google Scholar
Florina M. Ciorba
View author publications
You can also search for this author in PubMed Google Scholar
Tilmann Rabl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ilin Tolovski .

Editor information

Editors and Affiliations

Advanced Micro Devices Inc., Santa Clara, CA, USA
Raghunath Nambiar
Oracle Corporation, Redwood Shores, CA, USA
Meikel Poess

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ihde, N. et al. (2022). A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking. TPCTC 2021. Lecture Notes in Computer Science(), vol 13169. Springer, Cham. https://doi.org/10.1007/978-3-030-94437-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-94437-7_7
Published: 14 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-94436-0
Online ISBN: 978-3-030-94437-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks