Skip to main content
Log in

Efficient data persistence and data division for distributed computing in cloud data center networks

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Container-based Hadoop distributed file system (HDFS) storage has been widely used in cloud data center networks, while traditional HDFS has single point problem resulting in overall unavailability. In this paper, we mainly study the storage reliability of the Docker container-based HDFS cluster with single point of failure. Firstly, we investigate a data volume-based persistence solution of Hadoop with the single point failure and single backup strategy of HDFS cluster. Secondly, we propose an HDFS-based replica placement algorithm for data storage with considering the performance of the host and container nodes. Thirdly, we design the KADC-KNN data segmentation algorithm to effectively store the persistent data of the Docker container. Extensive experimental results show that this method can effectively ensure the stable storage and fast migration of cluster data. Compared with the most advanced algorithm, the proposed data volume persistence algorithm DVPS can improve the data reliability by 19.8%. The data partitioning algorithm KADC-KNN improves the partitioning accuracy by 20.2% and has lower time overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

  1. Mostafa S, Tavassolipour A, Motahari M, Taghi MS (2019) Learning of gaussian processes in distributed and communication limited systems. IEEE Trans Pattern Anal Mach Intell 42(8):1928–1941

    MATH  Google Scholar 

  2. Jones KJ, Alli M (2021) Data aware caching using map reduce framework. Int J Comput Appl 7(1):1797–2250

    Google Scholar 

  3. Chen X, Huo H, Huan J, Vitter JS, Zou L (2021) Msq-index: a succinct index for fast graph similarity search. IEEE Trans Knowl Data Eng 33(6):2654–2668

    Article  Google Scholar 

  4. Elkawkagy M, Elbeh H (2020) High performance hadoop distributed file system. Int J Network Distrib Comput 8(3):119–123

    Article  Google Scholar 

  5. Fan W, Han Z, Li P, Zhou J, Fan J, Wang R (2019) A live migration algorithm for containers based on resource locality. J Signal Process Syst 91(10):1077–1089

    Article  Google Scholar 

  6. Gemayel N (2016) Analyzing google file system and Hadoop distributed file system. Res J Inf Technol 8(3):66–74

    Google Scholar 

  7. Kalid S, Syed A, Mohammad A, Halgamuge M (2017) Big-data NoSQL databases: comparison and analysis of “Big-Table”, “DynamoDB”, and “Cassandra”. In: IEEE 2nd International Conference on Big Data Analysis (ICBDA’17), pp. 89–93

  8. Chen D, Zhang R (2022) An open source project for tuning and analyzing mapreduce performance in Hadoop and Spark. IEEE Softw 39(1):61–69

    Article  Google Scholar 

  9. Fan W, Xiao F, Fan J, Han Z, Sun L, Wang R (2023) Fault-tolerant routing with load balancing in LeTQ networks. IEEE Trans Depend Secure Comput 20(1):68–82

    Article  Google Scholar 

  10. Zhang H, Zhou R (2017) The analysis and optimization of decision tree based on ID3 algorithm. In: 9th International Conference on Modelling, Identification and Control (ICMIC), pp 924–928

  11. Fan W, He J, Guo M, Li P, Han Z, Wang R (2020) Privacy preserving classification on local differential privacy in data centers. J Parallel Distrib Comput 135:70–82

    Article  Google Scholar 

  12. Das S, Kumar Kolya A (2017) Sense GST: text mining and sentiment analysis of GST tweets by Naive Bayes algorithm. In: Third International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), pp 239–244

  13. Huang J, Wei Y, Yi J et al (2018) An improved kNN based on class contribution and feature weighting. In: 10th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), pp 313–316

  14. Konovalenko I, Ludwig A (2022) Generating decision support for alarm processing in cold supply chains using a hybrid \(k\)-nn algorithm. Expert Syst Appl 190:1–15

    Article  Google Scholar 

  15. Xu B, Fu Y, Jiang YG, Li B, Sigal L (2018) Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Trans Affect Comput 9(2):255–270

    Article  Google Scholar 

  16. Triguero I, Maillo J, Luengo J et al (2017) From big data to smart data with the \(k\)-nearest neighbours algorithm. In: IEEE International Conference on Internet of Things, pp. 859–864

  17. Fan W, Han Z, Wang R (2018) An evaluation model and benchmark for parallel computing frameworks. Mob Inf Syst 1–14

  18. Fan W, Xiao F, Chen X, Cui L, Yu S (2021) Efficient virtual network embedding of cloud-based data center networks into optical networks. IEEE Trans Parallel Distrib Syst 32(11):2793–2808

    Article  Google Scholar 

  19. Schmitz C, Peled G, Koren O (2021). Small files in HDFS and their impact on Hadoop performance. In The 23rd International Conference on Information Integration and Web Intelligence, pp 385–390

  20. Fan W, Fan J, Zhang Y, Han Z, Chen G (2022) Communication and performance evaluation of 3-ary \(n\)-cubes onto network-on-chips. Sci China Inf Sci 65:179101–179104

    Article  MathSciNet  Google Scholar 

  21. Fan W, He J, Han Z, Li P, Wang R (2020) Intelligent resource scheduling based on locality principle in data center networks. IEEE Commun Mag 58(10):94–100

    Article  Google Scholar 

  22. Usman AM, Haider S (2022) A flexible framework for diverse multi-robot task allocation scenarios including multi-tasking. ACM Trans Auton Adapt Syst 16(1):1–23

    Google Scholar 

  23. Pradeep Kumar S, Aswini A, Kavithadevi M, Ramya S (2017) Improvised dedupication with keys and chunks in HDFS storage. In: Third International Conference on Science Technology Engineering and Management (ICONSTEM), pp 226–230

  24. Liu J, Wang P, Zhou J, Li K (2019) Mctar: a multi-trigger checkpointing tactic for fast task recovery in mapreduce. IEEE Trans Serv Comput 14(6):1824–1836

    Article  Google Scholar 

  25. Zhou J, Chen Y, Wang W, He S, Meng D (2020) A highly reliable metadata service for large-scale distributed file systems. IEEE Trans Parallel Distrib Syst 31(2):374–392

    Article  Google Scholar 

  26. Wang X, Lee B, Qiao Y (2016) Experimental evaluation of memory configurations of Hadoop in Docker environments. In 2016 27th Irish Signals and Systems Conference (ISSC), pp 1–6

  27. Lin CY, Lin YC (2015) A load-balancing algorithm for Hadoop distributed file system. In: International Conference on Network Based Information Systems, pp 173–179

  28. Islam NS, Wasi-ur-Rahman M, Lu X, et al (2016) Efficient data access strategies for hadoop and spark on HPC cluster with heterogeneous storage. In: IEEE International Conference on Big Data, pp 223–232

  29. Sun D (2021) Efficient text feature extraction by integrating the average linkage and K-medoids clustering. Mod Phys Lett B 35(09):2150151

    Article  MathSciNet  Google Scholar 

  30. Deng Z, Zhu X, Cheng D et al (2016) Efficient kNN classification algorithm for big data. Neurocomputing 195:143–148

    Article  Google Scholar 

  31. Chen W, Chen S, Zhang H, Wu T (2017) A hybrid prediction model for type 2 diabetes using \(k\)-means and decision tree. In: 8th IEEE International Conference on Software Engineering and Service Science (ICSESS), pp 386–390

  32. Gallego AJ, Calvo-Zaragoza J, Valero-Mas JJ et al (2014) Clustering-based \(k\)-nearest neighbor classification for large-scale data with neural codes representation. Pattern Recogn 74:443–531

    Google Scholar 

  33. Zhang X, Wang L, Huang Z, Xie H, Zhang Y, Ngulube M (2022) ConeSSD: a novel policy to optimize the performance of HDFS heterogeneous storage. In: 2022 IEEE 24th International Conference on High Performance Computing and Communications; 8th International Conference on Data Science and Systems; 20th International Conference on Smart City; 8th International Conference on Dependability in Sensor, Cloud and Big Data Systems and Application (HPCC/DSS/SmartCity/DependSys), pp 876–881

  34. Dai W, Ibrahim I, Bassiouni M (2017) An improved replica placement policy for Hadoop distributed file system running on cloud platforms. In: IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud), pp 270–275

Download references

Acknowledgements

We thank the editors and the anonymous reviewers for their useful feedback that improved this paper.

Funding

This work is supported by Natural Science Foundation of China under grant (No. 62172291, 62102196, 62102195), Natural Science Foundation of Jiangsu Province (No. BK20200753), Jiangsu Postdoctoral Science Foundation Funded Project (No. 2021K096A), Future Network Scientific Research Fund Project (FNSRFP-2021-YB-60), Natural Science Fund For Colleges and Universities in Jiangsu Province (21KJB520026), the Fundamental Research Funds for the Central Universities JL (No. 93K172020K25, 93K172021K03), Innovative Research Team Project of Suzhou Institute of Industrial Technology (2021KYTD003), and the Qing Lan Project of Jiangsu Province.

Author information

Authors and Affiliations

Authors

Contributions

XW and WF wrote the main manuscript text and XH and RW prepared experiments. All authors reviewed the manuscript.

Corresponding author

Correspondence to Weibei Fan.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

It is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Hu, X., Fan, W. et al. Efficient data persistence and data division for distributed computing in cloud data center networks. J Supercomput 79, 16300–16327 (2023). https://doi.org/10.1007/s11227-023-05276-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05276-2

Keywords

Navigation