research-article

Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers

Authors:
Meng Wang

University of Chicago, Chicago, United States of America

University of Chicago, Chicago, United States of America

https://orcid.org/0000-0002-0970-6558
View Profile

,
Jiajun Mao

University of Chicago, Chicago, United States of America

University of Chicago, Chicago, United States of America

https://orcid.org/0000-0003-3087-0245
View Profile

,
Rajdeep Rana

University of Chicago, Chicago, United States of America

University of Chicago, Chicago, United States of America

https://orcid.org/0009-0003-8381-3583
View Profile

,
John Bent

Los Alamos National Laboratory, Los Alamos, United States of America

Los Alamos National Laboratory, Los Alamos, United States of America

https://orcid.org/0000-0001-7887-0047
View Profile

,
Serkay Olmez

Seagate Research, Longmont, United States of America

Seagate Research, Longmont, United States of America

https://orcid.org/0000-0003-2344-6753
View Profile

,
Anjus George

Oak Ridge National Laboratory, Oak Ridge, United States of America

Oak Ridge National Laboratory, Oak Ridge, United States of America

https://orcid.org/0000-0001-7973-7061
View Profile

,
Garrett Wilson Ransom

Los Alamos National Laboratory, Los Alamos, United States of America

Los Alamos National Laboratory, Los Alamos, United States of America

https://orcid.org/0009-0005-9324-6070
View Profile

,
Jun Li

CUNY Queens College & Graduate Center, New York, United States of America

CUNY Queens College & Graduate Center, New York, United States of America

https://orcid.org/0000-0001-8266-7463
View Profile

,
Haryadi S. Gunawi

University of Chicago, Chicago, United States of America

University of Chicago, Chicago, United States of America

https://orcid.org/0000-0003-3680-8450
View Profile

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2023Article No.: 47Pages 1–13https://doi.org/10.1145/3581784.3607072

Published:11 November 2023Publication History

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–13

ABSTRACT

Multi-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.

References

D. Colarelli and D. Grunwald. Massive Arrays of Idle Disks For Storage Archives. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC), 2002.Google ScholarDigital Library
Huaxia Xia and Andrew A. Chien. RobuSTore: Robust Performance for Distributed Storage Systems. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC), 2007.Google Scholar
Zizhong Chen. Optimal real number codes for fault tolerant matrix operations. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009.Google ScholarDigital Library
Haiyang Shi and Xiaoyi Lu. TriEC: Tripartite Graph Based Erasure Coding NIC Offload. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2019.Google Scholar
Haiyang Shi and Xiaoyi Lu. INEC: Fast and Coherent In-Network Erasure Coding. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2020.Google Scholar
Liangfeng Cheng, Yuchong Hu, Zhaokang Ke, Jia Xu, Qiaori Yao, Dan Feng, Weichun Wang, and Wei Chen. LogECMem: Coupling Erasure-Coded In-Memory Key-Value Stores with Parity Logging. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.Google ScholarDigital Library
Yuya Uezato. Accelerating XOR-based erasure coding using program optimization techniques. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.Google ScholarDigital Library
Salvatore Di Girolamo, Daniele De Sensi, Konstantin Taranov, Milos Malesevic, Maciej Besta, Timo Schneider, Severin Kistler, and Torsten Hoefler. Building blocks for network-accelerated distributed file systems. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2022.Google ScholarCross Ref
David Patterson, Garth Gibson, and Randy Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM SIGMOD Conference on the Management of Data (SIGMOD), 1988.Google ScholarDigital Library
Jaeho Kim, Jongmin Lee, Jongmoo Choi, Donghee Lee, and Sam H. Noh. Enhancing SSD reliability through efficient RAID support. In Proceedings of the Asia-Pacific Workshop on Systems (APSys), 2012.Google ScholarDigital Library
Guangyan Zhang, Zican Huang, Xiaosong Ma, Songlin Yang, Zhufan Wang, and Weimin Zheng. RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures. In Proceedings of the 16th USENIX Symposium on File and Storage Technologies (FAST), 2018.Google Scholar
K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster. In the 5th Workshop on Hot Topics in Storage and File Systems (HotStorage), 2013.Google Scholar
KV Rashmi, Preetum Nakkiran, Jingyan Wang, Nihar B. Shah, and Kannan Ramchandran. Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage and Network-bandwidth. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST), 2015.Google Scholar
Mingyuan Xia, Mohit Saxena, Mario Blaum, and David A. Pease. A Tale of Two Erasure Codes in HDFS. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST), 2015.Google ScholarDigital Library
Jeffrey Thornton Inman, William Flynn Vining, Garrett Wilson Ransom, and Gary Alan Grider. Marfs, a near-posix interface to cloud objects. ; Login, 42(LA-UR-16-28720; LA-UR-16-28952), 2017.Google Scholar
Scality ARTESCA: Object Storage for S3 Applications. https://www.scality.com/products/artesca/.Google Scholar
Hierarchical Erasure Coding: Making Erasure Coding Usable. https://www.snia.org/sites/default/files/SNIA_Hierarchical_Erasure_Coding_Final.pdf.Google Scholar
Jehan-François Pâris, S. J. Thomas J. E. Schwarz, Ahmed Amer, and Darrell D. E. Long. Highly reliable two-dimensional RAID arrays for archival storage. In 31th IEEE - International Performance Computing and Communications Conference (IPCCC), 2012.Google ScholarCross Ref
Neng Wang, Yinlong Xu, Yongkun Li, and Si Wu. OI-RAID: A Two-Layer RAID Architecture towards Fast Recovery and High Reliability. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2016.Google ScholarCross Ref
Alexander Thomasian. Multi-level RAID for very large disk arrays. In ACM SIGMETRICS Performance Evaluation Review, 2006.Google ScholarDigital Library
Sung Hoon Baek, Bong Wan Kim, Eui Joung Joung, and Chong Won Park. Reliability and performance of hierarchical RAID with multiple controllers. In Proceedings of the 20st ACM Symposium on Principles of Distributed Computing (PODC), 2001.Google ScholarDigital Library
Alexander Thomasian and Yujie Tang. Performance, Reliability, and Performability Aspects of Hierarchical RAID. In 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage (NAS), 2011.Google Scholar
Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. Erasure Coding in Windows Azure Storage. In Proceedings of the 2012 USENIX Annual Technical Conference (ATC), 2012.Google ScholarDigital Library
MLEC Github repository. https://github.com/ucare-uchicago/mlec-sim.Google Scholar
MLEC Artifact on Chameleon Trovi. https://tinyurl.com/mlec-artifact.Google Scholar
Richard R. Muntz and John C. S. Lui. Performance analysis of disk arrays under failure. In Proceedings of the 16th International Conference on Very Large Data Bases (VLDB), 1990.Google Scholar
Mark Holland and Garth Gibson. Parity Declustering for Continuous Operation in Redundant Disk Arrays. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1992.Google ScholarDigital Library
Guillermo A. Alvarez, Walter A. Burkhard, and Flaviu Cristian. Tolerating Multiple Failures in RAID Architectures with Optimal Storage and Uniform Declustering. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA), 1997.Google ScholarDigital Library
Guillermo A. Alvarez, Walter A. Burkhard, Larry J. Stockmeyer, and Flaviu Cristian. Declustered disk array architectures with optimal and near-optimal parallelism. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA), 1998.Google ScholarDigital Library
Thomas J.E. Schwarz S.J., Jesse Steinberg, and Walter A. Burkhard. Permutation development data layout (PDDL). In Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA-5), 1999.Google Scholar
Huan Ke, Haryadi S Gunawi, David Bonnie, Nathan DeBardeleben, Michael Grosskopf, Terry Grové, Dominic Manno, Elisabeth Moore, and Brad Settlemyer. Extreme protection against data loss with single-overlap declustered parity. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 343--354. IEEE, 2020.Google ScholarCross Ref
CORVAULT - Self-Healing, High Density Data Storage. https://www.seagate.com/products/storage/data-storage-systems/corvault/.Google Scholar
Jeff Bonwick and Bill Moore. Zfs: The last word in file systems, 2007.Google Scholar
Dell PowerEdge RAID Controller 12. https://infohub.delltechnologies.com/p/dell-poweredge-raid-controller-12/.Google Scholar
Paul Glasserman, Philip Heidelberger, Perwez Shahabuddin, and Tim Zajic. Splitting for rare event simulation: analysis of simple cases. In Proceedings of the 28th conference on Winter simulation, pages 302--308, 1996.Google ScholarDigital Library
Victor F Nicola, Perwez Shahabuddin, and Marvin K Nakayama. Techniques for fast simulation of models of highly dependable systems. IEEE Transactions on Reliability, 50(3):246--264, 2001.Google ScholarCross Ref
Daniel Ford, Franis Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlna. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI), 2010.Google ScholarDigital Library
Kevin M. Greenan, James S. Plank, and Jay J. Wylie. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In the 2nd Workshop on Hot Topics in Storage and File Systems (HotStorage), 2010.Google Scholar
Hiroaki Akutsu and Tomohiro Kawaguchi. Reliability analysis of distributed raid with priority rebuilding. In Proc. USENIX Conf., 2013.Google Scholar
Kishor S Trivedi. Probability and statistics with reliability, queuing, and computer science applications. John Wiley & Sons, 2001.Google ScholarDigital Library
ORNL's Alpine storage system. https://www.olcf.ornl.gov/olcf-resources/data-visualization-resources/alpine.Google Scholar
Personal Communication with LANL, ORNL, and Seagate Engineers and Operators.Google Scholar
Yuchong Hu, Liangfeng Cheng, Qiaori Yao, Patrick P. C. Lee, Weichun Wang, and Wei Chen. Exploiting Combined Locality for Wide-Stripe Erasure Coding in Distributed Storage. In Proceedings of the 19th USENIX Symposium on File and Storage Technologies (FAST), 2021.Google Scholar
Intel Intelligent Storage Acceleration Library (Intel ISA-L). https://software.intel.com/en-us/storage/ISA-L.Google Scholar
Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. XORing Elephants: Novel Erasure Codes for Big Data. In Proceedings of the 39th International Conference on Very Large Data Bases (VLDB), 2013.Google Scholar
Oleg Kolosov, Gala Yadgar, Matan Liram, Itzhak Tamo, and Alexander Barg. On Fault Tolerance, Locality, and Optimality in Locally Repairable Codes. In Proceedings of the 2018 USENIX Annual Technical Conference (ATC), 2018.Google Scholar
Itzhak Tamo and Alexander Barg. A family of optimal locally recoverable codes. IEEE Transactions on Information Theory, 60(8):4661--4676, 2014.Google ScholarCross Ref
Saurabh Kadekodi, Shashwat Silas, David Clausen, and Arif Merchant. Practical Design Considerations for Wide Locally Recoverable Codes (LRCs). In Proceedings of the 21th USENIX Symposium on File and Storage Technologies (FAST), 2023.Google ScholarDigital Library
Chameleon - A configurable experimental environment for large-scale cloud research. https://www.chameleoncloud.org.Google Scholar
Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. Lessons Learned from the Chameleon Testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (ATC), 2020.Google Scholar

Index Terms

Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. n-tier architectures
  2. Dependable and fault-tolerant systems and networks
    1. Redundancy
    2. Reliability
2. Computing methodologies
  1. Modeling and simulation
    1. Simulation evaluation

Recommendations

Cost analysis of erasure coding for exa-scale storage
Abstract
With the increasing demand for mass storage, research on exa-scale storage is actively underway. When the scale of storage grows to the exa-scale, the space efficiency becomes very important. To maintain the storage reliability and improve the ...
Read More
High performance erasure coding for very large stripe sizes
HPC '19: Proceedings of the High Performance Computing Symposium

Exascale computing demands high bandwidth and low latency I/O on the computing edge. Object storage systems can provide higher bandwidth and lower latencies than tape archive. File transfer nodes present a single point of mediation through which data ...
Read More
Fountain-inspired erasure coding for real-time traffic

An erasure correction strategy based on fountain coding is proposed for traffic with real-time requirements. A sliding window marks the range of non-expired data. Each new block entering the window is once sent as such, followed by probabilistically ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2023
1428 pages
ISBN:9798400701092
DOI:10.1145/3581784
Chair:
Dorian Arnold,
Program Chair:
Rosa M Badia,
Program Co-chair:
Kathryn Mohror
Copyright © 2023 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
Author Tags
data centers
HPC storage
scalable storage
reliability
data protection
erasure coding
system-design tradeoffs
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,516of6,373submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 223
  Total Downloads
- Downloads (Last 12 months)223
- Downloads (Last 6 weeks)33
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cost analysis of erasure coding for exa-scale storage

High performance erasure coding for very large stripe sizes

Fountain-inspired erasure coding for real-time traffic