ABSTRACT
Multi-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.
- D. Colarelli and D. Grunwald. Massive Arrays of Idle Disks For Storage Archives. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC), 2002.Google ScholarDigital Library
- Huaxia Xia and Andrew A. Chien. RobuSTore: Robust Performance for Distributed Storage Systems. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC), 2007.Google Scholar
- Zizhong Chen. Optimal real number codes for fault tolerant matrix operations. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009.Google ScholarDigital Library
- Haiyang Shi and Xiaoyi Lu. TriEC: Tripartite Graph Based Erasure Coding NIC Offload. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2019.Google Scholar
- Haiyang Shi and Xiaoyi Lu. INEC: Fast and Coherent In-Network Erasure Coding. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2020.Google Scholar
- Liangfeng Cheng, Yuchong Hu, Zhaokang Ke, Jia Xu, Qiaori Yao, Dan Feng, Weichun Wang, and Wei Chen. LogECMem: Coupling Erasure-Coded In-Memory Key-Value Stores with Parity Logging. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.Google ScholarDigital Library
- Yuya Uezato. Accelerating XOR-based erasure coding using program optimization techniques. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.Google ScholarDigital Library
- Salvatore Di Girolamo, Daniele De Sensi, Konstantin Taranov, Milos Malesevic, Maciej Besta, Timo Schneider, Severin Kistler, and Torsten Hoefler. Building blocks for network-accelerated distributed file systems. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2022.Google ScholarCross Ref
- David Patterson, Garth Gibson, and Randy Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM SIGMOD Conference on the Management of Data (SIGMOD), 1988.Google ScholarDigital Library
- Jaeho Kim, Jongmin Lee, Jongmoo Choi, Donghee Lee, and Sam H. Noh. Enhancing SSD reliability through efficient RAID support. In Proceedings of the Asia-Pacific Workshop on Systems (APSys), 2012.Google ScholarDigital Library
- Guangyan Zhang, Zican Huang, Xiaosong Ma, Songlin Yang, Zhufan Wang, and Weimin Zheng. RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures. In Proceedings of the 16th USENIX Symposium on File and Storage Technologies (FAST), 2018.Google Scholar
- K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster. In the 5th Workshop on Hot Topics in Storage and File Systems (HotStorage), 2013.Google Scholar
- KV Rashmi, Preetum Nakkiran, Jingyan Wang, Nihar B. Shah, and Kannan Ramchandran. Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage and Network-bandwidth. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST), 2015.Google Scholar
- Mingyuan Xia, Mohit Saxena, Mario Blaum, and David A. Pease. A Tale of Two Erasure Codes in HDFS. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST), 2015.Google ScholarDigital Library
- Jeffrey Thornton Inman, William Flynn Vining, Garrett Wilson Ransom, and Gary Alan Grider. Marfs, a near-posix interface to cloud objects. ; Login, 42(LA-UR-16-28720; LA-UR-16-28952), 2017.Google Scholar
- Scality ARTESCA: Object Storage for S3 Applications. https://www.scality.com/products/artesca/.Google Scholar
- Hierarchical Erasure Coding: Making Erasure Coding Usable. https://www.snia.org/sites/default/files/SNIA_Hierarchical_Erasure_Coding_Final.pdf.Google Scholar
- Jehan-François Pâris, S. J. Thomas J. E. Schwarz, Ahmed Amer, and Darrell D. E. Long. Highly reliable two-dimensional RAID arrays for archival storage. In 31th IEEE - International Performance Computing and Communications Conference (IPCCC), 2012.Google ScholarCross Ref
- Neng Wang, Yinlong Xu, Yongkun Li, and Si Wu. OI-RAID: A Two-Layer RAID Architecture towards Fast Recovery and High Reliability. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2016.Google ScholarCross Ref
- Alexander Thomasian. Multi-level RAID for very large disk arrays. In ACM SIGMETRICS Performance Evaluation Review, 2006.Google ScholarDigital Library
- Sung Hoon Baek, Bong Wan Kim, Eui Joung Joung, and Chong Won Park. Reliability and performance of hierarchical RAID with multiple controllers. In Proceedings of the 20st ACM Symposium on Principles of Distributed Computing (PODC), 2001.Google ScholarDigital Library
- Alexander Thomasian and Yujie Tang. Performance, Reliability, and Performability Aspects of Hierarchical RAID. In 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage (NAS), 2011.Google Scholar
- Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. Erasure Coding in Windows Azure Storage. In Proceedings of the 2012 USENIX Annual Technical Conference (ATC), 2012.Google ScholarDigital Library
- MLEC Github repository. https://github.com/ucare-uchicago/mlec-sim.Google Scholar
- MLEC Artifact on Chameleon Trovi. https://tinyurl.com/mlec-artifact.Google Scholar
- Richard R. Muntz and John C. S. Lui. Performance analysis of disk arrays under failure. In Proceedings of the 16th International Conference on Very Large Data Bases (VLDB), 1990.Google Scholar
- Mark Holland and Garth Gibson. Parity Declustering for Continuous Operation in Redundant Disk Arrays. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1992.Google ScholarDigital Library
- Guillermo A. Alvarez, Walter A. Burkhard, and Flaviu Cristian. Tolerating Multiple Failures in RAID Architectures with Optimal Storage and Uniform Declustering. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA), 1997.Google ScholarDigital Library
- Guillermo A. Alvarez, Walter A. Burkhard, Larry J. Stockmeyer, and Flaviu Cristian. Declustered disk array architectures with optimal and near-optimal parallelism. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA), 1998.Google ScholarDigital Library
- Thomas J.E. Schwarz S.J., Jesse Steinberg, and Walter A. Burkhard. Permutation development data layout (PDDL). In Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA-5), 1999.Google Scholar
- Huan Ke, Haryadi S Gunawi, David Bonnie, Nathan DeBardeleben, Michael Grosskopf, Terry Grové, Dominic Manno, Elisabeth Moore, and Brad Settlemyer. Extreme protection against data loss with single-overlap declustered parity. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 343--354. IEEE, 2020.Google ScholarCross Ref
- CORVAULT - Self-Healing, High Density Data Storage. https://www.seagate.com/products/storage/data-storage-systems/corvault/.Google Scholar
- Jeff Bonwick and Bill Moore. Zfs: The last word in file systems, 2007.Google Scholar
- Dell PowerEdge RAID Controller 12. https://infohub.delltechnologies.com/p/dell-poweredge-raid-controller-12/.Google Scholar
- Paul Glasserman, Philip Heidelberger, Perwez Shahabuddin, and Tim Zajic. Splitting for rare event simulation: analysis of simple cases. In Proceedings of the 28th conference on Winter simulation, pages 302--308, 1996.Google ScholarDigital Library
- Victor F Nicola, Perwez Shahabuddin, and Marvin K Nakayama. Techniques for fast simulation of models of highly dependable systems. IEEE Transactions on Reliability, 50(3):246--264, 2001.Google ScholarCross Ref
- Daniel Ford, Franis Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlna. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI), 2010.Google ScholarDigital Library
- Kevin M. Greenan, James S. Plank, and Jay J. Wylie. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In the 2nd Workshop on Hot Topics in Storage and File Systems (HotStorage), 2010.Google Scholar
- Hiroaki Akutsu and Tomohiro Kawaguchi. Reliability analysis of distributed raid with priority rebuilding. In Proc. USENIX Conf., 2013.Google Scholar
- Kishor S Trivedi. Probability and statistics with reliability, queuing, and computer science applications. John Wiley & Sons, 2001.Google ScholarDigital Library
- ORNL's Alpine storage system. https://www.olcf.ornl.gov/olcf-resources/data-visualization-resources/alpine.Google Scholar
- Personal Communication with LANL, ORNL, and Seagate Engineers and Operators.Google Scholar
- Yuchong Hu, Liangfeng Cheng, Qiaori Yao, Patrick P. C. Lee, Weichun Wang, and Wei Chen. Exploiting Combined Locality for Wide-Stripe Erasure Coding in Distributed Storage. In Proceedings of the 19th USENIX Symposium on File and Storage Technologies (FAST), 2021.Google Scholar
- Intel Intelligent Storage Acceleration Library (Intel ISA-L). https://software.intel.com/en-us/storage/ISA-L.Google Scholar
- Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. XORing Elephants: Novel Erasure Codes for Big Data. In Proceedings of the 39th International Conference on Very Large Data Bases (VLDB), 2013.Google Scholar
- Oleg Kolosov, Gala Yadgar, Matan Liram, Itzhak Tamo, and Alexander Barg. On Fault Tolerance, Locality, and Optimality in Locally Repairable Codes. In Proceedings of the 2018 USENIX Annual Technical Conference (ATC), 2018.Google Scholar
- Itzhak Tamo and Alexander Barg. A family of optimal locally recoverable codes. IEEE Transactions on Information Theory, 60(8):4661--4676, 2014.Google ScholarCross Ref
- Saurabh Kadekodi, Shashwat Silas, David Clausen, and Arif Merchant. Practical Design Considerations for Wide Locally Recoverable Codes (LRCs). In Proceedings of the 21th USENIX Symposium on File and Storage Technologies (FAST), 2023.Google ScholarDigital Library
- Chameleon - A configurable experimental environment for large-scale cloud research. https://www.chameleoncloud.org.Google Scholar
- Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. Lessons Learned from the Chameleon Testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (ATC), 2020.Google Scholar
Index Terms
- Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers
Recommendations
Cost analysis of erasure coding for exa-scale storage
AbstractWith the increasing demand for mass storage, research on exa-scale storage is actively underway. When the scale of storage grows to the exa-scale, the space efficiency becomes very important. To maintain the storage reliability and improve the ...
High performance erasure coding for very large stripe sizes
HPC '19: Proceedings of the High Performance Computing SymposiumExascale computing demands high bandwidth and low latency I/O on the computing edge. Object storage systems can provide higher bandwidth and lower latencies than tape archive. File transfer nodes present a single point of mediation through which data ...
Fountain-inspired erasure coding for real-time traffic
An erasure correction strategy based on fountain coding is proposed for traffic with real-time requirements. A sliding window marks the range of non-expired data. Each new block entering the window is once sent as such, followed by probabilistically ...
Comments