ABSTRACT
Data de-duplication has become a commodity component in data-intensive systems and it is required that these systems provide high reliability comparable to others. Unfortunately, by storing duplicate data chunks just once, de-duped system improves storage utilization at cost of error resilience or reliability. In this paper, R-ADMAD, a high reliability provision mechanism is proposed. It packs variable-length data chunks into fixed sized objects, and exploits ECC codes to encode the objects and distributes them among the storage nodes in a redundancy group, which is dynamically generated according to current status and actual failure domains. Upon failures, R-ADMAD proposes a distributed and dynamic recovery process. Experimental results show that R-ADMAD can provide the same storage utilization as RAID-like schemes, but comparable reliability to replication based schemes with much more redundancy. The average recovery time of R-ADMAD based configurations is about 2-6 times less than RAID-like schemes. Moreover, R-ADMAD can provide dynamic load balancing even without the involvement of the overloaded storage nodes.
- J F Gantz, et al. The Expanding Digital Universe: A Forecast of Worldwide Information Growth through 2010. IDC, March 2007.Google Scholar
- Partho Nath; Bhuvan Urgaonkar; Anand Sivasubramaniam. Evaluating the usefulness of content addressable storage for high-performance data intensive applications. Proceedings of the 17th international symposium on High performance distributed computing, Boston, MA, USA, 2008 Google ScholarDigital Library
- EMC Centera. Content Addressed Storage. http://www.emc.com/pdf/products/centera/centera guide.pdf.Google Scholar
- Data Domain. http://www.datadomain.com.Google Scholar
- Quantum Dxi-Series. http://www.quantum.com/Products/Google Scholar
- Symantec PureDisk. http://www.symantec.com/business/products/overview.jsp?pcid=2244&pvid=1381_1Google Scholar
- Chuanyi Liu, Yingping Lu, David Du and Dongsheng Wang. ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System. International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI 2008) held in conjunction with the 25th IEEE Conference on Mass Storage Systems and Technologies (MSST 2008) Google ScholarDigital Library
- Deepavali Bhagwat, Kristal Pollack, Darrell D. E. Long, Thomas Schwarz, Ethan L. Miller, Providing High Reliability in a Minimum Redundancy Archival Storage System, Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS'06), September 2006, pages 413--421. Google ScholarDigital Library
- Xin Qin, Miller E L, Schwarz T J E. Evaluation of Distributed Recovery in Large-scale Storage Systems. Proceedings of the 13 th IEEE International Symposium on High Performance Distributed Computing, 2004-06: 172--181. Google ScholarDigital Library
- Qin Xin. Understanding and Coping with Failures in Large-Scale Storage Systems. Technical Report UCSC-SSRC-07-06, May 2007.Google Scholar
- David Reine. Enterprise Data Center Storage Issues. THE CLIPPER GROUP Navigator, September 11, 2008. Accessed from http://www.clipper.com/research/TCG2008043.pdfGoogle Scholar
- W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a serverless distributed ?le system deployed on an existing set of desktop pcs. SIGMETRICS Perform. Eval. Rev., 28(1):34--43, 2000. Google ScholarDigital Library
- N Tolia, M Kozuch, and M Satyanarayanan, et al. Opportunistic Use of Content Addressable Storage for Distributed File Systems. In Proc. of Usenix 2003 Annual Technical Conference, San Antonio, TX, USAGoogle Scholar
- Athicha Muthitacharoen, Benjie Chen, David Mazières, and Abhishek Rawat, A low bandwidth network file system. SOSP 2001. Google ScholarDigital Library
- M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer, Compactly encoding unstructured inputs with differential compression. Journal of the ACM, 49(3):318--367, May 2002. Google ScholarDigital Library
- Lawrence L. You and Christos Karamanolis, Evaluation of Efficient Archival Storage Techniques. 12th NASA Goddard, 21st IEEE Conference on Mass Storage Systems and Technologies. April 13--16, 2004, College Park, Maryland, USAGoogle Scholar
- Bo Han and Pete, Keleher. Implementation and Performance Evaluation of Fuzzy File Block Matching. 2007 USENIX Annual Technical Conference, Santa Clara, CA, June 17--22 Google ScholarDigital Library
- N. Spillers. Storage Challenges in the Medical Industry. In The 4th Intelligent Storage Workshop, Digital Technology Center, University of Minnesota, 2006.Google Scholar
- Dennis Colarelli, Dirk Grunwald. Massive arrays of idle disks for storage archives. Proceedings of the 2002 ACM/IEEE conference on Supercomputing, p.1--11, November 16, 2002, Baltimore, Maryland Google ScholarDigital Library
- B Van Rompay, On the security of dedicated hash functions. In the 19th Symposium on Information Theory in the Benelux, 1998Google Scholar
- Xin Qin, Miller E L, Schwarz T J E. Evaluation of Distributed Recovery in Large-scale Storage Systems. Proceedings of the 13 th IEEE International Symposium on High Performance Distributed Computing, 2004-06: 172--181. Google ScholarDigital Library
- M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.Google Scholar
- Sage Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn, CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data, Proceedings of SC'06, November 2006. Google ScholarDigital Library
- W. W. Peterson and E. J. Weldon, Jr., Error-Correcting Codes, Second Edition. MIT Press, Cambridge, MA, 1972.Google Scholar
- F. J. MacWilliams and N. J. A. Sloane. The Theory of Error-Correcting Codes, Part I. North-Holland, Amsterdam, 1977Google Scholar
- Luby, M. G., M. Mitzenmacher, M.A. Shokrollahi, and D. A. Spielman, ''Efficient Erasure Correcting Codes'', IEEE Transactions on Information Theory, 47(2), 569--584, February 2001. Google ScholarDigital Library
- R. A. Meyer and R. Bagrodia. PARSEC user manual, release 1.1. http://pcl.cs.ucla.edu/projects/parsec/.Google Scholar
- S. Brin and L. Page., The anatomy of a large-scale hypertextual web search engine. In WWW Conference, volume 7, 1998. Google ScholarDigital Library
- MySQL. http://www.mysql.com.Google Scholar
- David Du, Dingshan He, Changjin Hong, Jaehoon Jeong, Vishal Kher, Yongdae Kim, Yingping Lu, Aravindan Raghuveer, and Sarah Sharafkandi, ''Experiences in Building an Object-Based Storage System based on the OSD T-10 Standard,'' Submitted to 14th NASA Goddard & 23rd IEEE (MSST2006) Conference on Mass Storage Systems and Technologies May 15-18, 2006, College Park, MDGoogle Scholar
- Mendel Rosenblum and John K. Ousterhout, The design and implementation of a log-structured file system. ACM Transactions on Computer Systems (TOCS), 10(1), February 1992. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. Proc. SOSP'03, 2003. Google ScholarDigital Library
- Sage Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn, Ceph: A Scalable, High-Performance Distributed File System, Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI'06), November 2006. Google ScholarDigital Library
- Lustre Object-based Cluster File System. http://www.sun.com/software/products/lustre/index.xmlGoogle Scholar
- Storage Networking Solutions. Object Storage Architecture: Defining a new generation of storage systems built on distributed, intelligent storage devices. http://www.snseurope.com/featuresfull.php?id=2193. 2004, 9Google Scholar
- Weatherspoon, H. and J. Kubiatowicz, ''Erasure Coding vs. Replication: A Quantitative Comparison'', Proceedings of the First International Workshop on Peer-to-Peer Systems (IPTPS 2002), March 2002, 328--338. Google ScholarDigital Library
- J. J.Wylie and R. Swaminathan. Determining fault tolerance of XOR-based erasure codes efficiently. In DSN-2007, pages 206--215. IEEE, June 2007. Google ScholarDigital Library
- The Zebra Striped Network File System, John Hartman and John Ousterhout, ACM TOCS 1995. Google ScholarDigital Library
- S Quinlan and S Dorward. Venti: A New Approach to Archival Storage. In Proceedings of Conference on File and Storage Technologies (2002). Google ScholarDigital Library
- IBM Enterprise disk storage. http://www.ibm.com/systems/storage/disk/enterprise/ds_family.htmlGoogle Scholar
- NCBI GenBank. http://www.ncbi.nlm.nih.gov/Genbank/.Google Scholar
- J. G. Elerath. Specifying reliability in the disk drive industry: No more MTBF's. In Proceedings of the 2000 Annual Reliability and Maintainability, pages 194--199. IEEE, 2000.Google ScholarCross Ref
- Bianca Schroeder; Garth A. Gibson. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? 5th USENIX Conference on File and Storage Technologies (FAST 2007) Google ScholarDigital Library
- James S. Plank, Jianqiang Luo, Catherine D. Schuman, Lihao Xu, Zooko Wilcox-O'Hearn. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage. In Proceedings of the Seventh USENIX Conference on File and Storage Technologies (FAST) 2009, San Francisco, CA Google ScholarDigital Library
- Sung Hoon Baek, Bong Wan Kim, Eui Joung Joung, Chong Won Park. Reliability and performance of hierarchical RAID with multiple controllers. Proceedings of the twentieth annual ACM symposium on Principles of distributed computing, Newport, Rhode Island, United States, Pages: 246--254, 2001 Google ScholarDigital Library
- Kai Hwang, Hai Jin, Roy Ho. RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing. In proceedings of the Ninth IEEE International Symposium on High Performance Distributed Computing, Pittsburgh, PA, PP279--287, 2000. Google ScholarDigital Library
Index Terms
- R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems
Recommendations
ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System
SNAPI '08: Proceedings of the 2008 Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/OsThere is a huge amount of duplicated or redundant data in current storage systems. So Data De-duplication, which uses lossless data compression schemes to minimize the duplicated data at the inter-file level, has been receiving broad attention in recent ...
Data Backup and Recovery Based on Data De-Duplication
AICI '10: Proceedings of the 2010 International Conference on Artificial Intelligence and Computational Intelligence - Volume 02This paper via compares data de-duplication with other data storage methods, analyses characteristics of data de-duplication and applies the technology to data backup and recovery. Highlights special process of asynchronous backup and recovery based on ...
HPDA: A hybrid parity-based disk array for enhanced performance and reliability
Flash-based Solid State Drive (SSD) has been productively shipped and deployed in large scale storage systems. However, a single flash-based SSD cannot satisfy the capacity, performance and reliability requirements of the modern storage systems that ...
Comments