research-article

Can We Group Storage? Statistical Techniques to Identify Predictive Groupings in Storage System Accesses

Authors:
Avani Wildani

The Salk Institute for Biological Studies, Atlanta, GA

The Salk Institute for Biological Studies, Atlanta, GA
View Profile

,
Ethan L. Miller

University of California, MS, Santa Cruz

University of California, MS, Santa Cruz
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 12 Issue 2Article No.: 7pp 1–33https://doi.org/10.1145/2738042

Published:01 February 2016Publication History

ACM Transactions on Storage

Abstract

Storing large amounts of data for different users has become the new normal in a modern distributed cloud storage environment. Storing data successfully requires a balance of availability, reliability, cost, and performance. Typically, systems design for this balance with minimal information about the data that will pass through them. We propose a series of methods to derive groupings from data that have predictive value, informing layout decisions for data on disk.

Unlike previous grouping work, we focus on dynamically identifying groupings in data that can be gathered from active systems in real time with minimal impact using spatiotemporal locality. We outline several techniques we have developed and discuss how we select particular techniques for particular workloads and application domains. Our statistical and machine-learning-based grouping algorithms answer questions such as “What can a grouping be based on?” and “Is a given grouping meaningful for a given application?” We design our models to be flexible and require minimal domain information so that our results are as broadly applicable as possible. We intend for this work to provide a launchpad for future specialized system design using groupings in combination with caching policies and architectural distinctions such as tiered storage to create the next generation of scalable storage systems.

References

I. F. Adams, M. W. Storer, and E. L. Miller. 2012. Analysis of workload behavior in scientific and historical long-term data repositories. ACM Transactions on Storage (TOS) 8, 2 (2012), 6. Google ScholarDigital Library
A. Amer and D. D. E. Long. 2002. Aggregating caches: A mechanism for implicit file prefetching. In IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 293--301. Google ScholarDigital Library
A. Amer, D. D. E. Long, J. F. Paris, and R. C. Burns. 2002. File access prediction with adjustable accuracy. In IEEE International Conference on Performance, Computing and Communications (IPCCC). IEEE Computer Society, 131--140. Google ScholarDigital Library
I. Ari, A. Amer, R. Gramacy, E. L. Miller, S. A. Brandt, and D. D. E. Long. 2002. ACME: Adaptive caching using multiple experts. In Proceedings in Informatics, Vol. 14. Citeseer, 143--158.Google Scholar
A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, L. N. Bairavasundaram, T. E. Denehy, F. I. Popovici, V. Prabhakaran, and M. Sivathanu. 2006. Semantically-smart disk systems: Past, present, and future. ACM SIGMETRICS Performance Evaluation Review 33, 4 (2006), 29--35. Google ScholarDigital Library
M. Barbaro and T. Zeller Jr. 2006. A face is exposed for aol searcher no. 4417749. (August 2006).Google Scholar
L. A. Belady. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal 5, 2 (1966), 78--101. Google ScholarDigital Library
E. G. Coffman, Jr. and Thomas A. Ryan, Jr. 1972. A study of storage partitioning using a mathematical model of locality. Communications of the ACM 15, 3 (March 1972), 185--190. Google ScholarDigital Library
D. Colarelli and D. Grunwald. 2002. Massive arrays of idle disks for storage archives. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press, 11. Google ScholarDigital Library
C. Constantinescu, J. Glider, and D. Chambliss. 2011. Mixing deduplication and compression on active data sets. In 2011 Data Compression Conference. IEEE, 393--402. Google ScholarDigital Library
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. 1990. Algorithms. MIT Press, Cambridge, MA.Google Scholar
X. Ding, S. Jiang, F. Chen, K. Davis, and X. Zhang. 2007. DiskSeen: Exploiting disk layout and access history to enhance I/O prefetch. In 2007 USENIX ATC. USENIX Association, 1--14. Google ScholarDigital Library
S. Doraimani and A. Iamnitchi. 2008. File grouping for scientific data management: Lessons from experimenting with real traces. In Proceedings of the 17th International Symposium on High Performance Distributed Computing. ACM, 153--164. Google ScholarDigital Library
R. O. Duda, P. E. Hart, and D. G. Stork. 2001. Pattern Classification. Vol. 2. Citeseer. Google ScholarDigital Library
D. Essary and A. Amer. 2008. Predictive data grouping: Defining the bounds of energy and latency reduction through predictive data grouping and replication. Transactions on Storage 4, 1 (2008), 1--23. Google ScholarDigital Library
Bert Dufrasne, Roger Eriksson, Lisa Martinez, and Wenzel Kalabza. 2012. IBM XIV Storage System Gen3 Architecture, Implementation, and Usage. IBM, International Technical Support Organization. 426 pages.Google Scholar
P. Jaccard. 1901. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin del la Société Vaudoise des Sciences Naturelles 37 (1901), 241--272.Google Scholar
S. Jiang, X. Ding, F. Chen, E. Tan, and X. Zhang. 2005. DULO: An effective buffer cache management scheme to exploit both temporal and spatial locality. In USENIX Conference on File and Storage Technologies (FAST). USENIX Association, 8. Google ScholarDigital Library
R. Koller and R. Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Transactions on Storage (TOS) 6, 3 (2010), 1--26. Google ScholarDigital Library
T. M. Kroeger and D. D. E. Long. 1996. Predicting file system actions from prior events. In Proceedings of the 1996 Annual Conference on USENIX Annual Technical Conference. Usenix Association, 26. Google ScholarDigital Library
T. M. Kroeger and D. D. E. Long. 2001. Design and implementation of a predictive file prefetching algorithm. In USENIX Annual Technical Conference, General Track. 105--118. Google ScholarDigital Library
A. Lancichinetti and S. Fortunato. 2009. Community detection algorithms: A comparative analysis. Physical Review E 80, 5 (2009), 056117.Google ScholarCross Ref
W. Li. 2008. An Efficient Query System for High-Dimensional Spatio-Temporal Data. Ph.D. Dissertation. University of Massachusetts Lowell. Google ScholarDigital Library
Z. Li, Z. Chen, S. M. Srinivasan, and Y. Zhou. 2004. C-miner: Mining block correlations in storage systems. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies. USENIX Association, 173--186. Google ScholarDigital Library
S.-w. Lo, B.-H. Chen, Y.-W. Chen, T.-C. Shen, and Y.-C. Lin. 2014. ICAP, a new flash wear-leveling algorithm inspired by locality. In Proceedings of the 29th Annual ACM Symposium on Applied Computing. ACM, 1478--1483. Google ScholarDigital Library
S. J. Leffler M. K. McKusick, W. N. Joy, and R. S. Fabry. 1984. A fast file system for UNIX. ACM Transactions on Computer Systems 2, 3 (Aug. 1984), 181--197. Google ScholarDigital Library
A. E. Magurran. 2004. Measuring biological diversity. In African Journal of Aquatic Science 29, 2, 285--286.Google ScholarCross Ref
J. Metz. 2012. Working document of the new technologies file system (NTFS). 0.0.3 (2012).Google Scholar
D. Narayanan, A. Donnelly, and A. Rowstron. 2008. Write off-loading: Practical power management for enterprise storage. ACM Transactions on Storage (TOS) 4, 3 (2008), 1--23. Google ScholarDigital Library
J. Oly and D. A. Reed. 2002. Markov model prediction of I/O requests for scientific applications. In Proceedings of the 16th International Conference on Supercomputing. ACM, 147--155. Google ScholarDigital Library
E. Pinheiro and R. Bianchini. 2004. Energy conservation techniques for disk array-based servers. In ICS’04. ACM, 68--78. Google ScholarDigital Library
W. M. Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 (1971), 846--850.Google ScholarCross Ref
A. Riska and E. Riedel. 2006. Disk drive level workload characterization. In Proceedings of the USENIX Annual Technical Conference. 97--103. Google ScholarDigital Library
J. Schindler, J. L. Griffin, C. R. Lumb, and G. R. Ganger. 2002. Track-aligned extents: Matching access patterns to disk drive characteristics. In Conference on File and Storage Technologies. Google ScholarDigital Library
F. Schmuck and R. Haskin. 2002. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 2002 Conference on File and Storage Technologies (FAST’02). USENIX, 231--244. http://www.ssrc.ucsc.edu/PaperArchive/schmuck-fast02.pdf. Google ScholarDigital Library
M. Sivathanu, V. Prabhakaran, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. 2005. Improving storage system availability with D-GRAID. ACM Transactions on Storage (TOS) 1, 2 (2005), 133--170. Google ScholarDigital Library
M. Sivathanu, V. Prabhakaran, F. I. Popovici, T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. 2003. Semantically-smart disk systems. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies. 73--88. Google ScholarDigital Library
N. Slonim, G. Singh Atwal, G. Tkacik, and W. Bialek. 2005. Information-based clustering. Proceedings of the National Academy of Science 1021 (Dec. 2005), 18297--18302.Google Scholar
T. Sørenson. 1948. A method of establishing groups of equal amplitude in plant sociology based oil similarity of species content. Biologiske Skrifter/Kongelige Danske Videnskabernes Selskab (1948), 1--34.Google Scholar
C. Staelin and H. Garcia-Molina. 1990. Clustering active disk data to improve disk performance. Princeton, NJ, Tech. Rep. CS--TR--298--90 (1990).Google Scholar
A. S. Tanenbaum, J. N. Herder, and H. Bos. 2006. File size distribution on UNIX systems: Then and now. ACM SIGOPS Operating Systems Review 40, 1 (2006), 104. Google ScholarDigital Library
J. Wang and Y. Hu. 2001. PROFS-performance-oriented data reorganization for log-structured file system on multi-zone disks. In IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). Published by the IEEE Computer Society, 0285. Google ScholarDigital Library
A. Wildani and E. L. Miller. 2010. Semantic data placement for power management in archival storage. In 2010 5th Petascale Data Storage Workshop (PDSW’10). IEEE, 1--5.Google Scholar
A. Wildani, E. L. Miller, and O. Rodeh. 2013. HANDS: A heuristically arranged non-backup in-line deduplication system. In 2013 IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 446--457. Google ScholarDigital Library
A. Wildani, E. L. Miller, and L. Ward. 2011. Efficiently identifying working sets in block I/O streams. In Proceedings of the 4th Annual International Conference on Systems and Storage. 5. Google ScholarDigital Library
A. Wildani, E. L. Miller, I. Adams, and D. D. E. Long. 2014. PERSES: Data layout for low impact failures. In 22th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’14). Google ScholarDigital Library
G. Wu and X. He. 2012. Delta-FTL: Improving SSD lifetime via exploiting content locality. In Proceedings of the 7th ACM European Conference on Computer Systems. ACM, 253--266. Google ScholarDigital Library
N. J. Yadwadkar, C. Bhattacharyya, K. Gopinath, T. Niranjan, and S. Susarla. 2010. Discovery of application workloads from network file traces. In Proceedings of the 8th USENIX Conference on File and Storage Technologies. USENIX Association, 14. Google ScholarDigital Library
S. Zaman, S. I. Lippman, L. Schneper, N. Slonim, and J. R. Broach. 2009. Glucose regulates transcription in yeast through a network of signaling pathways. Molecular Systems Biology 5, 1 (2009).Google Scholar
X. Zhuang and H. H. S. Lee. 2007. Reducing cache pollution via dynamic data prefetch filtering. IEEE Transactions on Computers (2007), 18--31. Google ScholarDigital Library

Index Terms

Can We Group Storage? Statistical Techniques to Identify Predictive Groupings in Storage System Accesses

Recommendations

Differentiated storage services

This article presents a Differentiated Storage Services architecture for file and storage systems. By classifying data at the block-level, a filesystem can request that different classes of data (e.g., file, directory, executable, text) be handled with ...
Read More
Evaluation of Exclusive Data Allocation Between SSD Tier and SSD Cache in Storage Systems
ICEIS 2014: Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1

We proposed and evaluated the storage I/O response time with the exclusive allocation method between SSD for tiered volume and SSD for cache in the storage system utilizing SSD and HDD. In the proposed method, the SSD cache function with exclusive ...
Read More
Scheduling and data layout policies for a near-line multimedia storage architecture

Recent advances in computer technologies have made it feasible to provide multimedia services, such as news distribution and entertainment, via high-bandwidth networks. The storage and retrieval of large multimedia objects (e.g., video) becomes a major ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Storage Volume 12, Issue 2
February 2016
134 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/2888404
Editor:
Darrell D. E. Long
University of California Santa Cruz, USA
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 February 2016
- Accepted: 1 February 2015
- Revised: 1 December 2014
- Received: 1 April 2014
Published in tos Volume 12, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data layout
predictive modeling
storage optimization
tiered storage
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 377
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Can We Group Storage? Statistical Techniques to Identify Predictive Groupings in Storage System Accesses

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Differentiated storage services

Evaluation of Exclusive Data Allocation Between SSD Tier and SSD Cache in Storage Systems

Scheduling and data layout policies for a near-line multimedia storage architecture