Abstract
Storing large amounts of data for different users has become the new normal in a modern distributed cloud storage environment. Storing data successfully requires a balance of availability, reliability, cost, and performance. Typically, systems design for this balance with minimal information about the data that will pass through them. We propose a series of methods to derive groupings from data that have predictive value, informing layout decisions for data on disk.
Unlike previous grouping work, we focus on dynamically identifying groupings in data that can be gathered from active systems in real time with minimal impact using spatiotemporal locality. We outline several techniques we have developed and discuss how we select particular techniques for particular workloads and application domains. Our statistical and machine-learning-based grouping algorithms answer questions such as “What can a grouping be based on?” and “Is a given grouping meaningful for a given application?” We design our models to be flexible and require minimal domain information so that our results are as broadly applicable as possible. We intend for this work to provide a launchpad for future specialized system design using groupings in combination with caching policies and architectural distinctions such as tiered storage to create the next generation of scalable storage systems.
- I. F. Adams, M. W. Storer, and E. L. Miller. 2012. Analysis of workload behavior in scientific and historical long-term data repositories. ACM Transactions on Storage (TOS) 8, 2 (2012), 6. Google ScholarDigital Library
- A. Amer and D. D. E. Long. 2002. Aggregating caches: A mechanism for implicit file prefetching. In IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 293--301. Google ScholarDigital Library
- A. Amer, D. D. E. Long, J. F. Paris, and R. C. Burns. 2002. File access prediction with adjustable accuracy. In IEEE International Conference on Performance, Computing and Communications (IPCCC). IEEE Computer Society, 131--140. Google ScholarDigital Library
- I. Ari, A. Amer, R. Gramacy, E. L. Miller, S. A. Brandt, and D. D. E. Long. 2002. ACME: Adaptive caching using multiple experts. In Proceedings in Informatics, Vol. 14. Citeseer, 143--158.Google Scholar
- A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, L. N. Bairavasundaram, T. E. Denehy, F. I. Popovici, V. Prabhakaran, and M. Sivathanu. 2006. Semantically-smart disk systems: Past, present, and future. ACM SIGMETRICS Performance Evaluation Review 33, 4 (2006), 29--35. Google ScholarDigital Library
- M. Barbaro and T. Zeller Jr. 2006. A face is exposed for aol searcher no. 4417749. (August 2006).Google Scholar
- L. A. Belady. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal 5, 2 (1966), 78--101. Google ScholarDigital Library
- E. G. Coffman, Jr. and Thomas A. Ryan, Jr. 1972. A study of storage partitioning using a mathematical model of locality. Communications of the ACM 15, 3 (March 1972), 185--190. Google ScholarDigital Library
- D. Colarelli and D. Grunwald. 2002. Massive arrays of idle disks for storage archives. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press, 11. Google ScholarDigital Library
- C. Constantinescu, J. Glider, and D. Chambliss. 2011. Mixing deduplication and compression on active data sets. In 2011 Data Compression Conference. IEEE, 393--402. Google ScholarDigital Library
- T. H. Cormen, C. E. Leiserson, and R. L. Rivest. 1990. Algorithms. MIT Press, Cambridge, MA.Google Scholar
- X. Ding, S. Jiang, F. Chen, K. Davis, and X. Zhang. 2007. DiskSeen: Exploiting disk layout and access history to enhance I/O prefetch. In 2007 USENIX ATC. USENIX Association, 1--14. Google ScholarDigital Library
- S. Doraimani and A. Iamnitchi. 2008. File grouping for scientific data management: Lessons from experimenting with real traces. In Proceedings of the 17th International Symposium on High Performance Distributed Computing. ACM, 153--164. Google ScholarDigital Library
- R. O. Duda, P. E. Hart, and D. G. Stork. 2001. Pattern Classification. Vol. 2. Citeseer. Google ScholarDigital Library
- D. Essary and A. Amer. 2008. Predictive data grouping: Defining the bounds of energy and latency reduction through predictive data grouping and replication. Transactions on Storage 4, 1 (2008), 1--23. Google ScholarDigital Library
- Bert Dufrasne, Roger Eriksson, Lisa Martinez, and Wenzel Kalabza. 2012. IBM XIV Storage System Gen3 Architecture, Implementation, and Usage. IBM, International Technical Support Organization. 426 pages.Google Scholar
- P. Jaccard. 1901. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin del la Société Vaudoise des Sciences Naturelles 37 (1901), 241--272.Google Scholar
- S. Jiang, X. Ding, F. Chen, E. Tan, and X. Zhang. 2005. DULO: An effective buffer cache management scheme to exploit both temporal and spatial locality. In USENIX Conference on File and Storage Technologies (FAST). USENIX Association, 8. Google ScholarDigital Library
- R. Koller and R. Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Transactions on Storage (TOS) 6, 3 (2010), 1--26. Google ScholarDigital Library
- T. M. Kroeger and D. D. E. Long. 1996. Predicting file system actions from prior events. In Proceedings of the 1996 Annual Conference on USENIX Annual Technical Conference. Usenix Association, 26. Google ScholarDigital Library
- T. M. Kroeger and D. D. E. Long. 2001. Design and implementation of a predictive file prefetching algorithm. In USENIX Annual Technical Conference, General Track. 105--118. Google ScholarDigital Library
- A. Lancichinetti and S. Fortunato. 2009. Community detection algorithms: A comparative analysis. Physical Review E 80, 5 (2009), 056117.Google ScholarCross Ref
- W. Li. 2008. An Efficient Query System for High-Dimensional Spatio-Temporal Data. Ph.D. Dissertation. University of Massachusetts Lowell. Google ScholarDigital Library
- Z. Li, Z. Chen, S. M. Srinivasan, and Y. Zhou. 2004. C-miner: Mining block correlations in storage systems. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies. USENIX Association, 173--186. Google ScholarDigital Library
- S.-w. Lo, B.-H. Chen, Y.-W. Chen, T.-C. Shen, and Y.-C. Lin. 2014. ICAP, a new flash wear-leveling algorithm inspired by locality. In Proceedings of the 29th Annual ACM Symposium on Applied Computing. ACM, 1478--1483. Google ScholarDigital Library
- S. J. Leffler M. K. McKusick, W. N. Joy, and R. S. Fabry. 1984. A fast file system for UNIX. ACM Transactions on Computer Systems 2, 3 (Aug. 1984), 181--197. Google ScholarDigital Library
- A. E. Magurran. 2004. Measuring biological diversity. In African Journal of Aquatic Science 29, 2, 285--286.Google ScholarCross Ref
- J. Metz. 2012. Working document of the new technologies file system (NTFS). 0.0.3 (2012).Google Scholar
- D. Narayanan, A. Donnelly, and A. Rowstron. 2008. Write off-loading: Practical power management for enterprise storage. ACM Transactions on Storage (TOS) 4, 3 (2008), 1--23. Google ScholarDigital Library
- J. Oly and D. A. Reed. 2002. Markov model prediction of I/O requests for scientific applications. In Proceedings of the 16th International Conference on Supercomputing. ACM, 147--155. Google ScholarDigital Library
- E. Pinheiro and R. Bianchini. 2004. Energy conservation techniques for disk array-based servers. In ICS’04. ACM, 68--78. Google ScholarDigital Library
- W. M. Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 (1971), 846--850.Google ScholarCross Ref
- A. Riska and E. Riedel. 2006. Disk drive level workload characterization. In Proceedings of the USENIX Annual Technical Conference. 97--103. Google ScholarDigital Library
- J. Schindler, J. L. Griffin, C. R. Lumb, and G. R. Ganger. 2002. Track-aligned extents: Matching access patterns to disk drive characteristics. In Conference on File and Storage Technologies. Google ScholarDigital Library
- F. Schmuck and R. Haskin. 2002. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 2002 Conference on File and Storage Technologies (FAST’02). USENIX, 231--244. http://www.ssrc.ucsc.edu/PaperArchive/schmuck-fast02.pdf. Google ScholarDigital Library
- M. Sivathanu, V. Prabhakaran, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. 2005. Improving storage system availability with D-GRAID. ACM Transactions on Storage (TOS) 1, 2 (2005), 133--170. Google ScholarDigital Library
- M. Sivathanu, V. Prabhakaran, F. I. Popovici, T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. 2003. Semantically-smart disk systems. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies. 73--88. Google ScholarDigital Library
- N. Slonim, G. Singh Atwal, G. Tkacik, and W. Bialek. 2005. Information-based clustering. Proceedings of the National Academy of Science 1021 (Dec. 2005), 18297--18302.Google Scholar
- T. Sørenson. 1948. A method of establishing groups of equal amplitude in plant sociology based oil similarity of species content. Biologiske Skrifter/Kongelige Danske Videnskabernes Selskab (1948), 1--34.Google Scholar
- C. Staelin and H. Garcia-Molina. 1990. Clustering active disk data to improve disk performance. Princeton, NJ, Tech. Rep. CS--TR--298--90 (1990).Google Scholar
- A. S. Tanenbaum, J. N. Herder, and H. Bos. 2006. File size distribution on UNIX systems: Then and now. ACM SIGOPS Operating Systems Review 40, 1 (2006), 104. Google ScholarDigital Library
- J. Wang and Y. Hu. 2001. PROFS-performance-oriented data reorganization for log-structured file system on multi-zone disks. In IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). Published by the IEEE Computer Society, 0285. Google ScholarDigital Library
- A. Wildani and E. L. Miller. 2010. Semantic data placement for power management in archival storage. In 2010 5th Petascale Data Storage Workshop (PDSW’10). IEEE, 1--5.Google Scholar
- A. Wildani, E. L. Miller, and O. Rodeh. 2013. HANDS: A heuristically arranged non-backup in-line deduplication system. In 2013 IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 446--457. Google ScholarDigital Library
- A. Wildani, E. L. Miller, and L. Ward. 2011. Efficiently identifying working sets in block I/O streams. In Proceedings of the 4th Annual International Conference on Systems and Storage. 5. Google ScholarDigital Library
- A. Wildani, E. L. Miller, I. Adams, and D. D. E. Long. 2014. PERSES: Data layout for low impact failures. In 22th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’14). Google ScholarDigital Library
- G. Wu and X. He. 2012. Delta-FTL: Improving SSD lifetime via exploiting content locality. In Proceedings of the 7th ACM European Conference on Computer Systems. ACM, 253--266. Google ScholarDigital Library
- N. J. Yadwadkar, C. Bhattacharyya, K. Gopinath, T. Niranjan, and S. Susarla. 2010. Discovery of application workloads from network file traces. In Proceedings of the 8th USENIX Conference on File and Storage Technologies. USENIX Association, 14. Google ScholarDigital Library
- S. Zaman, S. I. Lippman, L. Schneper, N. Slonim, and J. R. Broach. 2009. Glucose regulates transcription in yeast through a network of signaling pathways. Molecular Systems Biology 5, 1 (2009).Google Scholar
- X. Zhuang and H. H. S. Lee. 2007. Reducing cache pollution via dynamic data prefetch filtering. IEEE Transactions on Computers (2007), 18--31. Google ScholarDigital Library
Index Terms
- Can We Group Storage? Statistical Techniques to Identify Predictive Groupings in Storage System Accesses
Recommendations
Differentiated storage services
This article presents a Differentiated Storage Services architecture for file and storage systems. By classifying data at the block-level, a filesystem can request that different classes of data (e.g., file, directory, executable, text) be handled with ...
Evaluation of Exclusive Data Allocation Between SSD Tier and SSD Cache in Storage Systems
ICEIS 2014: Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1We proposed and evaluated the storage I/O response time with the exclusive allocation method between SSD for tiered volume and SSD for cache in the storage system utilizing SSD and HDD. In the proposed method, the SSD cache function with exclusive ...
Scheduling and data layout policies for a near-line multimedia storage architecture
Recent advances in computer technologies have made it feasible to provide multimedia services, such as news distribution and entertainment, via high-bandwidth networks. The storage and retrieval of large multimedia objects (e.g., video) becomes a major ...
Comments