skip to main content
research-article

Can We Group Storage? Statistical Techniques to Identify Predictive Groupings in Storage System Accesses

Published:01 February 2016Publication History
Skip Abstract Section

Abstract

Storing large amounts of data for different users has become the new normal in a modern distributed cloud storage environment. Storing data successfully requires a balance of availability, reliability, cost, and performance. Typically, systems design for this balance with minimal information about the data that will pass through them. We propose a series of methods to derive groupings from data that have predictive value, informing layout decisions for data on disk.

Unlike previous grouping work, we focus on dynamically identifying groupings in data that can be gathered from active systems in real time with minimal impact using spatiotemporal locality. We outline several techniques we have developed and discuss how we select particular techniques for particular workloads and application domains. Our statistical and machine-learning-based grouping algorithms answer questions such as “What can a grouping be based on?” and “Is a given grouping meaningful for a given application?” We design our models to be flexible and require minimal domain information so that our results are as broadly applicable as possible. We intend for this work to provide a launchpad for future specialized system design using groupings in combination with caching policies and architectural distinctions such as tiered storage to create the next generation of scalable storage systems.

References

  1. I. F. Adams, M. W. Storer, and E. L. Miller. 2012. Analysis of workload behavior in scientific and historical long-term data repositories. ACM Transactions on Storage (TOS) 8, 2 (2012), 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Amer and D. D. E. Long. 2002. Aggregating caches: A mechanism for implicit file prefetching. In IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 293--301. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Amer, D. D. E. Long, J. F. Paris, and R. C. Burns. 2002. File access prediction with adjustable accuracy. In IEEE International Conference on Performance, Computing and Communications (IPCCC). IEEE Computer Society, 131--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. I. Ari, A. Amer, R. Gramacy, E. L. Miller, S. A. Brandt, and D. D. E. Long. 2002. ACME: Adaptive caching using multiple experts. In Proceedings in Informatics, Vol. 14. Citeseer, 143--158.Google ScholarGoogle Scholar
  5. A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, L. N. Bairavasundaram, T. E. Denehy, F. I. Popovici, V. Prabhakaran, and M. Sivathanu. 2006. Semantically-smart disk systems: Past, present, and future. ACM SIGMETRICS Performance Evaluation Review 33, 4 (2006), 29--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Barbaro and T. Zeller Jr. 2006. A face is exposed for aol searcher no. 4417749. (August 2006).Google ScholarGoogle Scholar
  7. L. A. Belady. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal 5, 2 (1966), 78--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. G. Coffman, Jr. and Thomas A. Ryan, Jr. 1972. A study of storage partitioning using a mathematical model of locality. Communications of the ACM 15, 3 (March 1972), 185--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Colarelli and D. Grunwald. 2002. Massive arrays of idle disks for storage archives. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press, 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Constantinescu, J. Glider, and D. Chambliss. 2011. Mixing deduplication and compression on active data sets. In 2011 Data Compression Conference. IEEE, 393--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. 1990. Algorithms. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  12. X. Ding, S. Jiang, F. Chen, K. Davis, and X. Zhang. 2007. DiskSeen: Exploiting disk layout and access history to enhance I/O prefetch. In 2007 USENIX ATC. USENIX Association, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Doraimani and A. Iamnitchi. 2008. File grouping for scientific data management: Lessons from experimenting with real traces. In Proceedings of the 17th International Symposium on High Performance Distributed Computing. ACM, 153--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. O. Duda, P. E. Hart, and D. G. Stork. 2001. Pattern Classification. Vol. 2. Citeseer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Essary and A. Amer. 2008. Predictive data grouping: Defining the bounds of energy and latency reduction through predictive data grouping and replication. Transactions on Storage 4, 1 (2008), 1--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Bert Dufrasne, Roger Eriksson, Lisa Martinez, and Wenzel Kalabza. 2012. IBM XIV Storage System Gen3 Architecture, Implementation, and Usage. IBM, International Technical Support Organization. 426 pages.Google ScholarGoogle Scholar
  17. P. Jaccard. 1901. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin del la Société Vaudoise des Sciences Naturelles 37 (1901), 241--272.Google ScholarGoogle Scholar
  18. S. Jiang, X. Ding, F. Chen, E. Tan, and X. Zhang. 2005. DULO: An effective buffer cache management scheme to exploit both temporal and spatial locality. In USENIX Conference on File and Storage Technologies (FAST). USENIX Association, 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Koller and R. Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Transactions on Storage (TOS) 6, 3 (2010), 1--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. M. Kroeger and D. D. E. Long. 1996. Predicting file system actions from prior events. In Proceedings of the 1996 Annual Conference on USENIX Annual Technical Conference. Usenix Association, 26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. M. Kroeger and D. D. E. Long. 2001. Design and implementation of a predictive file prefetching algorithm. In USENIX Annual Technical Conference, General Track. 105--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Lancichinetti and S. Fortunato. 2009. Community detection algorithms: A comparative analysis. Physical Review E 80, 5 (2009), 056117.Google ScholarGoogle ScholarCross RefCross Ref
  23. W. Li. 2008. An Efficient Query System for High-Dimensional Spatio-Temporal Data. Ph.D. Dissertation. University of Massachusetts Lowell. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Z. Li, Z. Chen, S. M. Srinivasan, and Y. Zhou. 2004. C-miner: Mining block correlations in storage systems. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies. USENIX Association, 173--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S.-w. Lo, B.-H. Chen, Y.-W. Chen, T.-C. Shen, and Y.-C. Lin. 2014. ICAP, a new flash wear-leveling algorithm inspired by locality. In Proceedings of the 29th Annual ACM Symposium on Applied Computing. ACM, 1478--1483. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. J. Leffler M. K. McKusick, W. N. Joy, and R. S. Fabry. 1984. A fast file system for UNIX. ACM Transactions on Computer Systems 2, 3 (Aug. 1984), 181--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. E. Magurran. 2004. Measuring biological diversity. In African Journal of Aquatic Science 29, 2, 285--286.Google ScholarGoogle ScholarCross RefCross Ref
  28. J. Metz. 2012. Working document of the new technologies file system (NTFS). 0.0.3 (2012).Google ScholarGoogle Scholar
  29. D. Narayanan, A. Donnelly, and A. Rowstron. 2008. Write off-loading: Practical power management for enterprise storage. ACM Transactions on Storage (TOS) 4, 3 (2008), 1--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Oly and D. A. Reed. 2002. Markov model prediction of I/O requests for scientific applications. In Proceedings of the 16th International Conference on Supercomputing. ACM, 147--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. Pinheiro and R. Bianchini. 2004. Energy conservation techniques for disk array-based servers. In ICS’04. ACM, 68--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. M. Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 (1971), 846--850.Google ScholarGoogle ScholarCross RefCross Ref
  33. A. Riska and E. Riedel. 2006. Disk drive level workload characterization. In Proceedings of the USENIX Annual Technical Conference. 97--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Schindler, J. L. Griffin, C. R. Lumb, and G. R. Ganger. 2002. Track-aligned extents: Matching access patterns to disk drive characteristics. In Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. F. Schmuck and R. Haskin. 2002. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 2002 Conference on File and Storage Technologies (FAST’02). USENIX, 231--244. http://www.ssrc.ucsc.edu/PaperArchive/schmuck-fast02.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Sivathanu, V. Prabhakaran, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. 2005. Improving storage system availability with D-GRAID. ACM Transactions on Storage (TOS) 1, 2 (2005), 133--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Sivathanu, V. Prabhakaran, F. I. Popovici, T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. 2003. Semantically-smart disk systems. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies. 73--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. N. Slonim, G. Singh Atwal, G. Tkacik, and W. Bialek. 2005. Information-based clustering. Proceedings of the National Academy of Science 1021 (Dec. 2005), 18297--18302.Google ScholarGoogle Scholar
  39. T. Sørenson. 1948. A method of establishing groups of equal amplitude in plant sociology based oil similarity of species content. Biologiske Skrifter/Kongelige Danske Videnskabernes Selskab (1948), 1--34.Google ScholarGoogle Scholar
  40. C. Staelin and H. Garcia-Molina. 1990. Clustering active disk data to improve disk performance. Princeton, NJ, Tech. Rep. CS--TR--298--90 (1990).Google ScholarGoogle Scholar
  41. A. S. Tanenbaum, J. N. Herder, and H. Bos. 2006. File size distribution on UNIX systems: Then and now. ACM SIGOPS Operating Systems Review 40, 1 (2006), 104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Wang and Y. Hu. 2001. PROFS-performance-oriented data reorganization for log-structured file system on multi-zone disks. In IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). Published by the IEEE Computer Society, 0285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. A. Wildani and E. L. Miller. 2010. Semantic data placement for power management in archival storage. In 2010 5th Petascale Data Storage Workshop (PDSW’10). IEEE, 1--5.Google ScholarGoogle Scholar
  44. A. Wildani, E. L. Miller, and O. Rodeh. 2013. HANDS: A heuristically arranged non-backup in-line deduplication system. In 2013 IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 446--457. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Wildani, E. L. Miller, and L. Ward. 2011. Efficiently identifying working sets in block I/O streams. In Proceedings of the 4th Annual International Conference on Systems and Storage. 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Wildani, E. L. Miller, I. Adams, and D. D. E. Long. 2014. PERSES: Data layout for low impact failures. In 22th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. G. Wu and X. He. 2012. Delta-FTL: Improving SSD lifetime via exploiting content locality. In Proceedings of the 7th ACM European Conference on Computer Systems. ACM, 253--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. N. J. Yadwadkar, C. Bhattacharyya, K. Gopinath, T. Niranjan, and S. Susarla. 2010. Discovery of application workloads from network file traces. In Proceedings of the 8th USENIX Conference on File and Storage Technologies. USENIX Association, 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. S. Zaman, S. I. Lippman, L. Schneper, N. Slonim, and J. R. Broach. 2009. Glucose regulates transcription in yeast through a network of signaling pathways. Molecular Systems Biology 5, 1 (2009).Google ScholarGoogle Scholar
  50. X. Zhuang and H. H. S. Lee. 2007. Reducing cache pollution via dynamic data prefetch filtering. IEEE Transactions on Computers (2007), 18--31. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Can We Group Storage? Statistical Techniques to Identify Predictive Groupings in Storage System Accesses

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Storage
            ACM Transactions on Storage  Volume 12, Issue 2
            February 2016
            134 pages
            ISSN:1553-3077
            EISSN:1553-3093
            DOI:10.1145/2888404
            Issue’s Table of Contents

            Copyright © 2016 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 February 2016
            • Accepted: 1 February 2015
            • Revised: 1 December 2014
            • Received: 1 April 2014
            Published in tos Volume 12, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader