skip to main content
10.1145/1551609.1551642acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

The quest for scalable support of data-intensive workloads in distributed systems

Published:11 June 2009Publication History

ABSTRACT

Data-intensive applications involving the analysis of large datasets often require large amounts of compute and storage resources, for which data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. To explore the feasibility of data diffusion, we offer both a theoretical and an empirical analysis. We define an abstract model for data diffusion, introduce new scheduling policies with heuristics to optimize real-world performance, and develop a competitive online cache eviction policy. We also offer many empirical experiments to explore the benefits of dynamically expanding and contracting resources based on load, to improve system responsiveness while keeping wasted resources small. We show performance improvements of one to two orders of magnitude across three diverse workloads when compared to the performance of parallel file systems with throughputs approaching 80 Gb/s on a modest cluster of 200 processors. We also compare data diffusion with a best model for active storage, contrasting the difference between a pull-model found in data diffusion and a push-model found in active storage.

References

  1. A. Szalay, J. Bunn, J. Gray, I. Foster, I. Raicu. The Importance of Data Locality in Distributed Computing Applications, NSF Workflow Workshop 2006Google ScholarGoogle Scholar
  2. J. Gray. Distributed Computing Economics, Technical Report MSR-TR-2003-24, Microsoft Research, 2003Google ScholarGoogle Scholar
  3. S. Ghemawat, H. Gobioff, S.T. Leung. The Google File System, ACM SOSP 2003, pp. 29--43 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. I. Raicu, Y. Zhao, I. Foster, A. Szalay. Accelerating Large-scale Data Exploration through Data Diffusion, ACM Workshop on Data-Aware Distributed Comp. 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Podlipnig, et al. A Survey of Web Cache Replacement Strategies, ACM Computing Surveys, 2003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Lancellotti, et al. A Scalable Architecture for Cooperative Web Caching, Web Engineering Workshop 2002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Hasan, et al. A Survey of Peer-to-Peer Storage Techniques for Distributed File Systems, ITCC 2005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Xiaohui, et al. Implementing Data Aware Scheduling in Gfarm Using LSF Scheduler Plugin Mechanism, GCA05, 2005Google ScholarGoogle Scholar
  10. P. Fuhrmann. dCache, the Commodity Cache, MSST 2004Google ScholarGoogle Scholar
  11. C. Moretti, et al. All-Pairs: An Abstraction for Data-Intensive Cloud Computing, IPDPS 2008Google ScholarGoogle Scholar
  12. D. Thain, et al. Chirp: A Practical Global Filesystem for Cluster and Grid Computing, JGC, Springer, 2008Google ScholarGoogle Scholar
  13. I. Raicu, et a. Falkon: A Fast and Light-weight tasK executiON Framework, IEEE/ACM SC 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Banga, et al. Resource Containers: A New Facility for Resource Management in Server Systems, OSDI 1999 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Raicu, Z. Zhang, M. Wilde, I. Foster, P. Beckman, K. Iskra, B. Clifford. Toward Loosely Coupled Programming on Petascale Systems, IEEE SC 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Bialecki, et al. Hadoop: A Framework for Running Applications on Large Clusters Built of Commodity Hardware, http://lucene.apache.org/hadoop/, 2005Google ScholarGoogle Scholar
  17. M. Feller, et al. GT4 GRAM: A Functionality and Performance Study, TeraGrid Conference 2007Google ScholarGoogle Scholar
  18. W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster. The Globus Striped GridFTP Framework and Server, ACM/IEEE SC, 2005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Cao, et al. Cost-Aware WWW Proxy Caching Algorithms, USENIX Symposium on Internet Technologies and Systems, 1997 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. I. Raicu, I. Foster, Y. Zhao, A. Szalay, P. Little, C. Moretti, A. Chaudhary, D. Thain. Towards Data Intensive Many-Task Computing, under review at Data Intensive Distributed Computing: Challenges and Solutions for Large-Scale Information Management, 2009Google ScholarGoogle Scholar
  21. I. Raicu, I. Foster, A. Szalay, G. Turcu. AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis, TeraGrid Conf. 2006Google ScholarGoogle Scholar
  22. E. Torng. A Unified Analysis of Paging and Caching, Algorithmica 20, 175--200, 1998Google ScholarGoogle ScholarCross RefCross Ref
  23. ANL/UC TeraGrid Site Details, http://www.uc.teragrid.org/tg-docs/tg-tech-sum.html, 2007Google ScholarGoogle Scholar
  24. F. Schmuck, R. Haskin, GPFS: A Shared-Disk File System for Large Computing Clusters, FAST 2002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Kosar. A New Paradigm in Data Intensive Computing: Stork and the Data-Aware Schedulers, IEEE CLADE 2006Google ScholarGoogle Scholar
  26. X. Wei, et al. Integrating Local Job Scheduler - LSF with Gfarm, ISPA05, vol. 3758/2005, 2005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Chang, et al. Bigtable: A Distributed Storage System for Structured Data, USENIX OSDI 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, I. Raicu, T. Stef-Praun, M. Wilde. Swift: Fast, Reliable, Loosely Coupled Parallel Computation, IEEE Workshop on Scientific Workflows 2007Google ScholarGoogle Scholar
  29. Y. Zhao, I. Raicu, I. Foster, M. Hategan, V. Nefedova, M. Wilde. Realizing Fast, Scalable and Reliable Scientific Computations in Grid Environments, Grid Computing Research Progress, Nova Pub. 2008Google ScholarGoogle Scholar
  30. R. Grossman, Y. Gu. Data Mining Using High Performance Clouds: Experimental Studies Using Sector and Sphere, ACM KDD 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Gu, et al. Distributing the Sloan Digital Sky Survey Using UDT and Sector, e-Science 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. Pruhs, et al. Online Sscheduling, Handbook of Scheduling: Algorithms, Models, and Performance Analysis, 2004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Irani. Randomized Weighted Caching with Two Page Weights, Algorithmica, 32:4, 624--640, 2002Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. X. Zhang, A. Espinosa, K. Iskra, I. Raicu, I. Foster, M. Wilde. Design and Evaluation of a Collective I/O Model for Loosely-coupled PetascaleProgramming, IEEE MTAGS 2008Google ScholarGoogle Scholar

Index Terms

  1. The quest for scalable support of data-intensive workloads in distributed systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      HPDC '09: Proceedings of the 18th ACM international symposium on High performance distributed computing
      June 2009
      237 pages
      ISBN:9781605585871
      DOI:10.1145/1551609

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 June 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate166of966submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader