ABSTRACT
Data-intensive applications involving the analysis of large datasets often require large amounts of compute and storage resources, for which data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. To explore the feasibility of data diffusion, we offer both a theoretical and an empirical analysis. We define an abstract model for data diffusion, introduce new scheduling policies with heuristics to optimize real-world performance, and develop a competitive online cache eviction policy. We also offer many empirical experiments to explore the benefits of dynamically expanding and contracting resources based on load, to improve system responsiveness while keeping wasted resources small. We show performance improvements of one to two orders of magnitude across three diverse workloads when compared to the performance of parallel file systems with throughputs approaching 80 Gb/s on a modest cluster of 200 processors. We also compare data diffusion with a best model for active storage, contrasting the difference between a pull-model found in data diffusion and a push-model found in active storage.
- A. Szalay, J. Bunn, J. Gray, I. Foster, I. Raicu. The Importance of Data Locality in Distributed Computing Applications, NSF Workflow Workshop 2006Google Scholar
- J. Gray. Distributed Computing Economics, Technical Report MSR-TR-2003-24, Microsoft Research, 2003Google Scholar
- S. Ghemawat, H. Gobioff, S.T. Leung. The Google File System, ACM SOSP 2003, pp. 29--43 Google ScholarDigital Library
- J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 Google ScholarDigital Library
- I. Raicu, Y. Zhao, I. Foster, A. Szalay. Accelerating Large-scale Data Exploration through Data Diffusion, ACM Workshop on Data-Aware Distributed Comp. 2008 Google ScholarDigital Library
- S. Podlipnig, et al. A Survey of Web Cache Replacement Strategies, ACM Computing Surveys, 2003 Google ScholarDigital Library
- R. Lancellotti, et al. A Scalable Architecture for Cooperative Web Caching, Web Engineering Workshop 2002 Google ScholarDigital Library
- R. Hasan, et al. A Survey of Peer-to-Peer Storage Techniques for Distributed File Systems, ITCC 2005 Google ScholarDigital Library
- W. Xiaohui, et al. Implementing Data Aware Scheduling in Gfarm Using LSF Scheduler Plugin Mechanism, GCA05, 2005Google Scholar
- P. Fuhrmann. dCache, the Commodity Cache, MSST 2004Google Scholar
- C. Moretti, et al. All-Pairs: An Abstraction for Data-Intensive Cloud Computing, IPDPS 2008Google Scholar
- D. Thain, et al. Chirp: A Practical Global Filesystem for Cluster and Grid Computing, JGC, Springer, 2008Google Scholar
- I. Raicu, et a. Falkon: A Fast and Light-weight tasK executiON Framework, IEEE/ACM SC 2007 Google ScholarDigital Library
- G. Banga, et al. Resource Containers: A New Facility for Resource Management in Server Systems, OSDI 1999 Google ScholarDigital Library
- I. Raicu, Z. Zhang, M. Wilde, I. Foster, P. Beckman, K. Iskra, B. Clifford. Toward Loosely Coupled Programming on Petascale Systems, IEEE SC 2008 Google ScholarDigital Library
- A. Bialecki, et al. Hadoop: A Framework for Running Applications on Large Clusters Built of Commodity Hardware, http://lucene.apache.org/hadoop/, 2005Google Scholar
- M. Feller, et al. GT4 GRAM: A Functionality and Performance Study, TeraGrid Conference 2007Google Scholar
- W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster. The Globus Striped GridFTP Framework and Server, ACM/IEEE SC, 2005 Google ScholarDigital Library
- P. Cao, et al. Cost-Aware WWW Proxy Caching Algorithms, USENIX Symposium on Internet Technologies and Systems, 1997 Google ScholarDigital Library
- I. Raicu, I. Foster, Y. Zhao, A. Szalay, P. Little, C. Moretti, A. Chaudhary, D. Thain. Towards Data Intensive Many-Task Computing, under review at Data Intensive Distributed Computing: Challenges and Solutions for Large-Scale Information Management, 2009Google Scholar
- I. Raicu, I. Foster, A. Szalay, G. Turcu. AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis, TeraGrid Conf. 2006Google Scholar
- E. Torng. A Unified Analysis of Paging and Caching, Algorithmica 20, 175--200, 1998Google ScholarCross Ref
- ANL/UC TeraGrid Site Details, http://www.uc.teragrid.org/tg-docs/tg-tech-sum.html, 2007Google Scholar
- F. Schmuck, R. Haskin, GPFS: A Shared-Disk File System for Large Computing Clusters, FAST 2002 Google ScholarDigital Library
- T. Kosar. A New Paradigm in Data Intensive Computing: Stork and the Data-Aware Schedulers, IEEE CLADE 2006Google Scholar
- X. Wei, et al. Integrating Local Job Scheduler - LSF with Gfarm, ISPA05, vol. 3758/2005, 2005 Google ScholarDigital Library
- F. Chang, et al. Bigtable: A Distributed Storage System for Structured Data, USENIX OSDI 2006 Google ScholarDigital Library
- Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, I. Raicu, T. Stef-Praun, M. Wilde. Swift: Fast, Reliable, Loosely Coupled Parallel Computation, IEEE Workshop on Scientific Workflows 2007Google Scholar
- Y. Zhao, I. Raicu, I. Foster, M. Hategan, V. Nefedova, M. Wilde. Realizing Fast, Scalable and Reliable Scientific Computations in Grid Environments, Grid Computing Research Progress, Nova Pub. 2008Google Scholar
- R. Grossman, Y. Gu. Data Mining Using High Performance Clouds: Experimental Studies Using Sector and Sphere, ACM KDD 2008 Google ScholarDigital Library
- Y. Gu, et al. Distributing the Sloan Digital Sky Survey Using UDT and Sector, e-Science 2006 Google ScholarDigital Library
- K. Pruhs, et al. Online Sscheduling, Handbook of Scheduling: Algorithms, Models, and Performance Analysis, 2004 Google ScholarDigital Library
- S. Irani. Randomized Weighted Caching with Two Page Weights, Algorithmica, 32:4, 624--640, 2002Google ScholarDigital Library
- X. Zhang, A. Espinosa, K. Iskra, I. Raicu, I. Foster, M. Wilde. Design and Evaluation of a Collective I/O Model for Loosely-coupled PetascaleProgramming, IEEE MTAGS 2008Google Scholar
Index Terms
- The quest for scalable support of data-intensive workloads in distributed systems
Recommendations
Topology-aware resource allocation for data-intensive workloads
APSys '10: Proceedings of the first ACM asia-pacific workshop on Workshop on systemsThis paper proposes an architecture for optimized resource allocation in Infrastructure-as-a-Service (IaaS)-based cloud systems. Current IaaS systems are usually unaware of the hosted application's requirements and therefore allocate resources ...
Accelerating large-scale data exploration through data diffusion
DADC '08: Proceedings of the 2008 international workshop on Data-aware distributed computingData-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires ...
Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales
Data-driven programming models such as many-task computing MTC have been prevalent for running data-intensive scientific applications. MTC applies over-decomposition to enable distributed scheduling. To achieve extreme scalability, MTC proposes a fully ...
Comments