research-article

The quest for scalable support of data-intensive workloads in distributed systems

Authors:
Ioan Raicu

University of Chicago, Chicago, IL, USA

University of Chicago, Chicago, IL, USA
View Profile

,
Ian T. Foster

University of Chicago & Argonne National Laboratory, Chicago, IL, USA

University of Chicago & Argonne National Laboratory, Chicago, IL, USA
View Profile

,
Yong Zhao

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

,
Philip Little

University of Notre Dame, Notre Dame, IN, USA

University of Notre Dame, Notre Dame, IN, USA
View Profile

,
Christopher M. Moretti

University of Notre Dame, Notre Dame, IN, USA

University of Notre Dame, Notre Dame, IN, USA
View Profile

,
Amitabh Chaudhary

University of Notre Dame, Notre Dame, IN, USA

University of Notre Dame, Notre Dame, IN, USA
View Profile

,
Douglas Thain

University of Notre Dame, Notre Dame, IN, USA

University of Notre Dame, Notre Dame, IN, USA
View Profile

HPDC '09: Proceedings of the 18th ACM international symposium on High performance distributed computingJune 2009Pages 207–216https://doi.org/10.1145/1551609.1551642

Published:11 June 2009Publication History

HPDC '09: Proceedings of the 18th ACM international symposium on High performance distributed computing

Pages 207–216

ABSTRACT

Data-intensive applications involving the analysis of large datasets often require large amounts of compute and storage resources, for which data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. To explore the feasibility of data diffusion, we offer both a theoretical and an empirical analysis. We define an abstract model for data diffusion, introduce new scheduling policies with heuristics to optimize real-world performance, and develop a competitive online cache eviction policy. We also offer many empirical experiments to explore the benefits of dynamically expanding and contracting resources based on load, to improve system responsiveness while keeping wasted resources small. We show performance improvements of one to two orders of magnitude across three diverse workloads when compared to the performance of parallel file systems with throughputs approaching 80 Gb/s on a modest cluster of 200 processors. We also compare data diffusion with a best model for active storage, contrasting the difference between a pull-model found in data diffusion and a push-model found in active storage.

References

A. Szalay, J. Bunn, J. Gray, I. Foster, I. Raicu. The Importance of Data Locality in Distributed Computing Applications, NSF Workflow Workshop 2006Google Scholar
J. Gray. Distributed Computing Economics, Technical Report MSR-TR-2003-24, Microsoft Research, 2003Google Scholar
S. Ghemawat, H. Gobioff, S.T. Leung. The Google File System, ACM SOSP 2003, pp. 29--43 Google ScholarDigital Library
J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 Google ScholarDigital Library
I. Raicu, Y. Zhao, I. Foster, A. Szalay. Accelerating Large-scale Data Exploration through Data Diffusion, ACM Workshop on Data-Aware Distributed Comp. 2008 Google ScholarDigital Library
S. Podlipnig, et al. A Survey of Web Cache Replacement Strategies, ACM Computing Surveys, 2003 Google ScholarDigital Library
R. Lancellotti, et al. A Scalable Architecture for Cooperative Web Caching, Web Engineering Workshop 2002 Google ScholarDigital Library
R. Hasan, et al. A Survey of Peer-to-Peer Storage Techniques for Distributed File Systems, ITCC 2005 Google ScholarDigital Library
W. Xiaohui, et al. Implementing Data Aware Scheduling in Gfarm Using LSF Scheduler Plugin Mechanism, GCA05, 2005Google Scholar
P. Fuhrmann. dCache, the Commodity Cache, MSST 2004Google Scholar
C. Moretti, et al. All-Pairs: An Abstraction for Data-Intensive Cloud Computing, IPDPS 2008Google Scholar
D. Thain, et al. Chirp: A Practical Global Filesystem for Cluster and Grid Computing, JGC, Springer, 2008Google Scholar
I. Raicu, et a. Falkon: A Fast and Light-weight tasK executiON Framework, IEEE/ACM SC 2007 Google ScholarDigital Library
G. Banga, et al. Resource Containers: A New Facility for Resource Management in Server Systems, OSDI 1999 Google ScholarDigital Library
I. Raicu, Z. Zhang, M. Wilde, I. Foster, P. Beckman, K. Iskra, B. Clifford. Toward Loosely Coupled Programming on Petascale Systems, IEEE SC 2008 Google ScholarDigital Library
A. Bialecki, et al. Hadoop: A Framework for Running Applications on Large Clusters Built of Commodity Hardware, http://lucene.apache.org/hadoop/, 2005Google Scholar
M. Feller, et al. GT4 GRAM: A Functionality and Performance Study, TeraGrid Conference 2007Google Scholar
W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster. The Globus Striped GridFTP Framework and Server, ACM/IEEE SC, 2005 Google ScholarDigital Library
P. Cao, et al. Cost-Aware WWW Proxy Caching Algorithms, USENIX Symposium on Internet Technologies and Systems, 1997 Google ScholarDigital Library
I. Raicu, I. Foster, Y. Zhao, A. Szalay, P. Little, C. Moretti, A. Chaudhary, D. Thain. Towards Data Intensive Many-Task Computing, under review at Data Intensive Distributed Computing: Challenges and Solutions for Large-Scale Information Management, 2009Google Scholar
I. Raicu, I. Foster, A. Szalay, G. Turcu. AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis, TeraGrid Conf. 2006Google Scholar
E. Torng. A Unified Analysis of Paging and Caching, Algorithmica 20, 175--200, 1998Google ScholarCross Ref
ANL/UC TeraGrid Site Details, http://www.uc.teragrid.org/tg-docs/tg-tech-sum.html, 2007Google Scholar
F. Schmuck, R. Haskin, GPFS: A Shared-Disk File System for Large Computing Clusters, FAST 2002 Google ScholarDigital Library
T. Kosar. A New Paradigm in Data Intensive Computing: Stork and the Data-Aware Schedulers, IEEE CLADE 2006Google Scholar
X. Wei, et al. Integrating Local Job Scheduler - LSF with Gfarm, ISPA05, vol. 3758/2005, 2005 Google ScholarDigital Library
F. Chang, et al. Bigtable: A Distributed Storage System for Structured Data, USENIX OSDI 2006 Google ScholarDigital Library
Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, I. Raicu, T. Stef-Praun, M. Wilde. Swift: Fast, Reliable, Loosely Coupled Parallel Computation, IEEE Workshop on Scientific Workflows 2007Google Scholar
Y. Zhao, I. Raicu, I. Foster, M. Hategan, V. Nefedova, M. Wilde. Realizing Fast, Scalable and Reliable Scientific Computations in Grid Environments, Grid Computing Research Progress, Nova Pub. 2008Google Scholar
R. Grossman, Y. Gu. Data Mining Using High Performance Clouds: Experimental Studies Using Sector and Sphere, ACM KDD 2008 Google ScholarDigital Library
Y. Gu, et al. Distributing the Sloan Digital Sky Survey Using UDT and Sector, e-Science 2006 Google ScholarDigital Library
K. Pruhs, et al. Online Sscheduling, Handbook of Scheduling: Algorithms, Models, and Performance Analysis, 2004 Google ScholarDigital Library
S. Irani. Randomized Weighted Caching with Two Page Weights, Algorithmica, 32:4, 624--640, 2002Google ScholarDigital Library
X. Zhang, A. Espinosa, K. Iskra, I. Raicu, I. Foster, M. Wilde. Design and Evaluation of a Collective I/O Model for Loosely-coupled PetascaleProgramming, IEEE MTAGS 2008Google Scholar

Index Terms

The quest for scalable support of data-intensive workloads in distributed systems
1. Information systems
  1. Information storage systems
    1. Storage management
      1. Hierarchical storage management

Recommendations

Topology-aware resource allocation for data-intensive workloads
APSys '10: Proceedings of the first ACM asia-pacific workshop on Workshop on systems

This paper proposes an architecture for optimized resource allocation in Infrastructure-as-a-Service (IaaS)-based cloud systems. Current IaaS systems are usually unaware of the hosted application's requirements and therefore allocate resources ...
Read More
Accelerating large-scale data exploration through data diffusion
DADC '08: Proceedings of the 2008 international workshop on Data-aware distributed computing

Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires ...
Read More
Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales

Data-driven programming models such as many-task computing MTC have been prevalent for running data-intensive scientific applications. MTC applies over-decomposition to enable distributed scheduling. To achieve extreme scalability, MTC proposes a fully ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '09: Proceedings of the 18th ACM international symposium on High performance distributed computing
June 2009
237 pages
ISBN:9781605585871
DOI:10.1145/1551609
General Chairs:
Dieter Kranzlmüller
Ludwig-Maximilians-Universität München, Leibniz Supercomputing Centre, Germany
,
Arndt Bode
Technische Universität München, Germany
,
Heinz-Gerd Hegering
Leibniz Supercomputing Centre, Germany
,
Program Chairs:
Henri Casanova
University of Hawaii at Manoa, USA
,
Michael Gerndt
Technische Universität München, Germany
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data diffusion
data management
data-aware scheduling
falkon
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate166of966submissions,17%
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 567
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The quest for scalable support of data-intensive workloads in distributed systems

HPDC '09: Proceedings of the 18th ACM international symposium on High performance distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Topology-aware resource allocation for data-intensive workloads

Accelerating large-scale data exploration through data diffusion

Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The quest for scalable support of data-intensive workloads in distributed systems

HPDC '09: Proceedings of the 18th ACM international symposium on High performance distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Topology-aware resource allocation for data-intensive workloads

Accelerating large-scale data exploration through data diffusion

Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media