short-paper

Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing

Authors:
Da Wang

Two Sigma Investments, New York, NY

Two Sigma Investments, New York, NY
View Profile

,
Gauri Joshi

MIT, Cambridge, MA

MIT, Cambridge, MA
View Profile

,
Gregory Wornell

MIT, Cambridge, MA

MIT, Cambridge, MA
View Profile

ACM SIGMETRICS Performance Evaluation Review Volume 43 Issue 3December 2015pp 7–11https://doi.org/10.1145/2847220.2847223

Published:19 November 2015Publication History

ACM SIGMETRICS Performance Evaluation Review

Abstract

In cloud computing jobs consisting of many tasks run in parallel, the tasks on the slowest machines (straggling tasks) become the bottleneck in the completion of the job. One way to combat the variability in machine response time is to add replicas of straggling tasks and wait for the earliest copy to finish. Using the theory of extreme order statistics, we analyze how task replication reduces latency, and its impact on the cost of computing resources. We also propose a heuristic algorithm to search for the best replication strategies when it is difficult to model the empirical behavior of task execution time and use the proposed analysis techniques. Evaluation of the heuristic policies on Google Trace data shows a significant latency reduction compared to the replication strategy used in MapReduce.

References

Google cluster data. http://code.google.com/p/googleclusterdata/.Google Scholar
Ananthanarayanan, G., Ghodsi, A., and S. Shenker, I. S. Effective straggler mitigation: Attack of the clones. In USENIX NSDI (2013), pp. 185--198. Google ScholarDigital Library
David, H. A., and Nagaraja, H. N. Order statistics. John Wiley, Hoboken, N.J., 2003.Google Scholar
de Haan, L., and Ferreira, A. Extreme value theory an introduction. Springer, New York, 2006.Google Scholar
Dean, J., and Ghemawat, S. MapReduce: simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
Ghare, G., and Leutenegger, S. T. Improving speedup and response times by replicating parallel programs on a SNOW. In International conference on Job Scheduling Strategies for Parallel Processing (Jan. 2005), pp. 264--287. Google ScholarDigital Library
J. Dean and L. Barroso. The Tail at Scale. Communications of the ACM 56, 2 (2013), 74--80. Google ScholarDigital Library
Joshi, G., Liu, Y., and Soljanin, E. On the Delay-Storage Trade-o? in Content Download from Coded Distributed Storage Systems. IEEE JSAC (May 2014), 989--997.Google Scholar
Joshi, G., Soljanin, E., and G., W. Queues with redundancy: Latency-cost analysis. In ACM SIGMETRICS Workshop on Mathematical Modeling and Analysis (jun 2015). Google ScholarDigital Library
Kochar, S., and Wiens, D. Partial orderings of life distributions with respect to their aging properties. Naval Research Logistics 34, 6 (1987), 823--829.Google ScholarCross Ref
Ousterhout, K., Wendell, P., Zaharia, M., and Stoica, I. Sparrow: Distributed, low latency scheduling. In ACM SOSP (2013), pp. 69--84. Google ScholarDigital Library
Reiss, C., Tumanov, A., Ganger, G., Katz, R. H., and Kozuch, M. A. Towards understanding heterogeneous clouds at scale: Google trace analysis. Intel Science and Technology Center for Cloud Computing, Tech. Rep (2012).Google Scholar
Vulimiri, A., Godfrey, P. B., Mittal, R., Sherry, J., Ratnasamy, S., and Shenker, S. Low latency via redundancy. CoNEXT (2013), 283--294. Google ScholarDigital Library
Wang, D. Computing with Unreliable Resources: Design, Analysis and Algorithms. PhD thesis, Massachusetts Institute of Technology, 2014.Google Scholar
Wang, D., Joshi, G., and Wornell, G. Efficient task replication for fast response times in parallel computation. ACM Sigmetrics short paper (June 2014). Google ScholarDigital Library
Wang, D., Joshi, G., and Wornell, G. Using straggler replication to reduce latency in large-scale parallel computing (extended version). arXiv:1503.03128 {cs.dc} (Mar. 2015).Google Scholar

Recommendations

Efficient Straggler Replication in Large-Scale Parallel Computing

In a cloud computing job with many parallel tasks, the tasks on the slowest machines (straggling tasks) become the bottleneck in the job completion. Computing frameworks such as MapReduce and Spark tackle this by replicating the straggling tasks and ...
Read More
Efficient task replication for fast response times in parallel computation
SIGMETRICS '14: The 2014 ACM international conference on Measurement and modeling of computer systems

Large-scale distributed computing systems divide a job into many independent tasks and run them in parallel on different machines. A challenge in such parallel computing is that the time taken by a machine to execute a task is inherently variable, and ...
Read More
On the efficacy, efficiency and emergent behavior of task replication in large distributed systems

Large distributed systems challenge traditional schedulers, as it is often hard to determine a priori how long each task will take to complete on each resource, information that is input for such schedulers. Task replication has been applied in a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMETRICS Performance Evaluation Review Volume 43, Issue 3
December 2015
89 pages
ISSN:0163-5999
DOI:10.1145/2847220
Editor:
Nidhi Hegde
Bell Labs France, Alcatel-Lucent Centre de Villarceaux Route de Villejust, Nozay, France
Issue’s Table of Contents
Copyright © 2015 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 November 2015
Check for updates
Author Tags
parallel computation
scheduling
task replication
Qualifiers
- short-paper
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 90
  Total Citations
  View Citations
- 436
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing

ACM SIGMETRICS Performance Evaluation Review

Abstract

References

Cited By

Recommendations

Efficient Straggler Replication in Large-Scale Parallel Computing

Efficient task replication for fast response times in parallel computation

On the efficacy, efficiency and emergent behavior of task replication in large distributed systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing

ACM SIGMETRICS Performance Evaluation Review

Abstract

References

Cited By

Recommendations

Efficient Straggler Replication in Large-Scale Parallel Computing

Efficient task replication for fast response times in parallel computation

On the efficacy, efficiency and emergent behavior of task replication in large distributed systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media