Abstract
In cloud computing jobs consisting of many tasks run in parallel, the tasks on the slowest machines (straggling tasks) become the bottleneck in the completion of the job. One way to combat the variability in machine response time is to add replicas of straggling tasks and wait for the earliest copy to finish. Using the theory of extreme order statistics, we analyze how task replication reduces latency, and its impact on the cost of computing resources. We also propose a heuristic algorithm to search for the best replication strategies when it is difficult to model the empirical behavior of task execution time and use the proposed analysis techniques. Evaluation of the heuristic policies on Google Trace data shows a significant latency reduction compared to the replication strategy used in MapReduce.
- Google cluster data. http://code.google.com/p/googleclusterdata/.Google Scholar
- Ananthanarayanan, G., Ghodsi, A., and S. Shenker, I. S. Effective straggler mitigation: Attack of the clones. In USENIX NSDI (2013), pp. 185--198. Google ScholarDigital Library
- David, H. A., and Nagaraja, H. N. Order statistics. John Wiley, Hoboken, N.J., 2003.Google Scholar
- de Haan, L., and Ferreira, A. Extreme value theory an introduction. Springer, New York, 2006.Google Scholar
- Dean, J., and Ghemawat, S. MapReduce: simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
- Ghare, G., and Leutenegger, S. T. Improving speedup and response times by replicating parallel programs on a SNOW. In International conference on Job Scheduling Strategies for Parallel Processing (Jan. 2005), pp. 264--287. Google ScholarDigital Library
- J. Dean and L. Barroso. The Tail at Scale. Communications of the ACM 56, 2 (2013), 74--80. Google ScholarDigital Library
- Joshi, G., Liu, Y., and Soljanin, E. On the Delay-Storage Trade-o? in Content Download from Coded Distributed Storage Systems. IEEE JSAC (May 2014), 989--997.Google Scholar
- Joshi, G., Soljanin, E., and G., W. Queues with redundancy: Latency-cost analysis. In ACM SIGMETRICS Workshop on Mathematical Modeling and Analysis (jun 2015). Google ScholarDigital Library
- Kochar, S., and Wiens, D. Partial orderings of life distributions with respect to their aging properties. Naval Research Logistics 34, 6 (1987), 823--829.Google ScholarCross Ref
- Ousterhout, K., Wendell, P., Zaharia, M., and Stoica, I. Sparrow: Distributed, low latency scheduling. In ACM SOSP (2013), pp. 69--84. Google ScholarDigital Library
- Reiss, C., Tumanov, A., Ganger, G., Katz, R. H., and Kozuch, M. A. Towards understanding heterogeneous clouds at scale: Google trace analysis. Intel Science and Technology Center for Cloud Computing, Tech. Rep (2012).Google Scholar
- Vulimiri, A., Godfrey, P. B., Mittal, R., Sherry, J., Ratnasamy, S., and Shenker, S. Low latency via redundancy. CoNEXT (2013), 283--294. Google ScholarDigital Library
- Wang, D. Computing with Unreliable Resources: Design, Analysis and Algorithms. PhD thesis, Massachusetts Institute of Technology, 2014.Google Scholar
- Wang, D., Joshi, G., and Wornell, G. Efficient task replication for fast response times in parallel computation. ACM Sigmetrics short paper (June 2014). Google ScholarDigital Library
- Wang, D., Joshi, G., and Wornell, G. Using straggler replication to reduce latency in large-scale parallel computing (extended version). arXiv:1503.03128 {cs.dc} (Mar. 2015).Google Scholar
Recommendations
Efficient Straggler Replication in Large-Scale Parallel Computing
In a cloud computing job with many parallel tasks, the tasks on the slowest machines (straggling tasks) become the bottleneck in the job completion. Computing frameworks such as MapReduce and Spark tackle this by replicating the straggling tasks and ...
Efficient task replication for fast response times in parallel computation
SIGMETRICS '14: The 2014 ACM international conference on Measurement and modeling of computer systemsLarge-scale distributed computing systems divide a job into many independent tasks and run them in parallel on different machines. A challenge in such parallel computing is that the time taken by a machine to execute a task is inherently variable, and ...
On the efficacy, efficiency and emergent behavior of task replication in large distributed systems
Large distributed systems challenge traditional schedulers, as it is often hard to determine a priori how long each task will take to complete on each resource, information that is input for such schedulers. Task replication has been applied in a ...
Comments