research-article

Addressing the straggler problem for iterative convergent parallel ML

Authors:
Aaron Harlap

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Henggang Cui

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Wei Dai

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Jinliang Wei

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Gregory R. Ganger

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Phillip B. Gibbons

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Garth A. Gibson

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Eric P. Xing

Carnegie Mellon University

Carnegie Mellon University
View Profile

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud ComputingOctober 2016Pages 98–111https://doi.org/10.1145/2987550.2987554

Published:05 October 2016Publication History

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

Pages 98–111

ABSTRACT

FlexRR provides a scalable, efficient solution to the straggler problem for iterative machine learning (ML). The frequent (e.g., per iteration) barriers used in traditional BSP-based distributed ML implementations cause every transient slowdown of any worker thread to delay all others. FlexRR combines a more flexible synchronization model with dynamic peer-to-peer re-assignment of work among workers to address straggler threads. Experiments with real straggler behavior observed on Amazon EC2 and Microsoft Azure, as well as injected straggler behavior stress tests, confirm the significance of the problem and the effectiveness of FlexRR's solution. Using FlexRR, we consistently observe near-ideal run-times (relative to no performance jitter) across all real and injected straggler behaviors tested.

References

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.Google Scholar
U. A. Acar, A. Chargueraud, and M. Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 219--228. ACM, 2013. Google ScholarDigital Library
A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In WSDM, pages 123--132, 2012. Google ScholarDigital Library
J. Albrecht, C. Tuttle, A. C. Snoeren, and A. Vahdat. Loose synchronization for large-scale networked systems. In USENIX Annual Tech, 2006. Google ScholarDigital Library
G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective straggler mitigation: Attack of the clones. In NSDI'13, pages 185--198, 2013. Google ScholarDigital Library
G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using Mantri. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--16. USENIX Association, 2010. Google ScholarDigital Library
Apache Hadoop. http://hadoop.apache.org/.Google Scholar
P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale. In IEEE International Conference on Cluster Computing, pages 1--12, 2006.Google ScholarCross Ref
C. M. Bishop et al. Pattern recognition and machine learning, volume 4. springer New York, 2006. Google ScholarDigital Library
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720--748, 1999. Google ScholarDigital Library
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.Google Scholar
T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 571--582. USENIX Association, 2014. Google ScholarDigital Library
J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. Xing. Solving the straggler problem with bounded staleness. In USENIX conference on Hot topics in operating systems (HotOS), 2013. Google ScholarDigital Library
H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC, pages 37--48, 2014. Google ScholarDigital Library
H. Cui, A. Tumanov, J. Wei, L. Xu, W. Dai, J. Haber-Kucharsky, Q. Ho, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting iterative-ness for parallel ML computations. In Proceedings of the ACM Symposium on Cloud Computing, pages 1--14. ACM, 2014. Google ScholarDigital Library
H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems, page 4. ACM, 2016. Google ScholarDigital Library
C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: If you're late don't blame us! In Proceedings of the ACM Symposium on Cloud Computing, SOCC'14, pages 2:1--2:14. ACM, 2014. Google ScholarDigital Library
J. Dean. Achieving rapid response times in large online services. In Berkeley AMPLab Cloud Seminar, 2012.Google Scholar
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC'09, pages 53:1--53:11. ACM, 2009. Google ScholarDigital Library
J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--8, 2007.Google ScholarCross Ref
A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: guaranteed job latency in data parallel clusters. In Proceedings of the 7th ACM European conference on Computer Systems, pages 99--112. ACM, 2012. Google ScholarDigital Library
K. B. Ferreira, P. G. Bridges, R. Brightwell, and K. T. Pedretti. The impact of system design parameters on application noise sensitivity. In Proceedings of the 2010 IEEE International Conference on Cluster Computing, CLUSTER'10, pages 146--155. IEEE Computer Society, 2010. Google ScholarDigital Library
R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, 2011. Google ScholarDigital Library
G. Gibson, G. Grider, A. Jacobson, and W. Lloyd. PRObE: A thousand-node experimental cluster for computer systems research. USENIX; login, 38(3), 2013.Google Scholar
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Proc. OSDI, 2012. Google ScholarDigital Library
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599--613, 2014. Google ScholarDigital Library
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 2004.Google ScholarCross Ref
A. Harlap, G. R. Ganger, and P. B. Gibbons. Tier ml: Using tiers of reliability for agile elasticity in machine learning. 2016.Google Scholar
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI, volume 11, pages 22--22, 2011. Google ScholarDigital Library
Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing. More effective distributed ML via a Stale Synchronous Parallel parameter server. In NIPS, 2013. Google ScholarDigital Library
E. Krevat, J. Tucek, and G. R. Ganger. Disks are like snowflakes: no two are alike. In USENIX conference on Hot topics in operating systems (HotOS), 2011. Google ScholarDigital Library
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012. Google ScholarDigital Library
J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In NIPS, 2009. Google ScholarDigital Library
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In Proc. OSDI, pages 583--598, 2014. Google ScholarDigital Library
M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine learning with the parameter server. In NIPS, pages 19--27, 2014. Google ScholarDigital Library
J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547--556. ACM, 2009. Google ScholarDigital Library
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 2010.Google ScholarDigital Library
Y. Low, G. Joseph, K. Aapo, D. Bickson, C. Guestrin, and M. Hellerstein, Joseph. Distributed GraphLab: A framework for machine learning and data mining in the cloud. PVLDB, 2012. Google ScholarDigital Library
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439--455. ACM, 2013. Google ScholarDigital Library
New York Times dataset. http://www.ldc.upenn.edu/.Google Scholar
F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE conference on Supercomputing, SC'03, pages 55--55. ACM, 2003. Google ScholarDigital Library
R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--14. USENIX Association, 2010. Google ScholarDigital Library
Power-law distribution. http://en.wikipedia.org/wiki/Power_law.Google Scholar
C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing, page 7. ACM, 2012. Google ScholarDigital Library
A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware 2001, pages 329--350. Springer, 2001. Google ScholarDigital Library
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.Google Scholar
I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 31(4):149--160, 2001. Google ScholarDigital Library
A. Tumanov, J. Cipar, G. R. Ganger, and M. A. Kozuch. alsched: Algebraic scheduling of mixed workloads in heterogeneous clouds. In Proceedings of the Third ACM Symposium on Cloud Computing, page 25. ACM, 2012. Google ScholarDigital Library
J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3360--3367. IEEE, 2010.Google ScholarCross Ref
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10:10--10, 2010. Google ScholarDigital Library
M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce performance in heterogeneous environments. In OSDI, 2008. Google ScholarDigital Library

Recommendations

A New QP-Free, Globally Convergent, Locally Superlinearly Convergent Algorithm For Inequality Constrained Optimization

In this paper, we propose a new QP-free method, which ensures the feasibility of all iterates, for inequality constrained optimization. The method is based on a nonsmooth equation reformulation of the KKT optimality condition, by using the Fischer--...
Read More
An FPTAS for the parallel two-stage flowshop problem

We consider the NP-hard m-parallel two-stage flowshop problem, abbreviated as the ( m , 2 ) -PFS problem, where we need to schedule n jobs to m parallel identical two-stage flowshops in order to minimize the makespan, i.e. the maximum completion time of ...
Read More
Solving the straggler problem with bounded staleness
HotOS'13: Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems

Many important applications fall into the broad class of iterative convergent algorithms. Parallel implementations of these algorithms are naturally expressed using the Bulk Synchronous Parallel (BSP) model of computation. However, implementations using ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
October 2016
534 pages
ISBN:9781450345255
DOI:10.1145/2987550
Editors:
Marcos K. Aguilera,
Brian Cooper,
Yanlei Diao
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
SoCC '16 Paper Acceptance Rate38of151submissions,25%Overall Acceptance Rate169of722submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 81
  Total Citations
  View Citations
- 996
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Addressing the straggler problem for iterative convergent parallel ML

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Recommendations

A New QP-Free, Globally Convergent, Locally Superlinearly Convergent Algorithm For Inequality Constrained Optimization

An FPTAS for the parallel two-stage flowshop problem

Solving the straggler problem with bounded staleness

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Addressing the straggler problem for iterative convergent parallel ML

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Recommendations

A New QP-Free, Globally Convergent, Locally Superlinearly Convergent Algorithm For Inequality Constrained Optimization

An FPTAS for the parallel two-stage flowshop problem

Solving the straggler problem with bounded staleness

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media