skip to main content
10.1145/2987550.2987554acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Addressing the straggler problem for iterative convergent parallel ML

Published:05 October 2016Publication History

ABSTRACT

FlexRR provides a scalable, efficient solution to the straggler problem for iterative machine learning (ML). The frequent (e.g., per iteration) barriers used in traditional BSP-based distributed ML implementations cause every transient slowdown of any worker thread to delay all others. FlexRR combines a more flexible synchronization model with dynamic peer-to-peer re-assignment of work among workers to address straggler threads. Experiments with real straggler behavior observed on Amazon EC2 and Microsoft Azure, as well as injected straggler behavior stress tests, confirm the significance of the problem and the effectiveness of FlexRR's solution. Using FlexRR, we consistently observe near-ideal run-times (relative to no performance jitter) across all real and injected straggler behaviors tested.

References

  1. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.Google ScholarGoogle Scholar
  2. U. A. Acar, A. Chargueraud, and M. Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 219--228. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In WSDM, pages 123--132, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Albrecht, C. Tuttle, A. C. Snoeren, and A. Vahdat. Loose synchronization for large-scale networked systems. In USENIX Annual Tech, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective straggler mitigation: Attack of the clones. In NSDI'13, pages 185--198, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using Mantri. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--16. USENIX Association, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Apache Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  8. P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale. In IEEE International Conference on Cluster Computing, pages 1--12, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  9. C. M. Bishop et al. Pattern recognition and machine learning, volume 4. springer New York, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720--748, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.Google ScholarGoogle Scholar
  12. T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 571--582. USENIX Association, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. Xing. Solving the straggler problem with bounded staleness. In USENIX conference on Hot topics in operating systems (HotOS), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC, pages 37--48, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Cui, A. Tumanov, J. Wei, L. Xu, W. Dai, J. Haber-Kucharsky, Q. Ho, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting iterative-ness for parallel ML computations. In Proceedings of the ACM Symposium on Cloud Computing, pages 1--14. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems, page 4. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: If you're late don't blame us! In Proceedings of the ACM Symposium on Cloud Computing, SOCC'14, pages 2:1--2:14. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Dean. Achieving rapid response times in large online services. In Berkeley AMPLab Cloud Seminar, 2012.Google ScholarGoogle Scholar
  19. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC'09, pages 53:1--53:11. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--8, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  22. A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: guaranteed job latency in data parallel clusters. In Proceedings of the 7th ACM European conference on Computer Systems, pages 99--112. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. B. Ferreira, P. G. Bridges, R. Brightwell, and K. T. Pedretti. The impact of system design parameters on application noise sensitivity. In Proceedings of the 2010 IEEE International Conference on Cluster Computing, CLUSTER'10, pages 146--155. IEEE Computer Society, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Gibson, G. Grider, A. Jacobson, and W. Lloyd. PRObE: A thousand-node experimental cluster for computer systems research. USENIX; login, 38(3), 2013.Google ScholarGoogle Scholar
  26. J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Proc. OSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599--613, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  29. A. Harlap, G. R. Ganger, and P. B. Gibbons. Tier ml: Using tiers of reliability for agile elasticity in machine learning. 2016.Google ScholarGoogle Scholar
  30. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI, volume 11, pages 22--22, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing. More effective distributed ML via a Stale Synchronous Parallel parameter server. In NIPS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. E. Krevat, J. Tucek, and G. R. Ganger. Disks are like snowflakes: no two are alike. In USENIX conference on Hot topics in operating systems (HotOS), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In NIPS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In Proc. OSDI, pages 583--598, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine learning with the parameter server. In NIPS, pages 19--27, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547--556. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Y. Low, G. Joseph, K. Aapo, D. Bickson, C. Guestrin, and M. Hellerstein, Joseph. Distributed GraphLab: A framework for machine learning and data mining in the cloud. PVLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439--455. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. New York Times dataset. http://www.ldc.upenn.edu/.Google ScholarGoogle Scholar
  42. F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE conference on Supercomputing, SC'03, pages 55--55. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--14. USENIX Association, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Power-law distribution. http://en.wikipedia.org/wiki/Power_law.Google ScholarGoogle Scholar
  45. C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing, page 7. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware 2001, pages 329--350. Springer, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.Google ScholarGoogle Scholar
  48. I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 31(4):149--160, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Tumanov, J. Cipar, G. R. Ganger, and M. A. Kozuch. alsched: Algebraic scheduling of mixed workloads in heterogeneous clouds. In Proceedings of the Third ACM Symposium on Cloud Computing, page 25. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3360--3367. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  51. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10:10--10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce performance in heterogeneous environments. In OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
    October 2016
    534 pages
    ISBN:9781450345255
    DOI:10.1145/2987550

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 5 October 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    SoCC '16 Paper Acceptance Rate38of151submissions,25%Overall Acceptance Rate169of722submissions,23%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader