ABSTRACT
FlexRR provides a scalable, efficient solution to the straggler problem for iterative machine learning (ML). The frequent (e.g., per iteration) barriers used in traditional BSP-based distributed ML implementations cause every transient slowdown of any worker thread to delay all others. FlexRR combines a more flexible synchronization model with dynamic peer-to-peer re-assignment of work among workers to address straggler threads. Experiments with real straggler behavior observed on Amazon EC2 and Microsoft Azure, as well as injected straggler behavior stress tests, confirm the significance of the problem and the effectiveness of FlexRR's solution. Using FlexRR, we consistently observe near-ideal run-times (relative to no performance jitter) across all real and injected straggler behaviors tested.
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.Google Scholar
- U. A. Acar, A. Chargueraud, and M. Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 219--228. ACM, 2013. Google ScholarDigital Library
- A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In WSDM, pages 123--132, 2012. Google ScholarDigital Library
- J. Albrecht, C. Tuttle, A. C. Snoeren, and A. Vahdat. Loose synchronization for large-scale networked systems. In USENIX Annual Tech, 2006. Google ScholarDigital Library
- G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective straggler mitigation: Attack of the clones. In NSDI'13, pages 185--198, 2013. Google ScholarDigital Library
- G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in map-reduce clusters using Mantri. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--16. USENIX Association, 2010. Google ScholarDigital Library
- Apache Hadoop. http://hadoop.apache.org/.Google Scholar
- P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale. In IEEE International Conference on Cluster Computing, pages 1--12, 2006.Google ScholarCross Ref
- C. M. Bishop et al. Pattern recognition and machine learning, volume 4. springer New York, 2006. Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720--748, 1999. Google ScholarDigital Library
- T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.Google Scholar
- T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 571--582. USENIX Association, 2014. Google ScholarDigital Library
- J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. Xing. Solving the straggler problem with bounded staleness. In USENIX conference on Hot topics in operating systems (HotOS), 2013. Google ScholarDigital Library
- H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC, pages 37--48, 2014. Google ScholarDigital Library
- H. Cui, A. Tumanov, J. Wei, L. Xu, W. Dai, J. Haber-Kucharsky, Q. Ho, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting iterative-ness for parallel ML computations. In Proceedings of the ACM Symposium on Cloud Computing, pages 1--14. ACM, 2014. Google ScholarDigital Library
- H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems, page 4. ACM, 2016. Google ScholarDigital Library
- C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: If you're late don't blame us! In Proceedings of the ACM Symposium on Cloud Computing, SOCC'14, pages 2:1--2:14. ACM, 2014. Google ScholarDigital Library
- J. Dean. Achieving rapid response times in large online services. In Berkeley AMPLab Cloud Seminar, 2012.Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
- J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC'09, pages 53:1--53:11. ACM, 2009. Google ScholarDigital Library
- J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--8, 2007.Google ScholarCross Ref
- A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: guaranteed job latency in data parallel clusters. In Proceedings of the 7th ACM European conference on Computer Systems, pages 99--112. ACM, 2012. Google ScholarDigital Library
- K. B. Ferreira, P. G. Bridges, R. Brightwell, and K. T. Pedretti. The impact of system design parameters on application noise sensitivity. In Proceedings of the 2010 IEEE International Conference on Cluster Computing, CLUSTER'10, pages 146--155. IEEE Computer Society, 2010. Google ScholarDigital Library
- R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, 2011. Google ScholarDigital Library
- G. Gibson, G. Grider, A. Jacobson, and W. Lloyd. PRObE: A thousand-node experimental cluster for computer systems research. USENIX; login, 38(3), 2013.Google Scholar
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Proc. OSDI, 2012. Google ScholarDigital Library
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599--613, 2014. Google ScholarDigital Library
- T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 2004.Google ScholarCross Ref
- A. Harlap, G. R. Ganger, and P. B. Gibbons. Tier ml: Using tiers of reliability for agile elasticity in machine learning. 2016.Google Scholar
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI, volume 11, pages 22--22, 2011. Google ScholarDigital Library
- Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing. More effective distributed ML via a Stale Synchronous Parallel parameter server. In NIPS, 2013. Google ScholarDigital Library
- E. Krevat, J. Tucek, and G. R. Ganger. Disks are like snowflakes: no two are alike. In USENIX conference on Hot topics in operating systems (HotOS), 2011. Google ScholarDigital Library
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012. Google ScholarDigital Library
- J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In NIPS, 2009. Google ScholarDigital Library
- M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In Proc. OSDI, pages 583--598, 2014. Google ScholarDigital Library
- M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine learning with the parameter server. In NIPS, pages 19--27, 2014. Google ScholarDigital Library
- J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547--556. ACM, 2009. Google ScholarDigital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 2010.Google ScholarDigital Library
- Y. Low, G. Joseph, K. Aapo, D. Bickson, C. Guestrin, and M. Hellerstein, Joseph. Distributed GraphLab: A framework for machine learning and data mining in the cloud. PVLDB, 2012. Google ScholarDigital Library
- D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439--455. ACM, 2013. Google ScholarDigital Library
- New York Times dataset. http://www.ldc.upenn.edu/.Google Scholar
- F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE conference on Supercomputing, SC'03, pages 55--55. ACM, 2003. Google ScholarDigital Library
- R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--14. USENIX Association, 2010. Google ScholarDigital Library
- Power-law distribution. http://en.wikipedia.org/wiki/Power_law.Google Scholar
- C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing, page 7. ACM, 2012. Google ScholarDigital Library
- A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware 2001, pages 329--350. Springer, 2001. Google ScholarDigital Library
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.Google Scholar
- I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 31(4):149--160, 2001. Google ScholarDigital Library
- A. Tumanov, J. Cipar, G. R. Ganger, and M. A. Kozuch. alsched: Algebraic scheduling of mixed workloads in heterogeneous clouds. In Proceedings of the Third ACM Symposium on Cloud Computing, page 25. ACM, 2012. Google ScholarDigital Library
- J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3360--3367. IEEE, 2010.Google ScholarCross Ref
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10:10--10, 2010. Google ScholarDigital Library
- M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce performance in heterogeneous environments. In OSDI, 2008. Google ScholarDigital Library
Recommendations
A New QP-Free, Globally Convergent, Locally Superlinearly Convergent Algorithm For Inequality Constrained Optimization
In this paper, we propose a new QP-free method, which ensures the feasibility of all iterates, for inequality constrained optimization. The method is based on a nonsmooth equation reformulation of the KKT optimality condition, by using the Fischer--...
An FPTAS for the parallel two-stage flowshop problem
We consider the NP-hard m-parallel two-stage flowshop problem, abbreviated as the ( m , 2 ) -PFS problem, where we need to schedule n jobs to m parallel identical two-stage flowshops in order to minimize the makespan, i.e. the maximum completion time of ...
Solving the straggler problem with bounded staleness
HotOS'13: Proceedings of the 14th USENIX conference on Hot Topics in Operating SystemsMany important applications fall into the broad class of iterative convergent algorithms. Parallel implementations of these algorithms are naturally expressed using the Bulk Synchronous Parallel (BSP) model of computation. However, implementations using ...
Comments