skip to main content
10.1145/3302424.3303953acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Public Access

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

Published:25 March 2019Publication History

ABSTRACT

This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language inspired by Halide. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models.

References

  1. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In arXiv:1605.07146, 2016.Google ScholarGoogle Scholar
  2. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, and Mohammad Norouzi. Google's neural machine translation system: Bridging the gap between human and machine translation. In arxiv.org:1609.08144, 2016.Google ScholarGoogle Scholar
  3. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen Meng, Minmin Sun, Jun Yang, Minghui Qiu, and Yang Gu. Training deeper models by gpu memory optimization on tensorflow. In Proc. of ML Systems Workshop in NIPS, 2017.Google ScholarGoogle Scholar
  5. Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems, pages 4125--4133, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. James Martens and Ilya Sutskever. Training deep and recurrent networks with hessian-free optimization. In Neural networks: Tricks of the trade, pages 479--535. Springer, 2012.Google ScholarGoogle Scholar
  7. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.Google ScholarGoogle Scholar
  8. Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Neural Information Processing Systems (NIPS), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1337--1345, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Xuan Yang, Jing Pu, Blaine Burton Rister, Nikhil Bhagdikar, Stephen Richardson, Shahar Kvatinsky, Jonathan Ragan-Kelley, Ardavan Pedram, and Mark Horowitz. A systematic approach to blocking convolutional neural networks. arXiv preprint arXiv:1606.04209, 2016.Google ScholarGoogle Scholar
  12. Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 380--392. IEEE, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. ACM SIGOPS Operating Systems Review, 51(2):751--764, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken. Exploring hidden dimensions in parallelizing convolutional neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 2279--2288, 2018.Google ScholarGoogle Scholar
  15. Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358, 2018.Google ScholarGoogle Scholar
  16. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.Google ScholarGoogle Scholar
  18. PyTorch. http://pytorch.org.Google ScholarGoogle Scholar
  19. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 48(6):519--530, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016.Google ScholarGoogle Scholar
  21. Google Cloud. Tpu: System architecture.Google ScholarGoogle Scholar
  22. Zhengping Qian, Xiuwei Chen, Nanxi Kang, Mingcheng Chen, Yuan Yu, Thomas Moscibroda, and Zheng Zhang. MadLINQ: large-scale distributed matrix computation for the cloud. In Proceedings of the 7th ACM european conference on Computer Systems, EuroSys '12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. E. Cannon. A cellular computer to implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ken Kennedy and Ulrich Kremer. Automatic data layout for distributed-memory machines. ACM Transactions on Programming Languages and Systems (TOPLAS), 20(4):869--916, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ulrich Kremer. Np-completeness of dynamic remapping. In Proceedings of the Fourth Workshop on Compilers for Parallel Computers, Delft, The Netherlands, 1993.Google ScholarGoogle Scholar
  26. Jingke Li and Marina Chen. Index domain alignment: Minimizing cost of cross-referencing between distributed arrays. In Frontiers of Massively Parallel Computation, 1990. Proceedings., 3rd Symposium on the, pages 424--433. IEEE, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jingke Li and Marina Chen. The data alignment phase in compiling programs for distributed-memory machines. Journal of parallel and distributed computing, 13(2):213--221, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. Device placement optimization with reinforcement learning. arXiv preprint arXiv:1706.04972, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V. Le, and Jeff Dean. A hierarchical model for device placement. In ICLR, 2018.Google ScholarGoogle Scholar
  30. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, 2018. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. In arXiv:1802.04730v2, 2018.Google ScholarGoogle Scholar
  32. Arnaud J Venet. The gauge domain: scalable analysis of linear inequality invariants. In International Conference on Computer Aided Verification, pages 139--154. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Radu Rugina and Martin Rinard. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. In ACM Sigplan Notices, volume 35, pages 182--195. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xueguang Wu, Liqian Chen, and Ji Wang. An abstract domain to infer symbolic ranges over nonnegative parameters. Electronic Notes in Theoretical Computer Science, 307:33--45, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Chien-Chin Huang, Qi Chen, Zhaoguo Wang, Russell Power, Jorge Ortiz, Jinyang Li, and Zhen Xiao. Spartan: A distributed array framework with smart tiling. In USENIX Annual Technical Conference, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J.A Bondy and U.S.R. Murty. Graph Theory with Applications. Elseyier Science Publishing, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Minjie Wang, Chien-chin Huang, and Jinyang Li. Supporting very large models using automatic dataflow graph partitioning. arXiv preprint arXiv:1807.08887, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  39. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.Google ScholarGoogle Scholar
  41. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121--2159, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Taro Sekiyama, Takashi Imamichi, Haruki Imai, and Rudy Raymond. Profile-guided memory optimization for deep neural networks. arXiv preprint arXiv:1804.10001, 2018.Google ScholarGoogle Scholar
  43. Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1--13. IEEE, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.Google ScholarGoogle Scholar
  46. Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. In arXiv:1404.5997, 2014.Google ScholarGoogle Scholar
  47. Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In USENIX OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. H. Cui, J. Cipar, Q. Ho, J.K. Kim, S. Lee, A. Kumar, J.Wei, W. Dai, G. R. Ganger, P.B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting bounded staleness to speed up big data analytics. In USENIX Annual Technical Conference, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. Wei, W. Dai, A. Qiao, H. Cui, Q. Ho, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E.P. Xing. Managed communication and consistency for fast data-parallel iterative analytics. In ACM Symposium on Cloud Computing (SoCC), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Eurosys, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Minjie Wang, Tianjun Xiao, Jianpeng Li, Jiaxing Zhang, Chuntao Hong, and Zheng Zhang. Minerva: A scalable and highly efficient training platform for deep learning. In NIPS Workshop, Distributed Machine Learning and Matrix Computations, 2014.Google ScholarGoogle Scholar
  52. Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth Gibson, and Eric Xing. Strads: A distributed framework for scheduled model parallel machine learning. In Eurosys, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135--1143, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.Google ScholarGoogle Scholar
  55. Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.Google ScholarGoogle Scholar
  56. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107--4115, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Edward Anderson, Zhaojun Bai, J Dongarra, A Greenbaum, A McKenney, Jeremy Du Croz, S Hammerling, J Demmel, C Bischof, and Danny Sorensen. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pages 2--11. IEEE Computer Society Press, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Jaeyoung Choi, Jack J Dongarra, Roldan Pozo, and David W Walker. Scalapack: A scalable linear algebra library for distributed memory concurrent computers. In Frontiers of Massively Parallel Computation, 1992., Fourth Symposium on the, pages 120--127. IEEE, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  59. Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw., 39(2):13:1--13:24, feb 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jaroslaw Nieplocha, Robert J Harrison, and Richard J Littlefield. Global arrays: A nonuniform memory access programming model for high-performance computers. The Journal of Supercomputing, 10(2):169--189, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202. Birkhäuser Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Robert A. van de Geijn and Jerrell Watts. Summa: Scalable universal matrix multiplication algorithm. Technical report, Austin, TX, USA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Edgar Solomonik, Devin Matthews, Jeff R Hammond, John F Stanton, and James Demmel. A massively parallel tensor contraction framework for coupled-cluster computations. Journal of Parallel and Distributed Computing, 74(12):3176--3190, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Calvin Lin and Lawrence Snyder. ZPL: An array sublanguage. In Languages and Compilers for Parallel Computing, pages 96--114. Springer, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. B.L. Chamberlain, D. Callahan, and H.P. Zima. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. UPC Consortium. UPC language specifications, v1.2. Technical report, Lawrence Berkeley National Lab, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  67. Joe B. Buck, Noah Watkins, Jeff LeFevre, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, and Scott Brandt. Scihadoop: array-based query processing in hadoop. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Murray Stokely, Farzan Rohani, and Eric Tassone. Large-scale parallel statistical forecasting computations in r. In JSM Proceedings, Section on Physical and Engineering Sciences, Alexandria, VA, 2011.Google ScholarGoogle Scholar
  69. SparkR: R frontend for Spark. http://amplab-extras.github.io/SparkR-pkg.Google ScholarGoogle Scholar
  70. Mingxing Zhang, Yongwei Wu, Kang Chen, Teng Ma, and Weimin Zheng. Measuring and optimizing distributed array programs. Proc. VLDB Endow., 9(12):912--923, August 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO '16, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In Proceedings of the 8th ACM European Conference on Computer Systems (Eurosys), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. Distributed halide. In Principles and Practice of Parallel Programming (PPoPP), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Edgar Solomonik, Devin Matthews, Jeff Hammond, and James Demmel. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 813--824. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. So Hirata. Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. The Journal of Physical Chemistry A, 107(46):9887--9897, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  76. Stefan C. Müller, Gustavo Alonso, Adam Amara, and André Csillaghy. Pydron: Semi-automatic parallelization for multi-core and the cloud. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 645--659, Broomfield, CO, October 2014. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. David E Hudak and Santosh G Abraham. Compiler techniques for data partitioning of sequentially iterated parallel loops. In ACM SIGARCH Computer Architecture News, volume 18, pages 187--200. ACM, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Kathleen Knobe, Joan D Lukas, and Guy L Steele Jr. Data optimization: Allocation of arrays to reduce communication on simd machines. Journal of Parallel and Distributed Computing, 8(2):102--118, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Michael Philippsen. Automatic alignment of array data and processes to reduce communication time on DMPPs, volume 30. ACM, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Igor Z Milosavljevic and Marwan A Jabri. Automatic array alignment in parallel matlab scripts. In Parallel Processing, 1999. 13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP. Proceedings, pages 285--289. IEEE, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. J Ramanujam and P Sadayappan. Compile-time techniques for data distribution in distributed memory machines. Parallel and Distributed Systems, IEEE Transactions on, 2(4):472--482, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. J Ramanujam and P Sadayappan. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In Proceedings of the 1989 ACM/IEEE conference on Supercomputing, pages 637--646. ACM, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. David Bau, Induprakas Kodukula, Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill. Solving alignment using elementary linear algebra. In Languages and Compilers for Parallel Computing, pages 46--60. Springer, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. ERIKH D'HOLLANDER. Partitioning and labeling of index sets in do loops with constant dependence vectors. In 1989 International Conference on Parallel Processing, University Park, PA, 1989.Google ScholarGoogle Scholar
  85. Chua-Huang Huang and Ponnuswamy Sadayappan. Communication-free hyperplane partitioning of nested loops. Journal of Parallel and Distributed Computing, 19(2):90--102, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Y-J Ju and H Dietz. Reduction of cache coherence overhead by compiler data layout and loop transformation. In Languages and Compilers for Parallel Computing, pages 344--358. Springer, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Qingda Lu, Christophe Alias, Uday Bondhugula, Thomas Henretty, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, Ponnuswamy Sadayappan, Yongjian Chen, Haibo Lin, et al. Data layout transformation for enhancing data locality on nuca chip multiprocessors. In Parallel Architectures and Compilation Techniques, 2009. PACT'09. 18th International Conference on, pages 348--357. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Jeff Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In Symposium on Operating System Design and Implementation (OSDI), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Supporting Very Large Models using Automatic Dataflow Graph Partitioning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019
        March 2019
        714 pages
        ISBN:9781450362818
        DOI:10.1145/3302424

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 March 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate241of1,308submissions,18%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader