skip to main content
10.1145/2908080.2908105acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article
Public Access

Latte: a language, compiler, and runtime for elegant and efficient deep neural networks

Published:02 June 2016Publication History

ABSTRACT

Deep neural networks (DNNs) have undergone a surge in popularity with consistent advances in the state of the art for tasks including image recognition, natural language processing, and speech recognition. The computationally expensive nature of these networks has led to the proliferation of implementations that sacrifice abstraction for high performance. In this paper, we present Latte, a domain-specific language for DNNs that provides a natural abstraction for specifying new layers without sacrificing performance. Users of Latte express DNNs as ensembles of neurons with connections between them. The Latte compiler synthesizes a program based on the user specification, applies a suite of domain-specific and general optimizations, and emits efficient machine code for heterogeneous architectures. Latte also includes a communication runtime for distributed memory data-parallelism. Using networks described using Latte, we demonstrate 3-6x speedup over Caffe (C++/MKL) on the three state-of-the-art ImageNet models executing on an Intel Xeon E5-2699 v3 x86 CPU.

References

  1. Effective Use of the Intel Compiler’s Offload Features. URL https://software.intel.com/en-us/articles/ effective-use-of-the-intel-compilers-offloadfeatures.Google ScholarGoogle Scholar
  2. Intel Math Kernel Library. Reference Manual. Intel Corporation, Santa Clara, USA, 2009. ISBN 630813-054US.Google ScholarGoogle Scholar
  3. Torch NN. https://github.com/torch/nn, 2015.Google ScholarGoogle Scholar
  4. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/.Google ScholarGoogle Scholar
  5. A. Agarwal, E. Akchurin, C. Basoglu, G. Chen, S. Cyphers, J. Droppo, A. Eversole, B. Guenter, M. Hillebrand, R. Hoens, X. Huang, Z. Huang, V. Ivanov, A. Kamenev, P. Kranen, O. Kuchaiev, W. Manousek, A. May, B. Mitra, O. Nano, G. Navarro, A. Orlov, M. Padmilac, H. Parthasarathi, B. Peng, A. Reznichenko, F. Seide, M. L. Seltzer, M. Slaney, A. Stolcke, Y. Wang, H. Wang, K. Yao, D. Yu, Y. Zhang, and G. Zweig. An introduction to computational networks and the computational network toolkit. Technical Report MSR-TR-2014-112, August 2014. URL http://research. microsoft.com/apps/pubs/default.aspx?id=226641.Google ScholarGoogle Scholar
  6. S. Amari. Backpropagation and stochastic gradient descent method. Neurocomputing, 5(4):185 – 196, 1993. ISSN 0925-2312. doi: http://dx.doi.org/10.1016/0925- 2312(93)90006-O. URL http://www.sciencedirect. com/science/article/pii/092523129390006O.Google ScholarGoogle ScholarCross RefCross Ref
  7. A. Ashari, S. Tatikonda, M. Boehm, B. Reinwald, K. Campbell, J. Keenleyside, and P. Sadayappan. On optimizing machine learning workloads via kernel fusion. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 173–182, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3205-7. URL http://doi.acm.org/10.1145/2688500.2688521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Bekolay, J. Bergstra, E. Hunsberger, T. DeWolf, T. C. Stewart, D. Rasmussen, X. Choo, A. R. Voelker, and C. Eliasmith. Nengo: a Python tool for building large-scale functional brain models. Frontiers in Neuroinformatics, 7, 2013.Google ScholarGoogle Scholar
  9. G. Belter, E. R. Jessup, I. Karlin, and J. G. Siek. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 59. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation.Google ScholarGoogle ScholarCross RefCross Ref
  11. J. Bezanson, S. Karpinski, V. B. Shah, and A. Edelman. Julia: A Fast Dynamic Language for Technical Computing. CoRR, abs/1209.5145, 2012. URL http://arxiv.org/abs/1209.Google ScholarGoogle Scholar
  12. 5145.Google ScholarGoogle Scholar
  13. B. Catanzaro, S. Kamil, Y. Lee, J. Demmel, K. Keutzer, J. Shalf, K. Yelick, and A. Fox. SEJITS: Getting productivity and performance with selective embedded JIT specialization.Google ScholarGoogle Scholar
  14. H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. ACM SIGPLAN Notices, 46(8):35–46, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cuDNN: Efficient Primitives for Deep Learning. CoRR, abs/1410.0759, 2014. URL http: //arxiv.org/abs/1410.0759.Google ScholarGoogle Scholar
  16. T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 571–582, Broomfield, CO, Oct. 2014. USENIX Association. ISBN 978-1-931971-16- 4. URL https://www.usenix.org/conference/osdi14/ technical-sessions/presentation/chilimbi. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Chintala. Convnet Benchmarks. https://github.com/ soumith/convnet-benchmarks, 2015.Google ScholarGoogle Scholar
  18. R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A MATLAB-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.Google ScholarGoogle Scholar
  19. D. Das, S. Avancha, D. Mudigere, K. Vaidyanathan, S. Sridharan, D. D. Kalamkar, B. Kaul, and P. Dubey. Distributed deep learning using synchronous stochastic gradient descent. CoRR, abs/1602.06709, 2016. URL http://arxiv.org/ abs/1602.06709.Google ScholarGoogle Scholar
  20. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Dukhan. NNPACK. https://github.com/ Maratyszcza/NNPACK, 2016.Google ScholarGoogle Scholar
  23. F. Gers. Long short-term memory in recurrent neural networks.Google ScholarGoogle Scholar
  24. X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, pages 249– 256, 2010.Google ScholarGoogle Scholar
  25. I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013.Google ScholarGoogle Scholar
  26. A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645–6649. IEEE, 2013.Google ScholarGoogle Scholar
  27. K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015. URL http: //arxiv.org/abs/1502.01852.Google ScholarGoogle Scholar
  28. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Intel. Intel Data Analytics Acceleration Library (DAAL). https://software.intel.com/en-us/intel-daal, 2015.Google ScholarGoogle Scholar
  31. Intel Labs. ParallelAccelerator.jl. https://github.com/ IntelLabs/ParallelAccelerator.jl, 2015.Google ScholarGoogle Scholar
  32. S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. CoRR, abs/1502.03167, 2015. URL http://arxiv. org/abs/1502.03167.Google ScholarGoogle Scholar
  33. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093, 2014.Google ScholarGoogle Scholar
  34. K. Kennedy and J. R. Allen. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. ISBN 1-55860-286-0.Google ScholarGoogle Scholar
  36. A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014. URL http: //arxiv.org/abs/1404.5997.Google ScholarGoogle Scholar
  37. A. Krizhevsky. cuda-convnet2. https://github.com/ akrizhevsky/cuda-convnet2, 2015.Google ScholarGoogle Scholar
  38. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Microsoft. CNTK. https:// github.com/Microsoft/CNTK/tree/ 7d3e84e7733c1c965d995e28ff4bac60f166a03b, 2015.Google ScholarGoogle Scholar
  40. J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 48(6):519– 530, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. CoRR, abs/1312.6229, 2013. URL http://arxiv.org/abs/1312.Google ScholarGoogle Scholar
  44. 6229.Google ScholarGoogle Scholar
  45. K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556, 2014.Google ScholarGoogle Scholar
  46. Skymind. Deep Learning for Java (DL4J). http:// deeplearning4j.org/architecture.html, 2015.Google ScholarGoogle Scholar
  47. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. URL http: //arxiv.org/abs/1409.4842.Google ScholarGoogle Scholar
  48. T. Tieleman and G. Hinton. Lecture 6.5-rmsprop. COURSERA: Neural Networks for Machine Learning, 2012.Google ScholarGoogle Scholar
  49. R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep Image: Scaling up Image Recognition. CoRR, abs/1501.02876, 2015. URL http://arxiv.org/abs/1501.02876.Google ScholarGoogle Scholar
  50. C. Zhang. Mocha.jl. https://github.com/pluskid/ Mocha.jl, 2015.Google ScholarGoogle Scholar

Index Terms

  1. Latte: a language, compiler, and runtime for elegant and efficient deep neural networks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation
      June 2016
      726 pages
      ISBN:9781450342612
      DOI:10.1145/2908080
      • General Chair:
      • Chandra Krintz,
      • Program Chair:
      • Emery Berger
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 51, Issue 6
        PLDI '16
        June 2016
        726 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2980983
        • Editor:
        • Andy Gill
        Issue’s Table of Contents

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 June 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate406of2,067submissions,20%

      Upcoming Conference

      PLDI '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader