Latte: a language, compiler, and runtime for elegant and efficient deep neural networks

Authors:
Leonard Truong

Intel Labs, USA / University of California at Berkeley, USA

Intel Labs, USA / University of California at Berkeley, USA
View Profile

,
Rajkishore Barik

Intel Labs, USA

Intel Labs, USA
View Profile

,
Ehsan Totoni

Intel Labs, USA

Intel Labs, USA
View Profile

,
Hai Liu

Intel Labs, USA

Intel Labs, USA
View Profile

,
Chick Markley

University of California at Berkeley, USA

University of California at Berkeley, USA
View Profile

,
Armando Fox

University of California at Berkeley, USA

University of California at Berkeley, USA
View Profile

,
Tatiana Shpeisman

Intel Labs, USA

Intel Labs, USA
View Profile

PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and ImplementationJune 2016Pages 209–223https://doi.org/10.1145/2908080.2908105

Published:02 June 2016Publication History

PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 209–223

ABSTRACT

Deep neural networks (DNNs) have undergone a surge in popularity with consistent advances in the state of the art for tasks including image recognition, natural language processing, and speech recognition. The computationally expensive nature of these networks has led to the proliferation of implementations that sacrifice abstraction for high performance. In this paper, we present Latte, a domain-specific language for DNNs that provides a natural abstraction for specifying new layers without sacrificing performance. Users of Latte express DNNs as ensembles of neurons with connections between them. The Latte compiler synthesizes a program based on the user specification, applies a suite of domain-specific and general optimizations, and emits efficient machine code for heterogeneous architectures. Latte also includes a communication runtime for distributed memory data-parallelism. Using networks described using Latte, we demonstrate 3-6x speedup over Caffe (C++/MKL) on the three state-of-the-art ImageNet models executing on an Intel Xeon E5-2699 v3 x86 CPU.

References

Effective Use of the Intel Compiler’s Offload Features. URL https://software.intel.com/en-us/articles/ effective-use-of-the-intel-compilers-offloadfeatures.Google Scholar
Intel Math Kernel Library. Reference Manual. Intel Corporation, Santa Clara, USA, 2009. ISBN 630813-054US.Google Scholar
Torch NN. https://github.com/torch/nn, 2015.Google Scholar
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/.Google Scholar
A. Agarwal, E. Akchurin, C. Basoglu, G. Chen, S. Cyphers, J. Droppo, A. Eversole, B. Guenter, M. Hillebrand, R. Hoens, X. Huang, Z. Huang, V. Ivanov, A. Kamenev, P. Kranen, O. Kuchaiev, W. Manousek, A. May, B. Mitra, O. Nano, G. Navarro, A. Orlov, M. Padmilac, H. Parthasarathi, B. Peng, A. Reznichenko, F. Seide, M. L. Seltzer, M. Slaney, A. Stolcke, Y. Wang, H. Wang, K. Yao, D. Yu, Y. Zhang, and G. Zweig. An introduction to computational networks and the computational network toolkit. Technical Report MSR-TR-2014-112, August 2014. URL http://research. microsoft.com/apps/pubs/default.aspx?id=226641.Google Scholar
S. Amari. Backpropagation and stochastic gradient descent method. Neurocomputing, 5(4):185 – 196, 1993. ISSN 0925-2312. doi: http://dx.doi.org/10.1016/0925- 2312(93)90006-O. URL http://www.sciencedirect. com/science/article/pii/092523129390006O.Google ScholarCross Ref
A. Ashari, S. Tatikonda, M. Boehm, B. Reinwald, K. Campbell, J. Keenleyside, and P. Sadayappan. On optimizing machine learning workloads via kernel fusion. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 173–182, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3205-7. URL http://doi.acm.org/10.1145/2688500.2688521. Google ScholarDigital Library
T. Bekolay, J. Bergstra, E. Hunsberger, T. DeWolf, T. C. Stewart, D. Rasmussen, X. Choo, A. R. Voelker, and C. Eliasmith. Nengo: a Python tool for building large-scale functional brain models. Frontiers in Neuroinformatics, 7, 2013.Google Scholar
G. Belter, E. R. Jessup, I. Karlin, and J. G. Siek. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 59. ACM, 2009. Google ScholarDigital Library
J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation.Google ScholarCross Ref
J. Bezanson, S. Karpinski, V. B. Shah, and A. Edelman. Julia: A Fast Dynamic Language for Technical Computing. CoRR, abs/1209.5145, 2012. URL http://arxiv.org/abs/1209.Google Scholar
5145.Google Scholar
B. Catanzaro, S. Kamil, Y. Lee, J. Demmel, K. Keutzer, J. Shalf, K. Yelick, and A. Fox. SEJITS: Getting productivity and performance with selective embedded JIT specialization.Google Scholar
H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. ACM SIGPLAN Notices, 46(8):35–46, 2011. Google ScholarDigital Library
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cuDNN: Efficient Primitives for Deep Learning. CoRR, abs/1410.0759, 2014. URL http: //arxiv.org/abs/1410.0759.Google Scholar
T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 571–582, Broomfield, CO, Oct. 2014. USENIX Association. ISBN 978-1-931971-16- 4. URL https://www.usenix.org/conference/osdi14/ technical-sessions/presentation/chilimbi. Google ScholarDigital Library
S. Chintala. Convnet Benchmarks. https://github.com/ soumith/convnet-benchmarks, 2015.Google Scholar
R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A MATLAB-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.Google Scholar
D. Das, S. Avancha, D. Mudigere, K. Vaidyanathan, S. Sridharan, D. D. Kalamkar, B. Kaul, and P. Dubey. Distributed deep learning using synchronous stochastic gradient descent. CoRR, abs/1602.06709, 2016. URL http://arxiv.org/ abs/1602.06709.Google Scholar
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.Google ScholarDigital Library
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011. Google ScholarDigital Library
M. Dukhan. NNPACK. https://github.com/ Maratyszcza/NNPACK, 2016.Google Scholar
F. Gers. Long short-term memory in recurrent neural networks.Google Scholar
X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, pages 249– 256, 2010.Google Scholar
I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013.Google Scholar
A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645–6649. IEEE, 2013.Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015. URL http: //arxiv.org/abs/1502.01852.Google Scholar
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. Google ScholarDigital Library
K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. Google ScholarDigital Library
Intel. Intel Data Analytics Acceleration Library (DAAL). https://software.intel.com/en-us/intel-daal, 2015.Google Scholar
Intel Labs. ParallelAccelerator.jl. https://github.com/ IntelLabs/ParallelAccelerator.jl, 2015.Google Scholar
S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. CoRR, abs/1502.03167, 2015. URL http://arxiv. org/abs/1502.03167.Google Scholar
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093, 2014.Google Scholar
K. Kennedy and J. R. Allen. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002. Google ScholarDigital Library
ISBN 1-55860-286-0.Google Scholar
A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014. URL http: //arxiv.org/abs/1404.5997.Google Scholar
A. Krizhevsky. cuda-convnet2. https://github.com/ akrizhevsky/cuda-convnet2, 2015.Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.Google ScholarDigital Library
Microsoft. CNTK. https:// github.com/Microsoft/CNTK/tree/ 7d3e84e7733c1c965d995e28ff4bac60f166a03b, 2015.Google Scholar
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 48(6):519– 530, 2013. Google ScholarDigital Library
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. Google ScholarDigital Library
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. Google ScholarDigital Library
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. CoRR, abs/1312.6229, 2013. URL http://arxiv.org/abs/1312.Google Scholar
6229.Google Scholar
K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556, 2014.Google Scholar
Skymind. Deep Learning for Java (DL4J). http:// deeplearning4j.org/architecture.html, 2015.Google Scholar
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. URL http: //arxiv.org/abs/1409.4842.Google Scholar
T. Tieleman and G. Hinton. Lecture 6.5-rmsprop. COURSERA: Neural Networks for Machine Learning, 2012.Google Scholar
R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep Image: Scaling up Image Recognition. CoRR, abs/1501.02876, 2015. URL http://arxiv.org/abs/1501.02876.Google Scholar
C. Zhang. Mocha.jl. https://github.com/pluskid/ Mocha.jl, 2015.Google Scholar

Index Terms

Latte: a language, compiler, and runtime for elegant and efficient deep neural networks
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Latte: a language, compiler, and runtime for elegant and efficient deep neural networks
PLDI '16

Deep neural networks (DNNs) have undergone a surge in popularity with consistent advances in the state of the art for tasks including image recognition, natural language processing, and speech recognition. The computationally expensive nature of these ...
Read More
gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
Read More
Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

The Intel® Xeon Phi™ coprocessor platform has a new software stack that enables new programming models. One such model is offload of computation from a host processor to a coprocessor that is a fully-capable Intel® Architecture CPU, namely, the Intel® ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2016
726 pages
ISBN:9781450342612
DOI:10.1145/2908080
General Chair:
Chandra Krintz
University of California at Santa Barbara, USA
,
Program Chair:
Emery Berger
University of Massachusetts at Amherst, USA
ACM SIGPLAN Notices Volume 51, Issue 6
PLDI '16
June 2016
726 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2980983
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Compiler
Deep Learning
Domain Specific Language
Neural Networks
Optimization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate406of2,067submissions,20%
Upcoming Conference
PLDI '24

Sponsor:

sigplan

ACM SIGPLAN Conference on Programming Language Design and Implementation

June 24 - 28, 2024

Copenhagen , Denmark
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 2,653
  Total Downloads
- Downloads (Last 12 months)149
- Downloads (Last 6 weeks)33
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Latte: a language, compiler, and runtime for elegant and efficient deep neural networks

PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Latte: a language, compiler, and runtime for elegant and efficient deep neural networks

gpucc: an open-source GPGPU compiler

Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor