research-article

NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs

Authors:
Paolo Meloni

Università di Cagliari, Italy

Università di Cagliari, Italy

0000-0002-8106-4641
View Profile

,
Alessandro Capotondi

Università di Bologna, Italy

Università di Bologna, Italy
View Profile

,
Gianfranco Deriu

Università di Cagliari, Italy and T3LAB, Bologna, Italy

Università di Cagliari, Italy and T3LAB, Bologna, Italy
View Profile

,
Michele Brian

T3LAB, Bologna, Italy

T3LAB, Bologna, Italy
View Profile

,
Francesco Conti

Università di Bologna, Italy and ETH Zurich, Switzerland, Italy

Università di Bologna, Italy and ETH Zurich, Switzerland, Italy
View Profile

,
Davide Rossi

Università di Bologna, Italy

Università di Bologna, Italy
View Profile

,
Luigi Raffo

Università di Cagliari, Italy

Università di Cagliari, Italy
View Profile

,
Luca Benini

Università di Bologna, Italy and ETH Zurich, Switzerland, Italy

Università di Bologna, Italy and ETH Zurich, Switzerland, Italy
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 11 Issue 3Article No.: 18pp 1–24https://doi.org/10.1145/3284357

Published:12 December 2018Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s, and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18.

References

Altera. 2017. Altera Arria 10. Retrieved from https://www.altera.com/products/fpga/arria-series/arria-10/overview.html.Google Scholar
E. Azarkhish, D. Rossi, I. Loi, and L. Benini. 2017. Neurostream: Scalable and energy-efficient deep learning with smart memory cubes. IEEE Trans. Parallel Distrib. Syst. PP, 99 (2017), 1--1.Google Scholar
A. Capotondi, A. Marongiu, and L. Benini. 2018. Runtime support for multiple offload-based programming models on clustered manycore accelerators. IEEE Trans. Emerg. Topics Comput. 6, 3 (2018), 330–342.Google ScholarCross Ref
L. Cavigelli and L. Benini. 2016. A 803GOp/s/W convolutional network accelerator. IEEE Trans. Circ. Syst. Video Technol. 99 (2016), 1--1.Google Scholar
Y. H. Chen, J. Emer, and V. Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 367--379. Google ScholarDigital Library
F. Conti and L. Benini. 2015. A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’15). EDA Consortium, 683--688. Google ScholarDigital Library
F. Conti, R. Schilling, P. D. Schiavone, A. Pullini, D. Rossi, F. K. Gürkaynak, M. Muehlberghuber, M. Gautschi, I. Loi, G. Haugou, S. Mangard, and L. Benini. 2017. An IoT endpoint system-on-chip for secure and energy-efficient near-sensor analytics. IEEE Trans. Circ. Syst. I: Reg. Papers 64, 9 (Sept. 2017), 2481--2494.Google ScholarCross Ref
M. Courbariaux, Y. Bengio, and J. David. 2015. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28 (NIPS’15). Curran Associates Inc., 3123–3131. Google ScholarDigital Library
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA’15). 92--104. Google ScholarDigital Library
M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi, E. Flamand, F. K. Gürkaynak, and L. Benini. 2017. Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Trans. Very Large Scale Integr. Syst. 25, 10 (Oct. 2017), 2700--2713.Google ScholarDigital Library
R. Girshick. 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV’15). Google ScholarDigital Library
V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1--4.Google Scholar
Google. 2017. Build and Train Machine Learning Models on our New Google Cloud TPUs. Retrieved from https://www.blog.google/topics/google-cloud/google-cloud-offer-tpus-machine-learning/.Google Scholar
A. Hannun et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv:1412.5567.Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep residual learning for image recognition. arxiv:cs.CV/1512.03385.Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50 fewer parameters and <1MB model size.CoRR abs/1602.07360 (2016). Retrieved from http://dblp.uni-trier.de/db/journals/corr/corr1602.html#IandolaMAHDK16.Google Scholar
Image-Net. 2017. Large Scale Visual Recognition Challenge. Retrieved from http://image-net.org/.Google Scholar
S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 (2015). Retrieved from http://arxiv.org/abs/1502.03167.Google Scholar
N. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, New York, NY, 1--12. Google ScholarDigital Library
A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’12). Curran Associates Inc., 1097–1105. Google ScholarDigital Library
Y. Ma, Y. Cao, S. Vrudhula, and J. S. Seo. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1--8.Google Scholar
P. Meloni, G. Deriu, F. Conti, I. Loi, L. Raffo, and L. Benini. 2016. Curbing the roofline: A scalable and flexible architecture for CNNs on FPGA. In Proceedings of the ACM International Conference on Computing Frontiers (CF’16). ACM, New York, NY, 376--383. Google ScholarDigital Library
V. Mnih et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015), 529--533. Letter.Google ScholarCross Ref
P. G. Mousouliotis and L. P. Petrou. 2018. SqueezeJet: High-level synthesis accelerator design for deep convolutional neural networks. ArXiv e-prints arxiv:cs.CV/1805.08695.Google Scholar
Movidius. 2017. Movidius Neural Compute Stick: Accelerate Deep Learning Development at the Edge. Retrieved from https://developer.movidius.com/.Google Scholar
NVIDIA. 2017. NVIDIA Deep Learning Accelerator (NVDLA). Retrieved from http://nvdla.org/.Google Scholar
NVIDIA. 2017. NVIDIA Tegra K1. Retrieved from http://www.nvidia.com/object/tegra-k1-processor.html.Google Scholar
NVIDIA. 2017. NVIDIA Tegra X1. Retrieved from http://www.nvidia.com/object/tegra-x1-processor.html.Google Scholar
A. Prost-Boucle, A. Bourge, F. Petrot, H. Alemdar, N. Caldwell, and V. Leroy. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1--7.Google Scholar
J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, New York, NY, 26--35. Google ScholarDigital Library
A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini. 2011. A fully synthesizable single-cycle interconnection network for Shared-L1 processor clusters. In Proceedings of the Design, Automation Test in Europe. 1--6.Google Scholar
Davide Rossi, Igor Loi, Germain Haugou, and Luca Benini. 2014. Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters. In Proceedings of the 11th ACM Conference on Computing Frontiers (CF’14). ACM, New York, NY. Google ScholarDigital Library
K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.Google Scholar
Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. 2014. DeepFace: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1701--1708. Google ScholarDigital Library
Y. Umuroglu, N. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, NY, 65--74. Google ScholarDigital Library
S. I. Venieris and C. S. Bouganis. 2017. Latency-driven design for FPGA-based convolutional neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1--8.Google Scholar
W. Ren, Y. Shengen, S. Yi, D. Qingqing, and S. Gang. 2015. Deep image: Scaling up image recognition. arXiv:1501.02876.Google Scholar
J. Weston. 2016. Dialog-based language learning. ArXiv:1604.06045 arxiv:cs.CL/1604.06045.Google Scholar
J. Weston, S. Chopra, and A. Bordes. 2014. Memory networks. ArXiv:1410.3916 arxiv:cs.AI/1410.3916.Google Scholar
Xilinx. 2017. Xilinx Zynq-7000 All Programmable SoC. Retrieved from https://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html.Google Scholar
Xilinx. 2017. Zynq UltraScale+ All Programmable Heterogeneous MPSoC. Retrieved from https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html.Google Scholar
M. Zastrow. 2016. Machine outsmarts man in battle of the decade. New Sci. 229, 3065 (2016), 21.Google ScholarCross Ref
C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong. 2016. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’16). 1--8. Google ScholarDigital Library
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 161--170. Google ScholarDigital Library

Index Terms

NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs

Recommendations

An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism
Increasingly, pre-trained convolutional neural networks (CNNs) are being deployed for inference in various computer vision applications, both on the server-side in the data centers and at the edge. CNN inference is a very compute-intensive task. It is a ...
Read More
3D CV Descriptor on Parallel Heterogeneous Platforms

Embedded three-dimensional (3D) Computer Vision (CV) is considered a technology enabler for future consumer applications, attracting a wide interest in academia and industry. However, 3D CV processing is a computation-intensive task. Its high ...
Read More
Synergy: An HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

Convolutional Neural Networks (CNN) have been widely deployed in diverse application domains. There has been significant progress in accelerating both their training and inference using high-performance GPUs, FPGAs, and custom ASICs for datacenter-scale ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Reconfigurable Technology and Systems Volume 11, Issue 3
Special Issue on Deep learning on FPGAs
September 2018
187 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3299999
Editor:
Steve Wilton
Department of Electrical and Computer Engineering / University of British Columbia / Kaiser 4112, 5500-2332 Main Mall / Vancouver, BC V6T 1Z4 Canada
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 December 2018
- Accepted: 1 August 2018
- Revised: 1 June 2018
- Received: 1 December 2017
Published in trets Volume 11, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGAs
HW accelerator
convolutional neural networks
image classification
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 43
  Total Citations
  View Citations
- 680
  Total Downloads
- Downloads (Last 12 months)74
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism

3D CV Descriptor on Parallel Heterogeneous Platforms

Synergy: An HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism

3D CV Descriptor on Parallel Heterogeneous Platforms

Synergy: An HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media