Abstract
Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s, and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18.
- Altera. 2017. Altera Arria 10. Retrieved from https://www.altera.com/products/fpga/arria-series/arria-10/overview.html.Google Scholar
- E. Azarkhish, D. Rossi, I. Loi, and L. Benini. 2017. Neurostream: Scalable and energy-efficient deep learning with smart memory cubes. IEEE Trans. Parallel Distrib. Syst. PP, 99 (2017), 1--1.Google Scholar
- A. Capotondi, A. Marongiu, and L. Benini. 2018. Runtime support for multiple offload-based programming models on clustered manycore accelerators. IEEE Trans. Emerg. Topics Comput. 6, 3 (2018), 330–342.Google ScholarCross Ref
- L. Cavigelli and L. Benini. 2016. A 803GOp/s/W convolutional network accelerator. IEEE Trans. Circ. Syst. Video Technol. 99 (2016), 1--1.Google Scholar
- Y. H. Chen, J. Emer, and V. Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 367--379. Google ScholarDigital Library
- F. Conti and L. Benini. 2015. A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’15). EDA Consortium, 683--688. Google ScholarDigital Library
- F. Conti, R. Schilling, P. D. Schiavone, A. Pullini, D. Rossi, F. K. Gürkaynak, M. Muehlberghuber, M. Gautschi, I. Loi, G. Haugou, S. Mangard, and L. Benini. 2017. An IoT endpoint system-on-chip for secure and energy-efficient near-sensor analytics. IEEE Trans. Circ. Syst. I: Reg. Papers 64, 9 (Sept. 2017), 2481--2494.Google ScholarCross Ref
- M. Courbariaux, Y. Bengio, and J. David. 2015. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28 (NIPS’15). Curran Associates Inc., 3123–3131. Google ScholarDigital Library
- Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA’15). 92--104. Google ScholarDigital Library
- M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi, E. Flamand, F. K. Gürkaynak, and L. Benini. 2017. Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Trans. Very Large Scale Integr. Syst. 25, 10 (Oct. 2017), 2700--2713.Google ScholarDigital Library
- R. Girshick. 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV’15). Google ScholarDigital Library
- V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1--4.Google Scholar
- Google. 2017. Build and Train Machine Learning Models on our New Google Cloud TPUs. Retrieved from https://www.blog.google/topics/google-cloud/google-cloud-offer-tpus-machine-learning/.Google Scholar
- A. Hannun et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv:1412.5567.Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep residual learning for image recognition. arxiv:cs.CV/1512.03385.Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
- Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50 fewer parameters and <1MB model size.CoRR abs/1602.07360 (2016). Retrieved from http://dblp.uni-trier.de/db/journals/corr/corr1602.html#IandolaMAHDK16.Google Scholar
- Image-Net. 2017. Large Scale Visual Recognition Challenge. Retrieved from http://image-net.org/.Google Scholar
- S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 (2015). Retrieved from http://arxiv.org/abs/1502.03167.Google Scholar
- N. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, New York, NY, 1--12. Google ScholarDigital Library
- A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’12). Curran Associates Inc., 1097–1105. Google ScholarDigital Library
- Y. Ma, Y. Cao, S. Vrudhula, and J. S. Seo. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1--8.Google Scholar
- P. Meloni, G. Deriu, F. Conti, I. Loi, L. Raffo, and L. Benini. 2016. Curbing the roofline: A scalable and flexible architecture for CNNs on FPGA. In Proceedings of the ACM International Conference on Computing Frontiers (CF’16). ACM, New York, NY, 376--383. Google ScholarDigital Library
- V. Mnih et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015), 529--533. Letter.Google ScholarCross Ref
- P. G. Mousouliotis and L. P. Petrou. 2018. SqueezeJet: High-level synthesis accelerator design for deep convolutional neural networks. ArXiv e-prints arxiv:cs.CV/1805.08695.Google Scholar
- Movidius. 2017. Movidius Neural Compute Stick: Accelerate Deep Learning Development at the Edge. Retrieved from https://developer.movidius.com/.Google Scholar
- NVIDIA. 2017. NVIDIA Deep Learning Accelerator (NVDLA). Retrieved from http://nvdla.org/.Google Scholar
- NVIDIA. 2017. NVIDIA Tegra K1. Retrieved from http://www.nvidia.com/object/tegra-k1-processor.html.Google Scholar
- NVIDIA. 2017. NVIDIA Tegra X1. Retrieved from http://www.nvidia.com/object/tegra-x1-processor.html.Google Scholar
- A. Prost-Boucle, A. Bourge, F. Petrot, H. Alemdar, N. Caldwell, and V. Leroy. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1--7.Google Scholar
- J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, New York, NY, 26--35. Google ScholarDigital Library
- A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini. 2011. A fully synthesizable single-cycle interconnection network for Shared-L1 processor clusters. In Proceedings of the Design, Automation Test in Europe. 1--6.Google Scholar
- Davide Rossi, Igor Loi, Germain Haugou, and Luca Benini. 2014. Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters. In Proceedings of the 11th ACM Conference on Computing Frontiers (CF’14). ACM, New York, NY. Google ScholarDigital Library
- K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.Google Scholar
- Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. 2014. DeepFace: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1701--1708. Google ScholarDigital Library
- Y. Umuroglu, N. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, NY, 65--74. Google ScholarDigital Library
- S. I. Venieris and C. S. Bouganis. 2017. Latency-driven design for FPGA-based convolutional neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1--8.Google Scholar
- W. Ren, Y. Shengen, S. Yi, D. Qingqing, and S. Gang. 2015. Deep image: Scaling up image recognition. arXiv:1501.02876.Google Scholar
- J. Weston. 2016. Dialog-based language learning. ArXiv:1604.06045 arxiv:cs.CL/1604.06045.Google Scholar
- J. Weston, S. Chopra, and A. Bordes. 2014. Memory networks. ArXiv:1410.3916 arxiv:cs.AI/1410.3916.Google Scholar
- Xilinx. 2017. Xilinx Zynq-7000 All Programmable SoC. Retrieved from https://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html.Google Scholar
- Xilinx. 2017. Zynq UltraScale+ All Programmable Heterogeneous MPSoC. Retrieved from https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html.Google Scholar
- M. Zastrow. 2016. Machine outsmarts man in battle of the decade. New Sci. 229, 3065 (2016), 21.Google ScholarCross Ref
- C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong. 2016. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’16). 1--8. Google ScholarDigital Library
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 161--170. Google ScholarDigital Library
Index Terms
- NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs
Recommendations
An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism
Increasingly, pre-trained convolutional neural networks (CNNs) are being deployed for inference in various computer vision applications, both on the server-side in the data centers and at the edge. CNN inference is a very compute-intensive task. It is a ...
3D CV Descriptor on Parallel Heterogeneous Platforms
Embedded three-dimensional (3D) Computer Vision (CV) is considered a technology enabler for future consumer applications, attracting a wide interest in academia and industry. However, 3D CV processing is a computation-intensive task. Its high ...
Synergy: An HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC
Convolutional Neural Networks (CNN) have been widely deployed in diverse application domains. There has been significant progress in accelerating both their training and inference using high-performance GPUs, FPGAs, and custom ASICs for datacenter-scale ...
Comments