skip to main content
research-article

NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs

Published:12 December 2018Publication History
Skip Abstract Section

Abstract

Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s, and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18.

References

  1. Altera. 2017. Altera Arria 10. Retrieved from https://www.altera.com/products/fpga/arria-series/arria-10/overview.html.Google ScholarGoogle Scholar
  2. E. Azarkhish, D. Rossi, I. Loi, and L. Benini. 2017. Neurostream: Scalable and energy-efficient deep learning with smart memory cubes. IEEE Trans. Parallel Distrib. Syst. PP, 99 (2017), 1--1.Google ScholarGoogle Scholar
  3. A. Capotondi, A. Marongiu, and L. Benini. 2018. Runtime support for multiple offload-based programming models on clustered manycore accelerators. IEEE Trans. Emerg. Topics Comput. 6, 3 (2018), 330–342.Google ScholarGoogle ScholarCross RefCross Ref
  4. L. Cavigelli and L. Benini. 2016. A 803GOp/s/W convolutional network accelerator. IEEE Trans. Circ. Syst. Video Technol. 99 (2016), 1--1.Google ScholarGoogle Scholar
  5. Y. H. Chen, J. Emer, and V. Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 367--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Conti and L. Benini. 2015. A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’15). EDA Consortium, 683--688. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. Conti, R. Schilling, P. D. Schiavone, A. Pullini, D. Rossi, F. K. Gürkaynak, M. Muehlberghuber, M. Gautschi, I. Loi, G. Haugou, S. Mangard, and L. Benini. 2017. An IoT endpoint system-on-chip for secure and energy-efficient near-sensor analytics. IEEE Trans. Circ. Syst. I: Reg. Papers 64, 9 (Sept. 2017), 2481--2494.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Courbariaux, Y. Bengio, and J. David. 2015. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28 (NIPS’15). Curran Associates Inc., 3123–3131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA’15). 92--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi, E. Flamand, F. K. Gürkaynak, and L. Benini. 2017. Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Trans. Very Large Scale Integr. Syst. 25, 10 (Oct. 2017), 2700--2713.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Girshick. 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1--4.Google ScholarGoogle Scholar
  13. Google. 2017. Build and Train Machine Learning Models on our New Google Cloud TPUs. Retrieved from https://www.blog.google/topics/google-cloud/google-cloud-offer-tpus-machine-learning/.Google ScholarGoogle Scholar
  14. A. Hannun et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv:1412.5567.Google ScholarGoogle Scholar
  15. K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep residual learning for image recognition. arxiv:cs.CV/1512.03385.Google ScholarGoogle Scholar
  16. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle Scholar
  17. Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50 fewer parameters and <1MB model size.CoRR abs/1602.07360 (2016). Retrieved from http://dblp.uni-trier.de/db/journals/corr/corr1602.html#IandolaMAHDK16.Google ScholarGoogle Scholar
  18. Image-Net. 2017. Large Scale Visual Recognition Challenge. Retrieved from http://image-net.org/.Google ScholarGoogle Scholar
  19. S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 (2015). Retrieved from http://arxiv.org/abs/1502.03167.Google ScholarGoogle Scholar
  20. N. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, New York, NY, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’12). Curran Associates Inc., 1097–1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Ma, Y. Cao, S. Vrudhula, and J. S. Seo. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1--8.Google ScholarGoogle Scholar
  23. P. Meloni, G. Deriu, F. Conti, I. Loi, L. Raffo, and L. Benini. 2016. Curbing the roofline: A scalable and flexible architecture for CNNs on FPGA. In Proceedings of the ACM International Conference on Computing Frontiers (CF’16). ACM, New York, NY, 376--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. Mnih et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015), 529--533. Letter.Google ScholarGoogle ScholarCross RefCross Ref
  25. P. G. Mousouliotis and L. P. Petrou. 2018. SqueezeJet: High-level synthesis accelerator design for deep convolutional neural networks. ArXiv e-prints arxiv:cs.CV/1805.08695.Google ScholarGoogle Scholar
  26. Movidius. 2017. Movidius Neural Compute Stick: Accelerate Deep Learning Development at the Edge. Retrieved from https://developer.movidius.com/.Google ScholarGoogle Scholar
  27. NVIDIA. 2017. NVIDIA Deep Learning Accelerator (NVDLA). Retrieved from http://nvdla.org/.Google ScholarGoogle Scholar
  28. NVIDIA. 2017. NVIDIA Tegra K1. Retrieved from http://www.nvidia.com/object/tegra-k1-processor.html.Google ScholarGoogle Scholar
  29. NVIDIA. 2017. NVIDIA Tegra X1. Retrieved from http://www.nvidia.com/object/tegra-x1-processor.html.Google ScholarGoogle Scholar
  30. A. Prost-Boucle, A. Bourge, F. Petrot, H. Alemdar, N. Caldwell, and V. Leroy. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1--7.Google ScholarGoogle Scholar
  31. J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, New York, NY, 26--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini. 2011. A fully synthesizable single-cycle interconnection network for Shared-L1 processor clusters. In Proceedings of the Design, Automation Test in Europe. 1--6.Google ScholarGoogle Scholar
  33. Davide Rossi, Igor Loi, Germain Haugou, and Luca Benini. 2014. Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters. In Proceedings of the 11th ACM Conference on Computing Frontiers (CF’14). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.Google ScholarGoogle Scholar
  35. Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. 2014. DeepFace: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1701--1708. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Umuroglu, N. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, NY, 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. I. Venieris and C. S. Bouganis. 2017. Latency-driven design for FPGA-based convolutional neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). 1--8.Google ScholarGoogle Scholar
  38. W. Ren, Y. Shengen, S. Yi, D. Qingqing, and S. Gang. 2015. Deep image: Scaling up image recognition. arXiv:1501.02876.Google ScholarGoogle Scholar
  39. J. Weston. 2016. Dialog-based language learning. ArXiv:1604.06045 arxiv:cs.CL/1604.06045.Google ScholarGoogle Scholar
  40. J. Weston, S. Chopra, and A. Bordes. 2014. Memory networks. ArXiv:1410.3916 arxiv:cs.AI/1410.3916.Google ScholarGoogle Scholar
  41. Xilinx. 2017. Xilinx Zynq-7000 All Programmable SoC. Retrieved from https://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html.Google ScholarGoogle Scholar
  42. Xilinx. 2017. Zynq UltraScale+ All Programmable Heterogeneous MPSoC. Retrieved from https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html.Google ScholarGoogle Scholar
  43. M. Zastrow. 2016. Machine outsmarts man in battle of the decade. New Sci. 229, 3065 (2016), 21.Google ScholarGoogle ScholarCross RefCross Ref
  44. C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong. 2016. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’16). 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Reconfigurable Technology and Systems
          ACM Transactions on Reconfigurable Technology and Systems  Volume 11, Issue 3
          Special Issue on Deep learning on FPGAs
          September 2018
          187 pages
          ISSN:1936-7406
          EISSN:1936-7414
          DOI:10.1145/3299999
          • Editor:
          • Steve Wilton
          Issue’s Table of Contents

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 December 2018
          • Accepted: 1 August 2018
          • Revised: 1 June 2018
          • Received: 1 December 2017
          Published in trets Volume 11, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader