skip to main content
10.1145/3394885.3431534acmconferencesArticle/Chapter ViewAbstractPublication PagesaspdacConference Proceedingsconference-collections
research-article

Accelerate Non-unit Stride Convolutions with Winograd Algorithms

Authors Info & Claims
Published:29 January 2021Publication History

ABSTRACT

While computer vision tasks target increasingly challenging scenarios, the need for real-time processing of images rises as well, requiring more efficient methods to accelerate convolutional neural networks. For unit stride convolutions, we use FFT-based methods and Winograd algorithms to compute matrix convolutions, which effectively lower the computing complexity by reducing the number of multiplications. For non-unit stride convolutions, we usually cannot directly apply those algorithms to accelerate the computations. In this work, we propose a novel universal approach to construct the non-unit stride convolution algorithms for any given stride and filter sizes from Winograd algorithms. Specifically, we first demonstrate the steps to decompose an arbitrary convolutional kernel and apply the Winograd algorithms separately to compute non-unit stride convolutions. We then present the derivation of this method and proof by construction to confirm the validity of this approach. Finally, we discuss the minimum number of multiplications and additions necessary for the non-unit stride convolutions and evaluate the performance of the decomposed Winograd algorithms. From our analysis of the computational complexity, the new approach can benefit from 1.5x to 3x fewer multiplications. In our experiments in real DNN layers, we have acquired around 1.3x speedup (Told /Tnew) of the Winograd algorithms against the conventional convolution algorithm in various experiment settings.

References

  1. U. Aydonat, S. O'Connell, D. Capalija, A. Ling, and G. Chiu. 2017. An OpenCL(TM) Deep Learning Accelerator on Arria 10. arXiv e-prints (Jan 2017). arXiv:1701.03534 [cs.DC]Google ScholarGoogle Scholar
  2. K. A. Campbell, D. Lin, S. Mitra, and D. Chen. 2015. Hybrid Quick Error Detection (H-QED): Accelerator validation and debug using high-level synthesis principles. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).Google ScholarGoogle Scholar
  3. A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski, S. Brown, and J. Anderson. 2013. LegUp: An Open-Source High-Level Synthesis Tool for FPGA-Based Processor/Accelerator Systems. In ACM Transactions on Embedded Computing Systems (TECS).Google ScholarGoogle Scholar
  4. D. Chen, J. Cong, Y. Fan, G. Han, Y. Jiang, and Z. Zhang. 2005. xPilot: A Platform-Based Behavioral Synthesis System. In SRC TechCon.Google ScholarGoogle Scholar
  5. D. Chen, J. Cong, Y. Fan, and L. Wan. 2010. LOPASS: A Low-Power Architectural Synthesis System for FPGAs With Interconnect Estimation and Optimization. In IEEE Transactions on Very Large Scale Integration (VLSI) Systems.Google ScholarGoogle Scholar
  6. D. Chen, J. Cong, and P. Pan. 2006. FPGA Design Automation: A Survey. Now Foundations and Trends.Google ScholarGoogle Scholar
  7. L. Chen, G. Papandreou, F. Schroff, and H. Adam. 2017. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv:1706.05587 [cs.CV]Google ScholarGoogle Scholar
  8. J. Cooley and J. Tukey. 1965. An Algorithm for the Machine Calculation of Complex Fourier Series. In Mathematics of Computation.Google ScholarGoogle Scholar
  9. C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W. Hwu, and D. Chen. 2019. FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge. In 2019 56th ACM/IEEE Design Automation Conference (DAC).Google ScholarGoogle Scholar
  10. K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]Google ScholarGoogle Scholar
  11. D. Huang, X. Zhang, R. Zhang, T. Zhi, D. He, J. Guo, C. Liu, Q. Guo, Z. Du, S. Liu, T. Chen, and Y. Chen. 2020. DWM: A Decomposable Winograd Method for Convolution Acceleration. In AAAI'20: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (New York, NY, USA). AAAI Press.Google ScholarGoogle Scholar
  12. S. T. C. Konigsmark, D. Chen, and M. D. F. Wong. 2017. High-Level Synthesis for side-channel defense. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).Google ScholarGoogle ScholarCross RefCross Ref
  13. A. Krizhevsky, I. Sutskever, and G. Hinton. 2017. ImageNet Classification with Deep Convolutional Neural Networks. In Commun. ACM. Association for Computing Machinery, New York, NY, USA.Google ScholarGoogle Scholar
  14. A. Lavin and S. Gray. 2015. Fast Algorithms for Convolutional Neural Networks. arXiv e-prints (Sep2015). arXiv:1509.09308 [cs.NE]Google ScholarGoogle Scholar
  15. J. Y. Lin, D. Chen, and J. Cong. 2006. Optimal simultaneous mapping and clustering for FPGA delay optimization. In 2006 43rd ACM/IEEE Design Automation Conference.Google ScholarGoogle Scholar
  16. J. Long, E. Shelhamer, and T. Darrell. 2014. Fully Convolutional Networks for Semantic Segmentation. arXiv:1411.4038[cs.CV]Google ScholarGoogle Scholar
  17. M. Mathieu, M. Henaff, and Y. LeCun. 2013. Fast Training of Convolutional Networks through FFTs. arXiv e-prints (Dec 2013). arXiv:1312.5851 [cs.CV]Google ScholarGoogle Scholar
  18. K. Murray, O. Petelin, S. Zhong, J. Wang, M. Eldafrawy, J. Legault, E. Sha, A. Graham, J. Wu, M. Walker, H. Zeng, P. Patros, J. Luu, K.Kent, and V. Betz. 2020. VTR 8: High-Performance CAD and Customizable FPGA Architecture Modelling. In ACM Trans. Reconfigurable Technol. Syst., Vol. 13. Association for Computing Machinery, New York, NY, USA.Google ScholarGoogle Scholar
  19. A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J.n Cong, and W. Hwu. 2013. Efficient Compilation of CUDA Kernels for High-Performance Computing on FPGAs. In ACM Trans. Embed. Comput. Syst. Association for Computing Machinery, New York, NY, USA.Google ScholarGoogle Scholar
  20. A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong, and W. W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In 2009 IEEE 7th Symposium on Application Specific Processors.Google ScholarGoogle Scholar
  21. A. Papakonstantinou, Y. Liang, J. A. Stratton, K. Gururaj, D. Chen, W. W. Hwu, and J. Cong. 2011. Multilevel Granularity Parallelism Synthesis on FPGAs. In 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.Google ScholarGoogle Scholar
  22. V. Podlozhnyuk. 2007. FFT-based 2D convolution. (June 2007).Google ScholarGoogle Scholar
  23. G. Scott. 2016. Not so fast, FFT: Winograd. (March 2016). https://www.intel.ai/winograd/#gs.8f8jrhGoogle ScholarGoogle Scholar
  24. K. Simonyan and A.Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs.CV]Google ScholarGoogle Scholar
  25. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567 [cs.CV]Google ScholarGoogle Scholar
  26. K. Vanmathi, K. Sekar, and R. Ramachandran. 2014. FPGA implementation of Fast Fourier Transform. In 2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE).Google ScholarGoogle Scholar
  27. K. Vincent, K. Stephano, M. Frumkin, B. Ginsburg, and J. Demouth. 2017. On Improving the Numerical Stability of Winograd Convolutions. In 2017 International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  28. S. Winograd. 1980. Product of Polynomials. In Arithmetic Complexity of Computations.Google ScholarGoogle Scholar
  29. C. Yang, Y. Wang, X. Wang, and L. Geng. 2020. A Stride-Based Convolution Decomposition Method to Stretch CNN Acceleration Algorithms for Efficient and Flexible Hardware Implementation. In IEEE Transactions on Circuits and Systems I: Regular Papers.Google ScholarGoogle Scholar
  30. J. Yepez and S. Ko. 2020. Stride 2 1-D, 2-D, and 3-D Winograd for Convolutional Neural Networks. In IEEE Transactions on Very Large Scale Integration (VLSI) Systems.Google ScholarGoogle Scholar
  31. X. Zhang, A. Ramachandran, C. Zhuge, D. He, W. Zuo, Z. Cheng, K. Rupnow, and D. Chen. 2017. Machine learning on FPGAs to face the IoT revolution. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).Google ScholarGoogle Scholar
  32. X. Zhang, H. Ye, J. Wang, Y. Lin, J. Xiong, W. Hwu, and D. Chen. 2020. DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator. In 2020 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).Google ScholarGoogle Scholar
  33. C. Zhuge, X. Liu, X. Zhang, S. Gummadi, J. Xiong, and D. Chen. 2018. Face Recognition with Hybrid Efficient Convolution Algorithms on FPGAs. In Proceedings of the 2018 on Great Lakes Symposium on VLSI (Chicago, IL, USA) (GLSVLSI '18). ACM, New York, NY, USA.Google ScholarGoogle Scholar
  34. A. Zlateski, Z. Jia, K. Li, and F. Durand. 2018. FFT Convolutions are Faster than Winograd on Modern CPUs, Here is Why. CoRR abs/1809.07851 (2018). arXiv:1809.07851Google ScholarGoogle Scholar

Index Terms

  1. Accelerate Non-unit Stride Convolutions with Winograd Algorithms

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ASPDAC '21: Proceedings of the 26th Asia and South Pacific Design Automation Conference
          January 2021
          930 pages
          ISBN:9781450379991
          DOI:10.1145/3394885

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 29 January 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          ASPDAC '21 Paper Acceptance Rate111of368submissions,30%Overall Acceptance Rate466of1,454submissions,32%

          Upcoming Conference

          ASPDAC '25

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader