ABSTRACT
While computer vision tasks target increasingly challenging scenarios, the need for real-time processing of images rises as well, requiring more efficient methods to accelerate convolutional neural networks. For unit stride convolutions, we use FFT-based methods and Winograd algorithms to compute matrix convolutions, which effectively lower the computing complexity by reducing the number of multiplications. For non-unit stride convolutions, we usually cannot directly apply those algorithms to accelerate the computations. In this work, we propose a novel universal approach to construct the non-unit stride convolution algorithms for any given stride and filter sizes from Winograd algorithms. Specifically, we first demonstrate the steps to decompose an arbitrary convolutional kernel and apply the Winograd algorithms separately to compute non-unit stride convolutions. We then present the derivation of this method and proof by construction to confirm the validity of this approach. Finally, we discuss the minimum number of multiplications and additions necessary for the non-unit stride convolutions and evaluate the performance of the decomposed Winograd algorithms. From our analysis of the computational complexity, the new approach can benefit from 1.5x to 3x fewer multiplications. In our experiments in real DNN layers, we have acquired around 1.3x speedup (Told /Tnew) of the Winograd algorithms against the conventional convolution algorithm in various experiment settings.
- U. Aydonat, S. O'Connell, D. Capalija, A. Ling, and G. Chiu. 2017. An OpenCL(TM) Deep Learning Accelerator on Arria 10. arXiv e-prints (Jan 2017). arXiv:1701.03534 [cs.DC]Google Scholar
- K. A. Campbell, D. Lin, S. Mitra, and D. Chen. 2015. Hybrid Quick Error Detection (H-QED): Accelerator validation and debug using high-level synthesis principles. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).Google Scholar
- A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski, S. Brown, and J. Anderson. 2013. LegUp: An Open-Source High-Level Synthesis Tool for FPGA-Based Processor/Accelerator Systems. In ACM Transactions on Embedded Computing Systems (TECS).Google Scholar
- D. Chen, J. Cong, Y. Fan, G. Han, Y. Jiang, and Z. Zhang. 2005. xPilot: A Platform-Based Behavioral Synthesis System. In SRC TechCon.Google Scholar
- D. Chen, J. Cong, Y. Fan, and L. Wan. 2010. LOPASS: A Low-Power Architectural Synthesis System for FPGAs With Interconnect Estimation and Optimization. In IEEE Transactions on Very Large Scale Integration (VLSI) Systems.Google Scholar
- D. Chen, J. Cong, and P. Pan. 2006. FPGA Design Automation: A Survey. Now Foundations and Trends.Google Scholar
- L. Chen, G. Papandreou, F. Schroff, and H. Adam. 2017. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv:1706.05587 [cs.CV]Google Scholar
- J. Cooley and J. Tukey. 1965. An Algorithm for the Machine Calculation of Complex Fourier Series. In Mathematics of Computation.Google Scholar
- C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W. Hwu, and D. Chen. 2019. FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge. In 2019 56th ACM/IEEE Design Automation Conference (DAC).Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]Google Scholar
- D. Huang, X. Zhang, R. Zhang, T. Zhi, D. He, J. Guo, C. Liu, Q. Guo, Z. Du, S. Liu, T. Chen, and Y. Chen. 2020. DWM: A Decomposable Winograd Method for Convolution Acceleration. In AAAI'20: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (New York, NY, USA). AAAI Press.Google Scholar
- S. T. C. Konigsmark, D. Chen, and M. D. F. Wong. 2017. High-Level Synthesis for side-channel defense. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).Google ScholarCross Ref
- A. Krizhevsky, I. Sutskever, and G. Hinton. 2017. ImageNet Classification with Deep Convolutional Neural Networks. In Commun. ACM. Association for Computing Machinery, New York, NY, USA.Google Scholar
- A. Lavin and S. Gray. 2015. Fast Algorithms for Convolutional Neural Networks. arXiv e-prints (Sep2015). arXiv:1509.09308 [cs.NE]Google Scholar
- J. Y. Lin, D. Chen, and J. Cong. 2006. Optimal simultaneous mapping and clustering for FPGA delay optimization. In 2006 43rd ACM/IEEE Design Automation Conference.Google Scholar
- J. Long, E. Shelhamer, and T. Darrell. 2014. Fully Convolutional Networks for Semantic Segmentation. arXiv:1411.4038[cs.CV]Google Scholar
- M. Mathieu, M. Henaff, and Y. LeCun. 2013. Fast Training of Convolutional Networks through FFTs. arXiv e-prints (Dec 2013). arXiv:1312.5851 [cs.CV]Google Scholar
- K. Murray, O. Petelin, S. Zhong, J. Wang, M. Eldafrawy, J. Legault, E. Sha, A. Graham, J. Wu, M. Walker, H. Zeng, P. Patros, J. Luu, K.Kent, and V. Betz. 2020. VTR 8: High-Performance CAD and Customizable FPGA Architecture Modelling. In ACM Trans. Reconfigurable Technol. Syst., Vol. 13. Association for Computing Machinery, New York, NY, USA.Google Scholar
- A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J.n Cong, and W. Hwu. 2013. Efficient Compilation of CUDA Kernels for High-Performance Computing on FPGAs. In ACM Trans. Embed. Comput. Syst. Association for Computing Machinery, New York, NY, USA.Google Scholar
- A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong, and W. W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In 2009 IEEE 7th Symposium on Application Specific Processors.Google Scholar
- A. Papakonstantinou, Y. Liang, J. A. Stratton, K. Gururaj, D. Chen, W. W. Hwu, and J. Cong. 2011. Multilevel Granularity Parallelism Synthesis on FPGAs. In 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.Google Scholar
- V. Podlozhnyuk. 2007. FFT-based 2D convolution. (June 2007).Google Scholar
- G. Scott. 2016. Not so fast, FFT: Winograd. (March 2016). https://www.intel.ai/winograd/#gs.8f8jrhGoogle Scholar
- K. Simonyan and A.Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs.CV]Google Scholar
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567 [cs.CV]Google Scholar
- K. Vanmathi, K. Sekar, and R. Ramachandran. 2014. FPGA implementation of Fast Fourier Transform. In 2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE).Google Scholar
- K. Vincent, K. Stephano, M. Frumkin, B. Ginsburg, and J. Demouth. 2017. On Improving the Numerical Stability of Winograd Convolutions. In 2017 International Conference on Learning Representations (ICLR).Google Scholar
- S. Winograd. 1980. Product of Polynomials. In Arithmetic Complexity of Computations.Google Scholar
- C. Yang, Y. Wang, X. Wang, and L. Geng. 2020. A Stride-Based Convolution Decomposition Method to Stretch CNN Acceleration Algorithms for Efficient and Flexible Hardware Implementation. In IEEE Transactions on Circuits and Systems I: Regular Papers.Google Scholar
- J. Yepez and S. Ko. 2020. Stride 2 1-D, 2-D, and 3-D Winograd for Convolutional Neural Networks. In IEEE Transactions on Very Large Scale Integration (VLSI) Systems.Google Scholar
- X. Zhang, A. Ramachandran, C. Zhuge, D. He, W. Zuo, Z. Cheng, K. Rupnow, and D. Chen. 2017. Machine learning on FPGAs to face the IoT revolution. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).Google Scholar
- X. Zhang, H. Ye, J. Wang, Y. Lin, J. Xiong, W. Hwu, and D. Chen. 2020. DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator. In 2020 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).Google Scholar
- C. Zhuge, X. Liu, X. Zhang, S. Gummadi, J. Xiong, and D. Chen. 2018. Face Recognition with Hybrid Efficient Convolution Algorithms on FPGAs. In Proceedings of the 2018 on Great Lakes Symposium on VLSI (Chicago, IL, USA) (GLSVLSI '18). ACM, New York, NY, USA.Google Scholar
- A. Zlateski, Z. Jia, K. Li, and F. Durand. 2018. FFT Convolutions are Faster than Winograd on Modern CPUs, Here is Why. CoRR abs/1809.07851 (2018). arXiv:1809.07851Google Scholar
Index Terms
- Accelerate Non-unit Stride Convolutions with Winograd Algorithms
Recommendations
A tile-fusion method for accelerating Winograd convolutions
AbstractCompared with fast convolution methods such as im2col and the fast Fourier transform, Winograd-based convolution, which has been widely applied to accelerate convolutional neural networks (CNNs), can provide high performance with ...
Discrete Convolutions via Mersenne Transrorms
A transform analogous to the discrete Fourier transform is defined in the ring of integers with a multiplication and addition modulo a Mersenne number. The arithmetic necessary to perform the transform requires only additions and circular shifts of the ...
Comments