research-article

Accelerate Non-unit Stride Convolutions with Winograd Algorithms

Authors:
Junhao Pan

University of Illinois at Urbana Champaign, Urbana, USA

University of Illinois at Urbana Champaign, Urbana, USA
View Profile

,
Deming Chen

University of Illinois at Urbana Champaign, Urbana, USA

University of Illinois at Urbana Champaign, Urbana, USA
View Profile

ASPDAC '21: Proceedings of the 26th Asia and South Pacific Design Automation ConferenceJanuary 2021Pages 358–364https://doi.org/10.1145/3394885.3431534

Published:29 January 2021Publication History

ASPDAC '21: Proceedings of the 26th Asia and South Pacific Design Automation Conference

Pages 358–364

ABSTRACT

While computer vision tasks target increasingly challenging scenarios, the need for real-time processing of images rises as well, requiring more efficient methods to accelerate convolutional neural networks. For unit stride convolutions, we use FFT-based methods and Winograd algorithms to compute matrix convolutions, which effectively lower the computing complexity by reducing the number of multiplications. For non-unit stride convolutions, we usually cannot directly apply those algorithms to accelerate the computations. In this work, we propose a novel universal approach to construct the non-unit stride convolution algorithms for any given stride and filter sizes from Winograd algorithms. Specifically, we first demonstrate the steps to decompose an arbitrary convolutional kernel and apply the Winograd algorithms separately to compute non-unit stride convolutions. We then present the derivation of this method and proof by construction to confirm the validity of this approach. Finally, we discuss the minimum number of multiplications and additions necessary for the non-unit stride convolutions and evaluate the performance of the decomposed Winograd algorithms. From our analysis of the computational complexity, the new approach can benefit from 1.5x to 3x fewer multiplications. In our experiments in real DNN layers, we have acquired around 1.3x speedup (Told /Tnew) of the Winograd algorithms against the conventional convolution algorithm in various experiment settings.

References

U. Aydonat, S. O'Connell, D. Capalija, A. Ling, and G. Chiu. 2017. An OpenCL(TM) Deep Learning Accelerator on Arria 10. arXiv e-prints (Jan 2017). arXiv:1701.03534 [cs.DC]Google Scholar
K. A. Campbell, D. Lin, S. Mitra, and D. Chen. 2015. Hybrid Quick Error Detection (H-QED): Accelerator validation and debug using high-level synthesis principles. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).Google Scholar
A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski, S. Brown, and J. Anderson. 2013. LegUp: An Open-Source High-Level Synthesis Tool for FPGA-Based Processor/Accelerator Systems. In ACM Transactions on Embedded Computing Systems (TECS).Google Scholar
D. Chen, J. Cong, Y. Fan, G. Han, Y. Jiang, and Z. Zhang. 2005. xPilot: A Platform-Based Behavioral Synthesis System. In SRC TechCon.Google Scholar
D. Chen, J. Cong, Y. Fan, and L. Wan. 2010. LOPASS: A Low-Power Architectural Synthesis System for FPGAs With Interconnect Estimation and Optimization. In IEEE Transactions on Very Large Scale Integration (VLSI) Systems.Google Scholar
D. Chen, J. Cong, and P. Pan. 2006. FPGA Design Automation: A Survey. Now Foundations and Trends.Google Scholar
L. Chen, G. Papandreou, F. Schroff, and H. Adam. 2017. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv:1706.05587 [cs.CV]Google Scholar
J. Cooley and J. Tukey. 1965. An Algorithm for the Machine Calculation of Complex Fourier Series. In Mathematics of Computation.Google Scholar
C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W. Hwu, and D. Chen. 2019. FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge. In 2019 56th ACM/IEEE Design Automation Conference (DAC).Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]Google Scholar
D. Huang, X. Zhang, R. Zhang, T. Zhi, D. He, J. Guo, C. Liu, Q. Guo, Z. Du, S. Liu, T. Chen, and Y. Chen. 2020. DWM: A Decomposable Winograd Method for Convolution Acceleration. In AAAI'20: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (New York, NY, USA). AAAI Press.Google Scholar
S. T. C. Konigsmark, D. Chen, and M. D. F. Wong. 2017. High-Level Synthesis for side-channel defense. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).Google ScholarCross Ref
A. Krizhevsky, I. Sutskever, and G. Hinton. 2017. ImageNet Classification with Deep Convolutional Neural Networks. In Commun. ACM. Association for Computing Machinery, New York, NY, USA.Google Scholar
A. Lavin and S. Gray. 2015. Fast Algorithms for Convolutional Neural Networks. arXiv e-prints (Sep2015). arXiv:1509.09308 [cs.NE]Google Scholar
J. Y. Lin, D. Chen, and J. Cong. 2006. Optimal simultaneous mapping and clustering for FPGA delay optimization. In 2006 43rd ACM/IEEE Design Automation Conference.Google Scholar
J. Long, E. Shelhamer, and T. Darrell. 2014. Fully Convolutional Networks for Semantic Segmentation. arXiv:1411.4038[cs.CV]Google Scholar
M. Mathieu, M. Henaff, and Y. LeCun. 2013. Fast Training of Convolutional Networks through FFTs. arXiv e-prints (Dec 2013). arXiv:1312.5851 [cs.CV]Google Scholar
K. Murray, O. Petelin, S. Zhong, J. Wang, M. Eldafrawy, J. Legault, E. Sha, A. Graham, J. Wu, M. Walker, H. Zeng, P. Patros, J. Luu, K.Kent, and V. Betz. 2020. VTR 8: High-Performance CAD and Customizable FPGA Architecture Modelling. In ACM Trans. Reconfigurable Technol. Syst., Vol. 13. Association for Computing Machinery, New York, NY, USA.Google Scholar
A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J.n Cong, and W. Hwu. 2013. Efficient Compilation of CUDA Kernels for High-Performance Computing on FPGAs. In ACM Trans. Embed. Comput. Syst. Association for Computing Machinery, New York, NY, USA.Google Scholar
A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong, and W. W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In 2009 IEEE 7th Symposium on Application Specific Processors.Google Scholar
A. Papakonstantinou, Y. Liang, J. A. Stratton, K. Gururaj, D. Chen, W. W. Hwu, and J. Cong. 2011. Multilevel Granularity Parallelism Synthesis on FPGAs. In 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.Google Scholar
V. Podlozhnyuk. 2007. FFT-based 2D convolution. (June 2007).Google Scholar
G. Scott. 2016. Not so fast, FFT: Winograd. (March 2016). https://www.intel.ai/winograd/#gs.8f8jrhGoogle Scholar
K. Simonyan and A.Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs.CV]Google Scholar
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567 [cs.CV]Google Scholar
K. Vanmathi, K. Sekar, and R. Ramachandran. 2014. FPGA implementation of Fast Fourier Transform. In 2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE).Google Scholar
K. Vincent, K. Stephano, M. Frumkin, B. Ginsburg, and J. Demouth. 2017. On Improving the Numerical Stability of Winograd Convolutions. In 2017 International Conference on Learning Representations (ICLR).Google Scholar
S. Winograd. 1980. Product of Polynomials. In Arithmetic Complexity of Computations.Google Scholar
C. Yang, Y. Wang, X. Wang, and L. Geng. 2020. A Stride-Based Convolution Decomposition Method to Stretch CNN Acceleration Algorithms for Efficient and Flexible Hardware Implementation. In IEEE Transactions on Circuits and Systems I: Regular Papers.Google Scholar
J. Yepez and S. Ko. 2020. Stride 2 1-D, 2-D, and 3-D Winograd for Convolutional Neural Networks. In IEEE Transactions on Very Large Scale Integration (VLSI) Systems.Google Scholar
X. Zhang, A. Ramachandran, C. Zhuge, D. He, W. Zuo, Z. Cheng, K. Rupnow, and D. Chen. 2017. Machine learning on FPGAs to face the IoT revolution. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).Google Scholar
X. Zhang, H. Ye, J. Wang, Y. Lin, J. Xiong, W. Hwu, and D. Chen. 2020. DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator. In 2020 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).Google Scholar
C. Zhuge, X. Liu, X. Zhang, S. Gummadi, J. Xiong, and D. Chen. 2018. Face Recognition with Hybrid Efficient Convolution Algorithms on FPGAs. In Proceedings of the 2018 on Great Lakes Symposium on VLSI (Chicago, IL, USA) (GLSVLSI '18). ACM, New York, NY, USA.Google Scholar
A. Zlateski, Z. Jia, K. Li, and F. Durand. 2018. FFT Convolutions are Faster than Winograd on Modern CPUs, Here is Why. CoRR abs/1809.07851 (2018). arXiv:1809.07851Google Scholar

Index Terms

Accelerate Non-unit Stride Convolutions with Winograd Algorithms

Recommendations

A tile-fusion method for accelerating Winograd convolutions
Abstract
Compared with fast convolution methods such as im2col and the fast Fourier transform, Winograd-based convolution, which has been widely applied to accelerate convolutional neural networks (CNNs), can provide high performance with ...
Read More
New algorithms for convolutions and FFTs
Read More
Discrete Convolutions via Mersenne Transrorms

A transform analogous to the discrete Fourier transform is defined in the ring of integers with a multiplication and addition modulo a Mersenne number. The arithmetic necessary to perform the transform requires only additions and circular shifts of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ASPDAC '21: Proceedings of the 26th Asia and South Pacific Design Automation Conference
January 2021
930 pages
ISBN:9781450379991
DOI:10.1145/3394885

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 January 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
ASPDAC '21 Paper Acceptance Rate111of368submissions,30%Overall Acceptance Rate466of1,454submissions,32%
More
Upcoming Conference
ASPDAC '25

Sponsor:

sigda

30th Asia and South Pacific Design Automation Conference

January 20 - 23, 2025

Tokyo , Japan
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 165
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Accelerate Non-unit Stride Convolutions with Winograd Algorithms

ASPDAC '21: Proceedings of the 26th Asia and South Pacific Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

A tile-fusion method for accelerating Winograd convolutions

New algorithms for convolutions and FFTs

Discrete Convolutions via Mersenne Transrorms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Accelerate Non-unit Stride Convolutions with Winograd Algorithms

ASPDAC '21: Proceedings of the 26th Asia and South Pacific Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

A tile-fusion method for accelerating Winograd convolutions

New algorithms for convolutions and FFTs

Discrete Convolutions via Mersenne Transrorms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media