short-paper

Accelerating Neural-ODE Inference on FPGAs with Two-Stage Structured Pruning and History-based Stepsize Search

Authors:
Lei Cai

Peking University, Beijing, China

Peking University, Beijing, China

0000-0001-5637-865X
View Profile

,
Jing Wang

Peking University, Beijing, China

Peking University, Beijing, China

0000-0003-4146-2749
View Profile

,
Lianfeng Yu

Peking University, Beijing, China

Peking University, Beijing, China

0000-0001-8113-2600
View Profile

,
Bonan Yan

Peking University, Beijing, China

Peking University, Beijing, China

0000-0002-3052-9330
View Profile

,
Yaoyu Tao

Peking University, Beijing, China

Peking University, Beijing, China

0000-0001-7500-5250
View Profile

,
Yuchao Yang

Peking University, Beijing, China

Peking University, Beijing, China

0000-0003-4674-4059
View Profile

FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate ArraysFebruary 2023Pages 177–183https://doi.org/10.1145/3543622.3573044

Published:12 February 2023Publication History

FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

Pages 177–183

ABSTRACT

Neural ordinary differential equation (Neural-ODE) outperforms conventional deep neural networks (DNNs) in modeling continuous-time or dynamical systems by adopting numerical ODE integration onto a shallow embedded NN. However, Neural-ODE suffers from slow inference due to the costly iterative stepsize search in numerical integration, especially when using higher-order Runge-Kutta (RK) methods and smaller error tolerance for improved integration accuracy. In this work, we first present algorithmic techniques to speedup RK-based Neural-ODE inference: a two-stage coarse-grained/fine-grained structured pruning method based on top-K sparsification that reduces the overall computations by more than 60% in the embedded NN and a history-based stepsize search method based on past integration steps that reduces the latency for reaching accepted stepsize by up to 77% in RK methods. A reconfigurable hardware architecture is co-designed based on proposed speedup techniques, featuring three processing loops to support programmable embedded NN and a variety of higher-order RK methods. Sparse activation processor with multi-dimensional sorters is designed to exploit structured sparsity in activations. Implemented on a Xilinx Virtex-7 XC7VX690T FPGA and experimented on a variety of datasets, the prototype accelerator using a more complex 3rd-order RK method achieves more than 2.6x speedup compared to the latest Neural-ODE FPGA accelerator using the simplest Euler method. Compared to a software execution on Nvidia A100 GPU, the inference speedup can be up to 18x.

References

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, Vol. 33 (2020), 1877--1901.Google Scholar
John Charles Butcher. 2016. Numerical methods for ordinary differential equations. John Wiley & Sons.Google Scholar
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. Neural ordinary differential equations. Advances in neural information processing systems, Vol. 31 (2018).Google Scholar
Francc ois Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1251--1258.Google ScholarCross Ref
Leonhard Euler. 1824. Institutionum calculi integralis. Vol. 1. impensis Academiae imperialis scientiarum.Google Scholar
Erwin Fehlberg. 1969. Low-order classical Runge-Kutta formulas with stepsize control and their application to some heat transfer problems. Vol. 315. National aeronautics and space administration.Google Scholar
Amir Gholami, Kurt Keutzer, and George Biros. 2019. Anode: Unconditionally accurate memory-efficient gradients for neural odes. arXiv preprint arXiv:1902.10298 (2019).Google Scholar
Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. 2018. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367 (2018).Google Scholar
Mathew Hall and Vaughn Betz. 2020. HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs. arXiv preprint arXiv:2007.10451 (2020).Google Scholar
Ramin Hasani, Mathias Lechner, Alexander Amini, Daniela Rus, and Radu Grosu. 2020. Liquid time-constant networks. arXiv preprint arXiv:2006.04439 (2020).Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Hiroki Kawakami, Hirohisa Watanabe, Keisuke Sugiura, and Hiroki Matsutani. 2021. A Low-Cost Neural ODE with Depthwise Separable Convolution for Edge Domain Adaptation on FPGAs. arXiv preprint arXiv:2107.12824 (2021).Google Scholar
Lucas Liebenwein, Ramin Hasani, Alexander Amini, and Daniela Rus. 2021. Sparse flows: Pruning continuous-depth models. Advances in Neural Information Processing Systems, Vol. 34 (2021).Google Scholar
Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. 2015. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 806--814.Google Scholar
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021a. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.Google ScholarCross Ref
Zhi-Gang Liu, Paul N Whatmough, Yuhao Zhu, and Matthew Mattina. 2021b. S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration. arXiv preprint arXiv:2107.07983 (2021).Google Scholar
Yufei Ma, Gokul Krishnan, Yu Cao, Le Ye, and Ru Huang. 2021a. SWIFT: Small-World-Based Structural Pruning to Accelerate DNN Inference on FPGA. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '21). Association for Computing Machinery, New York, NY, USA, 148. https://doi.org/10.1145/3431920.3439465Google ScholarDigital Library
Yufei Ma, Gokul Krishnan, Yu Cao, Le Ye, and Ru Huang. 2021b. SWIFT: Small-World-Based Structural Pruning to Accelerate DNN Inference on FPGA. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '21). Association for Computing Machinery, New York, NY, USA, 148. https://doi.org/10.1145/3431920.3439465Google ScholarDigital Library
Jian Meng, Shreyas Kolala Venkataramanaiah, Chuteng Zhou, Patrick Hansen, Paul Whatmough, and Jae-sun Seo. 2021. Fixyfpga: Efficient fpga accelerator for deep neural networks with high element-wise sparsity and without external memory access. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 9--16.Google ScholarCross Ref
Mahdi Nazemi, Arash Fayyazi, Amirhossein Esmaili, Atharva Khare, Soheil Nazar Shahsavani, and Massoud Pedram. 2021. NullaNet Tiny: Ultra-low-latency DNN inference through fixed-function combinational logic. In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 266--267.Google ScholarCross Ref
Amin Norollah, Danesh Derafshi, Hakem Beitollahi, and Mahdi Fazeli. 2019. RTHS: A low-cost high-performance real-time hardware sorter, using a multidimensional sorting algorithm. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 27, 7 (2019), 1601--1613.Google ScholarCross Ref
Alessio Quaglino, Marco Gallieri, Jonathan Masci, and Jan Koutn'ik. 2019. Snode: Spectral discretization of neural odes for system identification. arXiv preprint arXiv:1906.07038 (2019).Google Scholar
Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. 2020. Continuous-in-depth neural networks. arXiv preprint arXiv:2008.02389 (2020).Google Scholar
Md Aamir Raihan and Tor Aamodt. 2020. Sparse weight activation training. Advances in Neural Information Processing Systems, Vol. 33 (2020), 15625--15638.Google Scholar
Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. 2019. Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems, Vol. 32 (2019).Google Scholar
C. Runge. 1895. Ueber die numerische Auflösung von Differentialgleichungen. Math. Ann., Vol. 46, 2 ( 1895), 167--178. https://doi.org/10.1007/BF01446807Google ScholarCross Ref
Yaoyu Tao and Zhengya Zhang. 2021. HiMA: A Fast and Scalable History-based Memory Access Engine for Differentiable Neural Computer. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 845--856.Google Scholar
Hirohisa Watanabe and Hiroki Matsutani. 2021. Accelerating ODE-Based Neural Networks on Low-Cost FPGAs. In 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 88--95.Google Scholar
Qingcheng Xiao and Yun Liang. 2022a. Towards Agile DNN Accelerator Design Using Incremental Synthesis on FPGAs (FPGA '22). Association for Computing Machinery, New York, NY, USA, 42--48. https://doi.org/10.1145/3490422.3502351Google ScholarDigital Library
Qingcheng Xiao and Yun Liang. 2022b. Towards Agile DNN Accelerator Design Using Incremental Synthesis on FPGAs. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '22). Association for Computing Machinery, New York, NY, USA, 42--48. https://doi.org/10.1145/3490422.3502351Google ScholarDigital Library
Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Hongren Zheng, Yusong Wu, Fan Zhang, Xinhao Yang, Yi Cai, Yu Wang, and Huazhong Yang. 2021. 3M-AI: A Multi-Task and Multi-Core Virtualization Framework for Multi-FPGA AI Systems in the Cloud. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '21). Association for Computing Machinery, New York, NY, USA, 228. https://doi.org/10.1145/3431920.3439480Google ScholarDigital Library
Juntang Zhuang, Nicha Dvornek, Xiaoxiao Li, Sekhar Tatikonda, Xenophon Papademetris, and James Duncan. 2020. Adaptive checkpoint adjoint method for gradient estimation in neural ode. In International Conference on Machine Learning. PMLR, 11639--11649.Google Scholar
Juntang Zhuang, Nicha C Dvornek, Sekhar Tatikonda, and James S Duncan. 2021. Mali: A memory efficient and reverse accurate integrator for neural odes. arXiv preprint arXiv:2102.04668 (2021).Google Scholar

Index Terms

Accelerating Neural-ODE Inference on FPGAs with Two-Stage Structured Pruning and History-based Stepsize Search
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks

Recommendations

Accelerating Big Data Analytics Using FPGAs
FCCM '15: Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines

Emerging big data analytics applications require a significant amount of server computational power. As chips are hitting power limits, computing systems are moving away from general-purpose designs and toward greater specialization. Hardware ...
Read More
Towards High Performance and Accurate BNN Inference on FPGA with Structured Fine-Grained Pruning
ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

As the extreme case of quantization networks, Binary Neural Networks (BNNs) have received tremendous attention due to many hardware-friendly properties in terms of storage and computation. To reach the limit of compact models, we attempt to combine ...
Read More
Efficient AES implementations on ASICs and FPGAs
AES'04: Proceedings of the 4th international conference on Advanced Encryption Standard

In this article, we present two AES hardware architectures: one for ASICs and one for FPGAs. Both architectures utilize the similarities of encryption and decryption to provide a high throughput using only a relatively small area. The presented ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays
February 2023
283 pages
ISBN:9781450394178
DOI:10.1145/3543622
General Chair:
Paolo Ienne
EPFL, Switzerland
,
Program Chair:
Zhiru Zhang
Cornell University, USA
Copyright © 2023 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 February 2023
Check for updates
Author Tags
FPGA
history-based stepsize search
inference accelerator
neural ordinary differential equations
two-stage coarse-grained/fine-grained structured pruning
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate125of627submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 434
  Total Downloads
- Downloads (Last 12 months)276
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Accelerating Neural-ODE Inference on FPGAs with Two-Stage Structured Pruning and History-based Stepsize Search

FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accelerating Big Data Analytics Using FPGAs

Towards High Performance and Accurate BNN Inference on FPGA with Structured Fine-Grained Pruning

Efficient AES implementations on ASICs and FPGAs