skip to main content
10.1145/3543622.3573044acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
short-paper

Accelerating Neural-ODE Inference on FPGAs with Two-Stage Structured Pruning and History-based Stepsize Search

Published:12 February 2023Publication History

ABSTRACT

Neural ordinary differential equation (Neural-ODE) outperforms conventional deep neural networks (DNNs) in modeling continuous-time or dynamical systems by adopting numerical ODE integration onto a shallow embedded NN. However, Neural-ODE suffers from slow inference due to the costly iterative stepsize search in numerical integration, especially when using higher-order Runge-Kutta (RK) methods and smaller error tolerance for improved integration accuracy. In this work, we first present algorithmic techniques to speedup RK-based Neural-ODE inference: a two-stage coarse-grained/fine-grained structured pruning method based on top-K sparsification that reduces the overall computations by more than 60% in the embedded NN and a history-based stepsize search method based on past integration steps that reduces the latency for reaching accepted stepsize by up to 77% in RK methods. A reconfigurable hardware architecture is co-designed based on proposed speedup techniques, featuring three processing loops to support programmable embedded NN and a variety of higher-order RK methods. Sparse activation processor with multi-dimensional sorters is designed to exploit structured sparsity in activations. Implemented on a Xilinx Virtex-7 XC7VX690T FPGA and experimented on a variety of datasets, the prototype accelerator using a more complex 3rd-order RK method achieves more than 2.6x speedup compared to the latest Neural-ODE FPGA accelerator using the simplest Euler method. Compared to a software execution on Nvidia A100 GPU, the inference speedup can be up to 18x.

References

  1. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, Vol. 33 (2020), 1877--1901.Google ScholarGoogle Scholar
  2. John Charles Butcher. 2016. Numerical methods for ordinary differential equations. John Wiley & Sons.Google ScholarGoogle Scholar
  3. Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. Neural ordinary differential equations. Advances in neural information processing systems, Vol. 31 (2018).Google ScholarGoogle Scholar
  4. Francc ois Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1251--1258.Google ScholarGoogle ScholarCross RefCross Ref
  5. Leonhard Euler. 1824. Institutionum calculi integralis. Vol. 1. impensis Academiae imperialis scientiarum.Google ScholarGoogle Scholar
  6. Erwin Fehlberg. 1969. Low-order classical Runge-Kutta formulas with stepsize control and their application to some heat transfer problems. Vol. 315. National aeronautics and space administration.Google ScholarGoogle Scholar
  7. Amir Gholami, Kurt Keutzer, and George Biros. 2019. Anode: Unconditionally accurate memory-efficient gradients for neural odes. arXiv preprint arXiv:1902.10298 (2019).Google ScholarGoogle Scholar
  8. Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. 2018. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367 (2018).Google ScholarGoogle Scholar
  9. Mathew Hall and Vaughn Betz. 2020. HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs. arXiv preprint arXiv:2007.10451 (2020).Google ScholarGoogle Scholar
  10. Ramin Hasani, Mathias Lechner, Alexander Amini, Daniela Rus, and Radu Grosu. 2020. Liquid time-constant networks. arXiv preprint arXiv:2006.04439 (2020).Google ScholarGoogle Scholar
  11. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  12. Hiroki Kawakami, Hirohisa Watanabe, Keisuke Sugiura, and Hiroki Matsutani. 2021. A Low-Cost Neural ODE with Depthwise Separable Convolution for Edge Domain Adaptation on FPGAs. arXiv preprint arXiv:2107.12824 (2021).Google ScholarGoogle Scholar
  13. Lucas Liebenwein, Ramin Hasani, Alexander Amini, and Daniela Rus. 2021. Sparse flows: Pruning continuous-depth models. Advances in Neural Information Processing Systems, Vol. 34 (2021).Google ScholarGoogle Scholar
  14. Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. 2015. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 806--814.Google ScholarGoogle Scholar
  15. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021a. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.Google ScholarGoogle ScholarCross RefCross Ref
  16. Zhi-Gang Liu, Paul N Whatmough, Yuhao Zhu, and Matthew Mattina. 2021b. S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration. arXiv preprint arXiv:2107.07983 (2021).Google ScholarGoogle Scholar
  17. Yufei Ma, Gokul Krishnan, Yu Cao, Le Ye, and Ru Huang. 2021a. SWIFT: Small-World-Based Structural Pruning to Accelerate DNN Inference on FPGA. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '21). Association for Computing Machinery, New York, NY, USA, 148. https://doi.org/10.1145/3431920.3439465Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yufei Ma, Gokul Krishnan, Yu Cao, Le Ye, and Ru Huang. 2021b. SWIFT: Small-World-Based Structural Pruning to Accelerate DNN Inference on FPGA. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '21). Association for Computing Machinery, New York, NY, USA, 148. https://doi.org/10.1145/3431920.3439465Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jian Meng, Shreyas Kolala Venkataramanaiah, Chuteng Zhou, Patrick Hansen, Paul Whatmough, and Jae-sun Seo. 2021. Fixyfpga: Efficient fpga accelerator for deep neural networks with high element-wise sparsity and without external memory access. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 9--16.Google ScholarGoogle ScholarCross RefCross Ref
  20. Mahdi Nazemi, Arash Fayyazi, Amirhossein Esmaili, Atharva Khare, Soheil Nazar Shahsavani, and Massoud Pedram. 2021. NullaNet Tiny: Ultra-low-latency DNN inference through fixed-function combinational logic. In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 266--267.Google ScholarGoogle ScholarCross RefCross Ref
  21. Amin Norollah, Danesh Derafshi, Hakem Beitollahi, and Mahdi Fazeli. 2019. RTHS: A low-cost high-performance real-time hardware sorter, using a multidimensional sorting algorithm. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 27, 7 (2019), 1601--1613.Google ScholarGoogle ScholarCross RefCross Ref
  22. Alessio Quaglino, Marco Gallieri, Jonathan Masci, and Jan Koutn'ik. 2019. Snode: Spectral discretization of neural odes for system identification. arXiv preprint arXiv:1906.07038 (2019).Google ScholarGoogle Scholar
  23. Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. 2020. Continuous-in-depth neural networks. arXiv preprint arXiv:2008.02389 (2020).Google ScholarGoogle Scholar
  24. Md Aamir Raihan and Tor Aamodt. 2020. Sparse weight activation training. Advances in Neural Information Processing Systems, Vol. 33 (2020), 15625--15638.Google ScholarGoogle Scholar
  25. Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. 2019. Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems, Vol. 32 (2019).Google ScholarGoogle Scholar
  26. C. Runge. 1895. Ueber die numerische Auflösung von Differentialgleichungen. Math. Ann., Vol. 46, 2 ( 1895), 167--178. https://doi.org/10.1007/BF01446807Google ScholarGoogle ScholarCross RefCross Ref
  27. Yaoyu Tao and Zhengya Zhang. 2021. HiMA: A Fast and Scalable History-based Memory Access Engine for Differentiable Neural Computer. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 845--856.Google ScholarGoogle Scholar
  28. Hirohisa Watanabe and Hiroki Matsutani. 2021. Accelerating ODE-Based Neural Networks on Low-Cost FPGAs. In 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 88--95.Google ScholarGoogle Scholar
  29. Qingcheng Xiao and Yun Liang. 2022a. Towards Agile DNN Accelerator Design Using Incremental Synthesis on FPGAs (FPGA '22). Association for Computing Machinery, New York, NY, USA, 42--48. https://doi.org/10.1145/3490422.3502351Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Qingcheng Xiao and Yun Liang. 2022b. Towards Agile DNN Accelerator Design Using Incremental Synthesis on FPGAs. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '22). Association for Computing Machinery, New York, NY, USA, 42--48. https://doi.org/10.1145/3490422.3502351Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Hongren Zheng, Yusong Wu, Fan Zhang, Xinhao Yang, Yi Cai, Yu Wang, and Huazhong Yang. 2021. 3M-AI: A Multi-Task and Multi-Core Virtualization Framework for Multi-FPGA AI Systems in the Cloud. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '21). Association for Computing Machinery, New York, NY, USA, 228. https://doi.org/10.1145/3431920.3439480Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Juntang Zhuang, Nicha Dvornek, Xiaoxiao Li, Sekhar Tatikonda, Xenophon Papademetris, and James Duncan. 2020. Adaptive checkpoint adjoint method for gradient estimation in neural ode. In International Conference on Machine Learning. PMLR, 11639--11649.Google ScholarGoogle Scholar
  33. Juntang Zhuang, Nicha C Dvornek, Sekhar Tatikonda, and James S Duncan. 2021. Mali: A memory efficient and reverse accurate integrator for neural odes. arXiv preprint arXiv:2102.04668 (2021).Google ScholarGoogle Scholar

Index Terms

  1. Accelerating Neural-ODE Inference on FPGAs with Two-Stage Structured Pruning and History-based Stepsize Search

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays
      February 2023
      283 pages
      ISBN:9781450394178
      DOI:10.1145/3543622

      Copyright © 2023 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 February 2023

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate125of627submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader