ABSTRACT
Neural ordinary differential equation (Neural-ODE) outperforms conventional deep neural networks (DNNs) in modeling continuous-time or dynamical systems by adopting numerical ODE integration onto a shallow embedded NN. However, Neural-ODE suffers from slow inference due to the costly iterative stepsize search in numerical integration, especially when using higher-order Runge-Kutta (RK) methods and smaller error tolerance for improved integration accuracy. In this work, we first present algorithmic techniques to speedup RK-based Neural-ODE inference: a two-stage coarse-grained/fine-grained structured pruning method based on top-K sparsification that reduces the overall computations by more than 60% in the embedded NN and a history-based stepsize search method based on past integration steps that reduces the latency for reaching accepted stepsize by up to 77% in RK methods. A reconfigurable hardware architecture is co-designed based on proposed speedup techniques, featuring three processing loops to support programmable embedded NN and a variety of higher-order RK methods. Sparse activation processor with multi-dimensional sorters is designed to exploit structured sparsity in activations. Implemented on a Xilinx Virtex-7 XC7VX690T FPGA and experimented on a variety of datasets, the prototype accelerator using a more complex 3rd-order RK method achieves more than 2.6x speedup compared to the latest Neural-ODE FPGA accelerator using the simplest Euler method. Compared to a software execution on Nvidia A100 GPU, the inference speedup can be up to 18x.
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, Vol. 33 (2020), 1877--1901.Google Scholar
- John Charles Butcher. 2016. Numerical methods for ordinary differential equations. John Wiley & Sons.Google Scholar
- Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. Neural ordinary differential equations. Advances in neural information processing systems, Vol. 31 (2018).Google Scholar
- Francc ois Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1251--1258.Google ScholarCross Ref
- Leonhard Euler. 1824. Institutionum calculi integralis. Vol. 1. impensis Academiae imperialis scientiarum.Google Scholar
- Erwin Fehlberg. 1969. Low-order classical Runge-Kutta formulas with stepsize control and their application to some heat transfer problems. Vol. 315. National aeronautics and space administration.Google Scholar
- Amir Gholami, Kurt Keutzer, and George Biros. 2019. Anode: Unconditionally accurate memory-efficient gradients for neural odes. arXiv preprint arXiv:1902.10298 (2019).Google Scholar
- Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. 2018. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367 (2018).Google Scholar
- Mathew Hall and Vaughn Betz. 2020. HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs. arXiv preprint arXiv:2007.10451 (2020).Google Scholar
- Ramin Hasani, Mathias Lechner, Alexander Amini, Daniela Rus, and Radu Grosu. 2020. Liquid time-constant networks. arXiv preprint arXiv:2006.04439 (2020).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Hiroki Kawakami, Hirohisa Watanabe, Keisuke Sugiura, and Hiroki Matsutani. 2021. A Low-Cost Neural ODE with Depthwise Separable Convolution for Edge Domain Adaptation on FPGAs. arXiv preprint arXiv:2107.12824 (2021).Google Scholar
- Lucas Liebenwein, Ramin Hasani, Alexander Amini, and Daniela Rus. 2021. Sparse flows: Pruning continuous-depth models. Advances in Neural Information Processing Systems, Vol. 34 (2021).Google Scholar
- Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. 2015. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 806--814.Google Scholar
- Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021a. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.Google ScholarCross Ref
- Zhi-Gang Liu, Paul N Whatmough, Yuhao Zhu, and Matthew Mattina. 2021b. S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration. arXiv preprint arXiv:2107.07983 (2021).Google Scholar
- Yufei Ma, Gokul Krishnan, Yu Cao, Le Ye, and Ru Huang. 2021a. SWIFT: Small-World-Based Structural Pruning to Accelerate DNN Inference on FPGA. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '21). Association for Computing Machinery, New York, NY, USA, 148. https://doi.org/10.1145/3431920.3439465Google ScholarDigital Library
- Yufei Ma, Gokul Krishnan, Yu Cao, Le Ye, and Ru Huang. 2021b. SWIFT: Small-World-Based Structural Pruning to Accelerate DNN Inference on FPGA. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '21). Association for Computing Machinery, New York, NY, USA, 148. https://doi.org/10.1145/3431920.3439465Google ScholarDigital Library
- Jian Meng, Shreyas Kolala Venkataramanaiah, Chuteng Zhou, Patrick Hansen, Paul Whatmough, and Jae-sun Seo. 2021. Fixyfpga: Efficient fpga accelerator for deep neural networks with high element-wise sparsity and without external memory access. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 9--16.Google ScholarCross Ref
- Mahdi Nazemi, Arash Fayyazi, Amirhossein Esmaili, Atharva Khare, Soheil Nazar Shahsavani, and Massoud Pedram. 2021. NullaNet Tiny: Ultra-low-latency DNN inference through fixed-function combinational logic. In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 266--267.Google ScholarCross Ref
- Amin Norollah, Danesh Derafshi, Hakem Beitollahi, and Mahdi Fazeli. 2019. RTHS: A low-cost high-performance real-time hardware sorter, using a multidimensional sorting algorithm. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 27, 7 (2019), 1601--1613.Google ScholarCross Ref
- Alessio Quaglino, Marco Gallieri, Jonathan Masci, and Jan Koutn'ik. 2019. Snode: Spectral discretization of neural odes for system identification. arXiv preprint arXiv:1906.07038 (2019).Google Scholar
- Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. 2020. Continuous-in-depth neural networks. arXiv preprint arXiv:2008.02389 (2020).Google Scholar
- Md Aamir Raihan and Tor Aamodt. 2020. Sparse weight activation training. Advances in Neural Information Processing Systems, Vol. 33 (2020), 15625--15638.Google Scholar
- Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. 2019. Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems, Vol. 32 (2019).Google Scholar
- C. Runge. 1895. Ueber die numerische Auflösung von Differentialgleichungen. Math. Ann., Vol. 46, 2 ( 1895), 167--178. https://doi.org/10.1007/BF01446807Google ScholarCross Ref
- Yaoyu Tao and Zhengya Zhang. 2021. HiMA: A Fast and Scalable History-based Memory Access Engine for Differentiable Neural Computer. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 845--856.Google Scholar
- Hirohisa Watanabe and Hiroki Matsutani. 2021. Accelerating ODE-Based Neural Networks on Low-Cost FPGAs. In 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 88--95.Google Scholar
- Qingcheng Xiao and Yun Liang. 2022a. Towards Agile DNN Accelerator Design Using Incremental Synthesis on FPGAs (FPGA '22). Association for Computing Machinery, New York, NY, USA, 42--48. https://doi.org/10.1145/3490422.3502351Google ScholarDigital Library
- Qingcheng Xiao and Yun Liang. 2022b. Towards Agile DNN Accelerator Design Using Incremental Synthesis on FPGAs. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '22). Association for Computing Machinery, New York, NY, USA, 42--48. https://doi.org/10.1145/3490422.3502351Google ScholarDigital Library
- Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Hongren Zheng, Yusong Wu, Fan Zhang, Xinhao Yang, Yi Cai, Yu Wang, and Huazhong Yang. 2021. 3M-AI: A Multi-Task and Multi-Core Virtualization Framework for Multi-FPGA AI Systems in the Cloud. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '21). Association for Computing Machinery, New York, NY, USA, 228. https://doi.org/10.1145/3431920.3439480Google ScholarDigital Library
- Juntang Zhuang, Nicha Dvornek, Xiaoxiao Li, Sekhar Tatikonda, Xenophon Papademetris, and James Duncan. 2020. Adaptive checkpoint adjoint method for gradient estimation in neural ode. In International Conference on Machine Learning. PMLR, 11639--11649.Google Scholar
- Juntang Zhuang, Nicha C Dvornek, Sekhar Tatikonda, and James S Duncan. 2021. Mali: A memory efficient and reverse accurate integrator for neural odes. arXiv preprint arXiv:2102.04668 (2021).Google Scholar
Index Terms
- Accelerating Neural-ODE Inference on FPGAs with Two-Stage Structured Pruning and History-based Stepsize Search
Recommendations
Accelerating Big Data Analytics Using FPGAs
FCCM '15: Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing MachinesEmerging big data analytics applications require a significant amount of server computational power. As chips are hitting power limits, computing systems are moving away from general-purpose designs and toward greater specialization. Hardware ...
Towards High Performance and Accurate BNN Inference on FPGA with Structured Fine-Grained Pruning
ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided DesignAs the extreme case of quantization networks, Binary Neural Networks (BNNs) have received tremendous attention due to many hardware-friendly properties in terms of storage and computation. To reach the limit of compact models, we attempt to combine ...
Efficient AES implementations on ASICs and FPGAs
AES'04: Proceedings of the 4th international conference on Advanced Encryption StandardIn this article, we present two AES hardware architectures: one for ASICs and one for FPGAs. Both architectures utilize the similarities of encryption and decryption to provide a high throughput using only a relatively small area. The presented ...
Comments