Abstract
We present a novel method for neural network quantization. Our method, named UNIQ, emulates a non-uniform k-quantile quantizer and adapts the model to perform well with quantized weights by injecting noise to the weights at training time. As a by-product of injecting noise to weights, we find that activations can also be quantized to as low as 8-bit with only a minor accuracy degradation. Our non-uniform quantization approach provides a novel alternative to the existing uniform quantization techniques for neural networks. We further propose a novel complexity metric of number of bit operations performed (BOPs), and we show that this metric has a linear relation with logic utilization and power. We suggest evaluating the trade-off of accuracy vs. complexity (BOPs). The proposed method, when evaluated on ResNet18/34/50 and MobileNet on ImageNet, outperforms the prior state of the art both in the low-complexity regime and the high accuracy regime. We demonstrate the practical applicability of this approach, by implementing our non-uniformly quantized CNN on FPGA.
- Alexander G. Anderson and Cory P. Berg. 2018. The high-dimensional geometry of binary neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
- Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. 2015. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference on Machine Learning (PMLR’15). 1613--1622.Google Scholar
- Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. 2017. Deep learning with low precision by half-wave Gaussian quantization. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2018. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (April 2018), 834--848. DOI:https://doi.org/10.1109/TPAMI.2017.2699184Google ScholarCross Ref
- Y. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (Jan. 2017), 127--138. DOI:https://doi.org/10.1109/JSSC.2016.2616357Google ScholarCross Ref
- Yinpeng Dong, Renkun Ni, Jianguo Li, Yurong Chen, Jun Zhu, and Hang Su. 2017. Learning accurate low-bit deep neural networks with stochastic quantization. In Proceedings of the British Machine Vision Conference (BMVC’17).Google ScholarCross Ref
- Robert M. Gray and David L. Neuhoff. 1998. Quantization. IEEE Transactions on Information Theory 44, 6 (1998), 2325--2383.Google ScholarDigital Library
- Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 1737--1746.Google ScholarDigital Library
- Song Han, Huizi Mao, and William J. Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
- Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-Rahman Mohamed, Navdeep Jaitly, Andrew Senior, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (Nov. 2012), 82--97. DOI:https://doi.org/10.1109/MSP.2012.2205597Google ScholarCross Ref
- Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861Google Scholar
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. In Advances in Neural Information Processing Systems. 4107--4115.Google Scholar
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research 18, 1 (2017), 6869--6898.Google ScholarDigital Library
- P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos. 2016. Stripes: Bit-serial deep neural network computing. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--12. DOI:https://doi.org/10.1109/MICRO.2016.7783722Google ScholarCross Ref
- Alex Krizhevsky. 2009. Learning Multiple Layers of Features From Tiny Images. Master’s Thesis. Department of Computer Science, University of Toronto.Google Scholar
- Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). 2267--2273.Google ScholarCross Ref
- Edward H. Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S. Simon Wong. 2017. LogNet: Energy-efficient neural networks using logarithmic computation. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’17). IEEE, Los Alamitos, CA, 5900--5904.Google Scholar
- Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129--137.Google ScholarDigital Library
- Christos Louizos, Karen Ullrich, and Max Welling. 2017. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Red Hook, NY, 3288--3298.Google Scholar
- Asit Mishra and Debbie Marr. 2018. Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
- Asit Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. 2018. WRPN: Wide reduced-precision networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
- Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. 2017. Variational dropout sparsifies deep neural networks. In Proceedings of the International Conference on Machine Learning.Google Scholar
- Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. 2018. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarCross Ref
- Antonio Polino, Razvan Pascanu, and Dan Alistarh. 2018. Model compression via distillation and quantization. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision. 525--542.Google ScholarCross Ref
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252. DOI:https://doi.org/10.1007/s11263-015-0816-yGoogle ScholarDigital Library
- Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. 2018. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Alamitos, CA, 764--775. DOI:https://doi.org/10.1109/ISCA.2018.00069Google ScholarDigital Library
- Tao Sheng, Chen Feng, Shaojie Zhuo, Xiaopeng Zhang, Liang Shen, and Mickey Aleksic. 2018. A quantization-friendly separable convolution for MobileNets. In Proceedings of the 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2’18). 14--18. DOI:https://doi.org/10.1109/EMC2.2018.00011Google ScholarCross Ref
- Karen Ullrich, Edward Meeds, and Max Welling. 2017. Soft weight-sharing for neural network compression. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
- Yuhui Xu, Yongzhuang Wang, Aojun Zhou, Weiyao Lin, and Hongkai Xiong. 2018. Deep neural network compression with single and multiple level quantization. arXiv:1803.03289Google Scholar
- Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless CNNs with low-precision weights. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
- Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160Google Scholar
- Shu-Chang Zhou, Yu-Zhi Wang, He Wen, Qin-Yao He, and Yu-Heng Zou. 2017. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology 32, 4 (2017), 667--682.Google ScholarCross Ref
- Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2016. Trained ternary quantization. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google Scholar
Index Terms
- UNIQ: Uniform Noise Injection for Non-Uniform Quantization of Neural Networks
Recommendations
A greedy algorithm for quantizing neural networks
We propose a new computationally efficient method for quantizing the weights of pretrained neural networks that is general enough to handle both multi-layer perceptrons and convolutional neural networks. Our method deterministically quantizes layers in an ...
Efficient bit-rate scalability for weighted squared error optimization in audio coding
We propose two quantization techniques for improving the bit-rate scalability of compression systems that optimize a weighted squared error (WSE) distortion metric. We show that quantization of the base-layer reconstruction error using entropy-coded ...
Adaptive quantization with balanced distortion distribution and its application to H.264 intra coding
ICIP'09: Proceedings of the 16th IEEE international conference on Image processingQuantization in H.264 is achieved in the DCT domain using scalar quantizers, which assume a sum distortion constraint and often produce considerably larger distortions on block boundaries than inside a block in the pixel domain. This biased distortion ...
Comments