skip to main content
10.1145/3352460.3358291acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks

Published:12 October 2019Publication History

ABSTRACT

Convolutional neural networks (CNNs) are emerging as powerful tools for image processing. Recent machine learning work has reduced CNNs' compute and data volumes by exploiting the naturally-occurring and actively-transformed zeros in the feature maps and filters. While previous semi-sparse architectures exploit one-sided sparsity either in the feature maps or the filters, but not both, a recent fully-sparse architecture, called Sparse CNN (SCNN), exploits two-sided sparsity to improve performance and energy over dense architectures. However, sparse vector-vector dot product, a key primitive in sparse CNNs, would be inefficient using the representation adopted by SCNN. The dot product requires finding and accessing non-zero elements in matching positions in the two sparse vectors -- an inner join using the position as the key with a single value field. SCNN avoids the inner join by performing a Cartesian product capturing the relevant multiplications. However, SCNN's approach incurs several considerable overheads and is not applicable to non-unit-stride convolutions. Further, exploiting reuse in sparse CNNs fundamentally causes systematic load imbalance not addressed by SCNN. We propose SparTen which achieves efficient inner join by providing support for native two-sided sparse execution and memory storage. To tackle load imbalance, SparTen employs a software scheme, called greedy balancing, which groups filters by density via two variants, a software-only one which uses whole-filter density and a software-hardware hybrid which uses finer-grain density. Our simulations show that, on average, SparTen performs 4.7x, 1.8x, and 3x better than a dense architecture, one-sided sparse architecture, and SCNN, respectively. An FPGA implementation shows that SparTen performs 4.3x and 1.9x better than a dense architecture and a one-sided sparse architecture, respectively.

References

  1. Jorge Albericio, Alberto Delmas, Patrick Judd, Sayeh Sharify, Gerard O'Leary, Roman Genov, and Andreas Moshovos. 2017. Bit-pragmatic deep neural network computing. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2017, Cambridge, MA, USA, October 14-18, 2017. 382--394. https://doi.org/10.1145/3123939.3123982Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jorge Albericio, Patrick Judd, Tayler H. Hetherington, Tor M. Aamodt, Natalie D. Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016. 1--13. https://doi.org/10.1109/ISCA.2016.11Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-Layer CNN Accelerators. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  4. Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Geoffrey Ndu, Martin Foltin, R. Stanley Williams, Paolo Faraboschi, Wen-mei W Hwu, John Paul Strachan, Kaushik Roy, and Dejan S. Milojicic. 2019. PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). ACM, New York, NY, USA, 715--731. https://doi.org/10.1145/3297858.3304049Google ScholarGoogle Scholar
  5. N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. 1--11. https://doi.org/10.1145/1654059.1654078Google ScholarGoogle Scholar
  6. V.E. Beneš. 1965. Mathematical Theory of Connecting Networks and Telephone Traffic. Elsevier Science. https://books.google.com/books?id=CANltcFRRHMCGoogle ScholarGoogle Scholar
  7. Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 609--622. https://doi.org/10.1109/MICRO.2014.58Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yu-Hsin Chen, Tushar Krishna, Joel Emer, and Vivienne Sze. 2016. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In 2016 IEEE International Solid-State Circuits Conference (ISSCC). 262--263. https://doi.org/10.1109/ISSCC.2016.7418007Google ScholarGoogle ScholarCross RefCross Ref
  9. Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 27--39. https://doi.org/10.1109/ISCA.2016.13Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Clos. 1953. A study of non-blocking switching networks. The Bell System Technical Journal 32, 2 (March 1953), 406--424. https://doi.org/10.1002/j.1538-7305.1953.tb01433.xGoogle ScholarGoogle ScholarCross RefCross Ref
  11. Alberto Delmas, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, and Andreas Moshovos. 2018. Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How. CoRR abs/1803.03688 (2018). arXiv:1803.03688 http://arxiv.org/abs/1803.03688Google ScholarGoogle Scholar
  12. C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian, and B. Yuan. 2018. PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 189--202. https://doi.org/10.1109/MICRO.2018.00024Google ScholarGoogle Scholar
  13. Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, and Bo Yuan. 2017. CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-circulant Weight Matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 '17). ACM, New York, NY, USA, 395--408. https://doi.org/10.1145/3123939.3124552Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 92--104. https://doi.org/10.1145/2749469.2750389Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. A. Farrell and T. C. Fischer. 1998. Issue logic for a 600-MHz out-of-order execution microprocessor. IEEE Journal of Solid-State Circuits 33, 5 (May 1998), 707--712. https://doi.org/10.1109/4.668985Google ScholarGoogle ScholarCross RefCross Ref
  16. Norman E. Gibbs, William G. Poole, Jr., and Paul K. Stockmeyer. 1976. A Comparison of Several Bandwidth and Profile Reduction Algorithms. ACM Trans. Math. Softw. 2, 4 (Dec. 1976), 322--330. https://doi.org/10.1145/355705.355707Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS). 1--4. https://doi.org/10.1109/ISCAS.2017.8050809Google ScholarGoogle ScholarCross RefCross Ref
  18. Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 1737--1746. http://dl.acm.org/citation.cfm?id=3045118.3045303Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 243--254. https://doi.org/10.1109/ISCA.2016.30Google ScholarGoogle Scholar
  20. Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. http://arxiv.org/abs/1510.00149Google ScholarGoogle Scholar
  21. Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 1135--1143. http://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-network.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385Google ScholarGoogle Scholar
  23. Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 1--12. https://doi.org/10.1145/3079856.3080246Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Dae Hyun Kim and Sung Kyu Lim. 2015. Impact of TSV and Device Scaling on the Quality of 3D ICs. Springer New York, New York, NY, 1--22. https://doi.org/10.1007/978-1-4939-2163-8_1Google ScholarGoogle Scholar
  25. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097--1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdfGoogle ScholarGoogle Scholar
  26. H. T. Kung, Bradley McDanel, and Sai Qian Zhang. 2018. Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization. CoRR abs/1811.04770 (2018). arXiv:1811.04770 http://arxiv.org/abs/1811.04770Google ScholarGoogle Scholar
  27. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov 1998), 2278--2324. https://doi.org/10.1109/5.726791Google ScholarGoogle ScholarCross RefCross Ref
  28. Darryl D. Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy. 2016. Fixed Point Quantization of Deep Convolutional Networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (ICML'16). JMLR.org, 2849--2858. http://dl.acm.org/citation.cfm?id=3045390.3045690Google ScholarGoogle Scholar
  29. Yen-Chun Lin and Chin-Yu Su. 2005. Faster Optimal Parallel Prefix Circuits: New Algorithmic Construction. J. Parallel Distrib. Comput. 65, 12 (Dec. 2005), 1585--1595. https://doi.org/10.1016/j.jpdc.2005.05.017Google ScholarGoogle Scholar
  30. Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. PuDianNao: A Polyvalent Machine Learning Accelerator. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 369--381. https://doi.org/10.1145/2694344.2694358Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Mahmoud, K. Siu, and A. Moshovos. 2018. Diffy: a Déjà vu-Free Differential Deep Neural Network Accelerator. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 134--147. https://doi.org/10.1109/MICRO.2018.00020Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman Jouppi. 2009. Cacti 6.0: A tool to model large caches. HP Laboratories (01 2009).Google ScholarGoogle Scholar
  33. Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 27--40. https://doi.org/10.1145/3079856.3080254Google ScholarGoogle Scholar
  34. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2014. ImageNet Large Scale Visual Recognition Challenge. CoRR abs/1409.0575 (2014). http://arxiv.org/abs/1409.0575Google ScholarGoogle Scholar
  35. Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-situ Analog Arithmetic in Crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 14--26. https://doi.org/10.1109/ISCA.2016.12Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sayeh Sharify, Alberto Delmas Lascorz, Mostafa Mahmoud, Milos Nikolic, Kevin Siu, Dylan Malone Stuart, Zissis Poulos, and Andreas Moshovos. 2019. Laconic Deep Learning Inference Acceleration. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA '19). ACM, New York, NY, USA, 304--317. https://doi.org/10.1145/3307650.3322255Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer. In 25th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM).Google ScholarGoogle ScholarCross RefCross Ref
  38. L. Song, X. Qian, H. Li, and Y. Chen. 2017. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 541--552. https://doi.org/10.1109/HPCA.2017.55Google ScholarGoogle Scholar
  39. James E. Stine, Ivan Castellanos, Michael Wood, Jeff Henson, Fred Love, W. Rhett Davis, Paul D. Franzon, Michael Bucher, Sunil Basavarajaiah, Julie Oh, and Ravi Jenkal. 2007. FreePDK: An Open-Source Variation-Aware Design Kit. In Proceedings of the 2007 IEEE International Conference on Microelectronic Systems Education (MSE '07). IEEE Computer Society, Washington, DC, USA, 173--174. https://doi.org/10.1109/MSE.2007.44Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mithuna Thottethodi and T. N. Vijaykumar. 2019. Why the GPGPU is Less Efficient than the TPU forDNNs. https://www.sigarch.org/why-the-gpgpu-is-less-efficient-than-the-tpu-for-dnns/.Google ScholarGoogle Scholar
  41. Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 548--560. https://doi.org/10.1145/3079856.3080215Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-x: An Accelerator for Sparse Neural Networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). IEEE Press, Piscataway, NJ, USA, Article 20, 12 pages. http://dl.acm.org/citation.cfm?id=3195638.3195662Google ScholarGoogle Scholar
  43. X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen. 2018. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 15--28. https://doi.org/10.1109/MICRO.2018.00011Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ling Zhuo and Viktor K. Prasanna. 2005. Sparse Matrix-Vector Multiplication on FPGAs. In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-programmable Gate Arrays (FPGA '05). ACM, New York, NY, USA, 63--74. https://doi.org/10.1145/1046192.1046202Google ScholarGoogle Scholar

Index Terms

  1. SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
        October 2019
        1104 pages
        ISBN:9781450369381
        DOI:10.1145/3352460

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 October 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate484of2,242submissions,22%

        Upcoming Conference

        MICRO '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader