research-article

A dynamically configurable coprocessor for convolutional neural networks

Authors:
Srimat Chakradhar

NEC Laboratories America, Inc., Princeton, NJ, USA

NEC Laboratories America, Inc., Princeton, NJ, USA
View Profile

,
Murugan Sankaradas

NEC Laboratories America, Inc., Princeton, NJ, USA

NEC Laboratories America, Inc., Princeton, NJ, USA
View Profile

,
Venkata Jakkula

NEC Laboratories America, Inc., Princeton, NJ, USA

NEC Laboratories America, Inc., Princeton, NJ, USA
View Profile

,
Srihari Cadambi

NEC Laboratories America, Inc., Princeton, NJ, USA

NEC Laboratories America, Inc., Princeton, NJ, USA
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 38 Issue 3June 2010pp 247–257https://doi.org/10.1145/1816038.1815993

Published:19 June 2010Publication History

ACM SIGARCH Computer Architecture News

Abstract

Convolutional neural networks (CNN) applications range from recognition and reasoning (such as handwriting recognition, facial expression recognition and video surveillance) to intelligent text applications such as semantic text analysis and natural language processing applications. Two key observations drive the design of a new architecture for CNN. First, CNN workloads exhibit a widely varying mix of three types of parallelism: parallelism within a convolution operation, intra-output parallelism where multiple input sources (features) are combined to create a single output, and inter-output parallelism where multiple, independent outputs (features) are computed simultaneously. Workloads differ significantly across different CNN applications, and across different layers of a CNN. Second, the number of processing elements in an architecture continues to scale (as per Moore's law) much faster than off-chip memory bandwidth (or pin-count) of chips. Based on these two observations, we show that for a given number of processing elements and off-chip memory bandwidth, a new CNN hardware architecture that dynamically configures the hardware on-the-fly to match the specific mix of parallelism in a given workload gives the best throughput performance. Our CNN compiler automatically translates high abstraction network specification into a parallel microprogram (a sequence of low-level VLIW instructions) that is mapped, scheduled and executed by the coprocessor. Compared to a 2.3 GHz quad-core, dual socket Intel Xeon, 1.35 GHz C870 GPU, and a 200 MHz FPGA implementation, our 120 MHz dynamically configurable architecture is 4x to 8x faster. This is the first CNN architecture to achieve real-time video stream processing (25 to 30 frames per second) on a wide range of object detection and recognition tasks.

References

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, (pp. 1--46).Google ScholarCross Ref
Collobert, R.; Weston, J., "A unified architecture for natural language processing: deep neural networks with multitask learning," Proceedings of the 25th International Conference on Machine Learning (ICML 2008), vol. 307, pp.160--167, Jul 2008. Google ScholarDigital Library
Benkrid, K.; Belkacemi, S., "Design and implementation of a 2D convolution core for video applications on FPGAs," Digital and Computational Video, 2002. DCV 2002. Proceedings. Third International Workshop on, pp. 85--92, 14--15 Nov. 2002.Google Scholar
Cardells-Tormo, F.; Molinet, P.-L., "Area-efficient 2-D shift-variant convolvers for FPGA-based digital image processing," Circuits and Systems II: Express Briefs, IEEE Transactions on, vol.53, no.2, pp. 105--109, Feb. 2006.Google ScholarCross Ref
Hui Zhang; Mingxin Xia; Guangshu Hu, "A Multiwindow Partial Buffering Scheme for FPGA-Based 2-D Convolvers," Circuits and Systems II: Express Briefs, IEEE Transactions on, vol.54, no.2, pp. 200--204, Feb. 2007.Google ScholarCross Ref
Savich, A. W.; Moussa, M.; Areibi, S., "The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study," Neural Networks, IEEE Transactions on, vol.18, no.1, pp.240--252, Jan. 2007. Google ScholarDigital Library
Gironés, R. G.; Palero, R. C.; Boluda, J. C.; Cortés, A. S., "FPGA Implementation of a Pipelined On-Line Backpropagation," J. VLSI Signal Process. Syst., vol. 40, no. 2, pp.189--213., Jun 2005. Google ScholarDigital Library
Catanzaro, B.; Sundaram, N.; Keutzer, K., "Fast Support Vector Training and Classification on Graphics Processors," Machine Learning, 25th International Conference on, (ICML 2008), Jul. 2008. Google ScholarDigital Library
C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, "CNP: An FPGA-based Processor for Convolutional Networks", in Proc. International Conference on Field Programmable Logic and Applications (FPL'09), IEEE, Prague, 2009.Google ScholarCross Ref
Dixon, J. D. (1981). Asymptotically fast factorization of integers. Math. Comput., 36, 255--260.Google Scholar
Hadsell, R. e. (2009). Learning long-range vision for Autonomous off-road Driving. Journal of Field Robotics, 26 (2), 120--144. Google ScholarDigital Library
Haykin, S. (2008). Neural networks and learning machines. Prentice Hall.Google Scholar
Korekado, K., Morie, T., Nomura, O., Nakano, T., Matsugu, M., & Iwata, A. (2005). An Image Filtering Processor for Face/Object Recognition using Merged Analog-digital architecture. Symposium on VLSI Circuits, (pp. 220--223).Google ScholarCross Ref
Lisboa, P., Ifeachor, E., & Szczepaniak, P. (2009). Artificial neural networks in Biomedicine. Springer Google ScholarDigital Library
McNelis, P. D. (2005). Neural Networks in Finance: Gaining Predictive Edge in the Market. Academic Press. Google ScholarDigital Library
Mirowski, P. e. (2008). Comparing SVM and Convolutional networks for Epileptic Seizure Prediction from Intracranial EEG. Proceedings of Machine Learning and Signal Processing, (pp. 244--249).Google ScholarCross Ref
Mutch, J., & Lowe, D. (2006). Multiclass object recognition with sparse, localized features. International Conference on Computer Vision and Pattern Recognition, (pp. 11--18). Google ScholarDigital Library
Nakajima, M., & al., e. (2006). A 40GOPS 250mw massively parallel processor based on matrix architecture. International Solid-state Circuits Conference, (pp. 410--411).Google ScholarCross Ref
Nichols, K., Moussa, M., & Areibi, S. (2002). Feasibility of floating-point arithmetic in FPGA based artificial neural networks. Proceedings of the 15th International Conference on Computer Applications in Industry and Engineering. San Diego, CaliforniaGoogle Scholar
Nomura, O., & Morie, T. (2007). Projection-Field-Type VLSI Convolutional Neural Networks Using Merged/Mixed Analog-Digital approach. International Conference on Neural Information Processing (pp. 1081--1090). Springer-Verlag.Google Scholar
Omondi, A., & Rajapakse, J. (2006). FPGA Implementations of Neural Networks. Springer. Google ScholarDigital Library
Prasad, B., & Prasanna, S. (2008). Speech, Audio, Image and Biomedical Signal Processing using Neural Networks. Springer. Google ScholarDigital Library
Sermanet, P. e. (2009). Multi-range architecture for collision-free off-road Robot Navigation. Journal of Field Robotics, 26 (1), 58--87. Google ScholarDigital Library
Wolf, D. F., Romero, R. A., & Marques, E. (2001). Using embedded processors in hardware models of artificial neural networks. Proceedings of SBAI - Simposio Brasileiro de Automao Inteligente, (pp. 78--83).Google Scholar
Steve Lawrence, C. Lee Giles, Ah Chung Tsoi, Andrew D. Back, Face Recognition: A Convolutional Neural Network Approach. IEEE Transactions on Neural Networks 1997. Google ScholarDigital Library
Nasse, F., et al, "Face Detection using GPU-based Convolutional Neural Network", CAIP 2009, LNCS pp 83--90, Springer Verlag Google ScholarDigital Library
Serre, T. et al "Object recognition with features inspired by the visual cortex", Proceedings of Computer Vision and Pattern Recognition 2006. Google ScholarDigital Library
Dalal, N. et al, "Histograms of oriented gradients for human detection", Proceedings of Computer Vision and Pattern Recognition, 2005 Google ScholarDigital Library
Raina, R. et al, "Large-scale Deep Unsupervised Learning using Graphics Procesors", Proceedings of International Conference on Machine Learning, 2009 (pp. 873--880). Google ScholarDigital Library
Lee, H. et al, "Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations", Proceedings of International Conference on Machine Learning, 2009 (pp. 873--880). Google ScholarDigital Library

Index Terms

A dynamically configurable coprocessor for convolutional neural networks
1. Computer systems organization
  1. Architectures
    1. Other architectures
    2. Serial architectures
      1. Pipeline computing

Recommendations

A dynamically configurable coprocessor for convolutional neural networks
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Convolutional neural networks (CNN) applications range from recognition and reasoning (such as handwriting recognition, facial expression recognition and video surveillance) to intelligent text applications such as semantic text analysis and natural ...
Read More
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational ...
Read More
Efficient Processing of Convolutional Neural Networks on SW26010
Network and Parallel Computing
Abstract
Artificial intelligence has developed rapidly in recent years. Deep neural networks are the basis of many artificial intelligence applications. How to accelerate the computational processing of deep neural networks is very important. To explor the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 38, Issue 3
ISCA '10
June 2010
508 pages
ISSN:0163-5964
DOI:10.1145/1816038
Issue’s Table of Contents
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
June 2010
520 pages
ISBN:9781450300537
DOI:10.1145/1815961
General Chair:
André Seznec
INRIA Rennes
,
Program Chairs:
Uri Weiser
Technion
,
Ronny Ronen
Intel
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 June 2010
Check for updates
Author Tags
convolutional neural networks
dynamic reconfiguration
parallel computer architecture
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 343
  Total Citations
  View Citations
- 5,216
  Total Downloads
- Downloads (Last 12 months)208
- Downloads (Last 6 weeks)33
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A dynamically configurable coprocessor for convolutional neural networks

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

A dynamically configurable coprocessor for convolutional neural networks

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Efficient Processing of Convolutional Neural Networks on SW26010