ABSTRACT
Vector-matrix multiplication dominates the computation time and energy for many workloads, particularly neural network algorithms and linear transforms (e.g, the Discrete Fourier Transform). Utilizing the natural current accumulation feature of memristor crossbar, we developed the Dot-Product Engine (DPE) as a high density, high power efficiency accelerator for approximate matrix-vector multiplication. We firstly invented a conversion algorithm to map arbitrary matrix values appropriately to memristor conductances in a realistic crossbar array, accounting for device physics and circuit issues to reduce computational errors. The accurate device resistance programming in large arrays is enabled by close-loop pulse tuning and access transistors. To validate our approach, we simulated and benchmarked one of the state-of-the-art neural networks for pattern recognition on the DPEs. The result shows no accuracy degradation compared to software approach (99 % pattern recognition accuracy for MNIST data set) with only 4 Bit DAC/ADC requirement, while the DPE can achieve a speed-efficiency product of 1,000× to 10,000× compared to a custom digital ASIC.
- S. K. Hsu et al., "A 280 mv-to-1.1 v 256b reconfigurable simd vector permutation engine with 2-dimensional shuffle in 22 nm tri-gate cmos," IEEE JSSC, vol. 48, no. 1, pp. 118--127, 2013.Google Scholar
- J. J. Yang et al., "Memristive devices for computing," Nature nanotechnology, vol. 8, no. 1, pp. 13--24, 2013.Google ScholarCross Ref
- M. Hu et al., "Hardware realization of bsb recall function using memristor crossbar arrays," in DAC. ACM, 2012, pp. 498--503. Google ScholarDigital Library
- K. Fatahalian et al., "Understanding the efficiency of gpu algorithms for matrix-matrix multiplication," in ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware. ACM, 2004, pp. 133--137. Google ScholarDigital Library
- P. Gu et al., "Technological exploration of rram crossbar array for matrix-vector multiplication," in ASP-DAC. IEEE, 2015, pp. 106--111.Google Scholar
- G. Burr et al., "Experimental demonstration and tolerancing of a large-scale neural network (165,000 synapses), using phase-change memory as the synaptic weight element," in IEEE IEDM. IEEE, 2014, pp. 29--5.Google ScholarCross Ref
- B. Liu et al., "Vortex: variation-aware training for memristor x-bar," in DAC. ACM, 2015, p. 15. Google ScholarDigital Library
- M. Prezioso et al., "Training and operation of an integrated neuromorphic network based on metal-oxide memristors," Nature, vol. 521, no. 7550, pp. 61--64, 2015.Google ScholarCross Ref
- M. Hu et al., "Memristor crossbar-based neuromorphic computing system: A case study," IEEE TNNLS, vol. 25, no. 10, pp. 1864--1878, 2014.Google Scholar
- R. Salakhutdinov and G. E. Hinton, "Learning a nonlinear embedding by preserving class neighbourhood structure," in ICAIS, 2007, pp. 412--419.Google Scholar
- Y. Y. Chen et al., "Endurance/retention trade-off on cap 1t1r bipolar rram," TED, vol. 60, no. 3, pp. 1114--1121, 2013.Google ScholarCross Ref
- H.-S. P. Wong et al., "Metal--oxide rram," Proceedings of the IEEE, vol. 100, no. 6, pp. 1951--1970, 2012.Google ScholarCross Ref
- S. Jo et al., "Nanoscale Memristor Device as Synapse in Neuromorphic Systems," Nano Letter, vol. 10, no. 4, pp. 1297--1301, 2010.Google ScholarCross Ref
- M. Tarkov, "Mapping weight matrix of a neural network?s layer onto memristor crossbar," Optical Memory and Neural Networks, vol. 24, no. 2, pp. 109--115, 2015. Google ScholarDigital Library
- S. Choi et al., "Data clustering using memristor networks," Scientific Reports, vol. 5, 2015.Google Scholar
- F. Alibart et al., "High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm," Nanotechnology, vol. 23, no. 7, p. 075201, 2012.Google ScholarCross Ref
- S. Choi et al., "Random telegraph noise and resistance switching analysis of oxide based resistive memory," Nanoscale, vol. 6, no. 1, pp. 400--404, 2014.Google ScholarCross Ref
- X. Dong et al., "Pcramsim: System-level performance, energy, and area modeling for phase-change ram," in ICCAD. ACM, 2009, pp. 269--275. Google ScholarDigital Library
- S.-S. Sheu et al., "A 4mb embedded slc resistive-ram macro with 7.2 ns read-write random-access time and 160ns mlc-access capability," in IEEE ISSCC, 2011, pp. 200--202.Google Scholar
Recommendations
Computing discrete transforms on the Cell Broadband Engine
Discrete transforms are of primary importance and fundamental kernels in many computationally intensive scientific applications. In this paper, we investigate the performance of two such algorithms; Fast Fourier Transform (FFT) and Discrete Wavelet ...
Accelerating computing with the cell broadband engine processor
CF '08: Proceedings of the 5th conference on Computing frontiersIn this paper, we describe our approach to utilizing the compute power of the Cell Broadband Engine™ (Cell/B.E.)1 processor as an accelerator for computationally intensive portions of high performance computing applications. We call this approach "...
Multi-functional floating-point MAF designs with dot product support
This paper presents multi-functional double-precision and quadruple-precision floating-point multiply-add fused (FPMAF) designs. The double-precision FPMAF design can execute adouble-precision floating-point multiply-add, or two single-precision ...
Comments