Hadamard product-based in-memory computing design for floating point neural network training

Anjunyi Fan; Yihan Fu; Yaoyu Tao; Zhonghua Jin; Haiyue Han; Huiyu Liu; Yaojun Zhang; Bonan Yan; Yuchao Yang; Ru Huang

doi:10.1088/2634-4386/acbab9

1. Introduction

Deep neural networks (DNNs) become a game-changing solution to artificial intelligence (AI) tasks at various scales [1]. DNNs are data-intensive and involve heavy usage of general matrix-multiplication (GeMM), causing an 'efficiency wall' phenomenon [2]. That is, transferring data dominates the overall energy consumption of a DNN computing system, which hinders further improvement toward higher computing efficiency. Aiming to break this bottleneck, in-memory computing (IMC) technology, a.k.a processing-in-memory (PIM) or compute-in-memory (CIM), has been proposed. A basic structure of IMC is shown in figure 1(a). By fusing memories and processing units, IMC is capable of locally performing computation in memories and thus significantly relaxes the requirement for the data communication bandwidth between standalone processing units and peripheral memories.

**Figure 1.** (a) A conventional structure of IMC; (b)–(d) three challenges in training implementation by the conventional IMC.
Download figure:
Standard image High-resolution image

Multiple possible memory candidates, e.g. static random-access memory (SRAM) [3], magnetic random-access memory (MRAM) [4–6], resistive random-access memory [7–9], ferroelectric field-effect transistor [10–12] and Flash [13–15], have the capability of performing the required arithmetic operations within or near the memory bitcell array [16–20]. Among them, SRAM provides an extreme high operational speed (∼1 ns per read or write), features good compatibility with mature CMOS logic manufacturing platform and offers extraordinary high endurance [21–23]. With these advantages, SRAM emerges as one of the most promising memory technology for IMC as the CMOS technology down-scaling becomes harder than ever [24].

State-of-the-art IMC macro design⁶ is particularly advantageous to integer GeMM operations in terms of energy efficiency and performance [3, 25, 26]. Various SRAM-based IMC macros have been developed and applied for DNN inference. Bankman et al [27] developed a 3.8 µJ/classification mixed-signal processor on the CIFAR-10 image classification data set at 86 $\%$ accuracy using convolutional neural networks (CNNs). Biswas and Chandrakasan [28] presented a SRAM array for binary weight LeNet-5 inference with energy efficiency at 40.3 TOPS W⁻¹.⁷ Khwa et al [29] proposed an IMC macro for binary-DNN inference reaching 55.8 TOPS W⁻¹. Valavi et al [30] designed a charge-domain compute CNN accelerator with balanced energy and area efficiency.

However, most existing IMC designs focusing on DNN inference acceleration have difficulty in conducting training tasks [31, 32]. The reasons behind are threefold:

(a)
Limited data format: most prior macros only support integers or fixed-point numbers. However, these number formats suffer from precision loss and a small range. Even 6-bit integers training is enough for small-scale network training like MNIST on LeNet-5 [33], when it comes to larger networks, however, these schemes suffer DNN inference accuracy degradation even using 32-bit integers and quantization methods [34–36]. At least 16-bit floating point (FP) precision is necessary to guarantee the efficacy of training to reach a satisfactory accuracy (DNN performance) of the large-scale networks [37–40]. Compared to integer GeMM, FP demands both integer addition for the mantissa and extra processing circuits for the exponents (figure 1(b)).Recently, there are several works employing floating-point formats for IMC designs: Tu et al [41] developed a digital processor handling GeMM between integers and FP numbers, reaching 14 TFLOPS/W efficiency computing GeMM with BFloat16 format. Lee et al [42] designed an SRAM-based IMC circuit processing GeMM between BFloat16, which separates exponent and fraction storage with the best computing efficiency at 1.43 TFLOPS W⁻¹; Lee et al [43] give a DRAM-based near-memory computing design with high throughput at 1 TFLOPS per chip. Nevertheless, efficient in-memory simultaneous processing of mantissa and exponents are still not explored.
(b)
Limited data density: data in neural networks mainly include two parts, weights, and activations. In inference, IMC macros store the weights, while the activations are simply fed to the IMC macro as the transferred data. In contrast, during training, activations need to be cached in the IMC macros for extra processing (figure 1(c)). This leads to a critical requirement for large on-chip buffering space. Yet most existing SRAM-based IMC structures have low data density⁸ . The data density in [41, 42] is lower than 0.3Mb mm⁻². This inevitably causes more off-chip memory communication during training, and can hardly scale out for large-scale training tasks at low cost.
(c)
Limited operation type: Prior IMC designs are adept at vector-matrix multiplication (VMM). But other operations in training have poor compatibility with VMM, such as the Hadamard product in the DNN error backpropagation (BP) and vector outer product in weight updating (illustrated in figure 1(d)). These two operations call for the support of another typical type of operator, element-wise multiplication. However, this has been rarely implemented in the existing IMC designs.

In this paper, we propose an FP Hadamard product structure-based IMC design (H-IMC) for DNN training. We firstly design an IMC circuit structure for FP processing. It supports two basic operations in training, VMM and vector Hadamard product (VHP). Then we decompose the training algorithm into these basic operations. The entire design is implemented with the 28 nm commercial process development kit (PDK) and verified on FPGA hardware emulation. The performance of FP H-IMC is explored with a set of configurable design parameters. And the simulated result of training implementation shows that our scheme has considerable optimization results for edge-side network training. Results show that our design has a data density of 769.2Kb mm⁻² with FP processing circuits included. And the proposed IMC training scheme saves $91.2\%$ energy and is $13.9\%$ faster on MobileNet [44] training compared with the GPU platform of GTX 3060 Laptop.

The contribution of this paper can be concluded as follows:

A IMC design supporting FP VMM and VHP operation (FP H-IMC).
Reformulation of training algorithm to deploy it onto proposed hardware.
A circuit-level design space exploration analysis framework on FP H-IMC.

The following paper is organized as follows: section 2 presents the background of PIM and training algorithm; section 3 introduces our scheme on both hardware design and algorithm implementation; section 4 provides evaluation methodology and experimental results with the proposed hardware solution together with the hardware/software codesign technique; section 5 concludes this paper.

2. Preliminaries

2.1. PIM

PIM is a novel architecture that fuses computation into memory macros. A PIM macro often works in two modes: in the memory mode, the PIM macro shares the same function as a conventional memory macro, supporting data writing-in and sensing-out; in the computation mode, it takes the dedicated inputs and the already loaded data into the PIM macro in the memory mode as the operands and performs arithmetical or logical operations. State-of-the-art PIM designs mainly focus on in-memory VMM operation, which can be described as:

$\begin{align} \begin{array}{l} \mathrm{VMM}: \begin{bmatrix} v^\mathrm{in}_0\\ \cdots \\ v^\mathrm{in}_{m}\\ \end{bmatrix}^\mathrm{T} \begin{bmatrix} w_{0,0} & \cdots & w_{n,0}\\ \cdots & & \cdots \\ w_{0,m} & \cdots & w_{n,m}\\ \end{bmatrix} = \begin{bmatrix} \sum\limits^{m}_{j=0} w_{0,j}v^\mathrm{in}_{j} & \cdots & \sum\limits^{m}_{j=0} w_{n,j}v^\mathrm{in}_{j}\\ \end{bmatrix} \end{array}. \end{align} \tag{ 1 }$

By substituting x with the DNN inputs and w with DNN weights, this PIM VMM operation is very convenient to execute the Feed Forward part in the BP algorithm for its naturally matching the weight-stationary data-reuse scheme in DNN inference. This idea as well as the DNN inference deployment on PIM macros have been extensively studied to achieve extraordinarily high energy efficiency [25].

2.2. BP

A DNN is comprised of many neural and synaptic layers. Taking fully-connected layer as an example, it is mathematically described as

$\begin{align} a^l = \sigma(z^l),\ \mathrm{where}\ z^l=w^l a^{l-1} + b^l \end{align} \tag{ 2 }$

where w^l means the weight matrix (i.e. parameters) of lth layer; b^l is the bias; σ is the nonlinear activation function which is often differentiable, with the most commonly used ReLU function outputting 0/in, depending on whether the input in is less than zero/greater than zero; and a^l represents the calculation results of each layer and is often referred to as the 'activation'. l ranges from 1 to L: l = 1 means the input layer, l = L means the output layer, and others are hidden layers. Further, CNNs are a common type of DNN containing convolutional and pooling layers besides fully-connected layers. The weight matrix is replaced by smaller convolutional kernels (i.e. filters) in convolutional layers. The input is a feature map in the matrix form. The computation in a convolutional layer becomes $z^l = a^{l-1} \ast w^l + b^l$ , where $\ast$ denotes the convolution between the input feature map $a^{l-1}$ and the weight kernels w^l. The pooling layer does not contain weights, whereas it reduces the dimension of the input $a^{l-1}$ . Common use pooling layers are max pooling and average pooling [45].

DNNs learn as their weights and biases are updated with interested data. Such a deliberate process, the so-called 'training', endows a DNN with cognitive abilities, such as classification, feature extraction, and regression. BP algorithm is one of the fundamental DNN training algorithms, shown in algorithm 1 and figure 2(a). Here, C is the loss function (also called 'error function' or 'penalty function'); η is the learning rate as a hyperparameter; m is the number of examples in one batch; $\odot$ indicates the element-wise product of two matrices (i.e. Hadamard product) :

$\begin{align} \begin{array}{c} \begin{bmatrix} v^\mathrm{in}_{0,0} & \cdots & v^\mathrm{in}_{n,0}\\ \cdots & & \cdots\\ v^\mathrm{in}_{0,m} & \cdots & v^\mathrm{in}_{n,m} \end{bmatrix} \odot \begin{bmatrix} x_{0,0} & \cdots & x_{n,0} \\ \cdots & & \cdots\\ x_{0,m} & \cdots & x_{n,m} \\ \end{bmatrix} = \begin{bmatrix} x_{0,0}v^\mathrm{in}_{0,0} & \cdots & x_{n,0}v^\mathrm{in}_{n,0} \\ \cdots & & \cdots \\ x_{0,m}v^\mathrm{in}_{0,m} & \cdots & x_{n,m}v^\mathrm{in}_{n,m} \\ \end{bmatrix} \end{array}. \end{align} \tag{ 3 }$

**Figure 2.** (a) Diagram of BP training; (b) flow chart of the mixed-precision training.
Download figure:
Standard image High-resolution image

Algorithm 1. Backpropagation.
Require: X_B: training batch; W: initial weight; B: initial bias.
Ensure: $W_$ : updated weight; $B_$ : updated bias.
1: repeat
2: Input:
A training example x is chosen as $a^{x,1}$ ;
3: Feed Forward:
For each layer, activation $a^{x,l} = \sigma(w^{x,l} a^{x,l-1} + b^{x,l})$ ;
4: Calculate Output Error:
At the output layer, the error $\delta^{x,L} = \bigtriangledown_a C_x \odot \sigma^{^{\prime}}(z^{x,L})$ ;
5: Backpropagate:
For each layer, $\delta^{x,l} = ( (w^{x,l+1})^T \delta^{x,l+1} )\odot\sigma^{^{\prime}}(z^{x,l})$ ;
6: until X_B traversed
7: Weight Update:
$w_^l = w^l - \frac{\eta}{m}\sum_x\delta^{x,l}(a^{x,l-1})^T$ , $b_^l = b^l - \frac{\eta}{m}\sum_x\delta^{x,l}$

BP process includes five major steps: Input, Feed Forward, Calculate Output Error, Backpropagate and Weight Update. The input is first fed to get the activations a^l from the first layer to the last layer. Based on a^l, the error propagates in the reverse direction (i.e. 'backpropogates') to get the error δ for each layer. Subsequently, the weight gradient $\nabla W = \delta^{l}(a^{l-1})^T$ is obtained for Weight Update. After Weight Update, repeat the aforementioned computation for another iteration with another batch of data samples. Algorithm 1 keeps iteration until preset stop conditions, such as reaching a preset iteration loop number or observing a loss function smaller than a preset threshold.

2.3. Mixed precision training

DNN training usually uses single ('FP32') or double ('FP64') [46] precision FP numbers as the standard data format [39]. But the constant usage of FP32/FP64 will result in excessive storage, communication, and computing costs. In order to circumvent this, novel FP formats with reduced bit lengths are developed, including Tensor Float-32, Brain FP, etc. Taking Brain FP with 16-bit length ('BFloat16' or 'BF16') as an example, it truncates the 23-bit fraction (a.k.a. mantissa) in FP32 into only 7 bits while using the same 8-bit exponent. It sacrifices precision (mantissa) for much shorter data bit-length whereas maintains the same range (exponents).

Micikevicius et al [38] introduced a training technique with mixed precision, fusing FP32 and FP16 in training a CNN, and presents ∼50% less memory in use with no accuracy reducing. Zamirai et al [37] found that BFloat16 can be substituted for FP16 with extraordinary training performance. Figure 2(b) shows the flow chart of the FP32/BFloat16 mixed-precision training. In this scheme, FP32 is only used in storing a master copy of weights. Executing BP is always with BFloat16, including the steps of replicating weights, calculating activations and gradients. This dramatically reduces memory usage and computational latency with unnoticeable DNN inference accuracy drop.

3. Method

In order to address the aforementioned difficulties and realize efficient training, we propose a methodology for IMC accelerators for DNN training. We firstly model the features of such macro in section 3.1. Then, section 3.2 elaborates the transistor-level circuit structure of our proposed macro called FP H-IMC. By supporting new operators, section 3.3 reformulates the BP algorithm to fit in such IMC macro. Subsequently, section 3.4 provides a framework to analyze and explore the design space.

3.1. IMC abstraction for training operator

To leverage IMC macros for training, we extend the conventional IMC VMM function and introduce a new working mode. Besides the normal memory mode and working mode, the proposed macro features:

New computing mode: VHP Backpropogate and Weight Update for the fully-connected layers in training are dominated by not only VMM but also element-wise products with activations or their gradients. The proposed macro supports the highly-parallel VHP operator, which can be mathematically represented as:
$\begin{align} \begin{array}{l} \mathrm{VHP}: \begin{bmatrix} v^\mathrm{in}_0 \\ \cdots \\ v^\mathrm{in}_{m} \end{bmatrix} \odot \begin{bmatrix} w_{0}\\ \cdots\\ w_{m}\\ \end{bmatrix} = \begin{bmatrix} w_{0}v^\mathrm{in}_{0}\\ \cdots\\ w_{m}v^\mathrm{in}_{m}\\ \end{bmatrix} \end{array} \end{align} \tag{ 4 }$
where w_m is the data stored in IMC macro and $v^\mathrm{in}_m$ is applied with dedicated IMC input ports. IMC VHP mode highlights high parallelism for element-wise multiplication. That is, the m dimensions ( $1\textrm{st}\sim m\textrm{th}$ ) in equation (4) are performed simultaneously by the IMC processing units, instead of element-by-element execution Hadamard multiplication.
Support for floating-point formatIn this paper, all of the operations mentioned are floating-point operations by default for using IMC as an VLSI component towards general DNN training.

3.2. Circuit structure

In this section, we present the Hadamard product based IMC circuit structure to enable in-memory VHP. Based on this fundamental structure, we develop the implementation scheme of FP format IMC operation.

3.2.1. Hadamard product based IMC structure

The Hadamard product based IMC structure (H-IMC) is a VLSI scheme for SRAM IMC with 6-transistor (6T) bitcells [47]. The transistor-level design is shown in figure 3(a). We use several hierarchical structures (i.e. 'basic component', 'block', and 'compartment') to constitute a complete H-IMC macro.

**Figure 3.** (a) Hadamard product based IMC structure; (b) transistor-level basic component structure; (c) timing diagram of LPU operation; (d) in-out characteristics of DLSA.
Download figure:
Standard image High-resolution image

A column of SRAM cells and a local processing unit (LPU) form a 'basic component' and N_B of such components compose a 'block'. A block has N_I input ports and N_B output ports. N_C blocks form a 'compartment'. All blocks in a single compartment share N_B dynamic logic sense amplifiers (DLSAs) for output signal sensing. M compartments are arranged in parallel for vector-wise computing. Table 1 summarizes the hierarchical structure of H-IMC and variables that define the dimension of IMC macros. Detailed algorithm implementation will be given in section 3.3, and the impact of variables from table 1 on performance will be explored in section 3.4. The detailed vector processing principles will be elaborated in sections 3.2.3 and 3.2.4.

Table 1. Dimensional variable list.

Variable	Meaning
N_I	Number of input bits per block
N_O	Number of operated bits per input bit per block
N_B	Number of bits per row per block
N_R	Number of rows per block
N_C	Number of blocks per compartment
M	Number of compartments per subarray

Each basic component has two functions (working modes): storage and computation. The memory array for the storage function consists of 6T SRAM cells, horizontal write lines (WLs), and vertical bit lines (BLs). In the storage mode of the proposed H-IMC structure, it has the same behavior as conventional SRAM arrays and handles data writing or reading. In the computation mode, one block in a compartment is selected to be active, and N_B-bit-element binary matrix input vⁱⁿ is fed to input ports. The computing function is implemented by LPU computing and DLSA sensing, as shown in figure 3(b). Computing in each block is composed of N_B single-bit logic operations. For each bit, LPU has 4 inputs (i.e. 'INP', 'INN', 'W', and 'WB') routed from IMC macro inputs ('INP' and 'INN') and selected memory cells ('W', and 'WB'). 'INP' and 'INN' are reconfigurably mapped by Input Combinatory Logic (ICL) to conduct various bit-wise operations:

$\begin{align} \begin{aligned} \mathrm{NAND}:\ \mathrm{INP} & =v^\mathrm{in}_{j},\ \mathrm{INN}=0\\ \mathrm{OR}:\ \mathrm{INP} & =0,\ \mathrm{INN}=\overline{v^\mathrm{in}_{j}}\\ \mathrm{XOR}:\ \mathrm{INP} & =v^\mathrm{in}_{j},\ \mathrm{INN}=\overline{v^\mathrm{in}_{j}}\\ \end{aligned}. \end{align} \tag{ 5 }$

LPU will perform a preset bit-wise operation between each input bit and stored bit exported from BL. The computation can be formulated as:

$\begin{align} \begin{bmatrix} v^\mathrm{in}_{0,i} \\ \cdots \\ v^\mathrm{in}_{M-1,i} \\ \end{bmatrix} & \odot \begin{bmatrix} w_{(0,i*N_O)} & \cdots & w_{(0,(i+1)*N_O-1)}\\ \cdots & \cdots & \cdots \\ w_{(M-1,i*N_O)} & \cdots & w_{(M-1,(i+1)*N_O-1)}\\ \end{bmatrix}\nonumber\\ & =\begin{bmatrix} v^\mathrm{in}_{0,i} \Theta w_{(0,i*N_O)} & \cdots & v^\mathrm{in}_{0,i} \Theta w_{(0,(i+1)*N_O-1)}\\ \cdots & \cdots & \cdots \\ v^\mathrm{in}_{M-1,i} \Theta w_{(M-1,i*N_O)} & \cdots & v^\mathrm{in}_{M-1,i} \Theta w_{(M-1,(i+1)*N_O-1)}\\ \end{bmatrix}. \end{align} \tag{ 6 }$

where 'Θ' is the preset bit-wise operation (either AND, OR, or XOR), and ' $\odot$ ' means the VHP operator.

The operation in LPU is done by means of dynamic logic, which works in two stages, 'precharge' and 'evaluate', as shown in figure 3(c) [48]. The final output signal is the voltage on computing lines (CLs): when the clock signal is low, LPU turns on the precharge transistor (MP0) and precharges the parasitic transistor drain capacitors and wire capacitors on CL high; when the clock signal turns high, LPU enters the 'evaluate' stage and begins logic computing by cutting off MP0. Logic computing is realized by discharge transistors (MN0-MN3). Then the output signal is sensed out by the DLSA. DLSA reads the computational result out with one transistor per bit and can offer an extra NOT logic through inverters. DLSA can agilely detect '1' due to the precharging-discharging mechanism of LPU. Figure 3(d) shows the input-output DC characteristics of DLSA.

Here we take the AND operation as an example of a complete dataflow: the input bit $v^\mathrm{in}_j$ and the data bit w are connected to the gates of MN0 and MN2 respectively according to the ICL logic. CL will remain high after precharging unless both $v^\mathrm{in}_j$ and w are high. The following flip-flop with an additional inverter will invert and latch the output results. N_B such bit-wise operations compose the binary matrix logic AND operation, i.e. binary multiplication. In a compartment, all blocks share N_B CLs and DLSAs for N_B output bits computing and sensing in one cycle of logic computing. In this way, H-IMC paves a path for IMC element-wise operation, which is the foundation for the vectorized floating-point operations.

3.2.2. FP vector processing

The H-IMC structure with reconfigurable bit-wise vector logic operation is suitable for FP vector processing. Floating-point numbers consist of three parts: the sign bit, the exponent bits, and the fraction bits. The sign bit implies the number is positive or negative. The exponent bits represent the exponent of the scientific notation with a dedicated offset. The fraction bits (or mantissa) represent the coefficient with an extra hidden bit before the decimal point. For normalized FP numbers, the value is shown in the following equation:

$\begin{align} \mathrm{Value} = (-1)^\mathrm{Sign} \times 1.\mathrm{Fraction} \times 2^{\mathrm{Exponent} - \mathrm{Offset}}. \end{align} \tag{ 7 }$

3.2.3. BFloat16 VHP compartment

The element-wise compartment is developed with the bit-wise operation of H-IMC as shown in figure 4(a). Here BFloat16 is chosen to accommodate the mixed-precision training described in section 2.3, and N_B is set to 16 and N_I is set to 2 in the memory array as the basic configuration for BFloat16 element-wise multiplication. A BFloat16 number is stored in the bitcell array as follows: the eight most significant bits in each block are for 1 sign bit and 7 fraction bits, and the eight least significant bits are for eight exponent bits. Multi-bit inputs are split into a bit-serial form [49] and fed into LPU. The post-processing circuit is designed as shown in figure 4(b) for FP multiplication, and the timing diagram is shown in figure 4(c). BFloat16 VHP includes four steps: fraction bits multiplication, exponent bits addition, sign bits xor, and normalization.

(a)
Fraction Bits Multiplication: Multiplication of fractional bits is essentially the same as what integers do. This can be realized by accumulation in shift adders⁹ . As shown in figure 4(a), the signal from input port in₀ is given into LPUs of sign and fraction bits, and the LPU operation is configured as AND in fraction bits to perform bit-wise multiplication of in₀ and fraction bits. The leading hidden bit '1' in equation (7) needs to be considered during accumulation: the hidden input bit is implicitly included in the output processing circuit, and the extra multiplication of the hidden memory bit is realized by one cycle delay by a flip-flop. Fraction bits multiplication is done after 7 cycles of bit-wise accumulation.
(b)
Sign Bits Xor: The operation of the sign bit in BFloat16 multiplication is simply xor. Shift adders will stop accumulation when fraction bits multiplication is done. The input port in₀ then gives the sign bit of input data the next cycle, and completes the sign bits xor through LPU by ICL configuration.
(c)
Exponent Bits Addition: The addition of exponent bits is decomposed into bit-wise operation by shift registers. The signal from input port in₁ is always set to high for reading out exponent bits. Input exponent bits are kept via a shift register through $IN_{c,1}$ port, achieving the calculation of $x = x*2 + IN_{c,1}$ per cycle with bit-serial input. After eight cycles, input exponent bits are obtained and then summed with those in memory. Note that the addition of exponential bits requires subtracting an offset considering the offset-binary representation.
(d)
Normalization: The partial results of FP multiplication are obtained after the three steps above, which need to be stitched into BFloat16. Final normalization reformats these intermediate parts according to equation (7), and outputs them as the final multiplication result.

Through the four steps above, H-IMC with the peripheral circuit can process FP multiplication at the compartment level. Then the VHP subarray is formed by combining M such compartments in parallel, as figure 3(a) shows. In this way, a subarray can perform M elements of multiplication in parallel, i.e. VHP of a maximum width M.

3.2.4. FP VMM in an IMC subarray

Floating-point IMC VMM combines the aforementioned VHP and post-accumulation. The VHP compartment-level design above provides the function to calculate multiplication in BFloat16. And we implement the post-accumulation process with fused multiply-add (FMA) in BFloat16 as shown in figure 5(b). FMA copes with the calculation of $a\times b+c$ (given $a,b,c$ are FP numbers) with combined logic for optimal area and energy consumption. Instead of performing complete multiplication before addition, it skips the rounding period in the end, and the normalization is deferred after the addition. The normalization part of VHP compartments is disassembled into anomaly detection and rounding. In the VHP mode, data pass through the normalizer and get out as BFloat16, and in the VMM mode, data go to the input port of the fused adder tree after anomaly detection only.

**Figure 5.** (a) Multiplication process in VMM skipping rounding; (b) structure of the fused adder tree; (c) pipeline for the fused adder tree sharing.
Download figure:
Standard image High-resolution image

Floating-point addition includes complex steps of exponent comparison, fraction alignment, fixed-point addition, and normalization. Individual additions between VHP intermediate results are fused in the following accumulation process, as shown in figure 5(b) in VMM macros to reduce latency and area consumption. The detailed fused adder tree is designed as follows based on the structure in [43]:

Exponents Comparison: The exponents of all M unnormalized results are compared firstly to get the maximum value for fraction alignment.

Fraction Alignment: The weight difference in unnormalized fraction bits is obtained by subtracting each exponent from the maximum exponent. Then they are shifted by the subtraction results to get aligned with each other.

Fused Accumulation: The aligned unnormalized fraction bits are accumulated as integers by an adder tree, and then the unnormalized FMA result is obtained.

Normalization: the FMA result is obtained in unnormalized FP32 format and finally truncated into BFloat16.

In this way, alignment and normalization are omitted when obtaining the intermediate VMM results, saving the post-accumulation area. Note that the fused adder tree is combinational logic and only used when multiplication is complete. So in the macro level, the adder tree is shared by two VHP subarrays to improve overall structure utilization. The two VHP subarrays are activated at different cycles, and separate accumulation operations are performed by the time difference of VHP output results, as shown in figure 4(c).

3.3. BP Problem formulation

With the features defined above, this section details how BP algorithm is deployed onto FP H-PIM macros. figure 6(a) illustrates the calculation of algorithm 1 for a single layer using the defined VMM and VHP operators. Feed Forward mainly conducts VMM, whereas Backpropogate and Weight Update need to be implemented by both VHP and VMM. An entire neural network training process can be built up as shown in figure 6(b) with multiple layers: Master weights are kept outside the computational circuit in FP32, and the rest weights, activations, errors, and gradients are all in BFloat16. The computation for each step is elaborated as follows:

3.3.1. Weight install

IMC macros work in memory mode. Before Feed Forward starts, weights need to be written into the IMC macro first in the memory mode. The fixed weights are exploited by different operators in the following parts.

3.3.2. Feed forward

IMC macros work in the VMM mode. After Weight Install, activations begin to Feed Forward between layers. The computation of Feed Forward is mainly the multiplication of input vectors ( $a^{x,l-1}$ ) and weight matrices ( $w^{x,l}$ ). This is achieved by the aforementioned VMM operator. It conducts the computation as follows: in fully connected layers, the main operation is $a^{x,l-1}w^{x,l}$ , where weights $w^{x,l}$ are partitioned into partial matrices for storage and calculated by VMM operators; in convolutional layers, the main operation is $a^{x,l-1} \ast w^{x,l}$ , where convolutional kernels $w^{x,l}$ are reshaped into vectors, and convolution is transformed into VMM operators, as shown in figure 7(a); average pooling layers take average by performing VMM operators with the reciprocal of the size of filters. Max pooling layers and activation functions do not fit in IMC operators, and implementation requires additional circuits: max pooling layers are implemented by comparators, while activation functions are implemented by look-up tables (LUTs) [50].

**Figure 7.** (a) Implementation of the *Feed Forward* part; (b) implementation of the *Backpropagate* part; (c) implementation of the *Weight Update* part.
Download figure:
Standard image High-resolution image

3.3.3. Backpropagate

IMC macros need to switch between VMM/VHP (computation) and storage mode. After activations are fed to the last layer, errors start to Backpropagate. The computing of Backpropagate for convolutional and full-connected layers includes two parts: the operation between errors δ and weights w, and the following operation with activation gradients $\sigma^{^{\prime}}(z)$ . In this subsection, we divide BP into two steps: the first step performs VMM between errors and weights, and the second step performs the continued element-wise product with the activation gradients.

Step 1: VMM: For fully connected layers, VMM operation is performed between errors and the transposed weight matrix ( $u^{l} = ( (w^{l+1})^T \delta^{l+1})$ ); for convolutional layers, it is performed between errors and convolutional kernels rotated by 180^∘ ( $u^{l} = (\delta^{l+1} \ast rot180(W^{l+1})$ ). This step is implemented the same way as Feed Forward with VMM operators in figure 7(b), and gets the result u^l for the following step.

Step 2: Hadamard product: In fully connected and convolutional layers, activation gradients from Feed Forward are operated with the VMM results above ( $\delta^{l} = (u^{l}\odot\sigma^{^{\prime}}(z^{l})$ ). $\sigma^{^{\prime}}(z)$ is first written into IMC macros in the storage mode, then operated with VMM results from Step 1 by VHP operators, as shown in figure 7(b). The VHP operator handles element-wise product between $\sigma^{^{\prime}}(z)$ and u^l, and gives the error of this layer δ^l as output.

In pooling layers, there are no weights. Errors are upsampled instead, which means to reverse the pooling operations in Feed Forward. Average pooling takes the average value from input filters, and its upsampling averages the error back by the size of the filter. This step reuses the reciprocal mentioned in section 3.3.2. Max pooling selects the maximum value from input filters, and its upsampling puts the error back into the position of the maximum value. This step can be realized with the index for the maximum value recorded in Feed Forward. With VMM and VHP operators, Backpropagate is implemented completely.

3.3.4. Weight update

IMC macros need to switch between VMM/VHP and storage mode. After Backpropagate, Weight Update happens in non-pooling layers to get weight gradients ( $\nabla W$ ). In fully connected layers, $\nabla W$ is obtained by vector multiplication ( $\nabla W$ = $\delta^{l}(a^{l-1})^T$ ), which can be expanded as follows:

$\begin{align} \begin{bmatrix} \delta^{x,l}_0\\ \cdots \\ \delta^{x,l}_{m}\\ \end{bmatrix} \cdot \begin{bmatrix} a^{x,l-1}_{0}\\ \cdots\\ a^{x,l-1}_{n}\\ \end{bmatrix}^\mathrm{T} & = \begin{bmatrix} \delta^{x,l}_0\cdot[a^{x,l-1}_{0} & \cdots & a^{x,l-1}_{n}]\\ \cdots\\ \delta^{x,l}_m\cdot[a^{x,l-1}_{0} & \cdots & a^{x,l-1}_{n}]\\ \end{bmatrix}\nonumber\\ & = \begin{bmatrix} \begin{bmatrix} \delta^{x,l}_0\\ \cdots\\ \delta^{x,l}_0\\ \end{bmatrix} \odot \begin{bmatrix} a^{x,l-1}_{0}\\ \cdots\\ a^{x,l-1}_{n}\\ \end{bmatrix} & \cdots & \begin{bmatrix} \delta^{x,l}_m\\ \cdots\\ \delta^{x,l}_m\\ \end{bmatrix} \odot \begin{bmatrix} a^{x,l-1}_{0}\\ \cdots\\ a^{x,l-1}_{n}\\ \end{bmatrix} \end{bmatrix}^\mathrm{T}. \end{align} \tag{ 8 }$

Activations are firstly written as vectors in the storage mode, as shown in figure 7(c). In convolutional layers, this operation can be converted into convolutions between errors and activations ( $\nabla W$ = $\delta^{l} \ast (a^{l-1})$ ). Errors are written like convolutional kernels in Feed Forward and perform VMM. Then, the weight gradients obtained above are sent back to the master weights for further updating.

3.4. Framework for FP H-IMC design space exploration

The performance of FP H-IMC is determined by the parameters in table 1, as shown in figure 8. Those design parameters starting with N define the compartment-level H-IMC macro structure and the M parameter defines the macro-level design with the fused adder tree. Here N_I, N_O, and N_B solely depend on BFloat16 format in the training acceleration scheme, while N_R, N_C, and M are adjustable design parameters. In this section, we model the performance of the circuit based on these three parameters to guide the macro-level design to achieve the desired power, performance, and area.

3.4.1. H-IMC: N_R and N_C

The H-IMC structure has direct impacts on compartment-level performance, which can be mainly described by parameters N_R and N_C.

N_R is the number of rows per block, in other words, the number of WLs in the SRAM array. WL is parallel to CL and vertical to BL. Therefore, N_R does not affect the CL but BL. The load capacitance on BL is much smaller than CL, which means N_R can hardly directly affect the performance of the LPU, and its impact is mainly limited to area consumption: The larger the N_R is, the larger the area occupied by the SRAM array.

N_C is the number of blocks per compartment. It leads to a linear increase of CL length ( $L_\mathrm{CL}$ ), inflicting larger parasitic capacitance $C_\mathrm{CL}$ . This directly impacts the energy of dynamic logic per operation and also the maximum frequency due to its relation with precharging and discharging behaviors. The relationship between those parameters and the macro performance can be mathematically summarized as:

$\begin{align} \mathrm{Area} & = N_R \times N_C \times A_\mathrm{bitcell} + N_C\times A_\mathrm{LPU}\nonumber\\ \mathrm{Power} & \propto {C_\mathrm{CL}} \propto L_\mathrm{CL} \propto N_C \nonumber\\ \mathrm{Max(freq)} & \propto {C_\mathrm{CL}}^{-1} \propto L_\mathrm{CL}^{-1}\propto N_C^{-1} \end{align} \tag{ 9 }$

where $A_\mathrm{bitcell}$ is the area of SRAM bitcell, $A_\mathrm{LPU}$ is the area of LPU per block, $L_\mathrm{CL}$ and $C_\mathrm{CL}$ are the length and effective parasitic capacitance of CL. The quantitative results based on this model are given in section 4.

3.4.2. Peripheral circuit: M

A complete subarray consists of H-IMC compartments and their peripheral circuits. Its performance is positively related to the number of compartments per subarray M, as the overhead of the peripheral circuit for each compartment depends on N_C and N_R.

In an FP H-IMC macro, the number of H-IMC and BFloat16 multipliers is proportional to M, and the fused adder tree has a complex structure as shown in figure 5(b). A fused adder tree is composed of M alignment shifters, M exponent comparators, $2M-1$ intermediate adders, and a normalizer right before the output. Taking X as a performance parameter (X may be area or power), the performance equation will be:

$\begin{align} X_\mathrm{compartment} & = X_\mathrm{array} + X_\mathrm{multiplier} \nonumber\\ X_\mathrm{addertree} & = (2M-1) \times X_\mathrm{adder} + M \times X_\mathrm{alignmentshifter} + M \times X_\mathrm{comparator} + X_\mathrm{normalizer} \nonumber\\ X_\mathrm{macro} & = 2M \times X_\mathrm{compartment} + X_\mathrm{addertree} \end{align} \tag{ 10 }$

where the subscripts denote the components that the area or the power belongs to.

With simulation results of each circuit module, an overall estimation for different parameter M is also given in section 4.

4. Result

4.1. Experiment setup

4.1.1. Circuit-level verification setup

In order to validate the function of the proposed design, we construct an H-IMC behavioral model based on Verilog hardware description language. This behavioral model has been published on the open source platform for public access¹⁰ . In the following part, we use this model to verify the precision of the FP processing circuit and analyze performance in different designs. We implement the peripheral circuit using TSMC 28 nm PDK. The design space exploration is under Cadence Virtuoso® environment and Spectre® SPICE simulators.

4.1.2. Architecture-level verification setup

The proposed scheme aims at the training applications at edge devices with area limitations and low-power requirements. We select the CIFAR-10 [51] image recognition task with several widely-used networks (MobileNet [44], MobileNet v2 [52] and FBNet [53]) for embedded application to benchmark the performance of the proposed scheme. There are several simulators developed for CIM simulation (e.g. SIAM [54], NeuralSIM [55], and MNSIM [56]) The architecture-level evaluation in section 4.5 is carried out with the simulator based on MNSIM 2.0 [57] for the convenience of array-level modification. We replace the process element of this simulator with the FP H-IMC macro and add the training algorithm deployment feature according to the method proposed in section 3.3.

4.2. FP H-IMC: precision verification

The proposed fused adder tree efficiently reduces the area overhead of FP accumulation by 57.8% by reducing normalizers in the circuit. However, this method still has a potential risk that suffers accuracy loss from the mantissa pre-shift behavior. Here, we evaluate the computing accuracy of FP H-IMC by computing input cases randomly generated, the absolute value of which is from 10⁻² to 10² on the H-IMC circuit-level behavior model. Figure 9 shows precision loss of multiplication and accumulation on C model and FP H-IMC, and here FP32 results are considered as the baseline ( $\mathrm{Precision\ Loss} = \frac{\mathrm{Computing\ Error}}{\mathrm{FP32\ Results}}$ ). Figure 9(a) shows the distribution of precision loss in absolute values, and figure 9(b) shows the percentile of the loss, defined as the integral of distribution greater than a certain value ( $\int_{x}^{+\infty}(\mathrm{Percentage})$ ). Notably, the x-axes only show the absolute values of the loss in percentage. Therefore, the x-axes of the first quadrant and second quadrant are both positive and symmetric along the y-axes.

**Figure 9.** Precision comparison between the ideal BFloat16 calculation and the calculation with FP H-IMC.
Download figure:
Standard image High-resolution image

The precision loss of BFloat16 and FP H-IMC has similar distribution: in each figure, the left side reveals MAC results from calculation in C programming language; the right side shows the results from FP H-IMC circuit-level simulation. Their precision loss distributions are both concentrated below $0.2\%$ . Ideal BFloat16 computing has about $86.5\%$ results with precision loss less than $0.2\%$ ; only less than $0.1\%$ of the precision loss is over $0.4\%$ . In comparison, FP H-IMC computing has $86.4\%$ results with precision loss less than $0.2\%$ ; and less than $0.1\%$ of the results are with precision loss more than $0.4\%$ . This comparison shows that the proposed FP H-IMC can achieve as good precision as the ideal BFloat16 computing.

4.3. FP H-IMC: precision verification

Micikevicius et al show us the mixed precision training method has no accuracy loss compared to the full precision baseline [38]. In this section, we trained several widely-used networks for the CIFAR-10 and CIFAR-100 classification tasks with this method. Different training algorithms are used with identical hyperparameters. The top-1 accuracy of different training methods is shown in table 2.

Table 2. CIFAR-10 and CIFAR-100 top-1 accuracy table.

Model	CIFAR-10		CIFAR-100
Model	FP32	Mixed precision	FP32	Mixed precision
VGG-D [58]	87.06%	86.83%	63.75%	63.57%
GoogLeNet [59]	87.45%	87.67%	66.22%	66.02%
Resnet50 [60]	90.57%	90.54%	64.40%	64.95%
MobileNetv2 [52]	87.66%	87.75%	61.48%	61.53%

Taking the FP32 training session as the baseline, our training scheme can match the top-1 accuracy. The result shows that the mixed precision training method using FP numbers with fewer bits in computing has reliable training accuracy with different networks.

4.4. FP H-IMC: circuit design exploration

Section 3.4 analyzes the possible impacts of different parameters on performance. Here these simulation results are consistent with the performance functions from equations (9) and (10).

Figure 10(a) shows relationship between area (figure 10(a1)), data density (figure 10(a2)), max working frequency (figure 10(a3)), power (figure 10(a4)) of the IMC structure and parameters N_R and N_C. The post-processing circuits for the FP numbers and the fused adder tree are not taken into consideration to this extent. From these results, it can be observed that:

(a)
The area of IMC structure is determined by both N_C and N_R. The number of bitcells in the SRAM array is equal to N_C multiplied by N_R. More RAM bitcells bring higher data capacity in one macro but also result in larger area consumption as shown in figure 10(a1).
(b)
N_R mainly influences the number of bitcells in each column of a compartment. When it increases, more bitcells are installed in each column, and the area and data capacity both grow. Since each column possesses one extra LPU for logical operation, as BL increases, the data density ( $\frac{\mathrm{Data\ Capacity}}{\mathrm{Area}}$ ) will increase because data capacity grows faster than the area, while the area efficiency ( $\frac{\mathrm{Throughput}}{\mathrm{Area}}$ ) decreases in that no extra throughput improvement techniques are introduced during computation.Figure 10(a2) shows the correlated result that the data density grows for about 1 × as area efficiency decrease to ${\sim} 33\%$ at the same time in this circuit structure.
(c)
The growth of N_C leads to equivalent growth in the area and data capacity and does not influence the data density like N_R. It mainly influences CL in figure 10, which has a great impact on the precharge behavior of the dynamic logic in use. When N_C grows, the length of CL increases and more parasitic capacitance is attached to CL and drags down the computation speed accordingly. It makes the precharge period more time and energy consuming and subsequently leads to both power growth and a peak throughput reduction. The max working frequency $\mathrm{Max(freq)}$ and peak throughput will have a $\sim 50\%$ reduction when the compartment size increases for 3 × from N_C side as shown in figure 3(a3). In figure 3(a4), the power consumption grows almost linearly to N_C. The linearity of this relationship is due to the parasitic capacitance introduced by N_C. The charging power consumption in computation is proportional to the parasitic capacitance, and the parasitic capacitance grows linearly with N_C. A detailed explanation can be found in section 3.4.1

Figure 10(b) shows the relationship between the power (figure 10(b1)) and area (figure 10(b2)) of the proposed FP H-IMC versus the parameter M, i.e. the number of compartments in one subarray. The BFloat16 post-processing circuits and the fused adder tree are included. The total power and area of the IMC grow as M increases. Yet the average power and area overhead per compartment decrease, because the area power consumption of the fused adder tree does not increase strictly proportional to M. The proposed fused adder tree with the shared normalizer at the final stage effectively saves power and area. Based on the results above, the design parameters of FP H-IMC macros are determined as follows for the balance between storage and computing requirement: N_R = 16, N_C = 4 and S = 32. Detailed performance is compared to other SoTA designs in table 3.

Table 3. Comparison table of SoTA SRAM-Based IMC schemes.

	ISSCC'21 [24]	ISSCC'21 [61]	ISSCC'22 [41]	ISSCC'22 [47]	IEEE Micro'21 [42]	This work
Technology	28 nm	22 nm	28 nm	28 nm	28 nm	28 nm
Input format	INT4/8	INT1-8	INT8/BF16/FP32	INT1-8	BF16	BF16
Memory format	INT4/8	INT4/8/12/16	INT8/BF16/FP32	INT1/4/8	BF16	BF16
Frequency	56 MHz–100 MHz	139 MHz	95 MHz–220 MHz	333 MHz	40 MHz–250 MHz	500 MHz
Operation	VMM	VMM	VMM	VMM & VHP	VMM	VMM & VHP
Data density (IMC part)	234 Kb mm⁻²	317 Kb mm⁻²	102 Kb mm⁻²	1067 Kb mm⁻²	219.5 Kb mm⁻²	769.2 Kb mm⁻²
Computing efficiency	18.9 TOPS W⁻¹ (INT8/INT8)	24.7 TOPS W⁻¹ (INT8/INT8)	29.2 TFLOPS W⁻¹ (BF16/BF16)	27.38 TOPS W⁻¹ (INT8/INT8)	0.76 TFLOPS W⁻¹ (BF16/BF16)	0.56 TFLOPS W⁻¹ (BF16/BF16)

4.5. Microarchitectural benchmark

In training, VHP is applied in the Backpropagate and Weight Update in algorithm 1 to improve the hardware execution performance. Implemented with the proposed FP H-IMC macro, VHP optimizes the number of vector operations down by about $88.9\%$ with the parallel element-wise multiplication in Backpropagate, and by about $92.8\%$ in Weight Update of fully-connected layers. One vector operation is defined as one time FP H-IMC activated for each VHP or VMM operation. Every 16 bytes written in the memory mode is considered as one VMM/VHP operation corresponding to the throughput of each macro.

Here we take layers from MobileNet [44] to explain how VHP optimizes training by reducing the number of operations. MobileNet breaks large convolutional layers into smaller pointwise convolutional layers and depthwise convolutional layers. For example, when the kernel size is 3 in the convolutional layers, a convolutional layer with 7 × 7 × 1024 input and 7 × 7 × 1024 output has 3 × 3 × 1024 × 1024 filters. Yet MobileNet reconstructs it as a depthwise layer with 3 × 3 × 1024 filters and a pointwise layer (1 × 1 × 1024 × 1024 filters at all) for the purpose of operations reduction.

Figure 11 shows the comparison results between pointwise layers, depthwise layers, and fully connected layers with different input feature sizes on the operation count in training. When the input feature size of MobileNet is 224 × 224, the last three layers except the pooling layers are shown in figure 11(a). In a pointwise layer, VHP barely saves the operations because the convolution in Weight Update dominates. In a depthwise layer, the operation number of Backpropagation and Feed Forward increases. The total number of operations decreases by about $29\%$ with VHP optimization on Backpropagate, Step 2 ¹¹ . In a fully connected layer, Weight Update still dominates, but the outer multiplication of vectors in this step is implemented by VHP operation, and a large amount of element-wise multiplication is optimized. The number of operations is reduced by about $88\%$ .

This optimization becomes more obvious when the input feature size shrinks to 32 × 32 (the classification task changes from ImageNet to cifar-10). Although the situation for the pointwise layer remains nearly unchanged, the optimization on the depthwise layer grows to $44\%$ . When the networks shrink for edge applications, VHP with FP H-IMC can significantly reduce the number of operations in depthwise layers and fully connected layers.

Figure 12 shows the performance and energy comparison of FP32/BFloat16 mixed-precision training on various networks. Our simulation focuses on the BFloat16 part for the purpose of evaluating the proposed macro, and the energy consumption of storing the FP32 master copy is not included. In the figure, both energy and time consumption is normalized to the baseline of FP32 training on the GPU platform. On the same platform, mixed-precision training has a geometric average of $24.6\%$ energy saving and $22.1\%$ time saving due to the fewer data bits used. Our mixed-precision training scheme with the proposed FP H-IMC has a geometric average of $96.7\%$ energy saving and $53.1\%$ time saving than training in FP32. The reduction in energy and time stems from two aspects: VHP optimizes Backpropagate, Step 2 and Weight Update of fully connected layers, and FP H-IMC structure reduces computing consumption in VMM.

The comparison with other SoTA hardware schemes is in table 3. Although various types of integers training and inference have been studied, there are few PIM schemes supporting operations between FP numbers. The benchmark operations in the left four columns provide the capability for inference/tiny network training, whereas the two columns on the right provide the capability for general neural network training. Compared to previous work supporting FP operation, the energy efficiency of our design is relatively low due to maintaining high frequency in computing. However, the FP H-IMC structure not only proposes VHP for fully connected layers support, but also provides high data density and area efficiency for the need of area saving for training on edge devices.

5. Conclusion

In conclusion, we develop the FP H-IMC structure for training. Our proposed FP H-IMC design improves the data density to 3.5 × than the previous design, with negligible precision loss compared to the ideal BFloat16 computing. This makes it more possible to achieve on-chip training with high accuracy using the mixed-precision training algorithm. Furthermore, we decompose the BP training into two basic operators, VMM and VHP, for implementation of the IMC macros. We deploy mixed-precision training with VMM and VHP operators onto the proposed FP H-IMC macro. The simulation result shows that our scheme has an average 96.7 × energy saving and $53.1\%$ time saving on various networks compared to the GPU-based platform. The analysis results indicate that VHP has an obvious optimization for fully connected layers acceleration. For convolutional layers, the optimization of VHP becomes better in smaller networks.

The development of edge-computing AI agents is calling for stronger learning capability with limited computing resources. Support for VHP operation and floating-point precision is subsequently becoming a must for fast-evolving AI agents. This work intends to make IMC circuits evolve from pure neural network inference engines into highly-efficient cores for both inference and learning. The experimental results of this work elucidate how circuit-level VHP operator help train neural networks with low energy cost on edge devices. This work can benefit the applications that require high-efficiency learning or fine-tuning for relatively compact neural network models, such as perception and reasoning convolutional or recurrent neural networks on autonomous vehicles.

Acknowledgments

This work was supported by National Natural Science Foundation of China (92264201, 61925401, 92064004, 61927901, 92164302), and the 111 Project (B18001). Y Y acknowledges the support from the Fok Ying-Tong Education Foundation and the Tencent Foundation through the XPLORER PRIZE. The authors thanks for the support from Pimchip Technology Co., Ltd. This work is supported by High-performance Computing Platform of Peking University.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).