Ultra-High-Speed Accelerator Architecture for Convolutional Neural Network Based on Processing-in-Memory Using Resistive Random Access Memory

Processing-in-Memory (PIM) based on Resistive Random Access Memory (RRAM) is an emerging acceleration architecture for artificial neural networks. This paper proposes an RRAM PIM accelerator architecture that does not use Analog-to-Digital Converters (ADCs) and Digital-to-Analog Converters (DACs). Additionally, no additional memory usage is required to avoid the need for a large amount of data transportation in convolution computation. Partial quantization is introduced to reduce the accuracy loss. The proposed architecture can substantially reduce the overall power consumption and accelerate computation. The simulation results show that the image recognition rate for the Convolutional Neural Network (CNN) algorithm can reach 284 frames per second at 50 MHz using this architecture. The accuracy of the partial quantization remains almost unchanged compared to the algorithm without quantization.

In this work, we propose an RRAM PIM architecture that eliminates the need for ADCs and DACs for data conversion. There is no additional memory usage in convolution computation. The main contributions of this article include: 1.
Introducing an RRAM PIM architecture that can achieve the high-speed processing of CNN without requiring DACs and ADCs; 2.
Proposing a partial quantization scheme that can reduce accuracy loss; 3.
Achieving an acceleration of up to 284 frames per second while maintaining almost the same accuracy as that without quantization.
A conventional PIM configuration is illustrated in Figure 1. The design includes a PIM controller, input data converter, output data converter, and memory and result processing module. The memory array can be implemented using various memory devices such as Flash, SRAM, RRAM, among others. The PIM controller first writes weight data into memory array, and the data are then converted by the converter. The conversions may involve transforming the data from column to row and from digital to analog format. The converted data are then processed inside memories with weight data. If the in-memory processes are carried out in the analog domain, the output data should be converted back into the digital domain using an ADC. The digital signals are then transmitted to the next round of computation. A typical RRAM PIM configuration is depicted in Figure 2. The design mainly includes a DAC, RRAM array, ADC, and system controller [23]. After the systemic initialization by the controller, the DAC array converts features from digital to analog to enable processing in RRAM. The RRAM array then performs computations on these data within the memories in the analog domain. The resulting analog signals are output from the RRAM and should be converted back into digital by the AD, after which they are processed by the controller in the next step, as shown in Figure 2.

Architecture
The conventional RRAM computing accelerator comprises RRAM arrays, control modules, ADCs, and DACs [23], which is a suitable technique for the lightly-scaled algorithm.
However, as the scale of the algorithm increases, the power consumption and chip area required for the ADCs and DACs increase significantly. The proposed RRAM PIM architecture, shown in Figure 3a, consists of the pipeline register (Figure 3b), feature-reshaping module (Figure 3c), Process Element (PE) array (Figure 3d), post-feature processing module (Figure 3e), configuration register, and SRAM module. The output feature from Layer (N-1) is stored in the pipeline register. Then, the feature shape is modified in the featurereshaping module. The feature after reshaping is sent to the RRAM array, and convolution computation is performed with the weight matrix stored in the RRAM array. After the post-feature processing, including biasing, activation, pooling operation, etc., the result is sent to the pipeline register for computation in the next layer.

•
The Pipeline Register is shown in Figure 3b. The input is from Layer (N-1). As the configuration no longer needs to be written to memory, the timing feature is also required for the next layer; • Feature Reshaping module is shown in Figure 3c. For rows or columns that require padding, "0" or "1" in the feature is stored according to the padding request. After feature reshaping, the feature in a kernel is flattened and sent to the RRAM operation core in the PE module. The post-feature processing module is shown in Figure 3e. Activation and biasing are processed inside the module. The "ReLU" activation function is used, but it is possible to implement other activation functions suitable for algorithms other than CNN.

Partial Quantization
In this work, a partial quantization technique is proposed in conjunction with the characteristics of RRAM PE. The quantization of weights can substantially reduce the RRAM storage space. The weights for use were trained in the Pytorch ® platform. As shown in Figure 4a, both the weights and the features are quantized, but such full quantization may lead to accuracy loss. In contrast, Figure 4b shows a partial quantization where only the weights are quantized. After training, the floating-type data are quantized into three parts: 8-bit fixed-point integer for weights, 32-bit floating-point digit for scale factor, and 8bit fixed-point integer for zero-point parameters. During computation, the 8-bit fixed-point integer weights are written to the RRAM weight matrix, and the scale factor and zero-point parameters are mapped to 25-bit fixed-point digits.
The Weight matrix is quantized and dequantized according to Equation (1), where x (c,x+i,y+j) is the feature with its input channel; w (cout,c,i,j) is a weight with the channel; zp w represents the zero points of weight in the channel; scale w is the scaling factor; and bias cout is the bias with the channel.  If the zero points of weights can be trained to zero, Equation (1) can be further simplified as follows: For the feature-out module, since the operation of writing memory is not required, it is possible to transfer the feature out to the next layer in the 16-bit fixed-point integer without quantization.

Data Flowing without Additional Memory Storage
To reduce the data conversion cost, the ADCs and DACs are not utilized in this design. Instead, the ripple counter in conjunction with the RRAM array can effectively conduct convolution without data conversion, as demonstrated in Figure 5.
As shown in Figure 5, the RRAM PE unit comprises a column converter unit, a padding processing unit, a multi-bit D-flip-flop register, a RRAM computing matrix, a ripple counter, and a controller. As an example, for the input feature with a 3 × 3 × 2 matrix, the output is a 2 × 2 × 2 feature for the convolution with two 2 × 2 × 2 kernels. The column converter and the padding unit reshape the input feature. This feature vector is sequentially sent into the RRAM matrix in column order by the D flip-flop register. The column vector is input to the RRAM matrix in row order. That is, we first obtain a column vector, in time T 0 ; we send X in [0][0] to Row 0 of the RRAM array; and the ripple counter counts the output pulse of each column. At time T 1 , X in [1][0] will be sent to the RRAM array; and the ripple counter continues to count the pulse output of each column and accumulate the pulse amount. After all the row bits have been sent, a step of computation is completed.
The time to complete a window feature T win can be determined using Equation (3), where C in is the input feature; B x is the bit width of X in ; K x is the kernel size in the x-direction; and K y is the kernel size in the y-direction. K x and K y are determined by the algorithm. They can be considered as coefficients. Hence B x and C in determine the computation time of the window. By selecting the appropriate bit width, and if the number of input channels is supported by the RRAM array, the operation speed of the entire system can be further increased.

Mapping of Algorithm to the Proposed RRAM PIM
To demonstrate the feasibility of the proposed RRAM PIM, the YOLOv3 algorithm is used for mapping. The YOLOv3 algorithm includes a ResNet Block and YOLO Block. The main operator contains convolution, leaky ReLU, up-sampling, route, and common pooling operator. The operations are consolidated by a convolution with a kernel size of 3 × 3 and stride of 2 × 2.
To increase the processing speed, a large convolution computation can be distributed to a plurality of PE units. The computation of the PE units is carried out in parallel. The computation results of all the PE units are then accumulated together. When all the PE units have completed their computations, the convolution result is obtained. This result is then sent to the activation function of the post-feature processing module, up-sampling module, etc. The output feature is stored in the Pipeline Register and used as the input feature for the next-layer computation.
To improve the performance of the accelerator, multiple PEs are used for parallel computations. Each PE is responsible for computing a column of features, effectively reducing the computation time for each layer.
In the ResNet block of YOLOv3 algorithm, the output features from all the previous layers are accumulated to produce the input feature of the current layer. This requires the storage of the output feature of each of the previous layers, which is stored in SRAM as shown in Figure 3. No other memory is required in the architecture. Figure 6 shows the electrical characteristics of the Au/Ni/HfO 2 /Ni RRAM (blue line), which is reported in our previous work [24]. The resistance of the RRAM can be substantially changed by applying a voltage that exceeds its threshold. When a positive voltage is applied through RRAM (from top-electrode to bottom-electrode) and exceeds its positive threshold (V SET ), the resistance of RRAM decreases to the low-resistance state (LRS); this process is generally called the SET process. When a negative voltage is applied through RRAM (from bottom-electrode to top-electrode) and exceeds its negative threshold (V RESET ), the resistance of RRAM increases to the high-resistance state (LRS); this process is generally called the RESET process. The RRAM model that is adapted from the Knowm's mean metastable switch (MMSS) memristor model [25] is used to fit the experimental data of the Au/Ni/HfO 2 /Ni RRAM. The equation for the Au/Ni/HfO 2 /Ni RRAM model is as follows:

Au/Ni/HfO 2 /Ni RRAM Model
where G(x) represents the conductance of the RRAM. X represents the state variable of the RRAM, which is a value between 0 and 1. When X = 0, G(X) reaches its minimum value. When X = 1, G(X) reaches its maximum value. dX dt represents the drift velocity of the state variable, where τ is a time constant, and β = q kT = (V T ) −1 . V T represents the thermal voltage, where q is the elementary charge, k is the Boltzmann constant, and T is the temperature. V ON and V OFF represents the positive and negative thresholds of the RRAM, respectively. S(x) is an S-shape curve function, where α is the tuning constant to adjust the nonlinearity of the S(X) function (S(0.5) = 0.5, lim X→0 S(X) = 0, lim X→1 S(X) = 1, α < 0). By introducing the S(x) function, the hard-switching characteristics of the Au/Ni/HfO 2 /Ni RRAM can be better fitted. The simulated characteristics of the Au/Ni/HfO 2 /Ni RRAM is shown in Figure 6 (red line), and the parameters that are used in this work are listed in Table 1.

Results and Discussion
The internal delay can be obtained through the SPICE ® simulation. The hardware acceleration platform of YOLOv3 is used. The system was designed using Verilog except for the RRAM model. The working frequency was evaluated using the Design Compiler (DC) and IC Compiler (ICC) tools. Au/Ni/HfO 2 /Ni RRAM arrays were modeled by Verilog-A, which is a non-synthesizable Verilog construct. In the proposed design, only two states are required for RRAM (i.e., HRS and LRS), and the ON/OFF ratio of Au/Ni/HfO 2 /Ni RRAM is large enough. Thus, it is applicable to utilize an idealized RRAM model to simulate the proposed PIM architecture. The size of each RRAM array in the proposed design is 36 × 256, which contains 9216 RRAM devices. The data are fed into the RRAM array in series, while the outputs are read in parallel (32 × 8 bits). VCS2018 ® of Synopsis is used for the compiling and for the simulation of the architecture. The testbench is built using SystemVerilog and Universal Verification Methodology (UVM ® ); 5000 images from COCO datasets [26] are selected for identification. The images from the COCO dataset are converted to binary files that were read by SystemVerilog and sent to the proposed system (DUT). The testbench monitored the intermediate results and the data saved in the results to ensure that our system performed as expected. The post-processing of the YOLOv3 algorithm is realized using the PyTorch library because it is an efficient and lightweight option. Figure 7 illustrates the processing of images by the proposed architecture. After processing using the RRAM PIM, the images are effectively identified. The architecture parameters of the overall simulation are listed in Table 2.
The simulation results are presented in Table 3. In the simulation, the image size is unified to 416 × 416 pixels. The data bit length of the output feature is 16 bit at a working frequency of 50 MHz. A comparison of the operation and memory resource required between this RRAM PIM architecture with those of other architectures is also presented in Table 3.
The image processing speed of the RRAM PIM in this work is approximately 7 times faster than that of the GPU (NVIDIA RTX 3080), i.e., the speeds of the former and the latter are 284 and 39 frames per second (FPS), respectively. This indicates that the RRAM PIM exhibits better performance than the GPU, although its working efficiency is relatively lower than that of the GPU. The RRAM PIM with ADCs proposed by Khwa et al. achieves an image processing speed of 11,606 FPS with ResNet20, which is equivalent to 14.37 FPS with YOLOv3. Therefore, the image processing speed of the RRAM PIM in this work is approximately 20 times faster. The energy consumption and area of ADCs in the RRAM PIM with ADCs proposed by Peng et al. are 12 times and 20 times that of the RRAM arrays, respectively, which can be significantly reduced in the proposed architecture.  Weight replication is used to ensure that each stage of the pipeline requires the same amount of time. Therefore, the RRAM PIM consumes 877.22 Mbit of RRAM resources, which is more than th 497.84 Mbit (quantized to 8-bits) required by the YOLOv3 algorithm.
In general, the RRAM PIM outperforms other architectures in terms of the identification rate for heavy-load CNN algorithms, speed, and image size.
The data type of images after normalization is floating point. When the image data are mapped from floating point to fixed point, the data bit width is changed, which may affect the accuracy of the accelerator. To evaluate the overall accuracy of the accelerator, 5000 images with a size of 416 × 416 pixels from the COCO dataset were used for the examination. Figure 8 compares the prediction results of the fully quantized model and the partial quantized model proposed in this work, based on a typical prediction in a single picture using the YOLOv3 model. As shown in Figure 8, both models accurately predict the presence of a dog, a bicycle, and a truck. In terms of the predicting precisions, the partially quantized model outperforms the then fully quantized model, with the exception of the truck. This is a situation existing in all of the prediction series in pictures. Although the quantized model may not show better precision in a specific series, the partial quantized model has a better mean Average Precision (mAP).  Figure 9 illustrates the average precision of the partially quantized model and that of the model without quantization for different prediction series. The results show that, for most series, the average precision of the partially quantized model only slightly decreases compared to the model without quantization. For some specific series, the partially quantized model performs even better than the model without quantization. That means that the proposed partial quantization approach can effectively reduce the computation resources in the RRAM PIM.  Figure 10 illustrates the comparison of average precision between the fully quantized model and the partially quantized model for different series in the COCO database. As depicted in Figure 10, the model with partial quantization exhibits better average precision in most of the series, but not for all. Hence, it achieves a better overall mAP for all series. The mAP for various quantization techniques is presented in Figure 11. As can be observed in the figure, the mAP of the proposed RRAM PIM without quantization is 54.135%, while that of partial quantization with 8-bit and 16-bit are 53.588% and 53.386%, respectively. Compared with the mAP without quantization, the mAP of the partial quantization with 16-bit or 8-bit precision exhibits an insignificant decrease. However, the mAP of full quantization is 50.319%, which causes a 4% decrease in mAP.

Conclusions
In this work, an RRAM PIM architecture is designed to accelerate CNN networks to achieve an ultra-high frame rate. A partial quantization technique is proposed to reduce the accuracy loss. The YOLOv3 algorithm is used to build a system model for simulation. The resulting image recognition speed reaches 284 FPS, which is approximately 7 times faster than the Nvidia RTX 3080 GPU. The proposed partial quantization techniques, with 8-bit and 16-bit precision, achieved accuracies of 53.588% and 53.386%, respectively. These accuracies of partial quantization remained almost unchanged compared with the algorithm without quantization.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.