1. Introduction
Convolutional neural networks are widely used in a wide range of applications for image recognition and object detection. Compared to other computer vision algorithms, CNNs offer a significant improvement in accuracy for object recognition, target detection, and video tracking. As a result, CNN models have become popular and have played a crucial role in the rapid advancement of computer vision applications. Meanwhile, CNN models are becoming more complex, using different kernel sizes and applying more depth and scale of CNN networks to achieve higher prediction accuracy. These advancements and changes in CNN models have a significant impact on the performance of the storage and processing capabilities of current hardware accelerators. Thus, new acceleration architectures for object detection CNN models are necessary, applying efficient data streaming, storing, and processing methods [
1].
Most CNN models comprise a series of multiple convolutional layers, and each layer is convolved with different sized kernels. For example, our target CNN model is the YOLOv5 object detector, which is made up of 99% convolutional computations with various kernel sizes [
2]. Accelerating YOLO-like CNN networks on hardware devices can significantly improve their inference speed, enabling faster execution compared to traditional CPU or GPU implementations. Furthermore, it is essential to optimize the computations in the convolution operation to support various kernel sizes when designing a new high-speed CNN accelerator. These kernel-based optimizations of convolution operation blocks help to map any CNN models to the hardware accelerator.
Most of the CNN accelerator architectures use parallel processing element (PE) units, which consist of a multiplier and accumulator (MAC), as shown in
Figure 1. A systolic array is a special PE array structure built for the fast and efficient operation of regular array algorithms and to reduce their computation time. The systolic arrays are also effective at reducing memory access by reusing data that have already been passed through other PEs. Therefore, many research works have proposed systolic array architectures to maximize the speed of iterative convolution computations [
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14].
In many CNN models, the convolutional layer uses a stride of 2 to convolve the input data with a 3 × 3 kernel, which is faster than using a stride of 1. A convolutional layer with a stride of 2 is advantageous in that it requires less computation. The filter moves two pixels at a time over the input feature map, which results in faster down-sampling of the input. However, most research on CNN accelerators shows that the time consumption for stride 2 is the same as for the stride 1 mode. The reason for this is that the feature map data are supplied to the PE in the same way as the 3 × 3 stride 1 kernel mode, and it produces relatively low PE utilization. This indicates that the stride 2 mode is not as efficiently implemented as the stride 1 mode on CNN accelerators [
15].
In this paper, we propose a hardware architecture for a CNN inference accelerator, using a novel systolic array architecture called the flexible diagonal cyclic array (FDCA) to accelerate the convolution operation and support various kernel sizes, including the 3 × 3 kernel with stride 2. This paper introduces the following new methods to minimize repeated memory accesses, optimize hardware resources for various kernel modes, and enable the mapping of diverse CNN models onto a wide range of FPGA devices.
Flexible Diagonal Cyclic Array (FDCA) for kernel modes: The FDCA is a novel systolic array structure designed to maximize data reuse and speed up computation by efficiently performing convolutions. In the FDCA, PEs are arranged in a 3 × 3 systolic array. Multiplication and accumulation operations are performed to calculate a partial sum, which is then forwarded to the diagonal PEs to accumulate the output result. In this study, we optimized a DCA systolic array for the convolution operation using a 3 × 3 kernel with strides of 1 and 2, which are commonly used in CNN accelerators.
Input Zero Padding: The CNN accelerator supports adding zero-padding data to the around the input feature map. When the CNN accelerator reads the input feature map from DDR, it decides to add input zero padding at the corresponding position. This function provides several advantages, such as reducing DDR access and effectively utilizing on-chip memory, instead of writing padding data for the output feature map. To implement input padding, we designed the input zero padding circuit, which utilizes a 2-bit register to indicate the status required for each situation and a wire to control the global input buffer read enable signal.
Reconfigurable Input FIFO and FIFO Output Cache Memory: The reconfigurable input FIFO consists of three SRAM FIFOs and one register connected in a predefined sequence. When we define the kernel mode for a convolutional operation using a specific stride and kernel size, the reconfigurable input FIFOs are interconnected according to the kernel mode to efficiently reuse the data. The data that are read from the first FIFO will flow to another FIFO and systolic array processing, respectively.
The FIFO output cache memory is a register that supplies a large amount of data to the PE. It can transfer two different data to the PE depending on the kernel mode. By using the address for this register, it enables the activation of the “read enable” for the reconfigurable input FIFO connected to the register and generating a “write enable” signal for the register, allowing data to be read sequentially whenever needed.
Weight Parameter Quantization: Quantization is a method for reducing model size by converting model weights, biases, and activations from high-precision floating-point representation to low-precision floating-point (FP) or integer (INT) representations, such as 16-bit or 8-bit. By converting the weights of a model from high-precision floating-point representation to lower precision, the model size and inference speed can significantly improve without sacrificing too much accuracy. Additionally, quantization improves the model performance by reducing memory bandwidth requirements and increasing resource utilization [
16].
In this work, we used 32-bit floating point parameters for training and then quantized them to 8-bit integers to enable high-speed lightweight CNN inference. By applying a low-bit quantization, we can utilize small-size on-chip memory, multipliers, and adders.
The rest of this paper is organized as follows: In
Section 2, the background of the overall architecture of the CNN Accelerator is described. A detailed explanation of the proposed flexible PE array architecture is given in
Section 3. In
Section 4, we outlined the advantages of using the proposed architecture.
Section 5 includes the verification process and the results obtained.
Section 6 concludes the paper with future research plans.
2. Related Work and Motivation
Many researchers have studied lightweight object detectors and proposed hardware accelerators to accomplish real-time object detection on edge devices. The acceleration of CNN model inference for object detection is discussed in more detail, with a focus on FPGA-based implementations. Thus, various hardware architecture approaches and optimization methods are explored to examine their impact on throughput and accuracy [
17,
18,
19,
20,
21,
22,
23,
24,
25,
26].
Since its initial release [
27] in 2016, several versions of YOLO have been developed and accelerated for improving the processing efficiency. The problem is that designing a new dedicated accelerator for a new version of YOLO is a time-consuming process.
Most of the studies analyze the acceleration of the YOLOv2 algorithm to improve development speed, power efficiency, and computing performance. Many analyses provide developers with new insights for choosing hardware and architectures to optimize the YOLOv2 algorithm [
14,
28,
29].
The lightweight YOLO versions, including Tinier-YOLO, Tiny-YOLOv3, and Tiny-YOLOv4, have fewer parameters and require fewer computations compared to the full versions. However, they also exhibit some reduction in accuracy. They are deployed on FPGA-based embedded computing platforms and have achieved better real-time detection results, utilizing architectures with high performance and low energy consumption [
30,
31,
32]. The accelerators have been configured to run Tiny-YOLOv3 [
30] and Tiny-YOLOv4 [
32] in real time, achieving performance of over 8.3 and 30 frames per second.
Other studies [
13,
25,
26,
33,
34] have investigated the implementation of the entire CNN-based object detection networks on FPGA devices by building customized computation units and data flows into their accelerator designs. As a result, the impact of data communication bottlenecks is minimized and the overall performance is enhanced. However, the accelerator designs proposed in these works are tailored to specific versions of the YOLO network and lack the versatility to target more recent object detection models.
This paper introduces a novel architecture that enables the deployment of the next generation of models from the YOLO family on a variety of FPGA devices. Our proposed toolflow is designed to efficiently process YOLOv5 and the latest YOLO models, offering high performance and reconfigurability to accommodate new changes and architecture updates. To achieve efficient implementation of YOLOv5-based algorithms, we conducted research on various customized hardware accelerators and proposed new methods to optimize them.
3. Overall Hardware Architecture
Figure 2 illustrates the overall architecture of the proposed CNN accelerator. The architecture consists of the following hardware block components: PE arrays block with four FDCAs, 5 × 5 max-pooling, element-wise adder, upsampling, global in/output buffer, AXI4 data bus, and CNN controller blocks.
In this study, we introduce a new systolic array (SA) structure called the flexible diagonal cyclic array (FDCA) that also supports the stride 2 kernel mode. The SA structure is designed in the form of an array of 3 × 3 PEs, called a kernel unit (KU), to optimize the convolution operation using a 3 × 3 kernel with stride 1. In general, each PE computes a partial sum of the convolution and sends it to other PEs to generate a single convolution result using an accumulator. If the CNN model includes a layer with N × N filters, the proposed PE array can be easily configured to support the required kernel size and stride. The FDCA consists of 4 × 8 KUs, with each KU with 9 PEs, and it can simultaneously process four input channels and eight output channels.
3.1. Four FDCA for Convolution Acceleration
To accelerate the convolution operation, we employ four FDCAs to calculate 16 input data channels simultaneously. The relevant processing architecture is illustrated in
Figure 3. The convolutional result generated by a single filter using FDCA is stored in a convolutional memory (Conv_mem). After calculating the convolutional outputs for all input channels, the final result is produced by accumulating them and storing the result in the Conv_mem as the final output.
In addition, the four-FDCA architecture is specially designed to maximize data reuse and speed up the processing of the convolutional layer. Therefore, optimized data reuse on KUs of the architecture provides a higher utilization ratio in any 3 × 3 or 6 × 6 kernel modes compared to previous studies.
3.2. Max Pooling
The max pooling operation compares 25 input feature maps and produces the largest value from them. For the max pooling operation, we designed a 5 × 5 max pooling hardware block that is composed of 128 comparators. The max pooling block includes 16 in/out flip-flop memories and a (de)channeling controller that is used to reorder and write data back to DDR.
3.3. Element-Wise Adder
The element-wise adder is a hardware architecture designed to perform element-wise addition of data from two different feature maps with equal size. The hardware block consists of parallel adders and input/output buffers. The concept of element-wise addition operation is derived from the latest YOLO models, which merge data from two streams.
3.4. Upsampling (Resize)
Upsampling is a novel hardware sub-circuit used to increase the size of the input feature map. For an input feature map with a size of n × n, the upsampling layer increases the output feature map size to 2n × 2n by making an exact copy of each input feature map and placing it at the bottom, right, and bottom-right diagonal pixel positions.
3.5. Global Input/Output Buffers and AXI4 Data Bus
In the proposed architecture, the global input/output buffers are used for sending/receiving data to/from DDR via the 256-bit AXI4 data bus. The proposed architecture includes an additional input buffer called “Instruction Memory”, which is used for writing the instruction microcode from the RISC-V CPU core through the 32-bit AXI4-Lite bus protocol.
3.6. CNN Controller
The CNN controller hardware block controls the processing of functional hardware blocks in the proposed design, such as the four FDCAs for convolution acceleration, 5 × 5 max pooling, element-wise adder, and upsampling. The CNN controller uses microcode information to manage all data processing operations, from reading input data to the processing hardware block from a predefined address to writing output data to DDR DRAM using AXI4 transactions.
4. The Proposed CNN Accelerator
4.1. Stride in Convolutional Operation
The stride defines the number of moving steps of the filter through the input feature map to generate an output value. If the stride is bigger than one (>1), the output feature map size decreases compared to the input.
Equation (1) defines the output feature map size (
o) for a given stride (
s), where
i is the input feature map size,
k is the kernel size, and
p is the padding added to the input feature map.
Most CNN accelerators only use a stride of 1 when shifting a kernel over an input. In this article, we introduce a new method for implementing stride 2 convolution on hardware.
4.2. Convolution Using Kernel 3 × 3 with Stride 1 in Kernel Unit
The convolution operation using a 3 × 3 kernel with a stride of 1 is a basic kernel mode that utilizes the entire 3 × 3 PE array structure, known as the KU. In this work, the filter convolves the input feature map in a direction from up to down; the kernel is shifted vertically over the input feature map with a stride of 1. In the KU, pixel data moves horizontally from left to right and is reused for one clock cycle in each PE. The weights move vertically, from up to down. The weights will be reused in every clock cycle in the KU. In each clock cycle, FIFOs supply three pixels and weight data values to the KU. Each PE transfers its accumulated value to the bottom-right diagonal PE over KU, during the processing. As shown in
Figure 4, PE6 and PE7 do not have bottom-right diagonal PEs. Therefore, these PEs transfer accumulated values to PE1 and PE2, respectively.
The PE2, PE5, and PE8 perform convolution operations for one input channel by accumulating nine data, including six data accumulated by the Pes of the previous two vertical lines. The last vertical line produces one convolution result in every clock cycle after the first result is produced. Since only one final result is produced in KU for each clock, the conv_select signal is reused to select the results sequentially. The color of each PE and arrow indicates the path of partial sums used to compute one convolution result. The convolution results are repeatedly calculated in the order of red, blue, and green.
4.3. Convolution Using Kernel 3 × 3 with Stride 2 in Kernel Unit
The convolution operation with a 3 × 3 kernel and a stride of 2 uses the same PE array architecture, known as the KU, as that used for a stride of 1. However, in the stride 2 mode, data sharing between PEs differs; the data accumulated in each PE will be transferred horizontally to the next PE, not diagonally. To perform a convolution operation with a stride of n (n > 1), six pixels of data must be loaded into the KU at the same time. We used a FIFO output cache memory between the reconfigurable input FIFO and KU to prepare data for further processing in PEs. Using this cache memory, we can read two pixels of data from the input FIFO at the same time.
As illustrated in
Figure 5, the first three pixels of data will be sent to the PEs in the first column, as seen in the stride 1 mode. The next three pixels of data from the cache register will be simultaneously sent to the second-column PEs. After completing clock calculations in PEs, we have to pass the pixel values from the first-column PEs to the last-column PEs. In this order, we maintain the convolution operation with stride 2, efficiently reusing the data from the first-column PEs. Pixel data values from the first column PEs of the KU will be transferred to the diagonally downward PEs in the last column. This means that data reuse only occurs in the first and third columns of the KU, specifically in the stride 2 mode. In addition, weights are reused in every clock cycle by vertically rotating them from top to bottom.
Figure 6b illustrates an example of a convolution operation with stride 2, representing motion of the filter over the input feature data.
4.4. Convolution Using Kernel 1 × 1 with Stride 1 in Kernel Unit
Using the proposed 3 × 3 PEs array of KU, we can run the convolution operation using a 1 × 1 kernel size with a stride of 1. For the 1 × 1 convolution operation, we have designed an accelerator circuit that exclusively utilizes the first horizontal line of PEs in the KU unit. In this mode, the upper left PE receives only one datum in every clock cycle, while the upper right PE generates the output of the operation. After the initial output result is calculated in the proposed circuit, each clock cycle produces a result of the convolution output, similar to convolution with a 3 × 3 kernel mode.
4.5. Convolution Using Kernel 6 × 6 in Kernel Unit
The convolution operation with a 6 × 6 kernel is performed by executing the convolution operation with a 3 × 3 kernel on the KU.
Figure 7 illustrates the segmentation of a convolution operation using a 6 × 6 kernel into four 3 × 3 kernels for processing with the proposed design. In this mode, the input feature data for each convolution operation with a 3 × 3 kernel process a dedicated part of the whole input data, with overlapped contents. To calculate the final output of the convolution operation with a 6 × 6 kernel, we use a two-stage adder tree after finishing convolutions in the FDCAs.
4.6. Reconfigurable Input FIFO
Figure 8 shows the block diagram of the reconfigurable input FIFO. For each kernel mode, we utilize a portion of the reconfigurable input FIFO. Each FIFO stores one column of data for the input slice. Since each address of the FIFO contains four feature map data, the depth of the FIFO is equal to the slice size/4.
The following configurations of the reconfigurable input FIFO are used for different operation modes:
The convolution operation with a 3 × 3 kernel and a stride of 1 requires the use of two SRAM-FIFOs and a register. While the KU reads data from the SRAM-FIFO or register, the data not only go to the KU but also to another SRAM-FIFO containing the previous data from the left column in the feature map.
The convolution operation with a 3 × 3 kernel and a stride of 2 uses three SRAM-FIFOs. When the PEs array loads data from the FIFO address that stores the data for the last column, it also sends the data to another FIFO for reuse in the next first column, bypassing the middle column. This indicates that the stride 2 mode requires the KU to reuse only the data from the last column as the first column next time.
The convolution operation with a 1 × 1 kernel and a stride of 1 only utilizes one SRAM-FIFO. In this mode, the output data only come from PE2.
4.7. FIFO Output Cache Memory
We used FIFO output cache memories to share feature map data between the input FIFO and KU. Each address in the input FIFO contains four feature map data at a moment. Therefore, the FIFO output cache memory block receives four data from FIFO at once, queues them in order, and transfers them to the KU, respectively. Each cache memory consists of eight 16-bit registers and enables the loading of one or two feature map data to the KU. We divided eight 16-bit registers into two parts, named Area0 and Area1, each consisting of four 16-bit registers.
In our design, we used a cache memory with depth = 8, which is twice the size of the data read from the FIFO. By using cache memory with eight registers, we will be able to read two data from the cache without waiting for the next four data from FIFO. In the case of using cache memory with depth = 4, it will not be possible to read and send two data to the KU at the same time.
For example, if we use cache memory with depth = 4, after reading three data from the cache memory, we would not be able to read another two data from it. We face a memory limitation problem in our circuit. If we read only one datum, which is left in the cache memory, then we must wait for the next four data readings from FIFO. This problem causes circuit insufficiency and produces incorrect output.
Figure 9 shows that the “finish” signal becomes active when reading data from a specific location in Area0 and Area1. For example, if the address is 2, then finish[0] goes high when reading two data in Area0. The “finish” signal acts as the read enable signal for the input FIFO and generates the write enable signal for the FIFO output cache memory via a register. When data are initially stored in the FIFO output cache memory, the finish signal cannot be activated until the data are read from the FIFO output cache memory. Therefore, the cache memory must read data from the FIFO using the init_rd_en signal generated by the controller.
4.8. Slice and Iteration
Modern CNN models are becoming increasingly complex by using large image sizes for input data and increasing the depth and scale of neural networks to achieve high prediction accuracy. Due to these changes in CNN models, there is a need to develop new hardware accelerator models with high processing capabilities and reconfigurability. This work introduces the concept of slicing, which involves uniformly cutting a whole input feature map into specific-sized parts. The concept of slicing is not only used for defining the height and width of each slice but also the depth of the input channel, which is divided into slices based on the number of input channels.
After applying the slicing of the input data, each slice must be processed separately with filters. It is determined by the number of iterations required to process the entire input data. In our design, we introduced and used new-iteration ideas, such as input, slice, and output iteration, to process the input data efficiently and quickly.
By applying the concept of slicing, we reduce the amount of input data needed for processing in the CNN accelerator at once. As a result, the size of the on-chip memory and computational circuits is effectively reduced. To achieve the aforementioned improvements, this study not only introduced the concept of slicing but also implemented the idea for a limited number of kernels.
Figure 10 illustrates how the iteration is defined in our architecture based on the feature map size and slice. When using the concept of slicing in convolution, there is an overlap between two slices.
Figure 10a shows that the overlap between slices occurs when using 3 × 3 kernels.
Figure 10b,c represent an example of input/output iteration. In this study, because 16 input channels and 8 filters can be used for calculation simultaneously, the number of channels used for one input iteration becomes 16, contrary to what is shown in the figure. Similarly, eight (8) filters are used for convolution processing in each output iteration.
4.9. Input Padding
To add padding to the input feature map, most CNN accelerators use a software-based approach before loading the data to the DDR DRAM. In this work, we designed the circuit to add zero padding around the input feature map, a concept known as input padding in the circuit. Using input padding is more effective than using output padding.
Figure 11 presents the ratio of input padding storage pixels to output padding based on the size of the input feature map. The figure shows that as the feature map size decreases, the proportion of storage pixels also decreases by up to 82.6%.
The input padding circuit includes the register and control signal. The first component is the 2-bit padding state register, known as the “FIFO write selection”, which varies depending on the current/total slice iteration and kernel mode (
Figure 12).
Table 1 and
Figure 13, along with the description below, explain the adding-zero-padding method for each case.
ZEROZERO: When all data entering the reconfigurable FIFO is zero, only zeros are needed for padding.
READZERO: All data from the global input buffer are loaded into the FIFO when the slice iteration does not require any padding.
ZEROREAD: All data in the global input buffer are loaded, and a single zero is inserted in front of the data as padding. It is used to load a part of the zero padding at the top slice of the input feature map.
SHIFTREAD: When importing new data, the last pixel of the previously imported data is concatenated with the newly read data. The function exists to use the most recently imported data in the ZEROREAD scenario.
The 1-bit wire, InB_flag, generates a read enable signal for the global input buffer by entering a two-input AND gate with a read enable signal asserted by the controller in the KU. If the “FIFO write selection” is ZEROZERO, then the wire has a value of 0, and the KU receives only a “zero” value for the padding. Therefore, KU receives “zero” data without accessing the global input buffer.
The kernel mode also affects padding. The 3 × 3 convolution operation with a stride of 1 requires adding zeros around all sides of the input feature map. For a stride of 2, additional zeros are only required for the upper and left sides of the input. Padding is not applied in a 1 × 1 convolution operation.
4.10. Bias–Activation–Scaling Pipeline Architecture
In this study, we targeted the YOLOv5n model and quantized the model to an integer representation. Thus, we designed an extra circuit to calculate bias, activation, and scaling parameters for converting the final convolution result into the feature map for the next layer.
Figure 14 illustrates the bias–activation–scaling (BAS) pipeline architecture used in the proposed SoC hardware. All parameters, bias, activation, and scaling parameters are represented as 16-bit integers, allowing for a fast and low-cost area architecture. The pipeline process consists of six stages and requires seven cycles to process one piece of data.
The bias is seen as a part of batch normalization (BN). BN is usually used to train CNN models. In YOLOv5 training, the BN for CNN layers is calculated using Equation (2):
Here, the
is the output of convolutional filter,
represents the mean,
means the variance, epsilon (
) is added for numerical stability,
is the batch normalization scaling factor, and
is the shift factor (bias). These parameters are determined during the training process, and they remain constant within each layer during the inference [
14,
28,
29,
31].
The convolution operation with bias can be represented by the following equation:
where
represents weight,
is input feature map, and
means the constant number called “bias”. We simplify the addition in Equation (2) as Equation (3). This simplification reduces hardware costs without sacrificing accuracy. In addition, we used techniques like rounding and truncation in the BAS circuit.
The 36-bit dividers are used in the Leaky ReLU activation circuit, and they require two clock cycles to prevent setup-time violations at a high frequency of 400 MHz. The scaling process consists of four stages and operates for four cycles. The process involves multiplication, division, rounding, addition, truncation, and subtraction, in that order. The scaling applies the parameters generated by quantization.
5. Advantage of the Proposed Architecture
The YOLOv5n model requires numerous computations for each layer, with almost 99% of them involving the convolution operation. Therefore, we developed a reconfigurable and optimized hardware accelerator for convolution operations. The proposed computing method in the CNN accelerator supports stride 1 and stride 2 convolution operations with various kernel sizes. It enhances computational efficiency by using slicing and iterations, thereby accelerating image processing in hardware. Our design efficiently utilizes hardware resources to perform fast convolution operations, offering numerous structural advantages. In this section, we will discuss the improvements in the proposed architecture.
5.1. High PE Array Utilization with Flexibility
Most CNN accelerator architectures demonstrate good resource utilization, reaching up to 90% in commonly used kernel modes [
3,
28]. Due to CNN models becoming more complex by increasing the depth and scale of deep neural networks (DNNs) and using different kernel sizes and striding for image processing, they cannot perform all necessary convolution operations. Traditional CNN accelerators typically only operate with a kernel size of 3 × 3 and a stride of 1 or they may support a stride of 2 with less than 25% utilization. Therefore, we designed a new CNN accelerator architecture with FDCA to efficiently perform convolution operations using newly introduced kernel modes. Our proposed architecture demonstrates that PE resource utilization exceeds 95%, even for convolutions with kernel sizes of 3 × 3 or 6 × 6 at a stride of 2.
Table 2 presents the PE utilization ratio for various kernel modes with a slice size of 160.
5.2. Convolution Operation 3 × 3 Stride 2 Speed Optimization
Table 3 presents a comparison of the clock cycles required for 3 × 3 convolution with a stride of 2 in the previous and proposed architectures for processing the YOLOv5n model. The proposed architecture consumes about 9.4-times-fewer clock cycles. The proposed architecture provided more data in the same time frame for the speed optimization of convolution operation with 3 × 3 stride 2.
5.3. Area Efficiency
In general, the optimized architecture for the specific kernel (size) mode demonstrates high PE utilization. However, designing individual sub-circuits for each kernel mode requires a significant amount of hardware resources. Therefore, in our proposed CNN accelerator, we have designed it so that more than 90% of the KU area is shared among all kernel modes. The multiplexer (Mux) is used to configure the connection between the PEs of the KU for the required kernel mode.
Table 4 presents the total number of logic gates for the proposed CNN architecture and previous architectures in different kernel modes. It shows that the proposed architecture’s area is 2.14 times smaller than the total area of the previous architecture. The area was reduced to 2.14 times instead of 3 times due to the convolution operation with a 1 × 1 kernel mode, which utilizes only one PE, but the total area of KU is still 9 times larger.
Although the proposed architecture supports two additional kernel modes, the total chip area increases by only 6% compared to using the predicted 3 × 3 stride 1 mode separately. If we expand the proposed CNN accelerator architecture to support 6 × 6 convolution operations with stride 1 and stride 2 modes in the future, the expected chip area will increase by 6% compared to the current area.
5.4. Data Load Optimization
The proposed design significantly improves data loading speed, performing nine times faster than a GPU. In our design, we efficiently utilized the following components to achieve faster data loading on the circuit:
We used reconfigurable input FIFOs to organize, transfer, and reuse data on the KU. FIFOs manage all data feeding and reusing procedures on the vertical and horizontal lines of the KU unit during convolution operations. Our design allows for the use of up to three FIFOs, depending on the configuration of the KU (PEs array). Because of these three reconfigurable input FIFOs, the KU can reuse feature map data, reducing the number of data loads by up to one-third.
The KU reuses data by sharing them among connected PEs. Typically, the GPU reads each pixel of input data from DDR memory three times. In our design, the KU reuses the same pixel data three times by passing it to other PEs. This mechanism reduces the number of memory accesses by three times.
5.5. Small On-Chip Memory Size
Each of the four FDCAs consists of 32 KUs. Each FDCA is designed to simultaneously compute four input and eight output channels. The overall design supports the computing of 16 parallel input channels by employing four FDCAs in a parallel architecture. This allows for the simultaneous processing of data from 16 (sixteen) input feature map channels.
Modern CNN models, such as the YOLOv5n used in this study, require the processing of more than sixteen input and output data channels during convolutional layer computation. Simultaneously processing all corresponding channels in parallel requires multiple connected PE arrays and on-chip memories, which increases the hardware costs of the CNN accelerator by occupying a large amount of hardware resources. Therefore, we applied the slicing and iteration concepts to efficiently process the input data. As a result, we were able to optimize power consumption and chip area utilization.
By using the concept of slicing, we convolve a part of the input feature map with given filters to generate the sliced output result. In this scenario, only the essential slice data will be copied from DDR memory for processing in the FDCA block. Therefore, the proposed architecture stores partial input feature map data corresponding to a single slice, and it uses a smaller on-chip memory size in the design compared to storing the entire feature map. Utilizing small on-chip memories helps to minimize the number of accesses to the DDR memory, reduce data loading time, and maximize data reuse in slice rotating operations, which involve reusing the same data for different filter weights.
6. Hardware Implementation Results
In order to evaluate the hardware cost of the proposed CNN accelerator, we implemented it on the Xilinx Zynq UltraScale+ MPSoC ZCU102 FPGA platform. For hardware synthesis, Vivado 2022.2 is used. Our implementation occupies 249,357 LUTs, 2304 DSPs, and 567 KB of BRAMs in FPGA resource utilization. The CNN accelerator operates at 400 MHz, and the reference image inference speed is 47.17 frames per second (FPS).
Table 5 shows the implementation results on FPGA.
To test the performance of the proposed CNN accelerator, we accelerated a quantized YOLOv5n model for inference. For this purpose, we have developed a microcode-based CNN controller circuit that allows for the programmability of any CNN model. Our modified YOLOv5n is an object detection model pre-trained on the COCO dataset. All model parameters, including weights, bias values, and input feature map data, were quantized to 8-bit integers. The model’s object detection performance was evaluated using mean average precision (mAP). We set our threshold at 0.5 (
[email protected]) and achieved a detection mAP of 43.1% on the FPGA. The results demonstrate that the proposed architecture implementation significantly improves inference throughput while maintaining high accuracy, similar to the software model.
Furthermore, the proposed CNN accelerator was implemented as a system on chip (SoC) using a Samsung 14 nm CMOS process. The die consists of a shared LPDDR, RISC-V core, and CNN accelerator with FDCA, which are utilized in collaboration with partner companies. The area allocated for the CNN accelerator architecture is 10.96 mm
2 (3943 µm × 2780 µm). The chip operates at a frequency of 400 MHz, with a timing constraint set at 2.5 ns. The total power consumption of the chip is 18.52 mW. The implementation uses on-chip SRAM with a size of 275.75 KB.
Figure 15 shows the overall chip layout of the proposed CNN accelerator SoC.
7. Conclusions
In this paper, we proposed a high-speed CNN accelerator architecture based on a flexible diagonal cyclic array (FDCA). The proposed four-FDCA architecture comprises 1152 PEs that can process the data for sixteen input channels and eight output channels simultaneously. The proposed architecture enables the execution of convolution operations with different kernel modes and strides to accelerate the latest CNN models. In the proposed design, we introduced new optimization techniques that improved chip area efficiency by 6% and reduced total chip area utilization by 2.14 times compared to individual block designs for each kernel mode. We also minimized the number of DRAM accesses by using data reuse methods.
The CNN accelerator was synthesized and verified on the Xilinx ZCU102 FPGA and implemented in SoC silicon using 14 nm CMOS process technology. The results demonstrate that the proposed CNN accelerator can perform convolution operations 3.8 times faster, using the proposed new PE array structure, compared to previous CNN accelerators.