Compact Convolutional Neural Network Accelerator for IoT Endpoint SoC

: As a classical artiﬁcial intelligence algorithm, the convolutional neural network (CNN) algorithm plays an important role in image recognition and classiﬁcation and is gradually being applied in the Internet of Things (IoT) system. A compact CNN accelerator for the IoT endpoint System-on-Chip (SoC) is proposed in this paper to meet the needs of CNN computations. Based on analysis of the CNN structure, basic functional modules of CNN such as convolution circuit and pooling circuit with a low data bandwidth and a smaller area are designed, and an accelerator is constructed in the form of four acceleration chains. After the acceleration unit design is completed, the Cortex-M3 is used to construct a veriﬁcation SoC and the designed veriﬁcation platform is implemented on the FPGA to evaluate the resource consumption and performance analysis of the CNN accelerator. The CNN accelerator achieved a throughput of 6.54 GOPS (giga operations per second) by consuming 4901 LUTs without using any hardware multipliers. The comparison shows that the compact accelerator proposed in this paper makes the CNN computational power of the SoC based on the Cortex-M3 kernel two times higher than the quad-core Cortex-A7 SoC and 67% of the computational power of eight-core Cortex-A53 SoC.


Introduction
With the development of Internet of Things (IoT) technology and artificial intelligence (AI) algorithms, AI computing has moved from the cloud down to the edge [1]. Intelligent Internet of Things (AI+IoT, AIoT), which integrates the advantages of AI and IoT technology, has become a research hotspot in the related fields of the Internet of Things [2]. The IoT endpoint System-on-Chip (SoC) refers to a large number of MCUs near the sensors in the IoT system, some typical examples of which are STM32, ESP32 and MSP430. As the information collector and command executor in the IoT system, the node SoC contains a large amount of information that can be utilized by artificial intelligence. IoT developers are seeking to implement AI algorithms such as face recognition and speech recognition on resource-constrained IoT endpoint devices [3]. Based on these application demands, some well-known IoT chip manufacturers around the world also provide some AI libraries and solutions for their IoT chips, such as ESP-WHO of ESPRESSIF and STM32Cube.AI of STMicroelectronics. Nevertheless, optimizing and tailoring algorithms only from the software level to adapt the IoT endpoint SoC with limited computing capability is insufficient. It is of great practical significance to develop a smart IoT endpoint chip that is suitable for AIoT from the hardware level.

Function Module Design
At present, some mainstream CNN algorithms are composed of four parts: convolution layer, activation layer, pooling layer and full connection (FC) layer. Most state-of-the-art neural networks contain a large number of convolution layers, such as VGG-16 which contains at least 13 convolution layers and Alexnet which contains five convolution layers, so the acceleration of the convolution operation is the focus of this CNN accelerator. Activation layers and pooling layers are relatively simple in the algorithm, but as they closely follow the convolution layers, the number of layers is quite large, so this also needs to be accelerated by hardware. Although the full connection layer involves the most CNN parameters, its hardware implementation is similar to convolution, which can be understood as a special form of convolution operation. Essentially, in [12] it is also mentioned that the FC layer is not very meaningful in practical applications. Therefore, we focused on the design of the convolution layer.

Convolution Unit
In the CNN convolution layer, high-dimensional convolution is calculated by accumulating multiple two-dimensional (2-D) convolutions. The basic expression of the CNN convolution layer can be expressed as Equation (1).
where ifmap and ofmap represent the input and output feature map, z denotes the output feature map number, W denotes the weight of convolution kernel, k denotes the input feature map number, and '*' denotes the 2-D convolution calculation. According to [13], 2-D convolution is the most basic and important operation in CNN, and it consumes more than 90% of the total computational time. Thereby, 2-D convolution is always the focus of many CNN accelerators' optimization. Furthermore, convolution operation contains a large amount of data loading-storage, but in fact, this data has a lot of repetition [14]. 2-D Convolution operations of matrix X with N*N size and convolution kernel W with K*K size can be expressed by Equation (2).
According to Equation (2), the total data to be loaded to complete the convolution is K × K × (N -K + 1) × (N -K + 1). However, the actual amount of data needed for convolution is only N × N. If the repeatability of convolution data can be effectively utilized, not only the convolution can be accelerated but also memory access bandwidth can be reduced. In particular, for resource-constrained IoT endpoint SoC, the processor kernel and the CNN accelerator often share the on-chip memory. The reduced bandwidth usage of the CNN accelerator enables the processor kernel to release its computing power and further improve the overall performance of SoC.
In this paper, a 2-D convolution computing unit is designed to decrease the memory bandwidth by utilizing the data repeatability. Its structure is shown in Figure 1.
The data loading unit in the 2-D convolution circuit is the key to reduce data bandwidth. The data loading unit is composed of an address generator and a buffer RAM. Taking a 6*6 matrix (N = 6) and a 3*3 convolution kernel (K = 3) as examples, the working process of the designed data loading unit is illustrated in Figure 2.
We define an operation between the submatrix extracted from the source matrix and the convolution kernel as a convolution unit. The operational data of each convolution unit is read from the source matrix in column-first order and the operations of each convolution unit are performed in left-right reciprocating order. Each scan of the convolution kernel from left to right or from right to left is called a round. Figure 2f illustrates this reciprocating data calculation order. This data loading method maximizes the data repeatability between adjacent convolution units, but the difficulty is in how to accurately retain the useful data in the previous convolution unit. This paper adopts a buffer RAM and designs a data loading strategy to solve this challenge. The workflow diagram of the data buffer RAM is shown in Figure 3.
The blue-filled boxes in Figure 3 are data that need to be read from the source matrix, while the red-marked italic data represent the start of data that are about to be sent to the MAC (Multiply Accumulate) unit for convolution calculation. It can be seen from Figure 3 that only three (which is equal to the kernel's width K) data (missing) need to be loaded from the source matrix at a time when using the proposed reading strategy, which greatly reduces the bandwidth pressure of data RAM. The calculation methods of several key parameters in the structure are as follows.
The initial reading address of the ith convolution unit is expressed as x(i), and it is calculated by Equation (3).
where ifmap and ofmap represent the input and output feature map, z denotes the output feature map number, W denotes the weight of convolution kernel, k denotes the input feature map number, and '*' denotes the 2-D convolution calculation.
According to [13], 2-D convolution is the most basic and important operation in CNN, and it consumes more than 90% of the total computational time. Thereby, 2-D convolution is always the focus of many CNN accelerators' optimization. Furthermore, convolution operation contains a large amount of data loading-storage, but in fact, this data has a lot of repetition [14]. 2-D Convolution operations of matrix X with N*N size and convolution kernel W with K*K size can be expressed by Equation 2.
(2) According to Equation 2, the total data to be loaded to complete the convolution is K × K × (N -K + 1) × (N -K + 1). However, the actual amount of data needed for convolution is only N × N. If the repeatability of convolution data can be effectively utilized, not only the convolution can be accelerated but also memory access bandwidth can be reduced. In particular, for resourceconstrained IoT endpoint SoC, the processor kernel and the CNN accelerator often share the on-chip memory. The reduced bandwidth usage of the CNN accelerator enables the processor kernel to release its computing power and further improve the overall performance of SoC.
In this paper, a 2-D convolution computing unit is designed to decrease the memory bandwidth by utilizing the data repeatability. Its structure is shown in Figure 1  The data loading unit in the 2-D convolution circuit is the key to reduce data bandwidth. The data loading unit is composed of an address generator and a buffer RAM. Taking a 6*6 matrix (N = 6) and a 3*3 convolution kernel (K = 3) as examples, the working process of the designed data loading unit is illustrated in Figure 2 We define an operation between the submatrix extracted from the source matrix and the convolution kernel as a convolution unit. The operational data of each convolution unit is read from the source matrix in column-first order and the operations of each convolution unit are performed in left-right reciprocating order. Each scan of the convolution kernel from left to right or from right to left is called a round. Figure 2f illustrates this reciprocating data calculation order. This data loading method maximizes the data repeatability between adjacent convolution units, but the difficulty is in how to accurately retain the useful data in the previous convolution unit. This paper adopts a buffer left-right reciprocating order. Each scan of the convolution kernel from left to right or from right to left is called a round. Figure 2f illustrates this reciprocating data calculation order. This data loading method maximizes the data repeatability between adjacent convolution units, but the difficulty is in how to accurately retain the useful data in the previous convolution unit. This paper adopts a buffer RAM and designs a data loading strategy to solve this challenge. The workflow diagram of the data buffer RAM is shown in Figure 3.   The starting address of replacement data is expressed as t(i), which can be calculated through x(i), and the equation is as follows: The starting reading address s(i) that needs to be read from source matrix for the next convolution unit data is calculated by Equation (5).  Each time K new data is loaded from the source matrix (excluding the first convolution unit), the remaining K − 1 data can be calculated according to Equation (6) in turn after obtaining the starting reading address.
1st unit o f the odd or even round other units To sum up, we read the data from the source matrix at the address computed by s(i), and fill them into the buffer RAM with t(i) as the starting position, then read K × K data from the buffer RAM starting with x(i) to perform convolution. The flow chart is shown in Figure 4.
Using this method, the number of data to be loaded from the source matrix can be calculated as follows: In conclusion, the bandwidth optimization rate when using this module to read data is: Taking the first level of the Alexnet network as an example, N is 224 and K is 11, hence, η is equal to 90.907%. Analogously, the optimization rate of the first layer in Lenet-5 is 66.591%.
the odd round other units of the even round (5) Each time K new data is loaded from the source matrix (excluding the first convolution unit), the remaining K -1 data can be calculated according to Equation 6 in turn after obtaining the starting reading address.
To sum up, we read the data from the source matrix at the address computed by s(i), and fill them into the buffer RAM with t(i) as the starting position, then read K × K data from the buffer RAM starting with x(i) to perform convolution. The flow chart is shown in Figure 4.
Read k data from source matrix in address s(i) Fill these data in the buffer RAM address starting with t(i) After completing step 2, the buffer RAM stores the data needed for this convolution unit, and then reads K×K data from x(i) to the convolution calculator in turn. Using this method, the number of data to be loaded from the source matrix can be calculated as follows:

Multifunctional Accumulation Unit
Since the multi-dimensional convolution in CNN is calculated by accumulating multiple 2-D convolutions, a vector adder is needed after the 2-D convolution unit. Meanwhile, in some networks, there exists a bias layer behind each convolution layer that provides a bias for the result of convolution. Since the bias unit works only after a convolution layer is fully completed, most of its time remains idle. Therefore, the bias unit and convolution accumulation unit share a data adder in our design, which we call an adder module in the following section.

Serial Max-Pooling Unit
The pooling circuit in CNN is used to down sample the convolution result that reduces the input data size of the subsequent network and accelerates the calculation of the neural network. The commonly used pooling methods are average pooling and max-pooling, in which average pooling includes accumulation and division calculation, so it is not suitable for hardware implementation. Therefore, we select max-pooling to act as our pooling method.
According to the principle of max-pooling, combined with the characteristics of data serial input, the designed serial pooling circuit consists of a selector, maximum comparator, pool controller, a previous result buffer and a line maximum buffer. The circuit structure is shown in Figure 5a.
In conclusion, the bandwidth optimization rate when using this module to read data is: Taking the first level of the Alexnet network as an example, N is 224 and K is 11, hence, η is equal to 90.907%. Analogously, the optimization rate of the first layer in Lenet-5 is 66.591%.

Multifunctional Accumulation Unit
Since the multi-dimensional convolution in CNN is calculated by accumulating multiple 2-D convolutions, a vector adder is needed after the 2-D convolution unit. Meanwhile, in some networks, there exists a bias layer behind each convolution layer that provides a bias for the result of convolution. Since the bias unit works only after a convolution layer is fully completed, most of its time remains idle. Therefore, the bias unit and convolution accumulation unit share a data adder in our design, which we call an adder module in the following section.

Serial Max-Pooling Unit
The pooling circuit in CNN is used to down sample the convolution result that reduces the input data size of the subsequent network and accelerates the calculation of the neural network. The commonly used pooling methods are average pooling and max-pooling, in which average pooling includes accumulation and division calculation, so it is not suitable for hardware implementation. Therefore, we select max-pooling to act as our pooling method.
According to the principle of max-pooling, combined with the characteristics of data serial input, the designed serial pooling circuit consists of a selector, maximum comparator, pool controller, a previous result buffer and a line maximum buffer. The circuit structure is shown in Figure 5a. The pooling circuit works in row order and stores the row pooling results in the max data buffer. When the next row is input, the pooling result of the corresponding pooling area of the previous row is read from the buffer and used to compare with the input data. Figure 5b illustrates how this pooling circuit works with an example. The pooling circuit works in row order and stores the row pooling results in the max data buffer. When the next row is input, the pooling result of the corresponding pooling area of the previous row is read from the buffer and used to compare with the input data. Figure 5b illustrates how this pooling circuit works with an example.

Acceleration Chain Design
Due to the fact that the calculation sequence of CNN is relatively fixed, this paper constructs the CNN accelerator in the form of an acceleration chain, which reduces the data movement between the operations to obtain a higher performance. The acceleration module designed in Section 2 is connected by an on-chip stream bus to form an acceleration chain, and its structure is shown in Figure 6. To enable the designed acceleration chain to finish a CNN network completely, a bypass control circuit is added to each module. When a module in the chain does not need to work in a certain layer, the specified module can be bypassed without affecting the calculation results and performance.
The accelerator designed in this paper adopts a fixed-point architecture. The determination of the data width of each calculation module in the acceleration chain is an important factor that needs to be prioritized. To maintain the accuracy of the operation, the data bit expansion operation is used inside each convolution layer, while the results are reduced by interception before the non-linear operation, so that the data widths between layers are consistent. The variation of data width in a chain is shown in Figure 7. The M in Figure 6 is determined by the convolution kernel size supported by the accelerator circuit, and the formula is M = log2[K × K], while J is calculated by J = log2 [Ln]. The maximum convolution kernel size supported by the accelerator in this paper is 11×11, and a convolution kernel can be connected to 16 input feature maps at most, so K = 11, Ln = 16, M = 7 and J = 4. Since the data bus width of the IoT system is usually 32 bits, in order to ensure that the psum of convolution with a width of 2Q + M + J can be stored in memory, the width of Q = (32 -M -J)/2 is restricted to 10 bits. Therefore, the accelerator we designed uses a 10-bit fixed-point architecture to complete the CNN calculation. To conclude, in each convolution layer, the operation mode of data width expansion is used, and the 32-bit operation temporary value psum is stored in the buffer storage. At the end of each convolution layer, the 32-bit operation data is reduced to 10 bits to ensure the consistency of data bit width between layers and to prevent data overflow caused by the increase of layers.
In order to evaluate the effect of the 10-bit data width on algorithm accuracy, we compared the accuracy of Lenet-5 and GoogleNet network under the data widths of float, int8, and int10 based on tensorflow, and the recognition accuracy is shown in Table 1.

Float
Int8 Int10 Lenet-5 95.18% 95.11% 95.15% To enable the designed acceleration chain to finish a CNN network completely, a bypass control circuit is added to each module. When a module in the chain does not need to work in a certain layer, the specified module can be bypassed without affecting the calculation results and performance.
The accelerator designed in this paper adopts a fixed-point architecture. The determination of the data width of each calculation module in the acceleration chain is an important factor that needs to be prioritized. To maintain the accuracy of the operation, the data bit expansion operation is used inside each convolution layer, while the results are reduced by interception before the non-linear operation, so that the data widths between layers are consistent. The variation of data width in a chain is shown in Figure 7. To enable the designed acceleration chain to finish a CNN network completely, a bypass control circuit is added to each module. When a module in the chain does not need to work in a certain layer, the specified module can be bypassed without affecting the calculation results and performance.
The accelerator designed in this paper adopts a fixed-point architecture. The determination of the data width of each calculation module in the acceleration chain is an important factor that needs to be prioritized. To maintain the accuracy of the operation, the data bit expansion operation is used inside each convolution layer, while the results are reduced by interception before the non-linear operation, so that the data widths between layers are consistent. The variation of data width in a chain is shown in Figure 7. The M in Figure 6 is determined by the convolution kernel size supported by the accelerator circuit, and the formula is M = log2[K × K], while J is calculated by J = log2 [Ln]. The maximum convolution kernel size supported by the accelerator in this paper is 11×11, and a convolution kernel can be connected to 16 input feature maps at most, so K = 11, Ln = 16, M = 7 and J = 4. Since the data bus width of the IoT system is usually 32 bits, in order to ensure that the psum of convolution with a width of 2Q + M + J can be stored in memory, the width of Q = (32 -M -J)/2 is restricted to 10 bits. Therefore, the accelerator we designed uses a 10-bit fixed-point architecture to complete the CNN calculation. To conclude, in each convolution layer, the operation mode of data width expansion is used, and the 32-bit operation temporary value psum is stored in the buffer storage. At the end of each convolution layer, the 32-bit operation data is reduced to 10 bits to ensure the consistency of data bit width between layers and to prevent data overflow caused by the increase of layers.
In order to evaluate the effect of the 10-bit data width on algorithm accuracy, we compared the accuracy of Lenet-5 and GoogleNet network under the data widths of float, int8, and int10 based on tensorflow, and the recognition accuracy is shown in Table 1.

Float
Int8 Int10 Lenet-5 95.18% 95.11% 95.15% The M in Figure 6 is determined by the convolution kernel size supported by the accelerator circuit, and the formula is M = log 2 [K × K], while J is calculated by J = log 2 [Ln]. The maximum convolution kernel size supported by the accelerator in this paper is 11×11, and a convolution kernel can be connected to 16 input feature maps at most, so K = 11, Ln = 16, M = 7 and J = 4. Since the data bus width of the IoT system is usually 32 bits, in order to ensure that the psum of convolution with a width of 2Q + M + J can be stored in memory, the width of Q = (32 -M -J)/2 is restricted to 10 bits. Therefore, the accelerator we designed uses a 10-bit fixed-point architecture to complete the CNN calculation. To conclude, in each convolution layer, the operation mode of data width expansion is used, and the 32-bit operation temporary value psum is stored in the buffer storage. At the end of each convolution layer, the 32-bit operation data is reduced to 10 bits to ensure the consistency of data bit width between layers and to prevent data overflow caused by the increase of layers.
In order to evaluate the effect of the 10-bit data width on algorithm accuracy, we compared the accuracy of Lenet-5 and GoogleNet network under the data widths of float, int8, and int10 based on tensorflow, and the recognition accuracy is shown in Table 1. As is widely accepted, the fixed point of data has little impact on the accuracy of CNN, and the resources saved for this can greatly benefit the IoT system.

Accelerator Structure Design
The structure of the compact CNN accelerator designed for the IoT endpoint SoC is shown in Figure 8. The accelerator consists of a CNN controller, three buffer blocks, four acceleration chains and data selectors. By using these modules, the designed accelerator can complete convolution, activation and subsampling operations with four convolution kernels simultaneously based on the use of the source matrix, as in Figure 9.
For each acceleration chain, there are four data channels: Src A, Src B, Src C and Result. According to the parallel method of source data sharing, the four acceleration chains share one Src A data channel, so there are 13 data channels in total. It can be seen from the designed structure that the Src B data are provided by the independent COE RAM and the remaining four Src C channels, four Result channels and one Src A channel are all connected to the BUF RAM BANK. Furthermore, for each BUF RAM BANK, an SoC access channel should also be included so that the calculation results can be read by the processor kernel.
The system contains three BUF RAM BANKs, and each of them has four access channels: SoC access channel, Src A read channel, Src C read channel and Result channel. Three banks alternately act as psum memory, source matrix memory, and result buffer memory. An example of the role changes during the working process is shown in Figure 10.
The result of the previous convolution calculation which can be called psum is read as the accumulated value of the next convolution. Furthermore, the BUF RAM BANK that saves the calculation results of each layer acts as the source data RAM of the next convolution layer. This design utilizes the data relationship between operations and layers, reduces the amount of data migration, improves the system performance and reduces power consumption.
Electronics 2019, 8 FOR PEER REVIEW 8 activation and subsampling operations with four convolution kernels simultaneously based on the use of the source matrix, as in Figure 9.
For each acceleration chain, there are four data channels: Src A, Src B, Src C and Result. According to the parallel method of source data sharing, the four acceleration chains share one Src A data channel, so there are 13 data channels in total. It can be seen from the designed structure that the Src B data are provided by the independent COE RAM and the remaining four Src C channels, four Result channels and one Src A channel are all connected to the BUF RAM BANK. Furthermore, for each BUF RAM BANK, an SoC access channel should also be included so that the calculation results can be read by the processor kernel. BUF    Output feature map2 Output feature map3 Output feature map4 Figure 9. Schematic diagram of calculation based on source data sharing.
The system contains three BUF RAM BANKs, and each of them has four access channels: SoC access channel, Src A read channel, Src C read channel and Result channel. Three banks alternately The result of the previous convolution calculation which can be called psum is read as the accumulated value of the next convolution. Furthermore, the BUF RAM BANK that saves the calculation results of each layer acts as the source data RAM of the next convolution layer. This design utilizes the data relationship between operations and layers, reduces the amount of data migration, improves the system performance and reduces power consumption.
Each BUF RAM BANK is spliced using four independent SP-RAMs to increase the access bandwidth and its structure is shown in Figure 11. For Result and Src C channels, each SP-RAM is independently addressed, while for the SoC access channel and source data access channel the SP-RAMs are uniformly addressed. A fixed priority arbitrator is designed for each SP-RAM to meet the time-sharing access requirements of the four interfaces. According to the importance of the data channel, the order of priority is Result, Src C, Src A, and SoC interface.
In conclusion, the basic parameters of the accelerator designed in this paper are shown in Table  2. The maximum input image supported by this accelerator is 256 × 256 and the maximum convolution kernel is 11 × 11, which can meet the general image recognition requirements of the IoT endpoint SoC. In addition, the accelerator can only use the convolution or pooling function to accelerate the traditional image processing algorithm and then meet the diverse task requirements of the IoT node processors.  Each BUF RAM BANK is spliced using four independent SP-RAMs to increase the access bandwidth and its structure is shown in Figure 11. For Result and Src C channels, each SP-RAM is independently addressed, while for the SoC access channel and source data access channel the SP-RAMs are uniformly addressed. A fixed priority arbitrator is designed for each SP-RAM to meet the time-sharing access requirements of the four interfaces. According to the importance of the data channel, the order of priority is Result, Src C, Src A, and SoC interface.
Electronics 2019, 8 FOR PEER REVIEW 9 act as psum memory, source matrix memory, and result buffer memory. An example of the role changes during the working process is shown in Figure 10. The result of the previous convolution calculation which can be called psum is read as the accumulated value of the next convolution. Furthermore, the BUF RAM BANK that saves the calculation results of each layer acts as the source data RAM of the next convolution layer. This design utilizes the data relationship between operations and layers, reduces the amount of data migration, improves the system performance and reduces power consumption.
Each BUF RAM BANK is spliced using four independent SP-RAMs to increase the access bandwidth and its structure is shown in Figure 11. For Result and Src C channels, each SP-RAM is independently addressed, while for the SoC access channel and source data access channel the SP-RAMs are uniformly addressed. A fixed priority arbitrator is designed for each SP-RAM to meet the time-sharing access requirements of the four interfaces. According to the importance of the data channel, the order of priority is Result, Src C, Src A, and SoC interface.
In conclusion, the basic parameters of the accelerator designed in this paper are shown in Table  2. The maximum input image supported by this accelerator is 256 × 256 and the maximum convolution kernel is 11 × 11, which can meet the general image recognition requirements of the IoT endpoint SoC. In addition, the accelerator can only use the convolution or pooling function to accelerate the traditional image processing algorithm and then meet the diverse task requirements of the IoT node processors.  Figure 11. Function conversion diagram of three BUF RAM modules.

Parameter Description
Precision 10-bit fixed-point Figure 11. Function conversion diagram of three BUF RAM modules.
In conclusion, the basic parameters of the accelerator designed in this paper are shown in Table 2. The maximum input image supported by this accelerator is 256 × 256 and the maximum convolution kernel is 11 × 11, which can meet the general image recognition requirements of the IoT endpoint SoC.
In addition, the accelerator can only use the convolution or pooling function to accelerate the traditional image processing algorithm and then meet the diverse task requirements of the IoT node processors.

Verification Platform Construction
We use the FPGA to verify the prototype of the CNN accelerator. By constructing an IoT node SoC and running an example network, we can more intuitively verify the functions and some performance characteristics of the designed accelerator.

Design of the Verification Platform Based on Cortex-M3
As an MCU kernel launched by ARM, Cortex-M3 (CM3) adopts Armv7-M Harvard architecture with a 3-stage pipeline, which makes it achieve a good balance between power and performance. The Dhrystone score of the CM3 kernel is 1.25 DMIPS/MHz (if simultaneous compilation is permitted, i.e., 1.89 DMIPS/MHz), which can meet the processing requirement of the IoT node devices [15]. Based on the Cortex-M3 kernel RTL netlist provided by the ARM Designstart project, we built a testing SoC to complete the functional verification and performance analysis of the designed CNN accelerator. The architecture of the SoC is shown in Figure 12.

Verification Platform Construction
We use the FPGA to verify the prototype of the CNN accelerator. By constructing an IoT node SoC and running an example network, we can more intuitively verify the functions and some performance characteristics of the designed accelerator.

Design of the Verification Platform Based on Cortex-M3
As an MCU kernel launched by ARM, Cortex-M3 (CM3) adopts Armv7-M Harvard architecture with a 3-stage pipeline, which makes it achieve a good balance between power and performance. The Dhrystone score of the CM3 kernel is 1.25 DMIPS/MHz (if simultaneous compilation is permitted, i.e., 1.89 DMIPS/MHz), which can meet the processing requirement of the IoT node devices [15]. Based on the Cortex-M3 kernel RTL netlist provided by the ARM Designstart project, we built a testing SoC to complete the functional verification and performance analysis of the designed CNN accelerator. The architecture of the SoC is shown in Figure 12. The SoC built includes basic modules such as 128 KB RAM, 128 KB ROM and common peripherals such as GPIO and UART. As a processor core developed by ARM for an early time, The SoC built includes basic modules such as 128 KB RAM, 128 KB ROM and common peripherals such as GPIO and UART. As a processor core developed by ARM for an early time, Cortex-M3 uses AHB bus as its external interface, thus AHB (Advanced High performance Bus) and APB (Advanced Peripheral Bus) are used as the interconnected buses for the SoC, where high-speed devices such as SCCB (Serial Camera Control Bus) and CNN accelerators are connected with the kernel through AHB bus and low-speed devices such as GPIO are bridged through the APB bus.
We synthesize and implement the verification SoC on the FPGA, and obtain the resource report, as shown in Table 3. Figure 13 shows the breakdown of LUT resources for each circuit. Cortex-M3 uses AHB bus as its external interface, thus AHB (Advanced High performance Bus) and APB (Advanced Peripheral Bus) are used as the interconnected buses for the SoC, where high-speed devices such as SCCB (Serial Camera Control Bus) and CNN accelerators are connected with the kernel through AHB bus and low-speed devices such as GPIO are bridged through the APB bus. We synthesize and implement the verification SoC on the FPGA, and obtain the resource report, as shown in Table 3. Figure 13 shows the breakdown of LUT resources for each circuit.

Implementation of the Lenet-5 Network in Verification SoC
The Lenet-5 proposed in 1994 is considered to be one of the earliest and most classical convolution neural networks. With the development of CNN research, a series of more effective CNN structures have been put forward but as a classical structure, Lenet-5 and its variants are still used to evaluate the performance of CNN accelerators [16].
The structure of a Lenet-5 variant (abbreviated as Lenet-5) is shown in Figure 14. Its structure is divided into five hidden layers, which are the convolution layer with six convolution kernels, a subsampling layer S1, a partially connected layer containing sixteen convolution kernels, a subsampling layer S2, and a fully connected layer. More information about the Lenet-5 structure can be found in [17]. In Lenet-5, the calculation of the partially connected layer is the most complicated part because the results of this layer are related to the multiple layers or all outputs of the previous layer, so we mainly focus on the implementation of that layer. The specific connection relationship of the partially connected layer is shown in Figure 15.

Implementation of the Lenet-5 Network in Verification SoC
The Lenet-5 proposed in 1994 is considered to be one of the earliest and most classical convolution neural networks. With the development of CNN research, a series of more effective CNN structures have been put forward but as a classical structure, Lenet-5 and its variants are still used to evaluate the performance of CNN accelerators [16].
The structure of a Lenet-5 variant (abbreviated as Lenet-5) is shown in Figure 14. Its structure is divided into five hidden layers, which are the convolution layer with six convolution kernels, a subsampling layer S1, a partially connected layer containing sixteen convolution kernels, a subsampling layer S2, and a fully connected layer. More information about the Lenet-5 structure can be found in [17].
Electronics 2019, 8 FOR PEER REVIEW 11 Cortex-M3 uses AHB bus as its external interface, thus AHB (Advanced High performance Bus) and APB (Advanced Peripheral Bus) are used as the interconnected buses for the SoC, where high-speed devices such as SCCB (Serial Camera Control Bus) and CNN accelerators are connected with the kernel through AHB bus and low-speed devices such as GPIO are bridged through the APB bus. We synthesize and implement the verification SoC on the FPGA, and obtain the resource report, as shown in Table 3. Figure 13 shows the breakdown of LUT resources for each circuit.

Implementation of the Lenet-5 Network in Verification SoC
The Lenet-5 proposed in 1994 is considered to be one of the earliest and most classical convolution neural networks. With the development of CNN research, a series of more effective CNN structures have been put forward but as a classical structure, Lenet-5 and its variants are still used to evaluate the performance of CNN accelerators [16].
The structure of a Lenet-5 variant (abbreviated as Lenet-5) is shown in Figure 14. Its structure is divided into five hidden layers, which are the convolution layer with six convolution kernels, a subsampling layer S1, a partially connected layer containing sixteen convolution kernels, a subsampling layer S2, and a fully connected layer. More information about the Lenet-5 structure can be found in [17]. In Lenet-5, the calculation of the partially connected layer is the most complicated part because the results of this layer are related to the multiple layers or all outputs of the previous layer, so we mainly focus on the implementation of that layer. The specific connection relationship of the partially connected layer is shown in Figure 15. In Lenet-5, the calculation of the partially connected layer is the most complicated part because the results of this layer are related to the multiple layers or all outputs of the previous layer, so we mainly focus on the implementation of that layer. The specific connection relationship of the partially connected layer is shown in Figure 15. The accelerator designed in this paper uses the method of reusing the input feature that maps to calculate the partial connection layer, i.e., calculating the connection between each input feature map and all convolution kernels by taking the input feature map as an order and storing the results in a BUFFER RAM BANK as psum. Since there are four acceleration chains in the accelerator, the 2-D convolution of each feature map and four convolution kernels can be calculated simultaneously. Among them, the first chain is responsible for calculating the connection with {K0, K4, K8, K12} and the second chain is responsible for calculating the connection with {K1, K5, K9, K13} and so on. Figure  16 illustrates this calculation process visually. Stage 3 S1*K0 S1*K1 S1*K6 S1*K7 Stage 4 Figure 16. Schematic flow chart of partial connection layer.
Through the analysis of the structure of the Lenet-5 network using the accelerator designed in this paper to complete the Lenet-5, the concrete realization process can be segmented into four steps as Table 4.  The accelerator designed in this paper uses the method of reusing the input feature that maps to calculate the partial connection layer, i.e., calculating the connection between each input feature map and all convolution kernels by taking the input feature map as an order and storing the results in a BUFFER RAM BANK as psum. Since there are four acceleration chains in the accelerator, the 2-D convolution of each feature map and four convolution kernels can be calculated simultaneously. Among them, the first chain is responsible for calculating the connection with {K0, K4, K8, K12} and the second chain is responsible for calculating the connection with {K1, K5, K9, K13} and so on. Figure 16 illustrates this calculation process visually. The accelerator designed in this paper uses the method of reusing the input feature that maps to calculate the partial connection layer, i.e., calculating the connection between each input feature map and all convolution kernels by taking the input feature map as an order and storing the results in a BUFFER RAM BANK as psum. Since there are four acceleration chains in the accelerator, the 2-D convolution of each feature map and four convolution kernels can be calculated simultaneously. Among them, the first chain is responsible for calculating the connection with {K0, K4, K8, K12} and the second chain is responsible for calculating the connection with {K1, K5, K9, K13} and so on. Figure  16 illustrates this calculation process visually. Stage 3 S1*K0 S1*K1 S1*K6 S1*K7 Stage 4 Figure 16. Schematic flow chart of partial connection layer.

Performance Analysis of the Accelerator
Through the analysis of the structure of the Lenet-5 network using the accelerator designed in this paper to complete the Lenet-5, the concrete realization process can be segmented into four steps as Table 4.  Through the analysis of the structure of the Lenet-5 network using the accelerator designed in this paper to complete the Lenet-5, the concrete realization process can be segmented into four steps as Table 4. Table 4. The computation process of Lenet-5 using the accelerator.

Layer Calculation Methods Description
Convolution and subsampling Partially connection and subsampling Full connection

Performance Analysis of the Accelerator
To evaluate the performance of the proposed architecture, we chose a desktop processor and two mobile application processors as performance evaluation objects. In order to estimate the execution time of the Lenet-5 network on different hardware and software platforms, we used C language to implement forward propagation of Lenet-5. The forward propagation program of Lenet-5 was run on Intel 7500, Samsung S5P6818 and AllWinner H3 to compare with the SoC designed in this paper, respectively. The execution time of each platform is listed in Table 5 and Figure 17, in which frame pre-second (FPS) denotes how many MNIST figures (the size of the figures in the data set is 32 × 32) can be processed in one second. To evaluate the performance of the proposed architecture, we chose a desktop processor and two mobile application processors as performance evaluation objects. In order to estimate the execution time of the Lenet-5 network on different hardware and software platforms, we used C language to implement forward propagation of Lenet-5. The forward propagation program of Lenet-5 was run on Intel 7500, Samsung S5P6818 and AllWinner H3 to compare with the SoC designed in this paper, respectively. The execution time of each platform is listed in Table 5 and Figure 17, in which frame pre-second (FPS) denotes how many MNIST figures (the size of the figures in the data set is 32 × 32) can be processed in one second.  Figure 17. Acceleration performance comparison of the CNN accelerator.

Analysis of Resource Consumption
To design a CNN accelerator suitable for IoT systems, compact structure is a key principle during the design process. In order to complete the evaluation of resource consumption of the accelerating module, we have implemented our designed circuit on FPGA. The model of the FPGA board is Xilinx VC707 (XC7VX485T-2) (Xilinx, San Jose, CA, USA) and the synthesize tool is Vivado 17.2. The resource consumption comparison between the designed module in this paper and the reference [11,18,19] is shown in Table 6.

Analysis of Resource Consumption
To design a CNN accelerator suitable for IoT systems, compact structure is a key principle during the design process. In order to complete the evaluation of resource consumption of the accelerating module, we have implemented our designed circuit on FPGA. The model of the FPGA board is Xilinx VC707 (XC7VX485T-2) (Xilinx, San Jose, CA, USA) and the synthesize tool is Vivado 17.2. The resource consumption comparison between the designed module in this paper and the reference [11,18,19] is shown in Table 6. Considering that the target application scenario of this paper is in resource-constrained IoT node SoC, the cost and area of these chips may only be equivalent to STM32 or ESP32. It is impractical to rely on a large number of hardware multipliers to improve computing throughput, as high-performance multipliers that occupy a large number of circuit areas and dynamic power consumption are very rare resources in IoT chips.
The accelerator proposed in this paper takes up less than one-third of the resources than in [18] and exceeds its computational power. Although this accelerator cannot compete with some high-performance accelerators such as the one proposed in [19], or some GPU in performance, the resource and power consumption are less than 10% of these chips, which meets the need of the IoT node SoC to implement basic CNN computing with a compact structure and low cost. Since the computation of the Tiny-Yolo V2 network is 6.97 GOPS, while that of the Squeezenet network is only 0.86 GOPS, the accelerator designed in this paper can meet the calculation requirements of these networks. For many IoT nodes such as wearable devices, power consumption and chip area are often more attractive than redundant performance.
In order to evaluate the resource and power consumption characteristics of the designed accelerator more accurately, we will use IC design tools to synthesize the circuit after modifying the accelerator IP in our future work. Meanwhile, we will use the prototype verification system designed to develop some practical IoT applications, such as face detection, face recognition, and license plate recognition.

Conclusions
In this paper, a compact and efficient convolutional neural network accelerator for IoT endpoint SoC is proposed. Firstly, we propose a convolution calculation method that uses repeatability of convolution data to reduce data bandwidth usage and designs function circuits such as convolution, accumulation and pooling. Secondly, we use these modules to form a compact multi-function accelerator and design an efficient storage method for the accelerator. Thirdly, a verification SoC is implemented on the Xilinx VC707 FPGA based on the Cortex-M3 kernel. Finally, the evaluation of the designed accelerator is achieved by migrating the Lenet-5 network on the verification platform and comparing with other platforms. This paper designs a verification SoC based on the Cortex-M3 kernel and uses the Lenet-5 network and MINIST as test cases to evaluate the performance of the accelerator. We select a desktop CPU and two mobile application processors which are typical high-performance IoT SoC as reference objects for performance comparison. The test results show that the proposed accelerator can make the computing power of the Cortex-M3 kernel, with a main frequency of only 80 MHz, nearly two times higher than that of quad-core Cortex-A7 SoC and 67% of the computational power of eight-core Cortex-A53 SoC Samsung S5P6818. Accelerator throughput is about 6.54 GOPS at a 220 MHz frequency with a circuit cost of only 4901 LUTs and it does not use precious hardware multiplier resources, which can meet the needs of AI computing of the endpoint SoC for the IoT.