An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution

Abstract: The Convolutional Neural Network (CNN) has been used in many fields and has achieved remarkable results, such as image classification, face detection, and speech recognition. Compared to GPU (graphics processing unit) and ASIC, a FPGA (field programmable gate array)-based CNN accelerator has great advantages due to its low power consumption and reconfigurable property. However, FPGA’s extremely limited resources and CNN’s huge amount of parameters and computational complexity pose great challenges to the design. Based on the ZYNQ heterogeneous platform and the coordination of resource and bandwidth issues with the roofline model, the CNN accelerator we designed can accelerate both standard convolution and depthwise separable convolution with a high hardware resource rate. The accelerator can handle network layers of different scales through parameter configuration and maximizes bandwidth and achieves full pipelined by using a data stream interface and ping-pong on-chip cache. The experimental results show that the accelerator designed in this paper can achieve 17.11GOPS for 32bit floating point when it can also accelerate depthwise separable convolution, which has obvious advantages compared with other designs.


Introduction
Inspired by biological vision systems, Convolutional Neural Network (CNN) is a well-known deep learning algorithm extended from Artificial Neural Network (ANN) that has become one of the research hotspots in many scientific fields [1,2].It has achieved great success in image classification [3], object detection [4], and speech recognition [5].This technique has also been widely used in the industry, such as monitoring and surveillance, autonomous robot vision, and smart camera technologies [6][7][8][9].
Due to the development of consumer electronics and the development of the Internet of Things (IoT), embedded devices have occupied an important position.However, most of the current image processing devices are still based on the PC architecture, which is inconvenient for some specific occasions.Or, the use of embedded devices is only the image acquisition and display work and the background is still through the PC for data processing.Consumer-grade IoT devices often rely on high-quality Internet connections, are only available in some areas, and cost more.Therefore, the high performance of CNN directly on embedded devices has great application requirements.
The implementation of high performance relies on the computing platform.Because CNN is computationally intensive, it is not suitable for general-purpose processors, such as traditional CPUs.Many researchers have proposed CNN accelerators for implementation in the Field-programmable gate array (FPGA) [10,11], graphics processing unit (GPU) [3], and application-specific integrated circuit (ASIC) [12].These accelerators provide an order of magnitude performance improvement and energy advantage over general purpose processors [13].Although the GPU has superior performance in the computational efficiency of deep learning, it is expensive and has large power consumption.There are many problems in the large-scale deployment and operation platform.For the same given functional design, the power consumption of a single GPU is often several tens of times or even hundreds of times the power consumption of the FPGA.Compared to ASICs, FPGAs have a short design cycle and can be reconfigured.In recent years, due to the reconfigurable, customizable, and energy-efficient features of FPGAs [14] and the rapid development of high-performance products and more flexible architecture design, more and more researchers are focusing on FPGA-based CNN hardware Accelerate implementation.On the other hand, many efficient network structures have been proposed which effectively reduces the computational complexity and parameter quantities of the model.Among them, depthwise separable convolution is very typical and widely used.This has been applied in Mobile Net V1 [15] and later in Mobile Net V2 [16].
In general, deploying CNN on an FPGA-based hardware platform has become a research boom through the adoption of reliable and efficient hardware acceleration solutions to achieve high performance.The literature [7,17,18] implements a complete CNN application on the FPGA with high performance by exploiting different parallelism opportunities.Work [7,17] mainly uses the parallelism within feature maps and convolution kernel.Work [18] uses "inter-output" and "intra-output" parallelism.However, these three improve performance with high bandwidth and dynamic reconfiguration instead of using the on-chip buffer for data reuse.Reference [19] aims to design efficient accelerator problems with limited external storage bandwidth by maximizing data reuse, however it does not consider computational performance.Further, it is necessary to reprogram the FPGA when computing the next layer, which greatly increases the whole running time.Literature [20] studies the data parallelism of deep learning algorithms using six FPGAs to calculate cloud acceleration calculations, however it requires a well-coordinated control program and a large system.None of the above studies have taken into account the deployment requirements of mobile devices, such as storage bandwidth and resource constraints, and flexible portability.Work [21] presents an FPGA implementation of CNN designed for addressing portability and power efficiency.The implementation is as efficient as a general purpose 16-core CPU and is almost 15 times faster than a So C GPU for mobile application.The Squeeze Net DCNN is accelerated using a So C FPGA in order for the offered object recognition resource to be employed in a robotic application [22].In [23], under the roofline model, considering resources and bandwidth, a CNN accelerator was implemented on the VC707 FPGA board.Literature [24] proposes many optimization methods and uses the Xilinx SDAccel tool to accelerate a convolution layer under the OpenCL framework with a performance improvement of 14.4 times.In [25], the authors present a systematic methodology for maximizing the throughput of an FPGA-based accelerator.In this work, an entire CNN model is proposed consisting of all CNN layers: convolution, normalization, pooling, and classification layers.Work [26] proposes a FPGA accelerator with a scalable architecture of deeply pipelined Open CL kernels.However, none of the above work [21][22][23][24][25][26] implements depthwise separable convolution, therefore they cannot apply to series networks such as MobileNet.This paper makes the following major contributions: 1.
A configurable system architecture is proposed based on the ZYNQ heterogeneous platform.Under this architecture, the optimal design of the accelerator is completed with the Roofline model, and the accelerator is scalable.

2.
Based on the single-computation engine model, the CNN hardware accelerator we designed efficiently integrates standard convolution and depthwise separable convolution.

3.
Ping-pong on-chip buffer maximizes the bandwidth and the CNN accelerator we designed is full pipelined.
The rest of this article is organized as follows: Section 2 introduces the basic principles of CNN and depthwise separable convolution.Section 3 describes the architecture of this implementation and elaborates on the design details of the accelerator.Section 4 describes the experimental results of implementing the accelerator on the ZYNQ platform, completing design verification and analysis.Section 5 summarizes the content of this article.

Convolutional Neural Network
A typical CNN contains multiple computation layers which are concatenated together.The main common network layers are the convolutional layer, pooled layer, and fully connected layer.The details are as follows.

Convolution Layer
The convolutional layer is the most important layer in a CNN.It is used to extract the characteristics of the input image or the output feature map data of the upper layer.The operation is a two-dimensional convolution calculation by input data and a plurality of different convolution kernels, and a new two-dimensional output process is obtained by the activation function.The calculation formula for a single two-dimensional convolution is given by Equation (1).
where p x+i,y+j is the pixel value of the input feature map at the point of (x + i, y + j), k is the size of the convolution kernel, W and H are the width and height of the input feature map, w ij is the corresponding weight in the convolution kernel, b is the bias and f is the activation function (e.g., ReLU, Sigmoid, Tanh, Etc.), and O xy is a convolution output value of a two-dimensional convolution with a convolution window size of k × k centered on the point of (x, y).
The calculation of the convolutional layer is composed of many two-dimensional convolution operations, and its calculation is as Equation (2).
where X n j is the jth feature map output by the nth layer convolution layer, N is the number of input feature map channels, k n j,i indicates the corresponding convolution kernel and M is the number of convolution kernels, b n j is the offset term, * is a convolution operation, and f is the activation function.

Pool Layer
The pool layer, also called the down sample layer, reduces feature map redundancy and network computation complexity by reducing the feature map dimensions and effectively prevents over fitting.The formula for calculating the pooling layer is shown in Equation (3).
where X n j is the jth feature map output by the nth layer convolution layer, down is the pooling method, commonly used is the average pooling and maximum pooling, and f is the activation function.
When the pooling step size is 2, the process of 2 × 2 maximum pooling is shown in the Figure 1.The full connection is generally placed at the end of the convolutional neural network and the high-level two-dimensional feature map extracted by the previous convolutional layer is converted into a one-dimensional feature map output.In the fully connected layer, each of its neurons is connected to all neurons of the previous layer and there is no weight sharing.

Depthwise Separable Convolution
In recent years, in order to run high-quality CNN models on mobile terminals with strict memory and computing budgets, many innovative network models have been proposed, such as MobileNet and ShuffleNet.These models include depthwise separable convolution which effectively reduces the amount of parameters and calculations of the network under limited loss of precision.
The standard convolution uses a convolution kernel with the same channels of input data to sum a result after channel-by-channel convolution.As Figure 2 shows, depthwise separable convolution is divided into depthwise convolution and pointwise convolution.The former refers to the use of a set of two-dimensional (channel number is 1) kernels to perform the convolution for each channel between input feature maps and kernels individually.The latter is equivalent to the standard convolution of 1×1 kernel size.In the following text, it is implemented as a standard convolution.

Fully-Connected Layer
The full connection is generally placed at the end of the convolutional neural network and the high-level two-dimensional feature map extracted by the previous convolutional layer is converted into a one-dimensional feature map output.In the fully connected layer, each of its neurons is connected to all neurons of the previous layer and there is no weight sharing.

Depthwise Separable Convolution
In recent years, in order to run high-quality CNN models on mobile terminals with strict memory and computing budgets, many innovative network models have been proposed, such as MobileNet and ShuffleNet.These models include depthwise separable convolution which effectively reduces the amount of parameters and calculations of the network under limited loss of precision.
The standard convolution uses a convolution kernel with the same channels of input data to sum a result after channel-by-channel convolution.As Figure 2 shows, depthwise separable convolution is divided into depthwise convolution and pointwise convolution.The former refers to the use of a set of two-dimensional (channel number is 1) kernels to perform the convolution for each channel between input feature maps and kernels individually.The latter is equivalent to the standard convolution of 1 × 1 kernel size.In the following text, it is implemented as a standard convolution.The full connection is generally placed at the end of the convolutional neural network and the high-level two-dimensional feature map extracted by the previous convolutional layer is converted into a one-dimensional feature map output.In the fully connected layer, each of its neurons is connected to all neurons of the previous layer and there is no weight sharing.

Depthwise Separable Convolution
In recent years, in order to run high-quality CNN models on mobile terminals with strict memory and computing budgets, many innovative network models have been proposed, such as MobileNet and ShuffleNet.These models include depthwise separable convolution which effectively reduces the amount of parameters and calculations of the network under limited loss of precision.
The standard convolution uses a convolution kernel with the same channels of input data to sum a result after channel-by-channel convolution.As Figure 2 shows, depthwise separable convolution is divided into depthwise convolution and pointwise convolution.The former refers to the use of a set of two-dimensional (channel number is 1) kernels to perform the convolution for each channel between input feature maps and kernels individually.The latter is equivalent to the standard convolution of 1×1 kernel size.In the following text, it is implemented as a standard convolution.Assuming that the size of the input feature map is ** F F N , the size of the convolution kernel is * * * K K M N and the stride is 1.The parameter quantities of the standard convolution layer are: The amount of calculation is: The parameter quantities of depthwise separable convolution are: The amount of calculation is: Thus, the reduction factors on weights and operation are calculated in Equation (8):

Architecture and Accelerator Design
In the AI application scenario, the CPU is highly flexible, however not computationally efficient, and the accelerator is computationally efficient, however not flexible enough.Therefore, the architecture that is currently widely used for deep learning usually combines a CPU with an accelerator, called a heterogeneous system.We choose the Xilinx ZYNQ 7100 heterogeneous chip as the hardware platform to complete the design of the system architecture and accelerator.Assuming that the size of the input feature map is F * F * N, the size of the convolution kernel is K * K * M * N and the stride is 1.The parameter quantities of the standard convolution layer are: The amount of calculation is: The parameter quantities of depthwise separable convolution are: The amount of calculation is: Thus, the reduction factors on weights and operation are calculated in Equation ( 8):

Architecture and Accelerator Design
In the AI application scenario, the CPU is highly flexible, however not computationally efficient, and the accelerator is computationally efficient, however not flexible enough.Therefore, the architecture that is currently widely used for deep learning usually combines a CPU with an accelerator, called a heterogeneous system.We choose the Xilinx ZYNQ 7100 heterogeneous chip as the hardware platform to complete the design of the system architecture and accelerator.

Design Overview
There are currently two different implementation modes for CNN due to its hierarchical structure; one is Streaming Architectures and the other is Single Computation Engine.The former which allocates corresponding hardware resources to each network layer has the following three characteristics: (1) it can realize inter-layer parallelism and flexibly control the parallelism within each layer.(2) It is highly customized and inflexible.(3) The demand for resources is high and only applies to small networks.The latter means that different network layers share the same accelerator through resource reuse, which is a non-highly customized architecture, is more flexible, and is easier to migrate between platforms.Therefore, considering the limited resources of the hardware platform and the parallelism of fully developing the single-layer network structure, we design the system architecture in the single computation engine mode, as shown in Figure 3.

Design Overview
There are currently two different implementation modes for CNN due to its hierarchical structure; one is Streaming Architectures and the other is Single Computation Engine.The former which allocates corresponding hardware resources to each network layer has the following three characteristics: (1) it can realize inter-layer parallelism and flexibly control the parallelism within each layer.(2) It is highly customized and inflexible.(3) The demand for resources is high and only applies to small networks.The latter means that different network layers share the same accelerator through resource reuse, which is a non-highly customized architecture, is more flexible, and is easier to migrate between platforms.Therefore, considering the limited resources of the hardware platform and the parallelism of fully developing the single-layer network structure, we design the system architecture in the single computation engine mode, as shown in Figure 3.The system architecture mainly includes external memory Double Data Rate (DDR), processing system (PS), on-chip buffer, accelerator in programmable logic (PL), and on-chip and off-chip bus interconnection.The initial image data and weights are pre-stored in the external memory DDR.PS and PL are interconnected through the AXI4 bus.The accelerator receives configuration signals from the CPU through the AXI4_Lite bus (e.g., convolution kernel size, stride, performs standard convolution or depthwise convolution, etc.).Under the action of the DDR controller in the PS, the weight and input data of the current layer required by the accelerator are read from the DDR and are converted from the AXI4_memory map format to the AXI4_streaming format into the on-chip buffer of the accelerator under the action of the Direct Memory Access (DMA).The buffer of the IP core uses the AXI4_Streaming interface and the ping-pong mode.One buffer acts as a producer for receiving data from the DMA and the other is used to participate in the current calculation of the IP core, called a consumer.In the next stage, the producer and the consumer swap.After being processed by the accelerator, the output is sent back to the DDR through the AXI4 bus and the above operation is repeated until the calculation of the entire network model is completed.
It can be seen that the accelerator has two data exchanges with the external memory under the architecture, including receiving the weights and input feature map and sending output feature map back to the off-chip.Frequent data exchange imposes high requirements on the bandwidth of the platform.Therefore, taking the MobileNet + SSD as an example, use the roofline model to jointly consider the computing platform resources and storage bandwidth to seek optimal design for the accelerator.The system architecture mainly includes external memory Double Data Rate (DDR), processing system (PS), on-chip buffer, accelerator in programmable logic (PL), and on-chip and off-chip bus interconnection.The initial image data and weights are pre-stored in the external memory DDR.PS and PL are interconnected through the AXI4 bus.The accelerator receives configuration signals from the CPU through the AXI4_Lite bus (e.g., convolution kernel size, stride, performs standard convolution or depthwise convolution, etc.).Under the action of the DDR controller in the PS, the weight and input data of the current layer required by the accelerator are read from the DDR and are converted from the AXI4_memory map format to the AXI4_streaming format into the on-chip buffer of the accelerator under the action of the Direct Memory Access (DMA).The buffer of the IP core uses the AXI4_Streaming interface and the ping-pong mode.One buffer acts as a producer for receiving data from the DMA and the other is used to participate in the current calculation of the IP core, called a consumer.In the next stage, the producer and the consumer swap.After being processed by the accelerator, the output is sent back to the DDR through the AXI4 bus and the above operation is repeated until the calculation of the entire network model is completed.

Accelerator overview
It can be seen that the accelerator has two data exchanges with the external memory under the architecture, including receiving the weights and input feature map and sending output feature map back to the off-chip.Frequent data exchange imposes high requirements on the bandwidth of the platform.Therefore, taking the MobileNet + SSD as an example, use the roofline model to jointly consider the computing platform resources and storage bandwidth to seek optimal design for the accelerator.

Accelerator Overview
The structure of the accelerator is shown in Figure 4.The on-chip buffer is divided into three parts (1) Input buffer for storing input feature map, (2) Weights buffer for storing weights, (3) Output buffer for storing intermediate results and the output feature map.In order to maximize the external storage bandwidth, the three all use AXI_Streaming interfaces and the ping pong mode.The three processes of inputting input feature map data and weights, calculating convolution, and outputting the calculation result are completely flowed.The compute engine can be selected to work in standard convolution or depthwise convolution modes under the control of the CPU through the AXI_Lite bus.The structure of the accelerator is shown in Figure 4.The on-chip buffer is divided into three parts (1) Input buffer for storing input feature map, (2) Weights buffer for storing weights, (3) Output buffer for storing intermediate results and the output feature map.In order to maximize the external storage bandwidth, the three all use AXI_Streaming interfaces and the ping pong mode.The three processes of inputting input feature map data and weights, calculating convolution, and outputting the calculation result are completely flowed.The compute engine can be selected to work in standard convolution or depthwise convolution modes under the control of the CPU through the AXI_Lite bus.

The roofline model of ZYNQ 7100
In order to solve the performance prediction problem of the deep learning model on a specific hardware platform, in [27], the roofline model proposed a method of quantitative analysis using operational intensity, which calculates how fast the floating point calculation speed can be achieved under the limitation of external storage bandwidth and computing resources on a hardware platform.This is shown in Figure 5.

The Roofline Model of ZYNQ 7100
In order to solve the performance prediction problem of the deep learning model on a specific hardware platform, in [27], the roofline model proposed a method of quantitative analysis using operational intensity, which calculates how fast the floating point calculation speed can be achieved under the limitation of external storage bandwidth and computing resources on a hardware platform.This is shown in Figure 5.The structure of the accelerator is shown in Figure 4.The on-chip buffer is divided into three parts (1) Input buffer for storing input feature map, (2) Weights buffer for storing weights, (3) Output buffer for storing intermediate results and the output feature map.In order to maximize the external storage bandwidth, the three all use AXI_Streaming interfaces and the ping pong mode.The three processes of inputting input feature map data and weights, calculating convolution, and outputting the calculation result are completely flowed.The compute engine can be selected to work in standard convolution or depthwise convolution modes under the control of the CPU through the AXI_Lite bus.

The roofline model of ZYNQ 7100
In order to solve the performance prediction problem of the deep learning model on a specific hardware platform, in [27], the roofline model proposed a method of quantitative analysis using operational intensity, which calculates how fast the floating point calculation speed can be achieved under the limitation of external storage bandwidth and computing resources on a hardware platform.This is shown in Figure 5.

Attainable performance ( GFLOPS)
Computation to communication ratio (FLOP/Byte)  Equation (9) formulates the attainable throughput of an application on a specific hardware platform.Giga floating-point operations per second (GFLOPS) is a measure of computing performance.The roofline is divided into two regions: compute bound and memory bound.The computational performance achievable on the platform by the network model cannot exceed the minimum of the two regional bottlenecks.In the case of compute bound, the bottleneck is the computing roof (i.e., a computing platform exhausts all the floating-point operations that can be completed per second).When in the memory bound, the bottleneck is multiplied by the computing to communication (CTC) ratio (i.e., operations per DRAM traffic) and the I/O memory bandwidth (BW), In our work, we calculate the computational roof and the I/O memory maximum bandwidth roof of the Xilinx ZYNQ 7100 computing platform according to Equation (10).
where N Dsp is the number of hardware platform DSP divided by 5 because it requires five DSPs to complete a multiplication and addition operation of 32-bit floating-point multiply.N HP is the number of High Performance (HP) ports and f is the system clock frequency (assumed to be 100 MHz).
The constructed skeleton model is shown in Figure 6.
In our work, we calculate the computational roof and the I/O memory maximum bandwidth roof of the Xilinx ZYNQ 7100 computing platform according to Equation (10).
where Dsp N is the number of hardware platform DSP divided by 5 because it requires five DSPs to complete a multiplication and addition operation of 32-bit floating-point multiply.
HP N is the number of High Performance (HP) ports and f is the system clock frequency (assumed to be 100 MHz).The constructed skeleton model is shown in Figure 6.

Data partition and exploring the design space
Since the on-chip cache resources are often extremely limited, this is usually unsatisfactory for all input feature maps and weights to be cached on the chip.The data must be partitioned.As shown in Figure 7, since the convolution kernel size K itself is small, it is not divided in this dimension.

, , , R C M N , and
i N are the width, height, and number of convolution kernels of the output feature map (also the number of channels of the output feature map), the channels of convolution kernel channels, and the channels of the input feature map, respectively.

Data Partition and Exploring the Design Space
Since the on-chip cache resources are often extremely limited, this is usually unsatisfactory for all input feature maps and weights to be cached on the chip.The data must be partitioned.As shown in Figure 7, since the convolution kernel size K itself is small, it is not divided in this dimension.R, C, M, N, and N i are the width, height, and number of convolution kernels of the output feature map (also the number of channels of the output feature map), the channels of convolution kernel channels, and the channels of the input feature map, respectively.T r , T c , T m , and T n are the block factors of width, height of output feature map, the number and channels of convolution kernels, respectively.T ri , T ci , and T ni are the block factors width, height, and channels of the input feature map, respectively.The above-mentioned block coefficient setting takes into account both standard convolution and depthwise convolution.We use the example in Figure 8 to illustrate how block convolution works.In this example, the input tensor consists of three separated channels of size 8 × 8 with additional zero padding, the kernel size is 3 × 3, and each input feature map is divided into four independent tiles.Since inter-tile dependencies are eliminated in block convolution, it is not possible to obtain an output tile of size 4 × 4 directly from three input tiles at the corresponding position.As shown in Figure 8a, when the stride is 1, an input tile of size 6 × 6 is required to get an output tile of size 4 × 4. In Figure 8b, an input tile of size 5 × 5 is required to get an output tile of size 2 × 2 when the stride is 2. In block convolution, the relationship between Block convolution affects the external memory access of the model, which affects the CTC Ratio .See Equation (12), which establishes a mathematical connection between the block factors and CTC Ratio .We use the example in Figure 8 to illustrate how block convolution works.In this example, the input tensor consists of three separated channels of size 8 × 8 with additional zero padding, the kernel size is 3 × 3, and each input feature map is divided into four independent tiles.Since inter-tile dependencies are eliminated in block convolution, it is not possible to obtain an output tile of size 4 × 4 directly from three input tiles at the corresponding position.As shown in Figure 8a, when the stride is 1, an input tile of size 6 × 6 is required to get an output tile of size 4 × 4. In Figure 8b, an input tile of size 5 × 5 is required to get an output tile of size 2 × 2 when the stride is 2. In block convolution, the relationship between T r and T ri can be determined as Equation ( 11): We use the example in Figure 8 to illustrate how block convolution works.In this example, the input tensor consists of three separated channels of size 8 × 8 with additional zero padding, the kernel size is 3 × 3, and each input feature map is divided into four independent tiles.Since inter-tile dependencies are eliminated in block convolution, it is not possible to obtain an output tile of size 4 × 4 directly from three input tiles at the corresponding position.As shown in Figure 8a, when the stride is 1, an input tile of size 6 × 6 is required to get an output tile of size 4 × 4. In Figure 8b, an input tile of size 5 × 5 is required to get an output tile of size 2 × 2 when the stride is 2. In block convolution, the relationship between Block convolution affects the external memory access of the model, which affects the CTC Ratio .See Equation (12), which establishes a mathematical connection between the block factors and CTC Ratio .Block convolution affects the external memory access of the model, which affects the CTC Ratio.See Equation (12), which establishes a mathematical connection between the block factors and CTC Ratio.
In particular, for standard convolution, N = N i and T ni = T n ; however, for depthwise convolution, The hardware acceleration effect of CNN depends largely on the degree of development of algorithm parallelism.CNN belongs to a feedforward multi-layer network and its interlayer structure, intra-layer operation, and data stream drive all have certain similarities.Therefore, the convolutional neural network topology CNN itself has many parallelisms.This mainly includes (1) multi-channel operation of the input feature map and convolution kernel.(2) The same convolution window and different convolution kernels can simultaneously perform convolution operations.(3) Multiple convolution windows and the same convolution kernel can simultaneously perform convolution operations.( 4) In a convolution window, the parameters corresponding to all convolution kernels of all neuron nodes and corresponding parameters can be operated simultaneously.The above four parallelisms correspond to the dimensions of T n , T m , T r , and K, respectively.Computational parallel development not only has certain requirements for computing resources, yet also requires an on-chip cache structure to provide the data needed for parallel computing.However, it also increases the on-chip cache bandwidth.The Vivado HLS development tool makes it very easy to partition an array in a particular dimension.However, if the parallelism of ( 3) and ( 4) is developed, the cache structure of the data is shown in Figure 9.
In particular, for standard convolution, TT  .The hardware acceleration effect of CNN depends largely on the degree of development of algorithm parallelism.CNN belongs to a feedforward multi-layer network and its interlayer structure, intra-layer operation, and data stream drive all have certain similarities.Therefore, the convolutional neural network topology CNN itself has many parallelisms.This mainly includes (1) multi-channel operation of the input feature map and convolution kernel.(2) The same convolution window and different convolution kernels can simultaneously perform convolution operations.(3) Multiple convolution windows and the same convolution kernel can simultaneously perform convolution operations.(4) In a convolution window, the parameters corresponding to all convolution kernels of all neuron nodes and corresponding parameters can be operated simultaneously.The above four parallelisms correspond to the dimensions of n T , m T , r T , and K , respectively.Computational parallel development not only has certain requirements for computing resources, yet also requires an on-chip cache structure to provide the data needed for parallel computing.However, it also increases the on-chip cache bandwidth.The Vivado HLS development tool makes it very easy to partition an array in a particular dimension.However, if the parallelism of (3) and ( 4) is developed, the cache structure of the data is shown in Figure 9.As can be seen from the figure, if the data in a convolution window is to be distributed and stored in  KK buffers, the data is not continuous in the r T dimension.Vivado HLS is difficult to As can be seen from the figure, if the data in a convolution window is to be distributed and stored in K × K buffers, the data is not continuous in the T r dimension.Vivado HLS is difficult to implement this with array partitioning.Moreover, it repeatedly stores the overlapping data between the convolution windows which greatly increases the consumption of on-chip cache resources.Therefore, we will develop the parallelism of the calculations on the dimensions T m and T n .
The calculation engine diagram is shown in Figure 10.When calculating the standard convolution, the T n channels of the input feature map are simultaneously multiplied by the weights of the corresponding channels and then the intermediate results are continuously accumulated which will greatly reduce the delay by pipeline.At the same time, the same operation of the T m group among different convolution kernels is performed.When dealing with depthwise convolution, channels of the convolution kernel are filled with zero to T m in order to efficiently integrate the two kinds of convolution and to not destroy the computational parallelism, as shown in Figure 10b.
Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronicsimplement this with array partitioning.Moreover, it repeatedly stores the overlapping data between the convolution windows which greatly increases the consumption of on-chip cache resources.Therefore, we will develop the parallelism of the calculations on the dimensions m T and n T .The calculation engine diagram is shown in Figure 10.When calculating the standard convolution, the n T channels of the input feature map are simultaneously multiplied by the weights of the corresponding channels and then the intermediate results are continuously accumulated which will greatly reduce the delay by pipeline.At the same time, the same operation of the m T group among different convolution kernels is performed.When dealing with depthwise convolution, channels of the convolution kernel are filled with zero to m T in order to efficiently integrate the two kinds of convolution and to not destroy the computational parallelism, as shown in Figure 10b. ……. …….
…….It can be seen from the above analysis that under the calculation engine we designed, physical computation roo f can be calculated by the Equation ( 13) for a given block factors of T r , T c , T m , and T n .physical computation roo f = total number o f operations × system clock f requency number o f execution cycles ] where P = pipeline depth − 1 (13) increased.To associate the on-chip buffer with the block factors, we need to satisfy Equation (14).
Combining the above-mentioned CTC Ratio, pysical computation roo f , and the analyzed on-chip cache relationship with the roofline model of the ZYNQ 7100 under the block factors of T r , T c , T m , and T n , we seek the best design and find the optimal block factor, as shown in point A of the Figure 11, under some certain constraints, as shown in Equation (15).
Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics It can be seen from the above analysis that under the calculation engine we designed, physical computation roof can be calculated by the Equation ( 13) for a given block factors of total number of operations system clock frequency physical computation roof number of execution cycles R Due to array partition and ping-pong buffers, the consumption of on-chip cache resources is increased.To associate the on-chip buffer with the block factors, we need to satisfy Equation ( 14).Weight Buffer Input buffer Output Buffer

T T T T T T T T T T T T T T T T number of BRAM
Combining the above-mentioned CTC Ratio , pysical computation roof , and the analyzed on-chip cache relationship with the roofline model of the ZYNQ 7100 under the block factors of  According to the above ideas, the optimal block factors T r_opt , T c_opt , T m_opt , and T n_opt of the current network layer can be obtained by programming with Matlab.However, the optimal block coefficients obtained from the different layers are different.In particular, T m and T n affect the computational parallelism.If T m and T n are allowed to be variable, complex hardware architectures need to be designed to support reconfiguration of computational engines and interconnects.So, we will solve the global optimal T m and T n under the whole network model, as shown in Formula (16).Since T n and T m have been determined, the configurable parameters of the accelerator are shown in Table 1.

Experimental Evaluation and Results
The accelerator is implemented with Vivado HLS (v2016.4).Vivado HLS (v2016.4)can implement the accelerator in C++ and convert it to the RTL as a Vivado's IP core which greatly shortens the development cycle.The C code design of the accelerator can be implemented by adding the HLS-defined pragma of Vivado HLS to achieve the parallelism described previously, such as pipeline, array partition, dataflow, and so on.After that, the IP core is imported into the Vivado (v2016.4)project to complete the synthesis and verification on FPGA.

Resource Utilization
The hardware resources consumed by the accelerator are shown in Table 2.It can be seen from the table that the implemented accelerator has a high hardware resource rate and also verifies the analysis results of the previous design exploration and computational parallelism.

Comparisons of Pipelined and no Pipelined
MoblieNet+SSD has a total of 47 network layers which contains both standard convolution and depthwise separable convolution.We use the first network layer that is the standard convolution and the second network layer that is depthwise convolution as the example to compare the running time of full pipelined and no pipelined combined with layer parameters for more details.Table 3 is the parameter of the two network layers.A more detailed comparison is as shown in Tables 4  and 5, respectively.The parameters and structure of the network layer are different and the calculation throughput is also different.The performance bottleneck of MobileNet + SSD is the bandwidth roof rather than the computation roof.Therefore, the latency of the full-flow state achieved by the stream data and the ping-pong technique is greatly reduced.Combining the resource report, the pipelined version has higher energy efficiency than the version of not pipelined.

Comparisons with CPU Implementation
The CPU version is completed by the Cortex-A9 core of ZYNQ 7100.The complier is "arm_xilinx_eabigcc" in Xilinx Software Development Kit.And the software version of -O3 compilation flags is used to compare with an accelerator.The calculation amount of each layer is shown as Figure 12.The running time results of using the CPU and accelerator to complete the MobileNet + SSD network are shown as Figures 13 and 14, respectively.
The running time of each layer in Figures 13 and 14 includes the time fetching the data from the external memory, computing, and sending the results to the external memory with each layer.Combining Figures 13 and 14, it indicates that the time consumption of the accelerator is considerably smaller, more than 150 times faster compared with the software process.The CPU version is completed by the Cortex-A9 core of ZYNQ 7100.The complier is "arm_xilinx_eabigcc" in Xilinx Software Development Kit.And the software version of -O3 compilation flags is used to compare with an accelerator.The calculation amount of each layer is shown as Figure 12.The running time results of using the CPU and accelerator to complete the MobileNet+SSD network are shown as Figure 13 and Figure 14, respectively.The running time of each layer in Figure 13 and Figure 14 includes the time fetching the data from the external memory, computing, and sending the results to the external memory with each layer.Combining Figure 13 and Figure 14, it indicates that the time consumption of the accelerator is considerably smaller, more than 150 times faster compared with the software process.

Comparisons with others
Compare our implementation of the accelerator with other FPGA-based accelerators, as shown in Table 6.Since one Multiply-Accumulate (MACC) contains two operations, we convert the performance indicators into (Giga operations per second) GOPS and compare them.Other accelerator implementations do not include depthwise separable convolution, so the performance of our

Comparisons with Others
Compare our implementation of the accelerator with other FPGA-based accelerators, as shown in Table 6.Since one Multiply-Accumulate (MACC) contains two operations, we convert the performance indicators into (Giga operations per second) GOPS and compare them.Other accelerator implementations do not include depthwise separable convolution, so the performance of our standard convolution is involved in the comparison.If using a fixed-point calculation engine, our method can better perform because the fixed-point processing unit uses fewer resources.It can be seen that the accelerator we have achieved has certain performance advantages.

Conclusions
In this article, we implemented a CNN accelerator on the Xilinx ZYNQ 7100 hardware platform that accelerates both standard convolution and depthwise separable convolution.Thanks to the heterogeneous mode of ZYNQ, the accelerator based on the single-computing engine mode can realize network layer acceleration of different scales under the configurable architecture we designed.Taking the MobileNet + SSD network design as an example, the accelerator modeled the global optimal computational parallelism parameter of the entire network under the roofline model of ZYNQ 7100.In order to maximize bandwidth and reduce the delay caused by on-chip off-chip data exchange, the three stream buffers on the chip use the data stream interface and set the ping-pong buffer mode.Even when dealing with standard convolution or depthwise separable convolution, the above-mentioned technology achieves a full pipelined state with a much slower delay than the no pipelined state.In the end, the accelerator achieved a computing performance of 17.11GFLOPS at a clock frequency of 100 MHz and high resource utilization, which is superior to previous designs.Our current system clock frequency is only 100 MHZ, which is lower than other designs.If we can increase the system clock, the performance of the accelerator will be significantly improved.

Figure 3 .
Figure 3. FPGA (field programmable gate array) architecture of system implementation.

Figure 3 .
Figure 3. FPGA (field programmable gate array) architecture of system implementation.
of width, height of output feature map, the number and channels of convolution kernels, respectively.ri T , ci T , and ni T are the block factors width, height, and channels of the input feature map, respectively.The above-mentioned block coefficient setting takes into account both standard convolution and depthwise convolution.

Figure 8 .
Figure 8.An example of block convolution: (a) The stride of convolution is one; (b) The stride of convolution is two.

Figure 8 .
Figure 8.An example of block convolution: (a) The stride of convolution is one; (b) The stride of convolution is two.

Figure 8 .
Figure 8.An example of block convolution: (a) The stride of convolution is one; (b) The stride of convolution is two.

Figure 9 .
Figure 9. Array partitioning in Tr and K dimensions.

Figure 9 .
Figure 9. Array partitioning in T r and K dimensions.
seek the best design and find the optimal block factor, as shown in point A of the Figure11, under some certain constraints, as shown in Equation (15).

Figure 13 .
Figure 13.CPU running time results.Figure 13.CPU running time results.
on the platform by the network model cannot exceed the minimum of the two regional bottlenecks.In the case of compute bound, the bottleneck is the computing roof (i.e., a computing platform exhausts all the floating-point operations that can be completed per second).When in the memory bound, the bottleneck is multiplied by the computing Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronicscomputational performance achievable N is the number of network layers.T m_min and T m_max are the minimum and maximum values of T m_opt sought by all network layers.T n_min and T n_max are the minimum and maximum values of T n_opt sought by all network layers.The final global optimal solution is obtained: where

Table 2 .
Resource utilization of the accelerator.

Table 3 .
Parameter of the two network layers.

Table 4 .
Comparisons of the first layer.

Table 5 .
Comparisons of the second layer.

Table 6 .
Comparison to previous implementations.

Table 6 .
Comparison to previous implementations.