Novel CNN-Based AP2D-Net Accelerator: An Area and Power Efﬁcient Solution for Real-Time Applications on Mobile FPGA

: Standard convolutional neural networks (CNNs) have large amounts of data redundancy, and the same accuracy can be obtained even in lower bit weights instead of ﬂoating-point representation. Most CNNs have to be developed and executed on high-end GPU-based workstations, for which it is hard to transplant the existing implementations onto portable edge FPGAs because of the limitation of on-chip block memory storage size and battery capacity. In this paper, we present adaptive pointwise convolution and 2D convolution joint network (AP2D-Net), an ultra-low power and relatively high throughput system combined with dynamic precision weights and activation. Our system has high performance, and we make a trade-off between accuracy and power efﬁciency by adopting unmanned aerial vehicle (UAV) object detection scenarios. We evaluate our system on the Zynq UltraScale+ MPSoC Ultra96 mobile FPGA platform. The target board can get the real-time speed of 30 fps under 5.6 W, and the FPGA on-chip power is only 0.6 W. The power efﬁciency of our system is 2.8 × better than the best system design on a Jetson TX2 GPU and 1.9 × better than the design on a PYNQ-Z1 SoC FPGA.


Introduction
Convolutional neural network (CNN)-based deep learning (DL) algorithms are widely used in autonomous driving, natural language processing, web recommendation systems, etc., which greatly improve the quality of life of modern society. However, for more intricate tasks, the number of CNN model parameters grows exponentially. Usually, it takes the order of giga floating-point operations (GFLOP) to process a single image, which is far beyond the computational ability of the central processing unit (CPU) and hard to process in real-time. To overcome those compute-intensive tasks, researchers leverage the advantages of the graphics processing unit (GPU), such as high bandwidth and thread parallelism. The pioneer GPU chip vendors, such as NVIDIA, are also assisting researchers with a systematic and well-designed programming language: CUDA [1], and the NVIDIA CUDA deep neural network (DNN) library: cuDNN [2]. Thus, the NVIDIA GPU can provide higher computational performance and a user-friendly application programming interface (API). Although the GPU's benefit in the DNN application is obvious, its drawbacks are also remarkable: (1) high power dissipation (usually larger than 200 W [3]) limits the applications of deep learning in the industry field; (2) high-end equipment dependence, such that some research is stuck with applying deeper networks to achieve In this paper, we design a CNN-based accelerator based on the binary quantized networks on edge FPGA devices for unmanned aerial vehicle (UAV) object detection. Our design is configurable for different sizes of neural networks and pipelining structures for high throughput. We summarize the factors that affect the performance and power consumption of the mobile FPGA accelerator through different experiments. This paper has the following contributions: 1. We propose a CNN structure named adaptive pointwise (PW) convolution and 2D convolution joint network (AP2D-Net), a dynamic precision activation and binary weights combined with PW convolution and 2D convolution structure, which works on resource-limited mobile platforms (such as Zynq UltraScale+ MPSoC Ultra96 or PYNQ-Z1/Z2). The architecture also can be configured on other FPGA platforms according to different hardware resources to achieve the highest throughput.
2. For feature extraction layers, we simplified the convolution operation by using the XOR gate to remove the multiplication operation. Besides, we use offline preprocessing to combine batch normalization (BN) and scale/bias (SB) operation to optimize the computational kernels. 3. To get high bandwidth, we use the advanced extensible interface (AXI4) protocol to communicate between the programming logic (PL), programming system (PS), and memory. Furthermore, we propose a multi-processing scheme to optimize the heterogeneous system and reduce the latency between PL and PS. 4. We conduct a set of experiments considering the running speed, hardware resources, object detection accuracy, and power consumption to get the best combination. The code for training, inference, and our system demo is available online and is open source [9].
The rest of the paper organized as follows. In Section 2, we introduce the background and related works. Section 3 describes the AP2D-Net modeling design, the hardware system design, and optimization. Section 4 presents the experimental setup and results. Finally, Section 5 concludes this paper.

Related Work
In the past few years, there have been many different FPGA designs for CNN accelerators [3,5,. The above designs can be divided into three categories: computational kernels optimization, bandwidth optimization, and optimization of CNN models.

Optimization of the Computational Kernels
There are mainly three convolution algorithms used in CNNs: general matrix multiplication (GEMM), fast Fourier transform (FFT), and Winograd [31]. Cong et al. [10] provided an algorithmic modification to reduce the GEMM computational workload by 22% on average. However, this work only focused on the GEMM part and ignored the bandwidth design on external memory. Zeng et al. [11] used the optimized frequency domain (fast Fourier transform (FFT)) convolution instead of standard convolution to increase the processing speed. However, FFT-based convolution is fast for large convolution kernels, but state-of-the-art CNNs use small kernel sizes such as 3 × 3. In these small filter scenarios, the performance of FFT is very limited. Some works [12,13] have applied Winograd convolution to replace the traditional convolution operation. The Winograd operation can reduce the multiplication operations, but increases add/subtraction operations and needs extra transform time. Since add/subtraction operations are much faster than multiplications, the overall time can be reduced. Aydonat et al. [12] used Winograd transformation to boost the performance on FPGA, which can achieve a peak performance of 1.3 trillion floating-point operations per second (TFLOPS) for some specific operations such as fully-connected (FC) layer. However, in modern DNN applications, especially for object detection, the FC layer has a low working efficiency, which might cause overfitting. Furthermore, this work [12] only evaluated the performance on AlexNet [14], whose DNN structure is primitive. In our optimized GEMM method, we adopted BNN [6,7] and used the XOR operation to replace multiplication to increase the GEMM computational speed.

Bandwidth Optimization to Improve Throughput
Suda et al. [15] used quantized 16 bit fixed-point operation to improve the throughput. However, the throughput was only 117.8 giga operations per second (GOPS), which was still far less than the real-time requirement. Li et al. [16] applied parallel operations on convolution layers and a batch-based computing method on FC layers to process multiple input images in parallel. However, this method is not suitable for video sequence application, which has a temporality input order. Motamedi et al. [17] provided a way to develop a method to get a feasible weight file for FPGA. Nurvitadhi et al. [18] used special I/O between CPU and FPGA to accelerate the detection process. There were some works [17,20] that applied the roofline model to design the trade-off between computing throughput and required memory bandwidth to maximize the utilization of FPGA resources. Still, the performance was only 61.62 giga floating-point operations per second (GFLOPS) in the work of Zhang et al. [20] and 84.2 GFLOPS in the work of Motamedi et al. [17], respectively, which was far less than real-time application. Gokhale et al. [21] proposed a CNN accelerator that could achieve a peak performance of 200 GOPS. The architecture only included three primary modules: convolution, subsampling, and nonlinear functions, which are not suitable for complex tasks. Rahman et al. [25] proposed an input-recycling convolutional array of neurons (ICAN) architecture implemented by using 32 bit fixed-point MAC units, which could achieve 147.82 GOPS throughput on Virtex-7 FPGA. Qiu et al. [26] used a dynamic precision data quantization method and applied singular value decomposition (SVD) to the weight of the matrix of the FC layer to reduce resource consumption. Besides, they proposed a data arrangement method to improve the utilization of external memory bandwidth. However, the frame rate was only 4.45 fps, which was far less than the real-time application. Guo et al. [27] implemented a programmable and flexible CNN accelerator architecture, together with a data quantization strategy and compilation tool, which could obtain 137 GOPS throughput on Zynq XC7Z045. There were some works [24,28] that proposed a quantitative model for mapping CNNs to the FPGA cluster. A multi-FPGA architecture could improve throughput. However, the power dissipation also increased drastically. Guan et al. [29] proposed an end-to-end framework that took TensorFlow-described DNNs as the input, which generated the hardware implementation with register-transfer level-high-level synthesis (RTL-HLS) hybrid templates.

Model Optimization
Recently, Fujii et al. [22] used the pruning technique on the FC layer to reduce 39.8% of the parameters in the FC layers, which only increased the throughput of the FC layers somehow, and it did not help too much to improve the overall throughput. In fact, due to the low efficiency of the FC layer, it has disappeared in most advanced modern CNN models. Hailesellasie et al. [23] implemented a reduced parameter CNN on Xilinx ZYNQ FPGA, but there still existed a large gap between structure optimization and cross-platform transportable. Bai et al. [30] adopted depthwise separable convolution to replace standard convolution for embedded platforms, which could reduce the operations and parameters. Gong et al. [5] proposed an optimal paralleling scheme and a balanced pruning-based method to reduce the redundancy of FC layers, which could obtain 80 GOPS on Zynq Z7020 FPGA.

Binary Neural Networks
For GPU implementations, most of the data types are in floating numbers. In FPGA implementations, we usually use 8 bit or 16 bit quantization data types. However, the fixed-point data for the on-chip memory storage are still heavy. BNN [6,7] gives a solution to reduce the memory size. An extreme case of BNN is both input feature maps (IFMs) and weights as binary numbers. In this case, the convolution operation in BNN can be replaced by the XNOR operation and popcount. However, this case will introduce a large accuracy drop. Instead, we usually use a bit width larger than 1 bit to represent IFMs or output feature maps (OFMs) to keep an acceptable accuracy. Some BNN solutions also gave a way to train the DNN on FPGA, e.g., DoReFa-Net [32] gave a possible solution to train the neural networks on FPGA based on the quantized data type. However, the intention of quantifying the gradient to reduce the training time would introduce an unacceptable loss of precision. In our solution, we used dynamic precision weights for different layers and dynamic quantized activation precision to get the best combination between accuracy and speed.

Implementation Methodologies
For most research, the CNN-based FPGA methodologies can be divided into two categories: register-transfer level (RTL) design methodology [33,34] and high-level synthesis (HLS) methodology. RTL refers to the program through hardware description languages (HDL) such as VHDL and Verilog. This method will cost a longer development period and introduce more complexity during the programming, but can make good use of the FPGA resources without any redundancy and has a better sequential logic control at cycle-wise level. HLS refers to the C/System C-like synthesis, including the OpenCL method (for Intel/Xilinx FPGA both) or Vivado HLS (for Xilinx FPGA only). It generates RTL code automatically from C/C++ source code. Most of the deep learning FPGA designs are developed by using an HLS method [12,15,35,36]. This method can accelerate the design period and easily simulate and modify different designs. For Xilinx Vivado HLS, it supplies some advanced HLS tools such as HLS DNN intellectual property (IP), HLS linear algebra, and HLS digital signal processors (DSPs), which makes the design of the CNN FPGA accelerator much easier. Thus, more and more designs are based on Xilinx FPGA.
Even though recent research has achieved prosperous results, most of the designs only work for high-end FPGAs. There are some drawbacks for these kinds of platforms: (1) High price: High-end FPGAs usually have a very high price from thousands to tens of thousandsU.S. dollars. The high price of the FPGA is not suitable for mass production in industrial and civil applications. (2) High power consumption: The power dissipation of high-end FPGA usually costs dozens of Watts when running. The power is relatively lower than traditional GPUs, but it is still not low enough. Some FPGAs without an ARM core have to work with an external central processing unit (CPU) through the PCI-E port, which will introduce even more power consumption. (3) Large size: The geometric size of high-end FPGAs usually is larger than a tablet and inconvenient to carry. The Xilinx Zynq SoC platform changes the development step for FPGA, as each FPGA is connected with an integrated ARM-based programming system (PS) without any external CPU, and the communication between PS and FPGA programming logic (PL) is very convenient. In addition, it supports the PYNQ framework, which is used to design embedded systems using the Python language and libraries. It can control PS/PL and data movement and simplifies memory-mapped IO, GPIO, DMA data movement, and hardware memory allocation.
In this work, we developed the UAV object detection system for real-time, high accuracy, and low power application combined with RTL IPs such as DMA, AXI4-stream, and DSP to design our CNN FPGA accelerator on the Zynq UltraScale+ MPSoC Ultra96 development board. Ultra96 consists of ARM Cortex A53 with 1.5 GHz and ZU3 FPGA. The price of the Ultra96 board is inexpensive (249 U.S. dollars in 2019), and the size is 85 mm × 54 mm (which is smaller than a 4.7 inch smart phone) and portable. The total power consumption (including the standby power and working power) is 6.6 W, and the active power is only 0.6 W, which is less than 1 W. In Section 3, we will introduce the proposed system design.

Proposed System Architecture and IP Block Design
There are two main challenges involved in the architectural design of CNN accelerators on FPGA: (1) fetch data latency from global memory to FPGA on-chip memory is a bottleneck in the design; (2) hardware resources on FPGA are limited.
To overcome the data communication limitation between FPGA and CPU, we used the advanced extensible interface (AXI) direct memory access (DMA) controller to transfer data streams between FPGA and DDR memory. The input and output streams are the transferred data format used in this design. Using the limited hardware resources on FPGA creates a trade-off between the parallel/pipeline throughput and hardware cost.
We used AXI DMA for high bandwidth direct memory access between the AXI4 memory mapped (MM) and the AXI4-Stream IP interface [37]. The proposed system block design is shown in Figure 1. The system processor or PS has access to AXI DMA through the AXI4-Lite interface for control. The M_AXI_MM2S port (M-master; MM2S-memory mapped to stream) in AXI DMA connects to PS's S_AXI_HPport (S-slave; HP-high performance), and this allows the AXI DMA to fetch stream data from PS. The M_AXI_S2MM port (S2MM-stream to memory mapped) in AXI DMA connected to PS is used to write the stream data from AXI DMA to PS. The S_AXI_HP0 and S_AXI_HP1 are standard AXI high performance interfaces in PS, which are used for the write and read channels. The prefix S stands for "slave", which means the PL master can access (no matter read or write) the PS. The AXI DMA is connected through the AXI interconnect in the system.
The AXI DMA block diagram is shown in Figure 2. AXI DMA has two channels: MM2S and S2MM. The high-speed DMA data movement from the memory to stream target is performed through the AXI4 MM read to the AXI4 MM2S master. Whereas, the data movement from the stream target to memory is performed through the AXI4 S2MM slave to AXI4 MM write. A logic controller is used to control the sending of data through MM2S channel to the target IP. The same controller is used to control the receiving of data from the target IP through S2MM channel.   In Sections 3.2-3.4, we will introduce the modeling of AP2D-Net, the corresponding hardware design on FPGA, and the optimization of the heterogeneous system, respectively.

AP2D-Net Modeling of the CNN-Based FPGA Accelerator
In order to achieve high accuracy, a standard 2D convolution operation with sufficient number of parameters is performed to extract the features. However, a large number of parameters will also introduce a heavy burden on bandwidth and memory. Therefore, pointwise (PW) convolution is utilized to employ a limited number of parameters on the target device while maintaining high-speed and accuracy requirements. Pointwise convolution is also called 1 × 1 convolution or network in network (NIN) [38], which is used to reduce the dimension of the feature map (FM). In this paper, a combination of pointwise convolution and standard 2D convolution was proposed by concatenating pointwise convolution and standard 2D convolution together and generating a new structure named the P2D structure. In addition, we applied dynamic quantized activation precision in the P2D structure named AP2D-Net. This is discussed as follows.
3.2.1. Structure of AP2D-Net Figure 3 illustrates the structure of AP2D-Net. The model consisted of two parts: the first 15 convolution layers were used to extract the features, and the last three convolution layers were used for object classification and bounding box (BB) regression. The detection module would choose and output the object and location with the highest confidence score. In this model structure, the shape of the input image was 640 × 360 × 3.
The input image stream was resized to 224 × 224 × 3 before sending to PL. Here, we set up the OFM of P2D to have 160 channels, and we will analyze the reason in Section 4.4.

Feature Extraction
For the feature extraction portion, the first convolution layer (CONV1 in Figure 3) is a standard 2D convolution, which includes 2D convolution (CONV), batch normalization (BN), scale/bias (SB), activation (ReLU), and pooling operation. For the next 14 feature extraction layers, the kernels/weights are represented in binary numbers. The standard binarization function is: The weights are divided into two categories according to their values being above or below zero. For 1 bit representation on FPGA, we used (2) instead of (1) for the implementation.
For the classification and regression portion, we used 24 bit weights to ensure the final output bounding boxes' coordination with a high accuracy.

Classification and Regression
Different from ImageNet classification, object detection includes two tasks: classification and regression. Classification gives the class to which the object belongs, and regression gives the information for each object's coordinate in an image.
For network training, we applied the loss function from the idea of the YOLO framework [39] for object detection. Because the input image resolution was 224 × 224, after downsampling the input size 16 times, the size of the final OFMs was 14 × 14 (224/16 = 14).
The feature extraction part extracted W × H × 5 features, where W and H are the width and height of OFMs in the last layer, respectively. In other words, the image was divided into W × H cells, and for each cell, the feature extractor extracted five features. There were five parameters for each BB: where {c x , c y } is the top left corner of the cell and {b x /W, b y /H} is the center of BB relative to the whole image. σ(•) denotes the sigmoid function.
where Pr(object) is the confidence of the object and IOU(b, object) is the intersection over union (IOU) between the predicted BB and ground truth BB.
The loss function can be described as: There are four terms calculated in the loss function: no-object, object, class, and coordinates loss.
denotes that the k th BB in location (i, j) will be penalized if the IOU is lower than a threshold value T. I obj ijk denotes that the k th BB in location (i, j) is responsible for that prediction. I obj ij denotes if the object is in location (i, j). IOU(BB pred , BB gt ) is the IOUbetween predicted BBand ground truth BB. MSEloss is used for classification in location (i, j). The last term uses the sum of squared error (SSE) to calculate the coordination loss with the location information. {w gt , h gt } is the ground truth size for BB relative to the whole image. {b x /W, b y /H} is the ground truth of the center coordinate for BB. {t w , t h } is the natural logarithm of the ground truth size relative to the bounding box prior. We define the parameter λ noobj = 1, λ obj = 5, λ class = 1, λ coord = 1, and the threshold T = 0.5.
In this section, we introduced the proposed structure of AP2D-Net. The system design on FPGA will be discussed in Section 3.3.

AP2D-NET System Design on FPGA
In our AP2D-Net design, we used two convolution operations: standard 2D convolution and PW convolution. Besides convolution operations, in order to get final OFMs, the CNN accelerator had to perform batch normalization (BN), the scale/bias operation (SB), activation (e.g., rectified linearunit (ReLU), etc.), pooling, etc.

2D Convolution Module Design
The standard 2D convolution function consisted of two units: sliding window unit (SWU) and matrix vector activation unit (MVAU).

Sliding Window Unit
The SWU was used to convert the IFM into a new matrix for dot production and make the convolution operation faster with the expense of more memory usage. We used a sliding window with a size of D 2 k scanning progressively line-by-line in the IFM. By using the sliding window, the IFM can be expanded to a matrix with M × D 2 k rows and ( D F −D k S + 1) 2 columns, where S is the stride step. As shown in Figure 4a, the IFM had a shape of 3 × 3 × 2, and after applying the SWU, the shape of the new matrix φ was 8 × 4. The kernels (or weights) were expanded to row vectors. Since kernels had a 4D shape D K × D K × M × N, we concatenated each 3D cube D K × D K × M into a row vector. Thus, there were N row vectors in total. As shown in Figure 4b, the kernel shape was 2 × 2 × 2 × 2, and the expanded kernels θ had a shape of 2 × 8. Since the kernels were generated after training, we could organize and reshape the kernels offline, and there was no time consumption for kernel expansion during inference. In Figure 4c, we applied the dot product between matrix φ and kernel θ n , where n denotes the n th row in θ. Thus, there were N times of such a matrix multiplication operation. Finally, we obtained the OFMs with the size of ( D F −D k S + 1) 2 × N. In this process, we could apply the parallel factor vector_width to reduce the processing time by N vector_width times. In Figure 4c, the shape of final OFMs was 2 × 2 × 2.

Matrix Vector Activation Unit
The MVAU performs the matrix vector unit (MVU) and then applies the ReLU function for the activation operation.
As shown in Figure 5, the MVU multiplies the expanded kernels θ with the IFM matrix φ and generates the OFMs. The time complexity of the MVU operation is: In MVU, we applied the parallel factors vector_width and vector_depth. The vectorized factor vector_depth was applied in M IFMs, which helped to reduce the MVU processing time by vector_depth times. The vectorized factor vector_width was applied in N OFMs to generate multiple instances of different kernels. The overall MVU operation could be speeded up by vector_depth × vector_width times.
There was a trade-off between processing capacity and hardware resources or design area. The parallelism parameters vector_width and vector_depth were configured according to the resource usage.  In MVU, indicates the XOR operation. Since the weights were binary numbers, the multiply operation could be replaced by adding and subtraction.
Let us recall (2): θ = 0 indicates that the binary kernel value is a positive number, while θ = 1 means that the kernel is a negative value. In standard BNN, the positive binary number is represented as +1, and the negative kernel is −1. Thus, we can use XOR to get the correct MAC results in (8).
Batch normalization (BN) allows us to use higher learning rates to speed up the training process and avoids the gradient vanishing problem and gradient exploding problem. It also acts as a regularizer to eliminate the need for dropout [40] and obtains a better detection rate. In the BN layer, we normalized each scalar feature independently. For a layer of d-dimensional input x = (x (1) ...x (d) ), we normalized each dimension:x The parameters E[X (K) ] and Var[x (k) ] are statistically acquired during the training process, and ε is a hyper-parameter. As shown in (9), to calculate a batch normalization operation, it would use two adders (or subtractors), one divider, and one square root operation in hardware. The operation not only costs DSPs and LUTs on FPGA resources, but also increases latency. In addition, the parameters of expectation and variance transferred between external memory and FPGA on-chip memory would introduce a large delay. To solve this problem, we combined batch normalization and scale/bias (SB) operation together.
In the SB layer, we introduced a pair of parameters (γ (k) , β (k) ), which scaled and shifted the normalized value: We can combine (9) and (10) as: where Ψ = . Since all of the parameters (such as variance, expectation, scale, and bias) were collected in the training process, we could pre-compute Ψ and Ω offline, which could greatly increase the speed of the CNN FPGA accelerator. In addition, the offline pre-processing could reduce the DSP and LUT resources and arithmetic logic unit (ALU) power consumption during inference. As shown in Figure 6, the MVAU consisted of three parts: MVU, BN and SB, and ReLU.
1. MVU: This module generated the results after XOR operation between IFMs and kernels. We could apply the parallel mechanism in IFMs and OFMs to reduce the processing time on FPGA. 2. BN and SB: This module would receive the MVU results, multiply them with the pre-compute parameter scale Ψ, and add a parameter bias Ω. 3. ReLU: This module would receive the data after SB and apply the ReLU function in (12).  After the MVAU, the output stream would be sent out for the next layer's computation. Figure 7a illustrates the architecture of standard 2D convolution. There were mainly three parts in 2D convolution: padding, SWU, and MVAU, and the padding was optional. For standard 2D convolution without padding, the matrix size of OFM would shrink after each layer, which meant the spatial information at the boundaries would be lost. To preserve low level features, we recommend applying the padding operation before 2D convolution. The size of D k −1 2 zero padding was applied around the border of the IFM. SWU was used to reorganize the IFM according to the sliding window (kernel) size and prepare for faster convolution in MVAU, and MVAU output the OFM stream. Figure 7b is the structure of pointwise convolution. The 1 × 1 convolution would not affect the matrix size of OFM. Thus, padding was eliminated. Furthermore, there was no need for SWU to reorganize the IFM. Finally, pointwise convolution only had the MVAU module. In Figure 7c, we concatenated pointwise convolution and 2D convolution and grouped them as a module called the P2D module.  In this design, we used the multiplexer (MUX) and demultiplexer (DEMUX) to switch among the branches according to different layers.

SWU
The algorithm of this design is shown in Algorithm 1. The input buffer was used to store the input image (for the first layer) or the IFM stream (for other layers). The output buffer was used to store the object and BB stream (for the last layer) or the OFM stream (for other layers). There were L layers in the feature extraction portion of AP2D-Net. The AP2D-Net accelerator would work in a different manner according to the layer number.

Optimization on a Heterogeneous System
For a heterogeneous system, the framework worked on both PS and PL sides. As shown in Figure 9, the AP2D-Net accelerator engine was working on PL, while the image read/resize, final detected object, and BB coordinates' calculation were working on PS. For a standard heterogeneous workflow, each engine started when the previous operation fed the output. However, when the PS side was working, there was a sleep mode in PS, which introduced a large latency in the workflow. To solve this problem, we used a multi-processing scheme to fill in the gap in PS when PL was working. Theoretically speaking, the performance between multi-threading and multi-processing were quite similar. The reason we used multi-processing instead of multi-threading depended on the characteristic of the Python Global Interpreter Lock (GIL). GIL is a lock that allows only one thread to hold the control of the Python interpreter, which means even when it is working in a multi-threading architecture with multiple CPU cores, the GIL can execute only one thread at a time. For I/O bound programs, it usually has to spend an amount of time waiting for input/output. The multi-processing approach gets its own Python interpreter and memory space so the tasks can execute in parallel. As depicted in Figure 9a, the total latency is: where T read_PS is the image read/resize time on PS, T AP2D is the CNN engine working time on PL, and T write_PS is the BB calculation and write time on PS. After applying the multi-processing scheme, the image read/resize operation can work simultaneously when the AP2D engine is working. The optimized average latency is: The relationship between T read_PS and T AP2D is uncertain and depends on the secure digital (SD) card read speed. For the read speed, it varies from 12.5 MB/s to 985 MB/s. If T read_PS ≥ T AP2D , the optimized latency is T read_PS + T write_PS . If T read_PS < T AP2D , the latency is T AP2D + T write_PS .
The experimental results and performance evaluation are discussed in Section 4.

Performance Evaluation and Experimental Results
To apply our system on UAV applications, we needed to train the UAV dataset [9] on our AP2D-Net framework to get the weight file.

Dataset
The dataset [9] had 12 categories for UAV application. As shown in Figure 10, there were several challenges in the dataset. (1) Since it was in a UAV scenario, most of the object that needed to be detected only took a tiny portion (1-2%) in the whole image with the size of 640 × 360 (as shown in Figure 10a,b). The detection accuracy was very sensitive for such a small size. (2) The detection task only focused on a specific object if there were multiple objects, e.g., in Figure 10c, there is a group of people in the captured image, but only one specific object needs to be detected and localized. In Figure 10d, only a specific cyclist needs to be detected. (3) There were obstacles in the images, e.g., the horse rider in Figure 10e is covered by the caption, the person in Figure 10f is partially covered by the tree, etc. (4) The lighting was different. As shown in Figure 10g,h, the scenarios were captured in a low light case.

Training
To train binary weights during back propagation, we only binarized the weights. The gradients were still real values (floating-point numbers) because the changes of gradients were very tiny. The training would not be successful if we also binarized the gradients. After updating the trained weights, the new value could be binarized since the forward process only needed binary weights.
Before training the dataset, we used some training enhancement methods such as Zhang et al.'s [41], which used visually coherent image mix-up methods for the object detection task. The mix-up method would increase the generalization of the training result. In addition, the distribution of the dataset was unbalanced (as shown in Table 3). The categories "person" and "car" took up more than 20%, while the other categories only took a small portion. To solve the unbalanced distribution, we added data by copying the dataset whose distribution was below 10% to make the dataset more balanced.

Evaluation Criteria
Bounding box evaluation: For the object detection task, intersection over union (IOU) is an evaluation metric to measure bounding boxes overlap. To measure the detection accuracy, the area of overlap a 0 between the predicted bounding box B p and ground truth bounding box B gt is calculated by (15): B p ∩ B gt denotes the intersection of the predicted and ground truth BB, and B p ∪ B gt is the union of predicted and ground truth BB.
Suppose there are K evaluation images in the dataset; IOU k is the IOU score for the k th image. The average IOU is calculated by: The energy consumption is computed by: If we know the unit power costs on the FPGA and total time spent during inference, we can get the total energy. Here, E is a criterion considering both the power and speed. Obviously, E should be as low as possible. We assumedĒ to be the average energy consumption of the different FPGA accelerator designs. The energy consumption score ES is calculated by: According to (18), we know that if the energy consumption of our design is lower thanĒ, log 2Ē E will be a positive number, which will increase the ES value. On the contrary, if the design costs too much power consumption, it will have a negative effect on ES. The final evaluation score TSis: In (19), TS combines the accuracy, speed, and power together and gives an overall evaluation.

AP2D-Net Modeling
There was a trade-off between the accuracy and modeling parameters. To get the best combination, we conducted a set of experiments with dynamic quantized activation precision and other different modeling parameters such as networks channel size and layer number. The optional data realignment engine for an AXI DMA stream data width was up to 512 bit. To guarantee the stream would not excess 512 bit, the model followed the constraint in (20): where A b is the activation quantization bit width and C max is the maximum channel number in each layer. We list the results in Table 4. First of all, we set up the activation quantization precision to 5 bit with the same channel numbers, the only difference being the model depth, which was 15, 17, and 19, respectively. According to Table 4, we observed that increasing the model's depth did not help to increase the accuracy. We analyzed the reason to be that the BNN had only 1 bit representation of weights, and due to the low bit representation, the training of BNN had the gradient vanishing problem. Secondly, we reduced the activation quantization precision from 5 bit to 4, 3, and 2 bit, respectively, and increased the channel numbers of pointwise convolution and 2D convolution gradually. We found that there was a trade-off between the activation quantization precision and the number of channels. Increasing the channel numbers could help to increase the features' representation ability, which could increase the model's accuracy. However, reducing the activation quantization precision would also decrease the accuracy. We obtained the summit value when the activation quantization precision was 3 bit with 128 pointwise channels and 160 2D convolution channels. As shown in Figures 11 and 12, the best IOU for training was 0.75 and for validation was 0.55. There are solid lines and ghost lines in Figures 11 and 12. The ghost lines are the original data, which jittered in large amplitude, and it was hard to observe the trends. We used a smoothing factor of 0.6 to make the curve more stable and obtain clear IOU results. In the legend, A/P/D/Lrepresented for activation quantization precision, pointwise channel number, 2D convolution channel number, and layer number, respectively. We used 90,000 images for training, and it took 100k epochs in total.
After getting the model with the highest IOU, we wanted to apply the model in a most power efficient way. The higher speed would increase power consumption. We found that there was a trade-off between the PL working frequency and energy consumption.

Trade-Off between Working Frequency and Energy Consumption
As shown in Figure 13, we synthesized PL under a range of clock frequency from 125 MHz to 500 MHz (The quad-core ARM Cortex-A53 CPU was running at 1.5 GHz. We used clock divisors 12, 10, 7, 5, 4, and 3 to get the working frequency at 125 MHz, 150 MHz, 214.3 MHz, 300 MHz, 375 MHz, and 500 MHz, respectively. The PS side only supported a frequency up to 333.333 MHz. However, the PL supported synthesis at 500 MHz. Our system could work up to 500 MHz.). From Figure 13, we observe that the power dissipation was linearly increasing when the synthesis frequency changed, which was consistent with (21): where α is the activity factor, C is the capacitance, V is the voltage supply, and f is the clock frequency. The power dissipation was increased from 4.8 W to 6.5 W. The speed was also increased from 23.5 fps to 33.7 fps. According to (17), we could compute the energy value at different working frequencies.
The energy dissipation could get the minimum value of 0.1831 Joules at 300 MHz when the fps were 30.53 and power was 5.59 W. Thus, we used the frequency at 300 MHz in this design. It is worth mentioning that the power dissipation was the total power dissipation of the Ultra96 board. If we stopped running the program, the standby mode power was 5 W, which meant the active power cost was only 0.59 W. Table 5 lists the hardware usage of our AP2D-Net system on FPGA when we set the parallelism factors vector_depth = 32 and vector_width = 8. Specifically, for the first layer IFMs, vector_width = 3 because the first layer IFMs only had three channels. DSPs are dedicated floating/fixed-point arithmetic operation units such as multiply and accumulators. Block RAMs are a very small embedded memory system on FPGA. LUTs and registers can build and route arbitrary topologies for programmable logic.

Comparison with FPGA/GPU Designs
We compared our design with the state-of-the-art CNN FPGA accelerator designs listed in Table 6. The work in [5,27] used the FPGA platform Zynq XC7Z020, which has similar hardware resource capacities as our Ultra96 SoC FPGA platform, but they only obtained a throughput of 84.3 GOPS and 80.35 GOPS, respectively. Ma et al. [19] achieved a throughput of 715.9 GOPS on Arria 10 Intel FPGA, which was far above our throughput because the BRAM and DSP size of Arria 10 was ∼ 13× and ∼ 5× larger than our device, respectively. This kind of high-end FPGA collaborating with the external CPU would introduce high power consumption, which would decrease the overall power efficiency. In addition, it is not suitable for mobile applications. The work of Bai et al. [30] implemented MobileNet V2 on Arria 10 SoC, which obtained a throughput of 170.6 GOPS, but the usage of BRAMs and DSPs was far above our device. Geng et al. [24] achieved a throughput of 290 GOPS by using a high-end device that cost ∼ 7× more DSPs, and the power efficiency was 8.28 GOPS/W. The work of Guan et al. [29] could achieve 364.4 GOPS, and the power efficiency was 14.57 GOPS/W.
The power efficiency of our proposed design could reach 23.3 GOPS/W, which outperformed the previous works. It is worth mentioning that the throughput of our design was 130.2 GOPS because the hardware resources were very limited on this mobile FPGA platform. Ultra96 is just a tiny mobile SoC FPGA, and the resource is only 1/5-1/4 compared with the other traditional high-end FPGAs. However, our throughput outperformed the design on comparable devices such as in [5,27].  Table 7 lists the results for the UAV object detection task compared with our proposed AP2D-Net structure based on mobile platforms. ICT-CAS and TGIIF are the most advanced designs on mobile GPU and mobile FPGA, respectively [42]. ICT-CAS uses the YOLO model as the base design structure. It can get an IOU around 0.69 because it uses a high accuracy floating-point inference on GPU and hard example mining training. Due to the limited on-chip memory and bandwidth, the CUDA-based CNN model on GPU is hard to transplant to our mobile FPGA board. The drawback of Jetson TX2 is the high power consumption, which results in the power efficiency of 1.95 fps/W. TGIIF uses an 8 bit quantized data type and closed source Verilog IP. Besides, it prunes and fine-tunes the network to make it have fewer parameters. The IOU of our design was 0.55 because of the accuracy loss in the binary data type. However, our structure could get as fast as 30 fps, and the power efficiency was 5.46 fps/W, which was 2.8× better compared with the Jetson TX2 GPU solution and 1.9× better than the PYNQ-Z1 FPGA design. From the results, our FPGA system design outperformed the most advanced GPU and CPU designs with respect to power efficiency.

Conclusions
In this paper, we presented AP2D-Net, an ultra-low power and real-time system design for CNN-based unmanned aerial vehicle applications on the Ultra96 SoC FPGA platform. The system could obtain the real-time speed of 30 fps under 5.6 W, and the active power was only 0.6 W. The power efficiency of our system was 2.8× better than the best system design on Jetson TX2 GPU and 1.9× better than the design on PYNQ-Z1 SoC FPGA. Our system could be applied to the industrial and civil fields such as road object recognition for autonomous vehicles or drones. In the future, more advanced systems will be designed for high accuracy and low power application on SoC FPGA.