FPGA Implementation for CNN-Based Optical Remote Sensing Object Detection

In recent years, convolutional neural network (CNN)-based methods have been widely used for optical remote sensing object detection and have shown excellent performance. Some aerospace systems, such as satellites or aircrafts, need to adopt these methods to observe objects on the ground. Due to the limited budget of the logical resources and power consumption in these systems, an embedded device is a good choice to implement the CNN-based methods. However, it is still a challenge to strike a balance between performance and power consumption. In this paper, we propose an efficient hardware-implementation method for optical remote sensing object detection. Firstly, we optimize the CNN-based model for hardware implementation, which establishes a foundation for efficiently mapping the network on a field-programmable gate array (FPGA). In addition, we propose a hardware architecture for the CNN-based remote sensing object detection model. In this architecture, a general processing engine (PE) is proposed to implement multiple types of convolutions in the network using the uniform module. An efficient data storage and access scheme is also proposed, and it achieves low-latency calculations and a high memory bandwidth utilization rate. Finally, we deployed the improved YOLOv2 network on a Xilinx ZYNQ xc7z035 FPGA to evaluate the performance of our design. The experimental results show that the performance of our implementation on an FPGA is only 0.18% lower than that on a graphics processing unit (GPU) in mean average precision (mAP). Under a 200 MHz working frequency, our design achieves a throughput of 111.5 giga-operations per second (GOP/s) with a 5.96 W on-chip power consumption. Comparison with the related works demonstrates that the proposed design has obvious advantages in terms of energy efficiency and that it is suitable for deployment on embedded devices.


Introduction
Object detection is an important research topic of remote sensing image processing. Object detection of optical remote sensing images aims to predict the location of the objects belonging to the category of interest in a given remote sensing image [1]. Drawing upon recent advances in computer vision, many researchers have adopted CNN-based methods to remote sensing object detection applications, such as environmental monitoring [2], intelligent transportation [3], and other vital applications [4][5][6][7]. Traditional remote sensing image processing systems need to download images to the ground station for processing and analysis from satellites. However, with the development of remote sensing technology, the resolution and the data amount of optical remote sensing images are constantly increasing. The increasing volume of image data puts high pressure on the data downlink [8,9]. The processing delay of the systems may be too long to meet the requirements of timeliness [10,11]. Thus, many works have constructed an onboard remote sensing system to

•
We optimized an improved YOLOv2 network for hardware implementation. The optimization mainly includes three aspects: network quantization, layer fusion, and a unified implementation of multiple convolutions. Through these optimization methods, we effectively reduced the scale of the network while maintaining detection accuracy.

•
We proposed a hardware architecture for the CNN-based remote sensing object detection model. In this architecture, a general PE is proposed to implement multiple types of convolutions in the network. An efficient data storage and access scheme is also proposed, which achieves low-latency calculations and a high memory bandwidth utilization rate.

•
We implemented the optimized network on the Xilinx ZYNQ xc7z035 FPGA to evaluate the performance of the proposed hardware architecture. The experimental results tested on the detection in an aerial image (DOTA) [31] dataset illustrate that the performance of our implementation on an FPGA is only 0.18% lower than that on a GPU in mean average precision (mAP). A comparison with related works demonstrates that our design can strike an excellent balance between resource consumption and computing time cost.

Background
Several CNN-based methods, Region-CNNs (R-CNNs) [32], the Single Shot MultiBox Detector (SSD) [33], and you-only-look-once (YOLO) [34], have been proposed for object detection. YOLO ensures an excellent trade-off between accuracy and speed compared with other approaches [29,35]. In Reference [4], an improved YOLOv2 network was proposed for remote sensing object detection. This network adopted dilated convolution and transposed convolution to improve performance for multiscale objects in complex optical remote sensing scenes. This network strikes a balance between model complexity and object detection performance. In this paper, we took the improved YOLOv2 as the fundamental network. The structure of the fundamental work is shown in Figure 1. As shown in Figure 1, the fundamental work contains multiple computational layers. These layers are concatenated together. The main layers are the convolutional layer, pooling layer, batch normalization layer, and activation function.

Convolutional Layer
In CNNs, convolutional layers are used to extract features from input images. The operation in a convolutional layer is a three-dimensional calculation by input feature maps and convolution kernels. The fundamental network has three types of these operations: a standard convolution, a dilated convolution, and a transposed convolution. The details are described in the following subsections.

Standard Convolution
The calculation of standard convolution is defined as where Ix + s × m, y + s × n is the value of the input feature map at the point of (x + s × m, y + s × n), Cin is the number of input channels, s represents the stride of the convolutional layer, wm,n is the corresponding weight in the kernels, K is the size of kernels, bx,y is the corresponding bias, and Ox,y is the value of the output feature map at the point of (x, y).

Dilated Convolution
A dilated convolution can expand the receptive field of the feature map without increasing parameters [36]. It is mainly used to introduce fine features and avoid excessive loss of the resolution. The actual size of the kernel can be computed with = + − × − ( 1) ( 1) r k k k r (2) where r is a hyper-parameter that represents the dilation rate. Taking the dilated convolution with r = 2 in the fundamental network as an example, its computation is shown in Figure 1. The structure of the improved YOLOv2. Each convolutional layer contains not only convolution but also batch normalization and activation sublayers.

Convolutional Layer
In CNNs, convolutional layers are used to extract features from input images. The operation in a convolutional layer is a three-dimensional calculation by input feature maps and convolution kernels. The fundamental network has three types of these operations: a standard convolution, a dilated convolution, and a transposed convolution. The details are described in the following subsections.

Standard Convolution
The calculation of standard convolution is defined as where I x + s × m, y + s × n is the value of the input feature map at the point of (x + s × m, y + s × n), C in is the number of input channels, s represents the stride of the convolutional layer, w m,n is the corresponding weight in the kernels, K is the size of kernels, b x,y is the corresponding bias, and O x,y is the value of the output feature map at the point of (x, y).

Dilated Convolution
A dilated convolution can expand the receptive field of the feature map without increasing parameters [36]. It is mainly used to introduce fine features and avoid excessive loss of the resolution. The actual size of the kernel can be computed with where r is a hyper-parameter that represents the dilation rate. Taking the dilated convolution with r = 2 in the fundamental network as an example, its computation is shown in

Transposed Convolution
A transposed convolution is another type of convolutional operation in CNNs. This convolution can be used to achieve the up-sampling of the feature maps, which can be defined as where k T represents the weight matrix after transpose, and fin and fout represent the input and output feature maps, respectively. The essence of the transposed convolution is an operation expanding the size of the input feature map. The size of the input feature map after interpolated can be expressed as follows: where s is the stride of the convolution, padin is the amount of zero-padding to both sides of the input, padout is the amount of zero-padding to one side of the output, and Hin and Win represent the height and width of the original input feature map, respectively. H ′ in and W ′ in represent the height and width of the input feature map after interpolation, respectively. For the fundamental network, padin and padout are both 1 and the stride is equal to 2. The computation of the transposed convolution is shown in Figure 3.

Batch-Normalization Layer
In most CNNs, the batch normalization (BN) layer and activation function [37] are placed after the convolutional layers. BN layers are used to prevent over-fitting and speed up the training [38]. The calculation of BN can be defined as follows：

Transposed Convolution
A transposed convolution is another type of convolutional operation in CNNs. This convolution can be used to achieve the up-sampling of the feature maps, which can be defined as where k T represents the weight matrix after transpose, and f in and f out represent the input and output feature maps, respectively. The essence of the transposed convolution is an operation expanding the size of the input feature map. The size of the input feature map after interpolated can be expressed as follows: where s is the stride of the convolution, pad in is the amount of zero-padding to both sides of the input, pad out is the amount of zero-padding to one side of the output, and H in and W in represent the height and width of the original input feature map, respectively. H in and W in represent the height and width of the input feature map after interpolation, respectively. For the fundamental network, pad in and pad out are both 1 and the stride is equal to 2. The computation of the transposed convolution is shown in Figure 3.

Transposed Convolution
A transposed convolution is another type of convolutional operation in CNNs. This convolution can be used to achieve the up-sampling of the feature maps, which can be defined as where k T represents the weight matrix after transpose, and fin and fout represent the input and output feature maps, respectively. The essence of the transposed convolution is an operation expanding the size of the input feature map. The size of the input feature map after interpolated can be expressed as follows: where s is the stride of the convolution, padin is the amount of zero-padding to both sides of the input, padout is the amount of zero-padding to one side of the output, and Hin and Win represent the height and width of the original input feature map, respectively. H ′ in and W ′ in represent the height and width of the input feature map after interpolation, respectively. For the fundamental network, padin and padout are both 1 and the stride is equal to 2. The computation of the transposed convolution is shown in Figure 3.

Batch-Normalization Layer
In most CNNs, the batch normalization (BN) layer and activation function [37] are placed after the convolutional layers. BN layers are used to prevent over-fitting and speed up the training [38]. The calculation of BN can be defined as follows：

Batch-Normalization Layer
In most CNNs, the batch normalization (BN) layer and activation function [37] are placed after the convolutional layers. BN layers are used to prevent over-fitting and speed up the training [38]. The calculation of BN can be defined as follows: Electronics 2021, 10, 282 6 of 24 whereŌ and σ 2 represent the average and variance of the output feature maps of the previous convolution layer, respectively. ε is a minimal number to prevent the denominator from being zero. γ and β are learnable parameters to apply an affine transformation to the normalized output feature maps.

Activation Function
The activation functions are mainly used to change the output feature maps nonlinearly. The most commonly used activation functions are the rectified linear unit (ReLU) and Leaky ReLU. LeakyReLU activation is widely used in the fundamental network, which can be defined as where a is the fixed leaky coefficient in the range (0, 1). For ReLU activation, it can be computed in the same way by setting a = 0.

Pooling Layer
The pooling layer can reduce the size of feature maps by discarding redundant information. The most commonly used in CNNs are average pooling and max pooling. In the fundamental network, max pooling is used as the pooling layer. The output neuron of the max-pooling layer can be calculated as where the max pooling layers take the maximum value from the region m × m as the output. The height and the width of the pooling size are both 2, and its vertical and horizontal strides are 2 in the fundamental network.

Optimization for Implementation
With the above-mentioned descriptions of the computational layers in the fundamental network, we can use a processing block to achieve all the forward calculations during inference. The block is shown in Figure 4. We can directly deploy the block on the hardware. However, this deployment method has several disadvantages. Firstly, the values involved in the calculations are represented at the floating point and are not hardware friendly. Second, the multiplications and additions in the calculations are very dense, which limits the processing speed and has a high resource requirement. In addition, the fundamental network contains multiple types of convolution operations. If we customize the structure for each convolution, it will consume enormous hardware resources. To solve these problems, we optimize the network, as shown in Figure 4. The optimization contains network quantization, layer fusion, and the unified implementation of multiple convolutions. The details are described in the following subsections.

Network Quantization
In the fundamental network, the convolutional layers account for most of the calculations. To deploy the network on the FPGA efficiently, we adopted a hardware-aware-

Network Quantization
In the fundamental network, the convolutional layers account for most of the calculations. To deploy the network on the FPGA efficiently, we adopted a hardware-awareness symmetric quantization scheme to quantize both the feature maps and weights of the convolutional layers into a low-bit integer. Since other layers require high data accuracy [28,29], we retain them at full precision. Considering the general case of N-bit symmetric quantization, the quantization function is defined by the following: where r represents the floating-point element in the input feature matrix or weight matrix, q represents the corresponding quantized value, S indicates the scaling factor of the matrix, and N indicates the quantization bit width. The clamp(·) function is used to limit the quantized values to the range of −2 N−1 + 1 , 2 N−1 − 1 , and Int(·) function is used to round the data to an integer. The quantized bias can be calculated by the following: where b and b represent the original and quantized bias, respectively. S w and S d represent the scaling factor of the feature matrix and weight matrix, respectively. N b indicates the quantization bit width of the bias matrix. In this paper, the quantization bit widths of the feature maps and weights are set to 8 bits. The quantization bit widths of biases are set to 32 bits. With the quantized feature maps and weights, the convolutional layer can be converted into a quantized version aŝ x+s×m,y+s×n ×ŵ m,n +b x,y whereÎ andŵ represent the feature values and weights after being quantized to 8-bit fixedpoint version, respectively.b represents the biases after quantification, andÔ represents the output feature values of the quantized convolutional layer. The output feature maps of the quantized convolutional layer are 32-bit integers. It is necessary to perform inverse quantization to convert these values into floating-point types for the following layers. The operation of inverse quantization is defined as follows: where q represents the quantized value in the quantized convolutional results, and q represents the corresponding floating-point value after inverse quantization. With the hardware-awareness symmetric quantization, the volume of floating-point operations in the network can be reduced. Moreover, this hardware-friendly optimization can efficiently reduce the requirement of hardware resources for implementation.

Layer Fusion
Applying symmetric quantization to the fundamental network eliminates most of the floating-point operations in the convolutional layers. However, there are still dense floatingpoint operations in the other layers, the quantization layers and the inverse quantization layers. To further reduce the volume of the floating-point calculation in the fundamental network, we present another optimization, layer fusion, to merge the floating-point multiplications and additions in adjacent layers. In Equation (5),Ō, σ 2 , γ, and β are determined after the network is trained. With the determined parameters, Equation (5) can be rewritten as Compared to Equation (5), Equation (12) has one less addition and multiplication during the inference. This transformation can optimize the hardware design while deploying the BN layer on the FPGA.
Generally, BN layers are placed after convolutional layers. Applying the symmetric quantization will insert an inverse quantization layer between the quantized convolutional layer and BN layer. Considering that the multiplications in the inverse quantization layer and the BN layer can be achieved by only one multiplication, we fused the inverse quantization to the BN layer. The calculation of the fused BN layer is defined as follows: where γ and β represent the multiplication and addition coefficients of the fused BN layer, respectively. They can be calculated by the following: These two coefficients can be calculated offline. For hardware implementations, this fused layer only needs to perform one floating-point multiplication and addition. Notably, if the networks do not include BN layers, we can achieve them by setting γ = S d S w and β = 0 without re-design.
In the fundamental network, LeakyReLU is used to activate the output feature maps of a convolutional layer. With the network quantization, a quantization layer is placed before the following convolutional layer. Both the activation function and the quantization layer involve one floating-point multiplication. Like the previous layer fusion, the activation function can be fused into the following quantization layer. However, the max-pooling layer may make the two layers non-adjacent. Considering that the max-pooling layer only involves a comparison operation, and the change of calculation order will not result in wrong network calculations, we placed the following quantization layer before the current max-pooling layer. The fused activation layer is defined as Similar to the fused BN layer, the multiplying factors 1 /S d and a /S d of this fused layer can be calculated off-line before implementation. Figure 5 shows the optimized processing block with network quantization and layer fusion. Compared with the original, most float-point calculations have been converted to an 8-bit fixed-point version, and the processing flow is optimized. These optimizations can reduce the requirement of hardware resources.
Similar to the fused BN layer, the multiplying factors 1 d S and d a S of th layer can be calculated off-line before implementation. Figure 5 shows the optimized processing block with network quantization a fusion. Compared with the original, most float-point calculations have been conv an 8-bit fixed-point version, and the processing flow is optimized. These optim can reduce the requirement of hardware resources.

Unified Implementation of Multiple Convolutions
As shown in Figure 1, the fundamental network contains many different convolutional layers. It is ineffective and inflexible to customize the hardware mo each type of convolution. Furthermore, while the data type is optimized by netwo tization, the volume of calculations in the convolutional layers is constant. It is sti challenge to implement these calculations on an FPGA. To solve these problems, optimized the original loop computation in a standard convolutional layer to ru ciently on an FPGA. We then proposed some transformation methods to convert types of convolutions in the fundamental network into a standard convolution arrangement. With these transformation methods, only one processing engine nee designed to implement all types of convolutional layers in the fundamental netw

Loop Optimization for Standard Convolutions
The execution of convolution exhibits numerous sources of parallelism. Due ware constraints, it is impossible to exploit all of the parallelism patterns fully [ standard convolutional layer contains N filters, and each filter consists of M-chan K kernels. We optimize the original loop computation of the standard layer, as s Figure 6. When calculating, the K × K rectangular window slides along the wid input feature maps, which is called row processing. The extracted pixels need to b lated with N corresponding kernels. However, the N kernels may not be calculate same time due to the limited hardware resources. Thus, we divided the N kern multiple groups to perform calculations, and each time n kernels participate in th lation. This operation repeats Ni = N/n times until all the N intermediate results tained. The N intermediate results generated by the calculation are stored in the a lation buffer. These kernels are reused for subsequent row processing until the rec

Unified Implementation of Multiple Convolutions
As shown in Figure 1, the fundamental network contains many different types of convolutional layers. It is ineffective and inflexible to customize the hardware module for each type of convolution. Furthermore, while the data type is optimized by network quantization, the volume of calculations in the convolutional layers is constant. It is still a great challenge to implement these calculations on an FPGA. To solve these problems, we first optimized the original loop computation in a standard convolutional layer to run it efficiently on an FPGA. We then proposed some transformation methods to convert all other types of convolutions in the fundamental network into a standard convolution by data arrangement. With these transformation methods, only one processing engine needs to be designed to implement all types of convolutional layers in the fundamental network.

Loop Optimization for Standard Convolutions
The execution of convolution exhibits numerous sources of parallelism. Due to hardware constraints, it is impossible to exploit all of the parallelism patterns fully [39]. The standard convolutional layer contains N filters, and each filter consists of M-channels K × K kernels. We optimize the original loop computation of the standard layer, as shown in Figure 6. When calculating, the K × K rectangular window slides along the width of the input feature maps, which is called row processing. The extracted pixels need to be calculated with N corresponding kernels. However, the N kernels may not be calculated at the same time due to the limited hardware resources. Thus, we divided the N kernels into multiple groups to perform calculations, and each time n kernels participate in the calculation. This operation repeats Ni = N/n times until all the N intermediate results are obtained. The N intermediate results generated by the calculation are stored in the accumulation buffer. These kernels are reused for subsequent row processing until the rectangular window shifts to the end of the channel. In a row processing, N × H out intermediate results are obtained. The rectangular window then shifts toward the subsequent channels and repeats the above row processing. For each row processing, the corresponding kernels in the filters are taken out for convolution. Traversing all channels needs M row processing, which is called channel processing. After that, the rectangular window shifts down by one row and repeats the above-mentioned channel processing. The number of repetitions is H in + 2 × pad − K + 1, where pad is the amount of zero-padding to both sides of the input image. Therefore, to process the whole input feature maps, the weights of the filters should be read H in + 2 × pad − K + 1 times. To avoid performance degradation, we hide the waiting time in row processing by weight prefetching. Notably, the intermediate results produced by one row processing will be accumulated in the next row processing. Therefore, the size of the accumulation buffer is M × H out × 32 bit. This computing pattern has several advantages. Firstly, the same feature values will not be read repeatedly from external memories, which avoids frequent memory access. Second, the intermediate results can be accumulated in time, reducing the consumption of on-chip resources. Finally, the feature values of output feature maps can be obtained row by row, which is beneficial to pooling.
Hin + 2 × pad − K + 1, where pad is the amount of zero-padding to both sides of the input image. Therefore, to process the whole input feature maps, the weights of the filters should be read Hin + 2 × pad − K + 1 times. To avoid performance degradation, we hide the waiting time in row processing by weight prefetching. Notably, the intermediate results produced by one row processing will be accumulated in the next row processing. Therefore, the size of the accumulation buffer is M × Hout × 32 bit. This computing pattern has several advantages. Firstly, the same feature values will not be read repeatedly from external memories, which avoids frequent memory access. Second, the intermediate results can be accumulated in time, reducing the consumption of on-chip resources. Finally, the feature values of output feature maps can be obtained row by row, which is beneficial to pooling.

Transformation Method for Other Convolutions
The standard convolution with a kernel size of 3 × 3 is representative in the fundamental network. In this section, we focus on converting the remaining types of convolutions into the 3 × 3 standard convolutions for implementation. For a 1 × 1 standard convolution, we can convert the size of the kernel from 1 × 1 to 3 × 3 by zero-padding. Notably, an additional zero-padding is added to the input feature maps to obtain the correct size of the output feature maps. The calculation process after conversion is shown in Figure 7a. The essence of 3 × 3 transposed convolution and 3 × 3 standard convolution is the same. Their basic operations are both the convolution of nine weights and the corresponding feature values. Notably, the 3 × 3 transposed convolution has two differences. Firstly, its kernels need to be transposed for convolution. We adjust the order of the weights off-line to avoid matrix transpose operation during the calculation. The other difference with the standard convolution is that the input feature maps need to be interpolated by zero according to Equation (4). For the fundamental network, the s, padin and padout of the 3 × 3 transposed convolution is equal to 2, 1 and 1, respectively. Thus, 1-pixel-wide zero-padding is required in the left and top part; 2-pixel-wide zero-padding is required in the right and bottom part of the input feature maps. Moreover, 1-pixel-wide zero-padding is required between two rows and between two columns of the input feature maps. We have designed a reading strategy to efficiently complete the interpolation of feature maps and avoid extra overhead, as shown in Figure 7b. The Enable signal represents a valid signal for the input feature maps, but the feature values are not continuously read from the storage. The read enable signal of the memory is a square-wave signal corresponding to the zero-padding mode, named Rd_en signal. The Rd_en signal is set to zero when the input feature maps need to be interpolated by zero, and the input feature map at this time is set to zero.

Transformation Method for Other Convolutions
The standard convolution with a kernel size of 3 × 3 is representative in the fundamental network. In this section, we focus on converting the remaining types of convolutions into the 3 × 3 standard convolutions for implementation. For a 1 × 1 standard convolution, we can convert the size of the kernel from 1 × 1 to 3 × 3 by zero-padding. Notably, an additional zero-padding is added to the input feature maps to obtain the correct size of the output feature maps. The calculation process after conversion is shown in Figure 7a. The essence of 3 × 3 transposed convolution and 3 × 3 standard convolution is the same. Their basic operations are both the convolution of nine weights and the corresponding feature values. Notably, the 3 × 3 transposed convolution has two differences. Firstly, its kernels need to be transposed for convolution. We adjust the order of the weights off-line to avoid matrix transpose operation during the calculation. The other difference with the standard convolution is that the input feature maps need to be interpolated by zero according to Equation (4). For the fundamental network, the s, pad in and pad out of the 3 × 3 transposed convolution is equal to 2, 1 and 1, respectively. Thus, 1-pixel-wide zero-padding is required in the left and top part; 2-pixel-wide zero-padding is required in the right and bottom part of the input feature maps. Moreover, 1-pixel-wide zero-padding is required between two rows and between two columns of the input feature maps. We have designed a reading strategy to efficiently complete the interpolation of feature maps and avoid extra overhead, as shown in Figure 7b. The Enable signal represents a valid signal for the input feature maps, but the feature values are not continuously read from the storage. The read enable signal of the memory is a square-wave signal corresponding to the zero-padding mode, named Rd_en signal. The Rd_en signal is set to zero when the input feature maps need to be interpolated by zero, and the input feature map at this time is set to zero. Through the above method of processing the input feature map and kernel, the transposed convolution can be converted into a 3 × 3 standard convolution for implementation. Through the above method of processing the input feature map and kernel, the transposed convolution can be converted into a 3 × 3 standard convolution for implementation. Moreover, the fundamental network contains two types of dilated convolutions with a stride of 1 and 2, respectively. Figure 2 illustrates the principle of the dilated convolution with a 1-pixel stride. In essence, the operational of a 3 × 3 dilated convolution with a 1-pixel stride is constant with the 3 × 3 standard convolution. Notably, the feature values are extracted from non-adjacent rows and columns in the feature maps due to the expansion of the kernel. Furthermore, the feature values used in two consecutive operations are totally different. If the kernel slides in the original order, like a 3 × 3 standard convolution, frequent memory access will cause long latency as the bandwidth is limited. Therefore, we propose an efficient method to transform the dilated convolution with a 1-pixel stride into a 3 × 3 standard convolution for implementation, as shown in Figure 7c. When reading the input feature map, we will group the feature values according to odd rows, even rows, odd columns, and even columns. The input feature values are reordered according to the group and then convolved with the 3 × 3 kernel. The output feature maps can be obtained by reordering the obtained calculation results according to the groups. In addition, the principle of the dilated convolution with a 2-pixel stride is similar to that of the dilated convolution with a 1-pixel stride. The difference is that only the feature values located in odd rows and columns of the feature maps participate in the operation. Therefore, we only need to retain the corresponding feature values when grouping and perform the same subsequent operations as the dilated convolution with 1 pixel. Its calculation process Moreover, the fundamental network contains two types of dilated convolutions with a stride of 1 and 2, respectively. Figure 2 illustrates the principle of the dilated convolution with a 1-pixel stride. In essence, the operational of a 3 × 3 dilated convolution with a 1-pixel stride is constant with the 3 × 3 standard convolution. Notably, the feature values are extracted from non-adjacent rows and columns in the feature maps due to the expansion of the kernel. Furthermore, the feature values used in two consecutive operations are totally different. If the kernel slides in the original order, like a 3 × 3 standard convolution, frequent memory access will cause long latency as the bandwidth is limited. Therefore, we propose an efficient method to transform the dilated convolution with a 1-pixel stride into a 3 × 3 standard convolution for implementation, as shown in Figure 7c. When reading the input feature map, we will group the feature values according to odd rows, even rows, odd columns, and even columns. The input feature values are reordered according to the group and then convolved with the 3 × 3 kernel. The output feature maps can be obtained by reordering the obtained calculation results according to the groups. In addition, the principle of the dilated convolution with a 2-pixel stride is similar to that of the dilated convolution with a 1-pixel stride. The difference is that only the feature values located in odd rows and columns of the feature maps participate in the operation. Therefore, we only need to retain the corresponding feature values when grouping and perform the same subsequent operations as the dilated convolution with 1 pixel. Its calculation process is shown in Figure 7d. With these transformations, all types of convolutions in the fundamental network are converted to 3 × 3 standard convolutions. Compared to the method of designing custom architectures for each convolution, our method can reduce resource overhead and increase flexibility.

Hardware Implementation
As per the previous section, we optimized the fundamental network to reduce the complexity of FPGA-based hardware implementation. Based on this, in this section, a hardware architecture to implement the optimized fundamental network is presented. As shown in Figure 8, the proposed hardware architecture is composed of an Advanced RISC Machines (ARM)-centric processing system (PS) and programmable logic (PL). The PS contains general-purpose input/output (GPIO), Direct Memory Access (DMA) support, an Ethernet interface, an interrupt controller, etc. The PL contains the following main components: An input data reordering module, a decoding module, a DDR controller, a memory generator interface (MIG), a parameters buffer, and a processing array. Both PS and PL have an external Double Data Rate (DDR) SDRAM memory, called PS-DDR and PL-DDR, respectively. The PS-DDR and PL-DDR communicate with the FPGA through the DMA and the MIG IP core, respectively. The PS-DDR is mainly used to store feature maps, while the PL-DDR is used to store network parameters. A host PC connects with the PS through the Ethernet interface. The host PC is used to provide images to the PS-DDR and parameters to the PL-DDR and to generate the detection results.
Electronics 2021, 10, x FOR PEER REVIEW 12 of 25 is shown in Figure 7d. With these transformations, all types of convolutions in the fundamental network are converted to 3 × 3 standard convolutions. Compared to the method of designing custom architectures for each convolution, our method can reduce resource overhead and increase flexibility.

Hardware Implementation
As per the previous section, we optimized the fundamental network to reduce the complexity of FPGA-based hardware implementation. Based on this, in this section, a hardware architecture to implement the optimized fundamental network is presented. As shown in Figure 8, the proposed hardware architecture is composed of an Advanced RISC Machines (ARM)-centric processing system (PS) and programmable logic (PL). The PS contains general-purpose input/output (GPIO), Direct Memory Access (DMA) support, an Ethernet interface, an interrupt controller, etc. The PL contains the following main components: an input data reordering module, a decoding module, a DDR controller, a memory generator interface (MIG), a parameters buffer, and a processing array. Both PS and PL have an external Double Data Rate (DDR) SDRAM memory, called PS-DDR and PL-DDR, respectively. The PS-DDR and PL-DDR communicate with the FPGA through the DMA and the MIG IP core, respectively. The PS-DDR is mainly used to store feature maps, while the PL-DDR is used to store network parameters. A host PC connects with the PS through the Ethernet interface. The host PC is used to provide images to the PS-DDR and parameters to the PL-DDR and to generate the detection results. During work, the configuration instruction of each layer is performed with ARM and transmitted to PL by GPIO. After the decoding module decodes the instruction, control signals are sent to the relevant modules. DMA fetches the original image from the PS-DDR and transmits it to PL. The input data reordering module rearrange the pixels and feeds them to the processing array. The DDR controller in PL fetches the model parameters from the PL-DDR to the parameters buffer, and the parameters buffer then provides parameters to the processing array. In the proposed architecture, N PEs are adopted to build a processing array for parallel computing. These parallel PEs share the same input feature map and calculate for different output channels. These PEs complete the calculation of each layer in parallel. Finally, the output feature maps of the last layer are transferred back to the host PC from the PS-DDR. With the final feature maps, the host PC performs Non- During work, the configuration instruction of each layer is performed with ARM and transmitted to PL by GPIO. After the decoding module decodes the instruction, control signals are sent to the relevant modules. DMA fetches the original image from the PS-DDR and transmits it to PL. The input data reordering module rearrange the pixels and feeds them to the processing array. The DDR controller in PL fetches the model parameters from the PL-DDR to the parameters buffer, and the parameters buffer then provides parameters to the processing array. In the proposed architecture, N PEs are adopted to build a processing array for parallel computing. These parallel PEs share the same input feature map and calculate for different output channels. These PEs complete the calculation of each layer in parallel. Finally, the output feature maps of the last layer are transferred back to the host PC from the PS-DDR. With the final feature maps, the host PC performs Non-Maximum Suppression (NMS) to obtain the object detection results. The details are discussed in the following subsections.

Processing Engine Architecture Design
With the optimizations in Section 3, all calculations during the inference phase in the fundamental network are divided into multiple identical processing blocks. Thus, only one hardware module is needed to be designed to deploy the fundamental network on the FPGA. To achieve this goal, we propose an efficient PE. The architecture of the proposed PE is as shown in Figure 9. In the proposed PE, a finite state machine (FSM) is adopted to configure the routers to achieve the calculations in different processing blocks. Two local memories in the PE are used as a data register. The first local memory is used to store the intermediate results during calculations. These intermediate results can be accumulated in the overlap-add operation. The second local memory is used to cache one row of feature maps for the max pooling. These rows will be read out to achieve the pooling operation when the next row of convolutional results is obtained. The convolution unit is used to implement the standard 3 × 3 convolution. This unit is composed of nine multipliers and an adder tree. The operation bit width of each stage in the adder tree is increased by one bit to prevent overflow. A clamp function is used to prevent the overflow of the 32-bit adders in the adder tree. With this function, the values that exceed the upper and lower bounds are set to the 2 31 − 1 and 2 −31 + 1, respectively. The pseudo-code of the clamp operation in hardware implementation is shown as Algorithm 1. A fixed-to-float conversion unit is used to convert the fixed-point convolutional results to the floating-point values for the following fused layers. After the floating-point calculations of two fused layers, the results are converted to the integer. In addition, the comparison operation is used to achieve max pooling.
Maximum Suppression (NMS) to obtain the object detection results. The details are discussed in the following subsections.

Processing Engine Architecture Design
With the optimizations in Section 3, all calculations during the inference phase in the fundamental network are divided into multiple identical processing blocks. Thus, only one hardware module is needed to be designed to deploy the fundamental network on the FPGA. To achieve this goal, we propose an efficient PE. The architecture of the proposed PE is as shown in Figure 9. In the proposed PE, a finite state machine (FSM) is adopted to configure the routers to achieve the calculations in different processing blocks. Two local memories in the PE are used as a data register. The first local memory is used to store the intermediate results during calculations. These intermediate results can be accumulated in the overlap-add operation. The second local memory is used to cache one row of feature maps for the max pooling. These rows will be read out to achieve the pooling operation when the next row of convolutional results is obtained. The convolution unit is used to implement the standard 3 × 3 convolution. This unit is composed of nine multipliers and an adder tree. The operation bit width of each stage in the adder tree is increased by one bit to prevent overflow. A clamp function is used to prevent the overflow of the 32-bit adders in the adder tree. With this function, the values that exceed the upper and lower bounds are set to the 2 31 − 1 and 2 −31 + 1, respectively. The pseudo-code of the clamp operation in hardware implementation is shown as Algorithm 1. A fixed-to-float conversion unit is used to convert the fixed-point convolutional results to the floatingpoint values for the following fused layers. After the floating-point calculations of two fused layers, the results are converted to the integer. In addition, the comparison operation is used to achieve max pooling.

Data Storage and Transmission
In this section, we focus on the data storage and transmission scheme to efficiently deploy the network on an FPGA with the proposed PEs. To achieve this goal, a multi-level memory structure and a corresponding data path is designed to reuse calculation data and effectively access external memories. Notably, the design of multi-level memories is extended and based on our previous work [40]. To buffer and rearrange the feature maps, an input data reordering module is designed. As shown in Figure 10, this module is mainly composed of four block random access memories (Brams). The four Brams respectively store four rows of all channels. This module is designed as a Ping-Pong buffer. For example, at the beginning of the calculation, the first three rows of input feature maps are stored in Bram_1, Bram_2, and Bram_3 for calculation. At the same time, the fourth row of input feature maps are written into Bram_4. At this time, Bram_1, Bram_2, and Bram_3 is the Ping buffer; Bram_2, Bram_3, and Bram_4 is the Pong buffer. After the calculation of the current three rows is completed, the fifth row of feature maps are written into Bram_1. At this time, Bram_2, Bram_3, and Bram_4 is Ping buffer; Bram_3 Bram_4 and Bram_1 is Pong buffer. Notably, the Brams of the Ping buffer and Pang buffer are in order and cannot be reversed. This way can continuously provide feature values for the PEs, and hence enables low-latency calculations and improve the DDR bandwidth utilization rate. In addition, the zero-padding is achieved by selecting zero as the outputs of the input data reordering module in the appropriate place. The efficient calculation of PEs requires not only feature values but also parameters. For the fundamental network, the multiplying factors of the fused activation layer in each  The efficient calculation of PEs requires not only feature values but also parameters. For the fundamental network, the multiplying factors of the fused activation layer in each layer are two fixed values, which can be directly initialized in the read-only memory (ROM). The other parameters of each layer are stored in external memory due to the high volume. Figure 11 illustrates the parameter storage module, which contains the DDR controller and parameters buffer. The parameters buffer is used to provide parameters to PEs. The DDR controller is responsible for interacting with the PL-DDR though the MIG IP core. It is used to read parameters from the PL-DDR according to the control signals and store them in the parameters buffer. The efficient calculation of PEs requires not only feature values but also parameters. For the fundamental network, the multiplying factors of the fused activation layer in each layer are two fixed values, which can be directly initialized in the read-only memory (ROM). The other parameters of each layer are stored in external memory due to the high volume. Figure 11 illustrates the parameter storage module, which contains the DDR controller and parameters buffer. The parameters buffer is used to provide parameters to PEs. The DDR controller is responsible for interacting with the PL-DDR though the MIG IP core. It is used to read parameters from the PL-DDR according to the control signals and store them in the parameters buffer. During inference phase, the parameters of each layer need to be reused multiple times in the operation. For the fundamental network, the data amount of b , γ′, and β is small. Thus, these parameters of a layer can be stored in Brams. Nevertheless, it is not feasible to cache all weights in a large layer. Therefore, they are cached in FIFO and discarded immediately after a single use. In this case, we read them from the PL-DDR multiple times to meet the computing demand. The time spent for repeatedly reading weights is completely overlapped by the calculation time, which ensures the acceleration performance.
With the provided parameters and feature maps, the processing array can obtain the corresponding calculation results. These results will be transmitted to the PS-DDR through the DMA. As shown in Figure 12, we divide the storage space of the PS-DDR into During inference phase, the parameters of each layer need to be reused multiple times in the operation. For the fundamental network, the data amount ofb, γ , and β is small. Thus, these parameters of a layer can be stored in Brams. Nevertheless, it is not feasible to cache all weights in a large layer. Therefore, they are cached in FIFO and discarded immediately after a single use. In this case, we read them from the PL-DDR multiple times to meet the computing demand. The time spent for repeatedly reading weights is completely overlapped by the calculation time, which ensures the acceleration performance.
With the provided parameters and feature maps, the processing array can obtain the corresponding calculation results. These results will be transmitted to the PS-DDR through the DMA. As shown in Figure 12, we divide the storage space of the PS-DDR into several subspaces, which are used to store the results of different stages. Among them, two subspaces constitute a set of memories for alternately storing input and output feature values of a layer. In particular, if the output values are related to the route layer, they are stored in another subspace to avoid being overwritten. The results of the current layers will be rewritten into the input data reordering module as the input feature map of the following layer. When all the network calculations are achieved, the output feature maps of the final layer will be transmitted to the host PC via the Ethernet interface. The host PC runs NMS and obtains the detection results of the remote sensing image. several subspaces, which are used to store the results of different stages. Among them, two subspaces constitute a set of memories for alternately storing input and output feature values of a layer. In particular, if the output values are related to the route layer, they are stored in another subspace to avoid being overwritten. The results of the current layers will be rewritten into the input data reordering module as the input feature map of the following layer. When all the network calculations are achieved, the output feature maps of the final layer will be transmitted to the host PC via the Ethernet interface. The host PC runs NMS and obtains the detection results of the remote sensing image.

Experimental Evaluation and Results
In this section, we evaluate the performance of the proposed design by several experiments. The evaluation experiments were divided into two parts. First, the quantized fundamental network was trained and tested on a publicly available remote sensing image scene dataset to evaluate its detection metrics and obtain hybrid-type parameters for FPGA implementation. We then implemented the quantized network on the FPGA using

Experimental Evaluation and Results
In this section, we evaluate the performance of the proposed design by several experiments. The evaluation experiments were divided into two parts. First, the quantized fundamental network was trained and tested on a publicly available remote sensing image scene dataset to evaluate its detection metrics and obtain hybrid-type parameters for FPGA implementation. We then implemented the quantized network on the FPGA using the proposed architecture and tested the implementation processing performance. The experimental settings and detailed experimental results are provided below. . We used the training set of the DOTA dataset to train the quantized fundamental network and used the validation set to test it. In the training phase, all images were cropped to 1024 × 1024-pixel patches by the DOTA development kit in the experiment. Standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts were applied to the images during the training phase. Following previous work [4], the images were first cropped with a stride of 512 pixels in the testing phase. The detection results of each patch were then combined to obtain the results of the original images. Samples of the DOTA dataset are shown in Figure 13.

Experimental Procedure
To quickly evaluate the quantized fundamental network and obtain the hybrid-type parameters for FPGA implementation, we adopted a fine-tuning for the quantized network training. In our previous work [4], we obtained the weight parameters that perform best on the DOTA validation set. The weight parameter was used to initialize the quantized network. The quantized network was then trained for 20 epochs. The weight parameters were optimized by an Adam optimization method with a weight decay of 0.0005. A multistep learning rate was adopted. The detailed settings of the learning rate for this experiment are shown in Table 1. The batch size was set to 14. This experiment was performed on two NVIDIA Titan Xp GPUs with PyTorch 1.2.0 and TorchVision 0.4.0. (e) (f) To test the hardware implementation processing performance, we performed the quantized network on a hardware platform with a Xilinx ZYNQ xc7z035 SoPC chip and two Micron DDR3 SDRAMs. DDR3 SDRAMs were used as the PL-DDR and PS-DDR, respectively. The proposed design was implemented with Very-High-Speed Integrated Circuit Hardware Description Language (VHDL) and synthesized with Vivado Design Suite 2017.2. The codes on the embedded processor ARM and host PC were designed with C and python, respectively. The power results were obtained by the Vivado Power Analysis tool. For the fundamental network, the number of output channels is a multiple of 32, except for the last layer. Thus, we chose 32 PEs to build the processing array based on a trade-off between resource overhead and calculation time. For the last layer, the unneeded PEs were not activated. Table 2 presents the hardware resource utilization of our design. The utilized values of the look-up-table (LUT), Flip Flop (FF), Bram, and digital signal processing (DSP) were 83,240, 108,883, 369, and 192, respectively. The Brams were primarily consumed by the on-chip buffers. The embedded DSP slices were mainly used to implement the calculations of the network. As shown in Table 2, the available resources of the Xilinx ZYNQ xc7z035 SoPC are limited. However, with our implementation method, the large-scale fundamental network was successfully deployed on the SoPC platform with appropriate resource utilization. To evaluate the performance of the hardware implementation, we used the throughput of the hardware implementation and detection accuracy as evaluation criteria. The throughput performance was defined as the total operations divided by the required execution time. The total operations reflect the complexity of the network in terms of operations. For the FPGA implementation in this design, the total operations of the quantized fundamental network are 379 GOPs. For the input feature maps with a size of 1024 × 1024 × 3, it takes 3.4 s to get the output results of the last layer on the Xilinx ZYNQ xc7z035 FPGA. 'GOP/s' is an abbreviation of giga-operations per second. The proposed design achieved an overall throughput of 111.5 GOP/s for the quantized fundamental network under the 200MHz working frequency of the FPGA. Moreover, the mAP is most commonly used to evaluate the accuracy of object detection [41]. It indicates the mean accuracy of each class considering recall and precision. Generally, a higher mAP indicates better performance. Experiments show that our hardware implementation deployed on a ZYNQ xc7z035 device achieves a 69.32% mAP on the DOTA dataset. Some results on the DOTA dataset are shown in Figure 14. It can be seen that the arbitrarily oriented objects can be correctly detected by our hardware implementation on the FPGA.

Performance Evaluation
200MHz working frequency of the FPGA. Moreover, the mAP is most commonly used to evaluate the accuracy of object detection [41]. It indicates the mean accuracy of each class considering recall and precision. Generally, a higher mAP indicates better performance. Experiments show that our hardware implementation deployed on a ZYNQ xc7z035 device achieves a 69.32% mAP on the DOTA dataset. Some results on the DOTA dataset are shown in Figure 14. It can be seen that the arbitrarily oriented objects can be correctly detected by our hardware implementation on the FPGA. (e) (f) Figure 14. Some results on the DOTA dataset that were detected by our hardware architecture on Xilinx ZYNQ xc7z035 SoPC.

Performance Comparison
Several comparative experiments were conducted to show the effectiveness of our Figure 14. Some results on the DOTA dataset that were detected by our hardware architecture on Xilinx ZYNQ xc7z035 SoPC.

Performance Comparison
Several comparative experiments were conducted to show the effectiveness of our implementation. Firstly, the proposed FPGA-based hardware design was compared with different off-the-shelf platforms. We implemented the fundamental network on an Intel Xeon Gold 5120T with 128 GB DDR4 DRAM and an NVIDIA Titan Xp GPU with 12 GB GDDR5X memory. The main results of the CPU, the GPU, and the proposed design are listed in Table 3. Notably, a strategy of network quantitation was adopted on these platforms when mapping the fundamental network. The Thermal Design Power (TDP) values of the CPU and GPU were 105 W and 250 W, respectively. According to the power report supplied by the Vivado Design Suite, the total on-chip power of our hardware architecture was only 5.96 W. Therefore, our design is suitable for deployment in powerlimited application scenarios. We can see from Table 3 that the GPU has obvious advantages in terms of throughput among the platforms. The throughput of the GPU was 89.6 times that of the CPU and 47.3 times that of the proposed design, respectively. Compared with the CPU, the energy efficiency of our hardware architecture was 33.4 times higher. Additionally, the energy efficiency of our hardware architecture reached 89% of that of the GPU. Thus, our hardware architecture had better performance and energy efficiency than the CPU, and its energy efficiency was comparable with GPU at 1.6 GHz. In addition, our design is scalable. The performance and energy efficiency can be improved by increasing the number of PEs. As shown in the experimental results in Table 3, the detection accuracy of our implementation on the FPGA was only 0.18% lower than the mAP of the quantized fundamental network deployed on the GPU. We concluded that the reason for the limited accuracy loss was that we had fused some operations for hardware implementation. This degree of accuracy loss is acceptable in practical applications. The proposed design was also compared with related, state-of-the-art works. The performance comparison is shown in Table 4, wherein the relevant references are indicated. In Reference [42], a novel method to implement the YOLOv1 network framework on an FPGA is presented. However, the implementation allocated independent hardware resources for convolution and fully connected layers. This method limits the utilization of available resources. This design used 800 DSPs and its performance was only 18.82 GOP/s. In Reference [43], a Tiny-YOLOv2 algorithm was implemented on an FPGA, which contains nine convolutional layers and six max-pooling layers. As shown in Table 4, this work reported a low resource overhead, which is hardware-friendly for implementation. However, its processing performance was significantly limited-only 21.6 GOP/s. The authors of [44] successfully deployed Yolov2 on the Xilinx ZYNQ xc7z020 FPGA, a chip with limited resources. However, the design consumes almost all the DSPs on the chip, 211/220. Therefore, this design has limited expansion capabilities. In Reference [45], the YOLOv2 model was implemented on a Xilinx ZCU102 FPGA with an overall performance of 102.5 GOP/s, at a 300 MHz clock frequency. However, this design has extremely high requirements for computing resources-600 DSPs. Among the related works listed in Table  4, the [46] has the highest processing performance, reaching 500 GOP/s. Meanwhile, this design occupies the most logical resources, consuming more than 1000 DSPs and Brams. It is difficult to deploy the design in resource-limited scenarios. The processing performance of a design is closely related to resource overhead and operating frequency [12]. Hence, for a fair comparison, we considered the performance density, which is defined as the number of operations that one DSP slice executes in one cycle [13]. Compared with the above works, our design not only has superior processing speed but also performs best in performance density, which reached 2.90 OP/DSP/cycle. The comparison with these related works demonstrates that our design can strike a satisfactory balance between resource consumption and computing time cost and is suitable for deployment on embedded devices with a limited resource budget.

Discussion
We implemented an improved YOLOv2 network on an FPGA for large-scale optical remote sensing image object detection. Our implementation is a hardware/software codesign, which improves generalizability and flexibility. Compared with related works, we have significant advantages in hardware resource requirements. Therefore, in pursuit of further improving performance, we can improve the processing speed of our design by appropriately increasing the number of PEs. In future work, we will focus on improving the resource overhead and computing performance of the proposed architecture. To achieve this goal, we aim to implement the proposed design on a Xilinx MP-SoC board with a large on-chip memory. The external memories can be replaced by taking full advantage of the on-chip UltraRAMs. In this case, the design could perform at an even faster rate while using less power.

Conclusions
In this paper, we propose a hardware implementation method for CNN-based optical remote sensing object detection under power-limited conditions. First, we optimized the fundamental network for hardware implementation. The optimization mainly includes three aspects: network quantization, layer fusion, and the unified implementation of multiple convolutions. With these optimization methods, we effectively reduced the scale of the network and the resource requirements during deployment. We further propose a hardware architecture for the CNN-based remote sensing object detection model based on these optimizations. In this architecture, a PE is proposed to implement multiple types of convolutions in the network. An efficient data storage and access scheme is also proposed, and it achieves low-latency calculations and a high memory bandwidth utilization rate. We deployed an improved YOLOv2 network on a Xilinx ZYNQ xc7z035 FPGA using the proposed hardware architecture. The experimental results show that our design achieves an overall throughput of 111.5 GOP/s and an energy efficiency of 18.71 GOP/s/W under the 200 MHz working frequency of the FPGA. Compared with the CPU, the proposed accel-erator improves energy efficiency by 33.4 times. Additionally, the energy efficiency of our hardware architecture can reach 89% of that of the GPU. Moreover, the performance of the proposed accelerator can be further improved by increasing the number of PEs. The total on-chip power of our hardware architecture was only 5.96 W, which is much lower than the power consumption of the CPU and the GPU. In addition, experimental results tested on the DOTA dataset show that the proposed design can strike an excellent balance between hardware resource overhead and time cost. The detection accuracy of our implementation on the FPGA is only 0.18% lower than the mAP of the quantized fundamental network deployed on the GPU. Furthermore, several recent advanced FPGA-based implementations were compared to verify the superiority of the proposed hardware accelerator.