An Efficient FPGA-Based Implementation for Quantized Remote Sensing Image Scene Classification Network

Deep Convolutional Neural Network (DCNN)-based image scene classification models play an important role in a wide variety of remote sensing applications and achieve great success. However, the large-scale remote sensing images and the intensive computations make the deployment of these DCNN-based models on low-power processing systems (e.g., spaceborne or airborne) a challenging problem. To solve this problem, this paper proposes a high-performance Field-Programmable Gate Array (FPGA)-based DCNN accelerator by combining an efficient network compression scheme and reasonable hardware architecture. Firstly, this paper applies the network quantization to a high-accuracy remote sensing scene classification network, an improved oriented response network (IORN). The volume of the parameters and feature maps in the network is greatly reduced. Secondly, an efficient hardware architecture for network implementation is proposed. The architecture employs dual-channel Double Data Rate Synchronous Dynamic Random-Access Memory (DDR) access mode, rational on-chip data processing scheme and efficient processing engine design. Finally, we implement the quantized IORN (Q-IORN) with the proposed architecture on a Xilinx VC709 development board. The experimental results show that the proposed accelerator has 88.31% top-1 classification accuracy and achieves a throughput of 209.60 Giga-Operations Per Second (GOP/s) with a 6.32 W on-chip power consumption at 200 MHz. The comparison results with off-the-shelf devices and recent state-of-the-art implementations illustrate that the proposed accelerator has obvious advantages in terms of energy efficiency.


Introduction
Scene classification in remote sensing images refers to categorizing scene images into a discrete set of meaningful land use and land cover classes based on image content [1]. This useful remote sensing image processing task plays an essential role in natural disaster detection, vegetation mapping, urban planning and other applications [2][3][4][5]. Thus, it has become an important research topic in recent years. Drawing upon significant achievements in the field of computer vision, various deep convolution neural network (DCNN)-based methods have been proposed for remote sensing scene classification [6][7][8][9]. To employ efficient remote sensing scene classification in real-time, extensive researchers focus on building aerospace image processing systems, such as spaceborne or airborne systems. However, the large-scale remote sensing images and the intensive computations make the deployment of high-performance DCNN-based remote sensing image scene classification methods on power-limited real-time aerospace systems a challenging task.
Electronics 2020, 9,1344 2 of 20 To achieve this challenging task, many researchers are committed to building aerospace remote sensing image processing systems on low-power high-performance embedded devices. At present, extensive works adopt Central Processing Units (CPUs) and Graphics Processing Units (GPUs) as the implementation platforms of DCNNs for both training and inference phases. While CPUs and GPUs have high flexibility and excellent computing performance, their high power consumption and low performance-power ratio hinder their employment in power-limited systems. To achieve a trade-off between power consumption and computing performance, more attention has been paid to Field-Programmable Gate Arrays (FPGAs) and Application-Specific Integrated Circuits (ASICs) recently. Although FPGAs and ASICs hardly reach the same throughput as GPUs, their power consumption is limited. Furthermore, compared with FPGAs, the high cost and long development period of ASICs make it unable to keep up with the rapid changes of DCNNs. Thus, FPGA-based DCNN implementation has become a hotspot recently.
While FPGAs have relatively outstanding characteristics, the dense computations in DCNNs still cause difficulties for their deployment on FPGAs. Thus, optimizing DCNN models is of great significance. In recent years, many researchers have proposed various DCNN optimization methods to reduce network scale and computational complexity. In general, these optimization methods can be divided into three categories. The first category is to adopt computational transforms to reduce the volume of computation and the execution time of the network. Zhang et al. [10] used the Fast Fourier Transform (FFT) algorithm to optimize the convolution operation and achieved good performance in experiments. However, this method is effective only for large convolution kernels, but does little for small convolution kernels. Zhang et al. [11] adopted matrix multiplication to speed up the inference phase of the DCNN. However, the data replications in this method introduced a high memory cost. The second category is to reduce the amount of computation by cutting down the elements involved in DCNNs. Zhang et al. [12] effectively reduced the number of parameters in AlexNet by utilizing weight pruning and successfully deployed the compressed network on FPGA. However, due to not loading all calculations to the FPGA, their accelerator didn't achieve ideal calculation efficiency and good overall throughput. Mei et al. [13] compressed the network by applying low rank approximation to the full connection (FC) layer. However, since the computation of DCNNs is mainly concentrated in the convolution layers, their accelerator still occupied a lot of computing resources. The last category is to quantize the models to reduce the volume of parameters and calculations in DCNNs. Extensive research [14,15] has proved that the quantized model could reach similar performance as the floating-point one and could be well employed on FPGA platforms. Therefore, compared with the former two methods, the methods based on quantization schemes have received the most attention in the design of FPGA-based accelerators. Han et al. [16] quantized the weights of the convolutional layers and FC layers into 8-bit and 5-bit, respectively, to relieve the storage pressure. However, 5-bit precision is not effective for hardware implementation. Through comparative experiments, Gysel et al. [14] concluded that dynamic fixed-point (DFP) quantization had better performance than fixed-point quantization in the low-bit case. However, DFP quantization took up more on-chip storage compared with fixed-point quantization. Furthermore, many researchers have paid close attention to extreme bit compression [17,18]. Nevertheless, extreme bit compression reduces the network scale at the expense of network accuracy.
Although these optimization schemes provide significant convenience for the design of DCNN-based accelerators, the compressed DCNN still cannot be directly deployed on FPGA with limited resources. Aiming at this problem, the state-of-the-art implementations first advocate mapping a limited number of processing elements (PEs) on FPGA and reusing them during calculation. Then, to match the computing requirements of the PEs and obtain a high computing performance, these implementations also focus on configuring efficient storage schemes and data-paths when designing hardware architecture. To match the calculation amount with the memory bandwidth to a certain extent, Zhang et al. [19] designed a processing engine based on adder tree structure and theoretical roofline model. Similar to [19], Liu et al. [20] also applied the roofline model to coordinate Electronics 2020, 9,1344 3 of 20 resource and bandwidth issues in their design. Alwani et al. [21] effectively reduced off-chip access by optimizing the order of input feature maps and fusing multiple adjacent convolutional layers. Sun et al. [22] improved accelerator performance by adopting different parallelism schemes for different convolutional layers and optimizing the data path with loop unrolling and tiling. Mei et al. [13] implemented the proposed hardware accelerator on a Xilinx VC709 evaluation board by employing dual-channel Double Data Rate Synchronous Dynamic Random-Access Memory (DDR), tile-grain on-chip feature map buffers and the proposed Propagate Partial Multiply-Accumulate processor. Li et al. [23] developed an efficient hardware architecture with the proposed neural processing units and hierarchical on-chip storage organization.
In this paper, we propose an efficient remote sensing image scene classification accelerator. We adopt the improved oriented response network (IORN) [24] as the baseline, in which the average active rotating filters (A-ARFs) can effectively reduce the parameters of the network. To facilitate the deployment on FPGA, we further quantize the IORN with a quantization-awareness training method [25], which reduces the model size and computation amount without accuracy loss. Then, based on the quantized network, an efficient hardware architecture is proposed. During the development of the hardware architecture, we mainly consider parallel design and data-path optimization. We arrange parallel computing units with appropriate scale and propose an efficient hardware processing engine to maintain high hardware efficiency. In addition, a reasonable data-path is developed based on data reuse, efficient dual-channel DDR access mode and rational on-chip data processing scheme. Finally, we implement the quantized IORN (Q-IORN) with the proposed architecture on FPGA. Compared with off-the-shelf devices and other advanced implementations, the proposed accelerator can strike a better trade-off between performance and power consumption. The main contributions of this paper are summarized as follows: • A Q-IORN is proposed to facilitate the implementation of remote sensing scene classification on FPGA. The quantization-awareness training method used in this network can convert the feature maps and parameters of the network from floating-point to fixed-point, which efficiently compresses the model size without accuracy loss.

•
We analyze and optimize each calculation module of the proposed Q-IORN, which lays a foundation for the subsequent hardware implementation.

•
An efficient hardware architecture is proposed to implement the proposed Q-IORN. In this architecture, efficient dual-channel DDR access mode, rational on-chip data processing scheme and high-performance processing engine are adopted.

•
We verify the proposed hardware architecture on a Xilinx VC709 development board and evaluate it on an NWPU-RESISC45 dataset. The experimental results show that the classification accuracy of the proposed accelerator is consistent with that of GPU, i.e., 88.31%. The proposed accelerator achieves an overall energy efficiency of 33.16 Giga-Operations Per Second Per Watt (GOP/s/W) at 200 MHz, which is superior to CPU, GPU and other advanced accelerators.
The rest of this paper is organized as follows. Section 2 introduces the framework of the proposed Q-IORN. In Section 3, we discuss the algorithm basics and the mapping optimization scheme of each layer in the network in detail. The proposed hardware accelerator is illustrated in Section 4. The experiments and results are shown in Section 5. Finally, the conclusions are presented in Section 6.

Quantized IORN
In this paper, IORN [24] was adopted as the basic network since it can classify the remote sensing scene images with high accuracy and is conducive to hardware implementation. Then, we optimized the original IORN with network quantization to create quantized IORN (Q-IORN). The detailed descriptions of IORN and Q-IORN will be discussed in the following sub-sections.

IORN
Recently, DCNN-based models have attracted the most attention among various remote sensing image scene classification methods. However, DCNNs are not fully applicable to remote sensing scene classification [26]. Different from natural scene images, remote sensing scene images show frequent orientation variations due to the rotation of the earth and changing shooting angles, which increases the difficulty of recognizing scene categories. Aiming at this problem, [24] proposed an efficient network, IORN, for remote sensing scene classification. IORN was proposed by adding an additional A-ARF module and Squeeze-ORAlign (S-ORAlign) module on fundamental network VGG16 [27], which has satisfactory learning ability with less training data for different remote sensing scenes. As a DCNN using A-ARFs, IORN applies the prior knowledge of rotation to the most basic unit of DCNNs (i.e., convolution calculation unit) rather than introducing additional functional modules or new network topologies. According to [24,28], A-ARFs in IORN actively rotate during the convolution process to generate the feature maps with explicit location and orientation encoding. Due to the existence of A-ARFs, IORN can produce the features of within-class rotation-invariant while maintaining inter-class discrimination for classification tasks, which effectively handles the directionality of remote sensing image scenes. Moreover, with A-ARFs, IORN requires significantly less convolutional parameters with a negligible computation overhead. According to the different rotation angles of A-ARFs, two kinds of IORN were proposed in [24]: IOR4-VGG16 and IOR8-VGG16. The detailed description of A-ARF is introduced as follows.
According to [24,28], the construction of A-ARF is based on the circular displacement property of Fourier Transform. In practical application, only a limited number of orientations is required to ensure the accuracy. An A-ARF with N orientation channels consists of two parts: a materialized filter Γ and N − 1 immaterialized filters derived from its rotation. The n-th immaterialized filter, n [1, N − 1], is produced by anticlockwise rotating Γ by 2πn/N. In this work, the 3 × 3 × N in × 4 A-ARFs are mainly used, where N in and 4 are the number of input feature maps' channels and filter orientation channels, respectively. Taking a 3 × 3 × 64 materialized filter H as an example, the calculation process of its θ anticlockwise rotated version H θ contains two steps. First, the result of coordinate rotation Ψ θ can be obtained according to Equation (1): where N in = 64, θ = k 2π 8 , and I = is a mapping for the coordinate index, as shown in Figure 1. After the coordinate rotation, the orientation rotation is performed by Equation (2) to produce the other three unmaterialized filters.
Electronics 2020, 9, x FOR PEER REVIEW 4 of 20 frequent orientation variations due to the rotation of the earth and changing shooting angles, which increases the difficulty of recognizing scene categories. Aiming at this problem, [24] proposed an efficient network, IORN, for remote sensing scene classification. IORN was proposed by adding an additional A-ARF module and Squeeze-ORAlign (S-ORAlign) module on fundamental network VGG16 [27], which has satisfactory learning ability with less training data for different remote sensing scenes. As a DCNN using A-ARFs, IORN applies the prior knowledge of rotation to the most basic unit of DCNNs (i.e., convolution calculation unit) rather than introducing additional functional modules or new network topologies. According to [24,28], A-ARFs in IORN actively rotate during the convolution process to generate the feature maps with explicit location and orientation encoding. Due to the existence of A-ARFs, IORN can produce the features of within-class rotation-invariant while maintaining inter-class discrimination for classification tasks, which effectively handles the directionality of remote sensing image scenes. Moreover, with A-ARFs, IORN requires significantly less convolutional parameters with a negligible computation overhead. According to the different rotation angles of A-ARFs, two kinds of IORN were proposed in [24]: IOR4-VGG16 and IOR8-VGG16. The detailed description of A-ARF is introduced as follows. According to [24,28], the construction of A-ARF is based on the circular displacement property of Fourier Transform. In practical application, only a limited number of orientations is required to ensure the accuracy. An A-ARF with orientation channels consists of two parts: a materialized filter and − 1 immaterialized filters derived from its rotation. The n-th immaterialized filter, [1, − 1], is produced by anticlockwise rotating by 2 ⁄ . In this work, the 3 × 3 × × 4 A-ARFs are mainly used, where and 4 are the number of input feature maps' channels and filter orientation channels, respectively. Taking a 3 × 3 × 64 materialized filter as an example, the calculation process of its anticlockwise rotated version contains two steps. First, the result of coordinate rotation Ψ can be obtained according to Equation (1): where = 64, = , and = 0 1 2 7 3 6 5 4 is a mapping for the coordinate index, as shown in Figure 1. After the coordinate rotation, the orientation rotation is performed by Equation (2) to produce the other three unmaterialized filters.

Q-IORN
Since IOR4-VGG16 strikes a better trade-off between classification accuracy and the volume of parameters than IOR8-VGG16, this paper adopted the IOR4-VGG16 as the basic network for remote sensing scene classification. To facilitate hardware implementation, this paper further proposes a Q-IORN (i.e., Quantized IOR4-VGG16) based on the original IORN (i.e., IOR4-VGG16) with the following steps. First, the S-ORAlign layer is removed in the proposed Q-IORN, as it has no significant effects on performance and the reorder operation in this layer is not hardware-friendly. Second, a network quantization scheme is applied to the original IORN, where the A-ORConv layer (the convolutional layer with the A-ARFs) and FC layer in the original IORN are replaced by the quantized A-ORConv layer and quantized FC layer, respectively. With this quantization scheme, the weights and feature maps of the network are reduced by four times on the basis of the original IORN. Taking the A-ORConv layer and quantized A-ORConv layer as an example, the differences between them are shown in Figure 3. According to Figure 3, the quantized A-ORConv layer adopts more fixedpoint arithmetic units with low bit-width than A-ORConv layer. The adopted network quantization scheme obtains the integer-only parameters through a quantization-awareness training method instead of directly quantizing the trained floating-point parameters. This can effectively reduce the network scale while maintaining high classification accuracy. Moreover, applying this method ensures the consistency of the network prediction between the training phase on a general computing platform and the inference phase on the proposed accelerator.

Q-IORN
Since IOR4-VGG16 strikes a better trade-off between classification accuracy and the volume of parameters than IOR8-VGG16, this paper adopted the IOR4-VGG16 as the basic network for remote sensing scene classification. To facilitate hardware implementation, this paper further proposes a Q-IORN (i.e., Quantized IOR4-VGG16) based on the original IORN (i.e., IOR4-VGG16) with the following steps. First, the S-ORAlign layer is removed in the proposed Q-IORN, as it has no significant effects on performance and the reorder operation in this layer is not hardware-friendly. Second, a network quantization scheme is applied to the original IORN, where the A-ORConv layer (the convolutional layer with the A-ARFs) and FC layer in the original IORN are replaced by the quantized A-ORConv layer and quantized FC layer, respectively. With this quantization scheme, the weights and feature maps of the network are reduced by four times on the basis of the original IORN. Taking the A-ORConv layer and quantized A-ORConv layer as an example, the differences between them are shown in Figure 3. According to Figure 3, the quantized A-ORConv layer adopts more fixed-point arithmetic units with low bit-width than A-ORConv layer. The adopted network quantization scheme obtains the integer-only parameters through a quantization-awareness training method instead of directly quantizing the trained floating-point parameters. This can effectively reduce the network scale while maintaining high classification accuracy. Moreover, applying this method ensures the consistency of the network prediction between the training phase on a general computing platform and the inference phase on the proposed accelerator.

Q-IORN
Since IOR4-VGG16 strikes a better trade-off between classification accuracy and the volume of parameters than IOR8-VGG16, this paper adopted the IOR4-VGG16 as the basic network for remote sensing scene classification. To facilitate hardware implementation, this paper further proposes a Q-IORN (i.e., Quantized IOR4-VGG16) based on the original IORN (i.e., IOR4-VGG16) with the following steps. First, the S-ORAlign layer is removed in the proposed Q-IORN, as it has no significant effects on performance and the reorder operation in this layer is not hardware-friendly. Second, a network quantization scheme is applied to the original IORN, where the A-ORConv layer (the convolutional layer with the A-ARFs) and FC layer in the original IORN are replaced by the quantized A-ORConv layer and quantized FC layer, respectively. With this quantization scheme, the weights and feature maps of the network are reduced by four times on the basis of the original IORN. Taking the A-ORConv layer and quantized A-ORConv layer as an example, the differences between them are shown in Figure 3. According to Figure 3, the quantized A-ORConv layer adopts more fixedpoint arithmetic units with low bit-width than A-ORConv layer. The adopted network quantization scheme obtains the integer-only parameters through a quantization-awareness training method instead of directly quantizing the trained floating-point parameters. This can effectively reduce the network scale while maintaining high classification accuracy. Moreover, applying this method ensures the consistency of the network prediction between the training phase on a general computing platform and the inference phase on the proposed accelerator.

Algorithm Basics and Mapping Optimization Scheme
The proposed Q-IORN is mainly composed of Quantized A-ORConv layers, activation function layers, pooling layers, Quantized FC layers and a Softmax layer. The algorithm basics and mapping scheme of each layer are discussed in this Section.

Network Quantization
Based on our previous work [25], an efficient quantization scheme is applied to IORN. With this scheme, the floating-point matrices of the network can be converted to integer matrices with low bit-width. The relationship between the two matrices is shown in Equation (3): where M f loat indicates a floating-point matrix, M int indicates an integer matrix, and q is quantization interval. Since 8-bit quantization is employed in this work, the quantization interval q in Equation (3) can be computed with: From Equation (4), calculating the floating-point constant q first needs to find the maximum value of the floating-point matrix |M f loat |. After network training, the weight matrix is determined. Thus, the quantization interval of the weights q w can be easily computed by Equation (4). However, the feature map matrixes are different when different images are fed into the network. It requires large online computations to find the maximum element for different input images during the inference phase. To solve this problem, the quantization interval of the input feature maps q f is obtained through the following two steps: estimating the largest absolute value during training, and calculating q f based on the estimation. After the quantization interval is determined, the quantized integer matrix can be calculated by the following equation: where q is the quantization interval, round(·) is used to find the nearest integer, and clamp(·) is used to limit the quantized elements to a specific range. For 8-bit quantization, the range is [−127, 127].

Quantized A-ORConv
The A-ORConv layer is a custom convolutional layer, which replaces general filters with A-ARFs. Since an A-ARF contains N orientation channels, the output feature map Z calculated by it also contains N orientation channels, and the j-th channel can be obtained by Equation (6): where H θ is the j-th rotated version of the materialized filter H, Z is the input feature map, Bias j is the bias corresponding to H θ , Z n and H n θ are the n-th channel of Z and H θ , respectively.
To reduce the consumption of storage resources and computing resources, an efficient quantization scheme is applied to both parameters and feature maps in the network. With Equations (3) and (6), the Quantized A-ORConv is defined as follows: where q w and q f are quantization intervals of weights and input feature maps, respectively, H n θ int and Z n int are quantized H θ and quantized Z n , respectively. Since q w and q f are both non-zero constants, the expression of q b = q w q f is applied to Equation (7). Thus, the Quantized A-ORConv can be written as: With this expression, the biases can be represented by integers. Moreover, the floating-point calculations of the A-ORConv layer are mainly converted to low-bit fixed-point calculations. This can greatly relieve the computational pressure when deploying the network on the proposed accelerator. Based on the above-mentioned descriptions, the Quantized A-ORConv can be mapped on FPGA with the following steps. First, 8-bit quantization is performed on weights, biases and input feature maps by Equation (5). Second, only the materialized filters need to be stored in memory, while the immaterialized filters for convolution calculation are produced by logic. The mapping of coordinate rotation can be accomplished by simply exchanging register values. To implement orientation rotation, a suitable storage scheme is indispensable. Finally, the corresponding convolution calculations are completed by using the processed weights, biases and input feature maps. As fixed-point multiplication and fixed-point addition are used in convolution operations, it's necessary to pay attention to the variation of data bit-width during hardware deployment.

Activation Function
To enhance the nonlinear expression ability of DCNNs, various nonlinear activation functions have been widely used, including Sigmoid, Tanh, Rectified Linear Unit (ReLU) and so on. Among them, ReLU has drawn the most attention in recent times because it can efficiently alleviate gradient vanishing with low computational complexity. The neurons of the proposed Q-IORN are also activated by ReLU, as shown in Equation (9): where Z and Z in Equation (9) indicate the output and input of the activation function, respectively. According to Equation (9), the hardware implementation of ReLU can be completed by simply judging the symbol bit of input neurons.

Pooling Layer
The pooling layer in DCNNs is mainly used to reduce the scale of the features. Max-pooling and mean-pooling are two common pooling methods. In Q-IORN, max-pooling is employed as pooling layers. The output neuron Z of the 2 × 2 max-pooling layer can be calculated as: Electronics 2020, 9, x FOR PEER REVIEW 7 of 20 To reduce the consumption of storage resources and computing resources, an efficient quantization scheme is applied to both parameters and feature maps in the network. With Equations (3) and (6), the Quantized A-ORConv is defined as follows: where and are quantization intervals of weights and input feature maps, respectively, and are quantized and quantized , respectively. Since and are both non-zero constants, the expression of = is applied to Equation (7). Thus, the Quantized A-ORConv can be written as: With this expression, the biases can be represented by integers. Moreover, the floating-point calculations of the A-ORConv layer are mainly converted to low-bit fixed-point calculations. This can greatly relieve the computational pressure when deploying the network on the proposed accelerator. Based on the above-mentioned descriptions, the Quantized A-ORConv can be mapped on FPGA with the following steps. First, 8-bit quantization is performed on weights, biases and input feature maps by Equation (5). Second, only the materialized filters need to be stored in memory, while the immaterialized filters for convolution calculation are produced by logic. The mapping of coordinate rotation can be accomplished by simply exchanging register values. To implement orientation rotation, a suitable storage scheme is indispensable. Finally, the corresponding convolution calculations are completed by using the processed weights, biases and input feature maps. As fixedpoint multiplication and fixed-point addition are used in convolution operations, it's necessary to pay attention to the variation of data bit-width during hardware deployment.

Activation Function
To enhance the nonlinear expression ability of DCNNs, various nonlinear activation functions have been widely used, including Sigmoid, Tanh, Rectified Linear Unit (ReLU) and so on. Among them, ReLU has drawn the most attention in recent times because it can efficiently alleviate gradient vanishing with low computational complexity. The neurons of the proposed Q-IORN are also activated by ReLU, as shown in Equation (9): where and in Equation (9) indicate the output and input of the activation function, respectively. According to Equation (9), the hardware implementation of ReLU can be completed by simply judging the symbol bit of input neurons.

Pooling Layer
The pooling layer in DCNNs is mainly used to reduce the scale of the features. Max-pooling and mean-pooling are two common pooling methods. In Q-IORN, max-pooling is employed as pooling layers. The output neuron of the 2 × 2 max-pooling layer can be calculated as: = max (10) (10) Electronics 2020, 9,1344 8 of 20 where Z xy = Z 00 Z 01 Z 10 Z 11 is the neuron matrix in the pooling window. From Equation (10), the max-pooling operation can be regarded as finding the maximum value from the pooling window, which can be achieved by hardware comparators.

Quantized FC Layer
In classification networks, FC layers are generally used to integrate the features of the convolutional layers. Each output neuron of the FC layer is related to all input neurons, and the j-th output neuron Z J can be represented as: where N is the number of input neurons, Z i is the i-th input neuron, W ij is the weight corresponding to Z i when generating the j-th output neuron, and Bias j is the bias for the j-th output neuron. Since the quantization scheme is applied to the FC layer, the j-th output neuron of the quantized FC layer is defined as: where q b = q w q f , q w and q f are quantization intervals of weights and input neurons, respectively, Z inti and W intij are quantized Z i and quantized W ij , respectively. Similar to the quantized convolutional layers, the mapping of the quantized FC layers also relies on fixed-point multiplication and fixed-point addition. Moreover, the volume of parameters in the FC layers is too large to be stored in on-chip memory. Thus, these parameters need to be stored in external memory when deploying the accelerator.

Softmax Layer
The Softmax layer is mostly applied in multi-classification networks, which converts the output neurons of the last FC layer to probabilities. The Softmax layer is defined as follows: where N is the number of output neurons of the last FC layer, Z J and Z j are the j-th output of the last FC layer and Softmax layer, respectively. As shown in Equation (13), the Softmax layer contains exponential operations and division operations. These operations are unfriendly to hardware implementation. Thus, the Softmax layer can be replaced by a simple maximum operation. This can reduce the logical resource consumption while obtaining the correct classification prediction.

Hardware Implementation
Based on the detailed algorithm description in Section 3, an efficient hardware accelerator for classifying remote sensing image scene is proposed. The architecture of the proposed hardware accelerator is shown in Figure 4. In the proposed accelerator, we adopt the Advanced RISC Machine (ARM) as a processing system (PS). The PS is used to load the trained model and original images to programmable logic (PL), in addition to recording image classification results. The PL is composed of an FPGA chip and independent dual-channel DDR. To ensure low power consumption and short time overhead, the FPGA chip is utilized to implement and accelerate all calculations of the Q-IORN. Due to the limited on-chip memory, the dual-channel DDR is adopted to store the trained model and the output feature map of each layer. As illustrated in Figure 4, PL is the main part of the proposed hardware accelerator. To employ efficient network calculations on PL, two efficient storage schemes and a high-performance processing engine are proposed. The details are described in the following sub-sections. hardware accelerator. To employ efficient network calculations on PL, two efficient storage schemes and a high-performance processing engine are proposed. The details are described in the following sub-sections.

Efficient Storage Scheme
In our accelerator, both off-chip memory and on-chip memory are adopted to store weights and features. The external memory directly connected with PL is utilized to save the model parameters and the output features of each layer. The on-chip memory with low power consumption and fast access speed is used to cache a part of the input data and intermediate results. The on-chip memory is divided into several independent sub-storage units, including input buffer, weight buffer, offset buffer, output buffer and data buffer (for storing intermediate results). The depth of each buffer is configured independently according to the computing requirements. In addition, we propose two efficient data storage schemes to significantly reduce the occupation of storage resources and the time overhead caused by data interaction.
Firstly, a reasonable layer-wise storage scheme is proposed to take advantage of the independent dual-channel DDR. This scheme could avoid the conflict between the storage of the output feature map and the access of the input feature map in each layer by using the ping-pong buffer technique. Moreover, the scheme also reasonably arranges the storage and access of network weights according to the idle time of dual-channel DDR. The working state of the dual-channel DDR is changed alternately by using DDR controller, as shown in Figure 5a. For the n-th layer, the output features of the previous layer and the weights of the current layer are read from DDR3-B and DDR3-A, respectively. DDR3-A is also used to receive the output features of the current layer. Regarding the (n + 1)-th layer, the working mode is exchanged between DDR3-A and DDR3-B. This can greatly reduce the time overhead between layers and ensure the successful implementation of the full pipeline within layer. The storage space of the dual-channel DDR is also reasonably arranged. As shown in Figure 5b, the weights and output features of the odd-numbered convolutional layers are stored in DDR3-A, while the corresponding data of the even-numbered convolutional layers and the weights of the FC layers are stored in DDR3-B. Moreover, the storage space occupied by the output features is reused in this scheme, which helps to reduce the resource consumption caused by additional logical control.

Efficient Storage Scheme
In our accelerator, both off-chip memory and on-chip memory are adopted to store weights and features. The external memory directly connected with PL is utilized to save the model parameters and the output features of each layer. The on-chip memory with low power consumption and fast access speed is used to cache a part of the input data and intermediate results. The on-chip memory is divided into several independent sub-storage units, including input buffer, weight buffer, offset buffer, output buffer and data buffer (for storing intermediate results). The depth of each buffer is configured independently according to the computing requirements. In addition, we propose two efficient data storage schemes to significantly reduce the occupation of storage resources and the time overhead caused by data interaction.
Firstly, a reasonable layer-wise storage scheme is proposed to take advantage of the independent dual-channel DDR. This scheme could avoid the conflict between the storage of the output feature map and the access of the input feature map in each layer by using the ping-pong buffer technique. Moreover, the scheme also reasonably arranges the storage and access of network weights according to the idle time of dual-channel DDR. The working state of the dual-channel DDR is changed alternately by using DDR controller, as shown in Figure 5a. For the n-th layer, the output features of the previous layer and the weights of the current layer are read from DDR3-B and DDR3-A, respectively. DDR3-A is also used to receive the output features of the current layer. Regarding the (n + 1)-th layer, the working mode is exchanged between DDR3-A and DDR3-B. This can greatly reduce the time overhead between layers and ensure the successful implementation of the full pipeline within layer. The storage space of the dual-channel DDR is also reasonably arranged. As shown in Figure 5b, the weights and output features of the odd-numbered convolutional layers are stored in DDR3-A, while the corresponding data of the even-numbered convolutional layers and the weights of the FC layers are stored in DDR3-B. Moreover, the storage space occupied by the output features is reused in this scheme, which helps to reduce the resource consumption caused by additional logical control. Electronics 2020, 9,   The second efficient storage scheme is proposed based on the data bit-width relationship among input buffer, output buffer and external memory. The proposed data interaction scheme transfers the output matrix through the steps shown in Figure 6. Since the output of the n-th layer in DCNNs serves as the input of the (n + 1)-th layer, we first arrange the n-th layer's output elements in the order of the input elements required by the (n + 1)-th layer. The order is displayed in Figure 6a. Then, in the n-th calculation layer, we store the spliced 512-bit data in the output buffer instead of the 8bit output element , . As shown in Figure 6b, is generated by splicing the elements of the i-th column in Figure 6a. By this means, we effectively reduce the storage resource consumption and match the data bit-width of DDR. Next, in the (n + 1)-th calculation layer, the read from DDR is stored in the input buffer. The relationship between the 32-bit output element , and the 512-bit input element in the input buffer is presented in Figure 6c. Finally, the element , read from input buffer is grouped and reorganized based on the computing requirements. An example of the data rearrangement process is revealed in Figure 6d. Evidently, the order of the obtained input elements in Figure 6d is consistent with that of the elements in Figure 6a. With this scheme, we successfully meet the requirements of pipeline implementation within layer and significantly reduce the storage resource consumption. Moreover, the contradiction between calculation speed and data storage speed of the first layer in the network is eliminated. The second efficient storage scheme is proposed based on the data bit-width relationship among input buffer, output buffer and external memory. The proposed data interaction scheme transfers the output matrix through the steps shown in Figure 6. Since the output of the n-th layer in DCNNs serves as the input of the (n + 1)-th layer, we first arrange the n-th layer's output elements in the order of the input elements required by the (n + 1)-th layer. The order is displayed in Figure 6a. Then, in the n-th calculation layer, we store the spliced 512-bit data a i in the output buffer instead of the 8-bit output element c i,j . As shown in Figure 6b, a i is generated by splicing the elements of the i-th column in Figure 6a. By this means, we effectively reduce the storage resource consumption and match the data bit-width of DDR. Next, in the (n + 1)-th calculation layer, the a i read from DDR is stored in the input buffer. The relationship between the 32-bit output element b i,k and the 512-bit input element a i in the input buffer is presented in Figure 6c. Finally, the element b i,k read from input buffer is grouped and reorganized based on the computing requirements. An example of the data rearrangement process is revealed in Figure 6d. Evidently, the order of the obtained input elements in Figure 6d is consistent with that of the elements in Figure 6a. With this scheme, we successfully meet the requirements of pipeline implementation within layer and significantly reduce the storage resource consumption. Moreover, the contradiction between calculation speed and data storage speed of the first layer in the network is eliminated. (c) (d) Figure 6. An efficient on-chip storage scheme. (a) The order of the (n + 1)-th layer's input elements (or the n-th layer's output elements). , is the element generated by the j-th calculation unit and stored in the i-th address of the output buffer. is the maximum storage address of the output elements; (b) The composition of (i.e., the n-th layer's DDR input element or the (n + 1)-th layer's DDR output element); (c) The relationship of the output element , and the input element in the input buffer.
, is the element stored in the (16 + )-th output address of the input buffer; (d) The process of obtaining the required input element order.

Processing Engine Architecture
To implement the proposed Q-IORN efficiently, a hardware processing engine architecture is proposed, as shown in Figure 7. The proposed hardware processing engine is mainly composed of a convolutional processing engine and an FC processing engine. The former achieves the calculation for the Quantized A-ORConv layer, ReLU layer and max-pooling layer, and the latter is designed to calculate the Quantized FC layer, ReLU layer and Softmax layer. The implementation of the key modules in the hardware processing engine is described as follows.  The order of the (n + 1)-th layer's input elements (or the n-th layer's output elements). c i, j is the element generated by the j-th calculation unit and stored in the i-th address of the output buffer. N is the maximum storage address of the output elements; (b) The composition of a i (i.e., the n-th layer's DDR input element or the (n + 1)-th layer's DDR output element); (c) The relationship of the output element b i,k and the input element a i in the input buffer. b i,k is the element stored in the (16i + k)-th output address of the input buffer; (d) The process of obtaining the required input element order.

Processing Engine Architecture
To implement the proposed Q-IORN efficiently, a hardware processing engine architecture is proposed, as shown in Figure 7. The proposed hardware processing engine is mainly composed of a convolutional processing engine and an FC processing engine. The former achieves the calculation for the Quantized A-ORConv layer, ReLU layer and max-pooling layer, and the latter is designed to calculate the Quantized FC layer, ReLU layer and Softmax layer. The implementation of the key modules in the hardware processing engine is described as follows. (c) (d) Figure 6. An efficient on-chip storage scheme. (a) The order of the (n + 1)-th layer's input elements (or the n-th layer's output elements). , is the element generated by the j-th calculation unit and stored in the i-th address of the output buffer. is the maximum storage address of the output elements; (b) The composition of (i.e., the n-th layer's DDR input element or the (n + 1)-th layer's DDR output element); (c) The relationship of the output element , and the input element in the input buffer.
, is the element stored in the (16 + )-th output address of the input buffer; (d) The process of obtaining the required input element order.

Processing Engine Architecture
To implement the proposed Q-IORN efficiently, a hardware processing engine architecture is proposed, as shown in Figure 7. The proposed hardware processing engine is mainly composed of a convolutional processing engine and an FC processing engine. The former achieves the calculation for the Quantized A-ORConv layer, ReLU layer and max-pooling layer, and the latter is designed to calculate the Quantized FC layer, ReLU layer and Softmax layer. The implementation of the key modules in the hardware processing engine is described as follows.  (1) A-ARF generation module. As shown in Section 3.1, the unmaterialized filters of A-ARF are produced by coordinate rotation and orientation rotation. The mapping order of coordinate rotation and orientation rotation is exchanged in the proposed hardware accelerator. Specifically, the orientation rotation is implemented before the coordinate rotation. This can reduce the logical resource consumption while generating the required unmaterialized filters. Figure 8 depicts the procedure of generating the m-th input channel of A-ARF. The materialized filter H is stored with a custom data arrangement, as shown in Figure 8. In this case, four adjacent channels (i.e., {H 4i−3 , H 4i−2 , H 4i−1 , H 4i }) of the filter H required for orientation rotation can be provided simultaneously. The orientation rotation module reads the i-th column in the storage unit and reorders it according to the value of the counters. For example, when j = 1, the sequence is converted from F 1 ), and H i represents the i-th input channel of the materialized filter H. After the orientation rotation, the m-th input channel of A-ARF is finally generated by rotating the coordinates of each kernel in F j separately. The detailed process of the coordinate rotation is shown in Figure 1.
In conclusion, the implementation of A-ARF mainly depends on reasonable storage arrangement and efficient logical control. The proposed scheme can provide the convolution kernels required for convolution calculation in real-time, and relieve the on-chip storage pressure caused by storing immaterialized filters.
Electronics 2020, 9, x FOR PEER REVIEW 12 of 20 (1) A-ARF generation module. As shown in Section 3.1, the unmaterialized filters of A-ARF are produced by coordinate rotation and orientation rotation. The mapping order of coordinate rotation and orientation rotation is exchanged in the proposed hardware accelerator. Specifically, the orientation rotation is implemented before the coordinate rotation. This can reduce the logical resource consumption while generating the required unmaterialized filters. Figure 8 depicts the procedure of generating the m-th input channel of A-ARF. The materialized filter is stored with a custom data arrangement, as shown in Figure 8. In this case, four adjacent channels (i.e., , , , ) of the filter required for orientation rotation can be provided simultaneously. The orientation rotation module reads the i-th column in the storage unit and reorders it according to the value of the counters. For example, when = 1, the sequence is converted from , , , to , , , (defined as ), and represents the i-th input channel of the materialized filter . After the orientation rotation, the m-th input channel of A-ARF is finally generated by rotating the coordinates of each kernel in separately. The detailed process of the coordinate rotation is shown in Figure 1. In conclusion, the implementation of A-ARF mainly depends on reasonable storage arrangement and efficient logical control. The proposed scheme can provide the convolution kernels required for convolution calculation in real-time, and relieve the on-chip storage pressure caused by storing immaterialized filters. (2) Feature padding module. Unlike the implementation in [29], which used additional Block Random Access Memory (BRAM) to store padding data, the proposed hardware accelerator achieves the padding operation during the calculation by judging the position of the current calculation patch in the feature map. With this optimization, only a few register resources are required for mapping this module. (3) Convolution calculation module. During the deployment of the convolution calculation module, the main challenge is to map the dense multiplication and addition operations to FPGA. Thus, we focus on exploring a suitable parallel computing scheme. In the proposed convolution calculation module, a total of 64 PEs are placed. Each PE with nine multipliers adopts an adder tree structure to carry out the calculation in a 3 × 3 window. Based on the network structure and the proposed parallel scheme, our accelerator reads the three consecutive rows of the input feature map (with a size of 3 × × ) and the required convolution kernel (with a size of (2) Feature padding module. Unlike the implementation in [29], which used additional Block Random Access Memory (BRAM) to store padding data, the proposed hardware accelerator achieves the padding operation during the calculation by judging the position of the current calculation patch in the feature map. With this optimization, only a few register resources are required for mapping this module. (3) Convolution calculation module. During the deployment of the convolution calculation module, the main challenge is to map the dense multiplication and addition operations to FPGA. Thus, we focus on exploring a suitable parallel computing scheme. In the proposed convolution calculation module, a total of 64 PEs are placed. Each PE with nine multipliers adopts an adder tree structure to carry out the calculation in a 3 × 3 window. Based on the network structure and the proposed parallel scheme, our accelerator reads the three consecutive rows of the input feature map (with a size of 3 × N W × N i f ) and the required convolution kernel (with a size of 3 × 3 × N i f × N o f ) to the corresponding on-chip buffer each time. Then a row of the output feature map (with a size of 1 × N W × N o f ) can be generated according to the obtained input feature map and convolution kernel, as shown in Figure 9. The detailed calculation process for generating the m-th row of the output feature map is illustrated in Figure 10. After obtaining the required input feature map and convolution kernel, we divide them into N i f groups. The i-th group contains three rows of the i-th input channel in the input feature map (with a size of 3 × N W × 1) and the i-th channel of the convolution kernel (with a size of 3 × 3 × 1 × N o f ). This operation provides a guarantee for the realization of pipelining within layer. As 64 PEs are set in this module, the convolution kernels in the i-th group are further grouped by 64 (i.e., the kernels in each group are further divided into N o f 64 units, and each unit contains 64 3 × 3 windows), as shown in Figure 10. After that, the final outputs are produced by combining data reuse, vector inner product within group and accumulation between groups.

× × ×
) to the corresponding on-chip buffer each time. Then a row of the output feature map (with a size of 1 × × ) can be generated according to the obtained input feature map and convolution kernel, as shown in Figure 9. The detailed calculation process for generating the m-th row of the output feature map is illustrated in Figure 10. After obtaining the required input feature map and convolution kernel, we divide them into groups. The i-th group contains three rows of the i-th input channel in the input feature map (with a size of 3 × × 1) and the i-th channel of the convolution kernel (with a size of 3 × 3 × 1 × ). This operation provides a guarantee for the realization of pipelining within layer. As 64 PEs are set in this module, the convolution kernels in the i-th group are further grouped by 64 (i.e., the kernels in each group are further divided into units, and each unit contains 64 3 × 3 windows), as shown in Figure 10. After that, the final outputs are produced by combining data reuse, vector inner product within group and accumulation between groups.  Electronics 2020, 9, x FOR PEER REVIEW 13 of 20

× × ×
) to the corresponding on-chip buffer each time. Then a row of the output feature map (with a size of 1 × × ) can be generated according to the obtained input feature map and convolution kernel, as shown in Figure 9. The detailed calculation process for generating the m-th row of the output feature map is illustrated in Figure 10. After obtaining the required input feature map and convolution kernel, we divide them into groups. The i-th group contains three rows of the i-th input channel in the input feature map (with a size of 3 × × 1) and the i-th channel of the convolution kernel (with a size of 3 × 3 × 1 × ). This operation provides a guarantee for the realization of pipelining within layer. As 64 PEs are set in this module, the convolution kernels in the i-th group are further grouped by 64 (i.e., the kernels in each group are further divided into units, and each unit contains 64 3 × 3 windows), as shown in Figure 10. After that, the final outputs are produced by combining data reuse, vector inner product within group and accumulation between groups.  (4) FC calculation module. The mapping of the FC calculation module is mainly limited by the access of weights. Since DDR can only provide a 512-bit data per clock cycle, we set up a PE with 64 multipliers for FC calculation. With this scheme, we can make full use of the bandwidth of DDR. The detailed calculation process of the FC layer is shown in Figure 11. To match the bit-width of the DDR, the input neurons and weights are processed into a custom form through flattening and splicing (the input neurons are processed online, and the weights are processed offline). In this way, the input neurons and corresponding weights are divided into N i 64 groups, and each group has a 512-bit input and N o 512-bit weights. The input neurons are reused during the calculation process, which effectively reduces the access to on-chip storage. The weights are read from the external memory in real-time for calculation. Similar to convolution calculation module, the final output neurons are also generated by applying the vector inner product within group and the accumulation between groups.
Electronics 2020, 9, x FOR PEER REVIEW 14 of 20 (4) FC calculation module. The mapping of the FC calculation module is mainly limited by the access of weights. Since DDR can only provide a 512-bit data per clock cycle, we set up a PE with 64 multipliers for FC calculation. With this scheme, we can make full use of the bandwidth of DDR. The detailed calculation process of the FC layer is shown in Figure 11. To match the bitwidth of the DDR, the input neurons and weights are processed into a custom form through flattening and splicing (the input neurons are processed online, and the weights are processed offline). In this way, the input neurons and corresponding weights are divided into groups, and each group has a 512-bit input and 512-bit weights. The input neurons are reused during the calculation process, which effectively reduces the access to on-chip storage. The weights are read from the external memory in real-time for calculation. Similar to convolution calculation module, the final output neurons are also generated by applying the vector inner product within group and the accumulation between groups. (5) Fusion calculation module. Based on our previous work [25], we propose a fused layer to apply the quantization operation to the proposed processing engine. The proposed fused layer is obtained by merging the n-th de-quantized layer and the (n + 1)-th quantized layer and is implemented before ReLU. From Equations (3) and (5), the implementation of the dequantization and quantization rely on the multiplication operation and division operation, respectively. While for fused layer, only multiplication operation is required, which can reduce the consumption of computing resources. Figure 12 depicts the process of producing the (n + 1)th convolutional layer's input element , in general network inference and optimized network inference. As shown in Figure 12, this optimized implementation scheme can also optimize the input bit-width of the ReLU module and max-pooling module from 32-bit to 8-bit, which effectively reduces the consumption of storage resources. (5) Fusion calculation module. Based on our previous work [25], we propose a fused layer to apply the quantization operation to the proposed processing engine. The proposed fused layer is obtained by merging the n-th de-quantized layer and the (n + 1)-th quantized layer and is implemented before ReLU. From Equations (3) and (5), the implementation of the de-quantization and quantization rely on the multiplication operation and division operation, respectively. While for fused layer, only multiplication operation is required, which can reduce the consumption of computing resources. Figure 12 depicts the process of producing the (n + 1)-th convolutional layer's input element e 0,0 in general network inference and optimized network inference. As shown in Figure 12, this optimized implementation scheme can also optimize the input bit-width of the ReLU module and max-pooling module from 32-bit to 8-bit, which effectively reduces the consumption of storage resources. Electronics 2020, 9, Figure 12. The process of generating element , in general network inference and optimized network inference. , is the output of the convolution module. , , , , and , are the outputs of the de-quantization module (the n-th layer), ReLU module and Max-pooling module in general network inference, respectively.
, and , are the outputs of the fusion module and ReLU module in optimized network inference, respectively. 'Quant' and 'De-Quant' refer to 'quantized' and 'de-quantized', respectively.

Experiments and Results
In this section, extensive experiments were conducted to evaluate the performance of the proposed Q-IORN and hardware accelerator. The evaluation experiments were divided into two parts. First, the proposed Q-IORN was trained and tested on a publicly available remote sensing image scene dataset to evaluate its classification accuracy and obtain the integer-only model for FPGA implementation. Then, we implemented the proposed hardware accelerator on FPGA and tested its processing performance. The experimental settings and detailed experimental results are provided below.

Dataset Description
The proposed Q-IORN and hardware accelerator were evaluated on the NWPU-RESISC45 dataset [1], which is a large-scale remote sensing scene classification dataset. This dataset covers 45 scene classes and contains 700 images per class. The images in the dataset have a uniform size of 256 × 256. Sample images of each class are shown in Figure 13. In this work, we randomly selected 10% of the images in the NWPU-RESISC45 dataset for training, and the rest were utilized for testing. Data augmentation tricks, including random cropping, random horizontal flip and normalization, were applied in training phase. As for the testing phase, we just applied center cropping and normalization. The size of the cropped image was 224 × 224.

Experiments and Results
In this section, extensive experiments were conducted to evaluate the performance of the proposed Q-IORN and hardware accelerator. The evaluation experiments were divided into two parts. First, the proposed Q-IORN was trained and tested on a publicly available remote sensing image scene dataset to evaluate its classification accuracy and obtain the integer-only model for FPGA implementation. Then, we implemented the proposed hardware accelerator on FPGA and tested its processing performance. The experimental settings and detailed experimental results are provided below.

Dataset Description
The proposed Q-IORN and hardware accelerator were evaluated on the NWPU-RESISC45 dataset [1], which is a large-scale remote sensing scene classification dataset. This dataset covers 45 scene classes and contains 700 images per class. The images in the dataset have a uniform size of 256 × 256. Sample images of each class are shown in Figure 13. In this work, we randomly selected 10% of the images in the NWPU-RESISC45 dataset for training, and the rest were utilized for testing. Data augmentation tricks, including random cropping, random horizontal flip and normalization, were applied in training phase. As for the testing phase, we just applied center cropping and normalization. The size of the cropped image was 224 × 224.  Figure 13. Sample images of each class in NWPU-RESISC45 dataset.

Experimental Procedure
To test the performance of the proposed algorithm and obtain the integer-only model for FPGA implementation, we trained the Q-IORN with the quantization-awareness training method on a NVIDIA Titan Xp GPU with PyTorch 0.4. The 8-bit quantized network Q-IORN was trained for 200 epochs with a starting learning rate of 0.002. The learning rate was divided by 0.5 at every 50 epochs. The weight parameters were trained by a stochastic gradient descent (SGD) optimizer with a weight decay of 5 × 10 and a momentum of 0.9. The batch size was set to 64.
To verify the performance of the proposed hardware accelerator, we described the proposed hardware architecture with Very-High-Speed Integrated Circuit Hardware Description Language (VHDL) and synthesized it with Xilinx Vivado 2017.2. The Xilinx VC709 development board with Xilinx XC7VX690T FPGA and two DDR3 memory modules was utilized as the implementation platform. In the proposed Q-IORN, the channel numbers of the feature map in each layer are multiples of 64, and the maximum feature size is 224 × 224 . Thus, the main modules of the convolutional processing engine were configured as follows: the convolution calculation module placed 64 convolution PEs, the input buffer was comprised of six BRAMs (three BRAMs with a depth of 512 were used to cache original image, the rest with a depth of 224 were utilized to cache input feature maps), the weight buffer was composed of 64 BRAMs with a depth of 1024, and the output buffer consisted of a BRAM with a depth of 224. The maximum number of output neurons in the FC layers is 4096. Thus, a FC PE and a data buffer composed of a First Input First Output (FIFO) memory with a depth of 4096 were employed in the FC processing engine. The performance of the proposed hardware accelerator was compared with CPU, GPU and several recent advanced FPGA-based implementations.

Experimental Procedure
To test the performance of the proposed algorithm and obtain the integer-only model for FPGA implementation, we trained the Q-IORN with the quantization-awareness training method on a NVIDIA Titan Xp GPU with PyTorch 0.4. The 8-bit quantized network Q-IORN was trained for 200 epochs with a starting learning rate of 0.002. The learning rate was divided by 0.5 at every 50 epochs. The weight parameters were trained by a stochastic gradient descent (SGD) optimizer with a weight decay of 5 × 10 −4 and a momentum of 0.9. The batch size was set to 64.
To verify the performance of the proposed hardware accelerator, we described the proposed hardware architecture with Very-High-Speed Integrated Circuit Hardware Description Language (VHDL) and synthesized it with Xilinx Vivado 2017.2. The Xilinx VC709 development board with Xilinx XC7VX690T FPGA and two DDR3 memory modules was utilized as the implementation platform. In the proposed Q-IORN, the channel numbers of the feature map in each layer are multiples of 64, and the maximum feature size is 224 × 224. Thus, the main modules of the convolutional processing engine were configured as follows: the convolution calculation module placed 64 convolution PEs, the input buffer was comprised of six BRAMs (three BRAMs with a depth of 512 were used to cache original image, the rest with a depth of 224 were utilized to cache input feature maps), the weight buffer was composed of 64 BRAMs with a depth of 1024, and the output buffer consisted of a BRAM with a depth of 224. The maximum number of output neurons in the FC layers is 4096. Thus, a FC PE and a data buffer composed of a First Input First Output (FIFO) memory with a depth of 4096 were employed in the FC processing engine. The performance of the proposed hardware accelerator was compared with CPU, GPU and several recent advanced FPGA-based implementations.

Performance Evaluation of the Q-IORN
In this section, we aimed to evaluate the performance of the proposed Q-IORN. We took the original IORN as a baseline and the comparison results are summarized in Table 1. As illustrated in Table 1, the classification accuracy of the proposed Q-IORN is 88.31%. Compared with the baseline, the Q-IORN has almost no classification accuracy loss. The result shows that symmetric quantization and S-ORAlign operation have negligible effects on classification accuracy. Moreover, the model size of the proposed Q-IORN is almost 4× smaller than that of the baseline. In general, the proposed Q-IORN can efficiently reduce the scale of the network while maintaining approximate classification accuracy.

Performance Evaluation of the Proposed Accelerator
We deployed the proposed hardware architecture on FPGA to demonstrate its effectiveness. In the proposed architecture, Digital Signal Processing (DSP) slices were used to implement multiplication operations, including fixed-point multiplication and floating-point multiplication. The detailed occupation of DSPs provided by Vivado design suite is shown in the Table 2. In our accelerator, a total of 65 floating-point multipliers were used by the fusion calculation module, and each floating-point multiplier consisted of two DSPs. A total of 640 fixed-point multipliers were adopted by convolutional processing engine and FC processing engine, and each fixed-point multiplier was composed of one DSP. Thus, a total of 65 × 2 + 640 × 1 = 770 DSPs were used in our accelerator, which is consistent with the DSP utilization shown in Table 2. The Look-Up Tables (LUTs) were used by control units, buffer units, calculation units, data splicing units and so on. The BRAMs were mainly utilized to store offsets, weights and feature maps. The hardware resource utilization of the proposed accelerator is summarized in Table 3. As shown in Table 3, the utilization of DSP, BRAM, Flip Flop (FF), and LUT is 770, 404.5, 116,742 and 73,320, respectively. The utilization percent of each resource is less than 30%. These results illustrate that the proposed hardware architecture can be implemented on the platform with limited resources. We took the off-the-shelf platforms CPU and GPU for comparison to show the effectiveness of the proposed hardware accelerator. The CPU platform refers to an Intel Xeon E5-2697 v4 CPU @ 2.3 GHz with 224 GB DDR4 DRAM. The GPU platform is a NVIDIA TITAN Xp GPU with 12 GB GDDR5X memory. The performance comparison of CPU, GPU and the proposed hardware accelerator is shown in Table 4. The 'GOP/s' in Table 4 is the abbreviation of Giga-Operations Per Second. Based on the characteristics of data types in Q-IORN, fixed-point arithmetic units were mainly adopted on these platforms when mapping the network. Since the quantization-awareness training method applied in this paper could effectively simulate the hardware behavior, the proposed accelerator has the same classification accuracy as the CPU and GPU. The Thermal Design Power (TDP) values of the CPU and GPU are 145 W and 250 W, respectively. The power report supplied by Vivado design suite shows that the total power of the proposed accelerator is only 6.32 W. As shown in Table 4, GPU has obvious advantages in terms of throughput, which is 26.49 times that of CPU and 3.44 times that of the proposed accelerator, correspondingly. The proposed accelerator achieves the best energy efficiency performance among all platforms. Compared with CPU and GPU, the energy efficiency of our accelerator is 174.53 times and 11.51 times higher, respectively. The recent FPGA-based implementations were also compared in this section. The comparison results of the proposed hardware accelerator and several recent FPGA-based implementations are presented in Table 5. The 'conv' and 'all' in Table 5 refer to the performance of the convolutional layers and the overall system, respectively. The experimental results show that the overall throughput of the proposed accelerator reaches 209.60 GOP/s at 200 MHz, and the throughput of the convolutional layers achieves a better performance of 224.43 GOP/s. Since the performance of FPGA-based implementation is closely related to the utilization of on-chip resources, it is partial to evaluate the performance only by throughput. Therefore, the energy efficiency indicator of different FPGA-based implementations was also taken into consideration in this paper. As shown in Table 5, the energy efficiency of the proposed accelerator is 33.16 GOP/s/W. Compared with the previous works, the proposed hardware accelerator achieves a higher energy efficiency. In general, the proposed hardware accelerator could strike a better trade-off between performance and power consumption than previous FPGA-based implementations.

Conclusions
In this paper, we developed a DCNN accelerator for classifying remote sensing scene images under power-limited conditions. First of all, we combined the A-ARF algorithm with the symmetric quantization algorithm to compress the DCNN model. With this scheme, the parameters, feature maps and computation amount of the network are greatly reduced. Moreover, the proposed Q-IORN has almost no classification accuracy loss compared with the IOR4-VGG16 network. Then, an efficient hardware accelerator was proposed for Q-IORN. In the proposed accelerator, several reasonable storage schemes were presented to reduce the storage resource occupation and time overhead based on the characteristics of off-chip memory and on-chip buffer. Two processing engines with data reuse and computing parallelism were adopted to ensure the high-performance computing requirement of the convolutional layers and FC layers. The proposed accelerator was implemented on a Xilinx VC709 development board. The experimental results show that the throughput and energy efficiency of the proposed accelerator are 209.60 GOP/s and 33.16 GOP/s/W, respectively. Compared with CPU and GPU platforms, the proposed accelerator improves energy efficiency by 174.53 times and 11.51 times, respectively. Several recent advanced FPGA-based implementations were also compared to verify the superiority of the proposed hardware accelerator.