Modified Deep Reinforcement Learning with Efficient Convolution Feature for Small Target Detection in VHR Remote Sensing Imagery

: Small object detection in very ‐ high ‐ resolution (VHR) optical remote sensing images is a fundamental but challenaging problem due to the latent complexities. To tackle this problem, the MdrlEcf model is proposed by modifying deep reinforcement learning (DRL) and extracting the efficient convolution feature. Firstly, an efficient attention network is constructed by introducing the local attention into the convolutional neural network. Combining the shallow low ‐ level features with rich detail descriptions and high ‐ level features with more semantic meanings effectively, efficient convolution features can be obtained. By this, the attention network can effectively enhance the ability to extract small target features and suppressing useless features. Secondly, the efficient feature map is sent to the region proposal network constructed by modified DRL. Using the modified reward function, this model can accumulate more rewards to conduct the search process, and potentially generate effective subsequent proposals and classification scores. It also can increase the effectiveness of object locations and classifications for small targets. Quantitative and qualitative experiments are conducted to verify the detection performance of different models. The results show that the proposed MdrlEcf can effectively and accurately locate and identify related small objects.


Introduction
The VHR remote sensing imagery (RSI) develops quickly due to the wide exploration of sensor technologies and aerospace research. Its typical resolution is 3-4-m ground sample distance (GSD) and the objects in VHR images are usually of diverse shapes in arbitrary orientations. With the advantages of large-scale images and multi-angle data, the VHR remote sensing images have supported an increasingly wide range of applications including resource exploration, urban planning, natural disaster assessment, military target detection, and recognition. Nowadays its application field [1,2] is still expanding. Object detection in RSI aims to determine whether a given aerial or satellite image contains one or more objects belonging to the class of interest, and determine the positions of each predicted object in RSI. Different from natural images, the objects in VHR RSI such as cars have a relatively smaller spatial extent (usually smaller than 15 pixels [3]) than the other large satellite objects. The much smaller objects [4] and complex background content [5] greatly limit its detection performance and pose some severe challenges for the applications [6,7].
In literature, various models have been proposed to effectively detect the objects of interest. Traditional methods mainly deploy the handcrafted features and shallow machine learning models, which are easy to overfit and usually require a large number of calculations. Convolutional neural networks (CNN) can automatically and powerfully learn and extract the features from data, as well as better robustness and higher detection accuracies [8][9][10][11]. They have provided great improvements and achieved much better accuracies for object detection compared with traditional approaches [12][13][14][15][16][17][18][19][20]. Currently, traditional detection models have been gradually replaced by deep learning-based methods.
Considering the feature maps obtained by CNN, the deep high-level maps are of greatly lower resolutions which may harm their capacity for high-quality object localization due to the loss of detail information; the shallow low-level maps have highresolution features that reduce the representational capacity for object recognition. Thus, most CNN-based detectors show poor performance when detecting small objects, mainly due to the coarseness of the obtained deep feature maps [21,22]. The issue of ignoring the low-level features has greatly limited the performance of CNN-based detectors. Nowadays, the attention mechanism drawing on human attention thinking has been proven to be a potential means of enhancing the network performance [23,24], which can efficiently benefit from the human brain visual mechanism by quickly filtering out the high-value knowledge from a large amount of information. Constructed by integrating the attention into the deep neural network, the attention network has shown satisfactory performance. As presented in FPN [23] and PANet [24], the performance of object detection models has been effectively enhanced by introducing attention into the CNN and generating the integrated features. Both of them inspire us that the high-level and low-level maps are complementary and the attention network can be utilized to learn complementary features for object detection [23,24]. Therefore, it is very important and natural of developing the attention network to effectively extract the object features and significantly enhance the detection performance.
Generally speaking, the object detection models need to accomplish two tasks including classification and localization. If they are not properly balanced, the performance will be suboptimal because one task may be compromised. The imbalance between classification and location becomes an increasingly important issue that limits the detection performance. Integrate deep learning's strong understanding of visual perception problems and the decision-making ability of reinforcement learning, deep reinforcement learning is becoming a promising framework for object detection with satisfying performance [25][26][27]. Its success can be contributed to the balance between the classification and localization for object detection. Additionally, DRL can enhance the accuracy and reduce various costs associated with the usage of VHR images at the same time. Although DRL can effectively explore the strong understanding ability of deep learning and the decision-making ability of reinforcement learning, there are still problems for the detection of remote sensing images, such as premature termination of search and sparse rewards obtained from the search.
To address the above issues, we propose a novel small object detection model for VHR remote sensing images by exploiting deep reinforcement learning and efficient convolution feature learning (MdrlEcf). First, local attention is added to CNN to construct the attention network and obtain the efficient convolution features, which integrates the low-level features of content and high-level features of semantic meanings. By effectively depicting the images, more discriminative features are generated for small targets in different positions. That is, the detailed information-rich features can be selectively enhanced by the network to improve the detection accuracy. Second, modified DRL is exploited to effectively detect the small object by designing the new reward functions. It can accumulate more rewards to guide the search process, and potentially generate effective subsequent proposals and classification scores. For VHR remote sensing images, experimental results show that the proposed MdrlEcf can effectively improve the quantitative and qualitative results of small object detection.
The rest of this paper is organized as follows. Section 2 introduces specific algorithms and frameworks related to this paper. An overview of the proposed model is presented in Section 3. Section 4 briefly introduces the experimental setup and results. Section 5 is the conclusion.

Related Works
Object Detection. The object detection models based on deep learning can be roughly divided into two categories: anchor-based algorithms and anchor-free algorithms. Their differences are whether the anchor points are utilized to extract the region proposals. Anchor-based algorithms include the popular two-stage detection model R-CNN [12], Fast R-CNN [13], Faster R-CNN [14], etc., and one-stage detection model YOLOv2 [15], SSD [16], etc. Integrating with region proposals, R-CNN explored a high-capacity CNN trained with bottom-up candidate boxes to locate and segment objects [12]. Follow the idea of R-CNN, Girshick et al. [13] proposed the Fast R-CNN to improve the training and testing speed while increasing the detection accuracy. Ren et al. [14] firstly introduced a region proposal network (RPN) which generated high-quality region proposals to conduct the unified network where to look. It has shown a huge improvement in the development of object detection. The one-stage detection model SSD [16] predicted category scores and box offsets for a fixed set of default bounding boxes by using small convolutional filters. Instead of the anchor point, anchor-free approaches calculate the descriptions of the bounding box in other ways, e.g., YOLO [17], CornerNet [18], ExtremeNet [19], FCOS [20]. YOLO [17] was a representative anchor-free algorithm, which directly predicted bounding boxes and class probabilities from the full images in one evaluation. Law et al. [18] regarded the bounding box as the key points (the upper left corner and lower right corner of the target) and exploited a single neural network to perform the detection. Based on it, Yi et al. [19] presented a novel object detection framework by detecting the four bottom-up extreme points of the target (that is, the uppermost point, the lowermost point, the leftmost point, the rightmost point). As an anchor-free and proposal-free algorithm, FCOS [20] was proposed to solve object detection in a per-pixel prediction fashion, analog to semantic segmentation.
Attention Networks. The attention has been proven to be a potential means to enhance the performance of deep neural networks [21,22] because it can utilize multi-level features to generate discriminative feature representations. Integrating the attention into the deep neural network, the attention network is constructed to achieve satisfactory results. FPN [23] proposed lateral connections to enhance the semantic characteristics of shallow layers via a top-down pathway. It has shown a huge improvement as the generic feature extractors. After that, PANet [24] explored a bottom-up pathway to further enhance the low-level information of deep layers. Jie et al. [28] focused on channels and explicitly modelled the interdependence between channels in CNN by the attention to enhance the network performance. Based on [28], Wang et al. [29] proposed the efficient channel attention through a fast one-dimensional convolution which involves a handful of parameters while bringing clear performance gains. Currently, the attention network has been widely utilized in various types of tasks such as natural language processing (NLP) [30], image classification [31], speech recognition [32], and facial recognition [33], which achieved remarkable results.
Deep reinforcement learning in Object Detection. With the development of deep reinforcement learning, object detection becomes a new task in this field. In [34], Bellver et al. proposed a hierarchical deep reinforcement learning object detection framework, which was characterized by a top-down exploration of a hierarchy of regions guided by an intelligent agent. Utilizing the multi-agent-based algorithm, Kong et al. [35] proposed a joint search algorithm based on collaborative deep reinforcement learning to learn the optimal strategy for target localizing. To reduce the high computational and monetary cost, a reinforcement learning agent was proposed for large images [36] by adaptively selecting the spatial resolution of each image. In [37], a novel and effective detector were proposed by integrating the bottom-up single-shot convolutional neural networks and a top-down operating strategy.

Methods
Exploring deep reinforcement learning and attention network, the MdrlEcf model is proposed for the small object detection in VHR optical remote sensing image. First, CNN (here we use VGG16) is utilized as the main network of the attention network for feature learning. To integrate the detail features of the shallow layer into semantic features of the deep layer, local attention is introduced into VGG16. We conduct the experiments in Section 4.2 to determine where the local attention is added. By this, the efficient convolution characteristics of small targets can be fully depicted. Then, the integrated convolution features are delivered into the modified DRL with the proposed reward function. By accumulating much more rewards during the search process, the modified DRL can potentially generate effective subsequent proposals and classification scores. That is, the modified DRL can make a trade-off between the location and classification for the small object detection to a certain degree. At last, the prediction bounding boxes and classification results are output. The overall framework of the proposed MdrlEcf model is shown in Fig

Efficient Convolution Feature Learning
As presented in [28], the SE module has been widely explored due to its advantages of improving network performance. Therefore, it is chosen to design the local attention of our attention network. For the SE module, the squeeze part can globally exploit contextual information, and its excitation part aims to exploit channel-wise dependencies. The purple rectangular of Figure 1 presents our attention network, where the local attention is explored to allow VGG16 to selectively enhance low-level features with rich detail information. By this, the feature map obtained by our attention network will efficiently contain the low-level informative features and high-level semantic information simultaneously. Additionally, it has dynamic adaptability to the input and helps to enhance the general ability of the proposed model.
Let represent an input VHR image, and the map denotes the shallow features generated by the first block of VGG16. Two 3 × 3 convolutional layers with 64 channels and one max-pooling layer are combined to construct the first block, in which the local detail information can be well perceived by the convolutional layer.
The local attention is added between the first block and the second block according to the results in Section 4.2. The feature map with the size of W H C is considered as the input of the local attention. The global average pooling (GAP) is performed to well eliminate the spatial interference between features and generate a 1 1 C weight .
Here the parameter C is set to 64. Then, the full connection (FC) is utilized to incorporate learned weights and obtain the final weight . The location attention is formulated as follows: where is the re-weighted feature map and ⨀ denotes the element-wise product. The overall structure of local attention is presented in Figure 2. As a soft attention, the utilized local attention pays more attention to areas or channels. It means local attention has the same latent characteristics as the Modified DRL mentioned in Section 3.2, which aims to generate effective subsequent proposals and classification scores. Its parameters can be calculated by performing the gradient, forward propagation, and backward feedback algorithms.
After re-weighting, the new feature map is sent to the four remaining blocks of the attention network. In detail, the second block includes two 3 × 3 convolutional layers with 128 channels and 1 maxpooling layer; the third block includes three 3 × 3 convolutional layers with 256 channels and 1 maxpooling layer. The fourth and last blocks include three 3 × 3 convolutional layers with 512 channels and 1 maxpooling layer respectively. Utilizing these blocks, the low-level informative features are selectively combined and enhanced. Then, it is integrated with the deep features, which are usually obtained by the fully connected layer at the end of the network. Finally, the attention network outputs the efficient convolution feature map .

Deep Reinforcement Learning with the Modified Reward Function
Exploring deep reinforcement learning is a new research direction for solving object detection problems. When performing DRL, the agent receives the input data from its environment and estimates how good or bad the taken actions according to a reward function. Normally, the reward function is utilized to assigns a numerical value to each performed action from a given state and then taken actions aim to achieve the predetermined goal. In addition, the agent reaches the new state after performing an action. The framework of DRL in the proposed method is displayed in the orange rectangular of Figure 1.
Let represent the state space, is the action space and is the reward. There are two types of actions (fixate action and done action ) in corresponding to the two rewards (fixate reward and done reward ) in . The feature map is input into the modified DRL and forms an initial state 0 . Then in each time slot , the agent selects the best action to output by utilizing the policy π | , which is a given random strategy and represents the maps from states to actions in the policy center. The π | is formulated as follows: where represents the probability map of the new position ; σ • is the logistic sigmoid function; presents the trainable weight vector, and ∈ is a vector generated by the first done action.
When the agent chooses the fixate action & 0 , the new location is visited and the fixate reward is obtained. Meanwhile, region of interests (RoIs) are updated with the areas centered at . Let IoU represent the intersection-over-union (IoU) of the ground truth instance given by the RoIs of the specific time slot ; is defined as the maximum IoU between the predicted bounding box and the ground truth bounding box for the -th instance in the time slot 0….
1. The modified fixate reward in time slot is formulated as where β (set to 0.075 [26]) is a small negative reward, and τ represents the IoU threshold. The fixate reward reflects the quality of the selected location . After the fixate reward is obtained, all the corresponding RoIs will be sent to the RoI pooling module, and then do the classification and bounding box offset prediction of the specific class. The predictions are mapped to certain locations and are added to the history of a specific class ℎ . The and the ℎ will be combined with the original state to form a new state . At the time 1, the agent redetermines whether taking the new actions or not in the policy center according to the new state . If the agent decides to perform the done action 1 , it will stop searching the feature map and collect all selected predictions in the entire trajectory for prediction and classification. Also, the agent gains a done reward reflecting the quality of the search, which can be utilized to guide the next search process. The modified done reward is formulated as , : .
According to the rewards obtained during the searching process, the agent can effectively optimize the searching process of the next feature map. At last, the prediction and classification results will be output. The overall pseudo code of the modified deep reinforcement learning is shown in Algorithm 1.

, :
, generate the region proposals 14: Stop the agent 15: calculate and draw the prediction boxes based on the regional proposals; obtain the classification result .

Experiments
In this section, we first describe the datasets, comparison methods, experiment settings, and evaluation metrics. Then, we compare and analyze the results obtained by the proposed MdrlEcf and six compared approaches on the experimental datasets.

Experimental Setup
(1) Datasets: To verify the performance and effectiveness of the proposed model, experiments are carried out on three public VHR datasets. The details are listed in Table 1.  [2], Faster R-CNN [14], DRL-Fr [26], MDRL, SSD [16], and YOLO [17]. MDRL is proposed as an ablation experiment method to verify the effectiveness of convolution feature learning described in Section 3.1, which is a DRL model with the modified reward functions only. Comparing the DRL-Fr and MDRL, the necessary and effectiveness of the modified reward functions can be verified; comparing MDRL and MdrlEcf, those of the proposed efficient convolution feature learning can be evaluated.
(3) Experimental Settings: We retrained the approaches (DRL-Fr [26], MDRL, Mdr-lEcf) using the datasets in Section 4.1. To be fair, all the ablation experiments choose VGG16 as the backbone and all hyperparameters are consistent. In the experiment, the parameter settings (DRL-Fr [26], MDRL, MdrlEcf) are presented as follows. The number of iterations is 110,000, and the batch size is 256. The learning rate is 0.00025, which is automatically adjusted every 80,000 times to decrease it by 0.1 times. The input size of the image is 600 on the smallest side and 1000 on the longest side. IoU threshold is set to 0.45. As for other methods (Faster R-CNN [14], RICNN [2], SSD [16], and YOLO [17]), their pretrained weights and the related setting are utilized to conduct our experiments.
The conditions of the experiments are presented as follows. TensorFlow and python are utilized to build the target detection environment. The experimental environment is Linux, and the platform is equipped with an 8GB GPU (Tesla P100), 14-core CPU (Intel(R) Xeon(R) Gold 5117 CPU @ 2.00 GHz). GPU and CPU are used for joint training.
(4) Evaluation Metrics: The Average Precision (AP) and mean Average Precision (mAP) are utilized as the evaluation indicators. AP is utilized to measure the performance of the detector in each category; mAP is utilized to estimate the detector performance in all categories. The higher the values of AP and Map, the better the detection performance. Let represents the P-R curve; N is the number of target types in the test set. Both of them can be formulated as follows:

Estimating Locations of the Added Local Attention
The quantitative experiments are conducted to determine where to add the local attention module. We choose the SE module as a comparison and utilize the VGG16 as the network. The SE module and local attention share the same experimental settings. The following descriptions of the SE module can also apply to the local attention, just replace the SE module with the local attention.
First, the SE module (or local attention) is added for training only after the first block. Then the test results and output matrix table can be calculated. After that, we increase the number of SE modules, such as adding SE modules after the first and second blocks, adding SE modules after the first three blocks, adding SE modules after the second and third blocks, and adding SE modules after all five blocks. In addition, the scaling parameters are set to 16, 32, and 64 for debugging. The experiments for local attention are basically the same. Table 2 presents the detection results of different approaches on the NWPU VHR-10 dataset. In Table 2, "Add layer" indicates the position where the SE module (or local attention) is added, and the number indicates the detailed block of VGG16. The values of mAP are very different when the adding locations change. From Table 2, it can be easily observed that the best result is obtained when the local attention is added after the first block, which can show the effectiveness of integrating the low-level detailed characteristics into deep features to a certain degree. Based on the above results, local attention is added after the first block of VGG16 for the proposed approach. The final structure of efficient convolution feature learning is displayed in Figure 3.

Quantitative Analysis
RICNN [2], Faster R-CNN [14], DRL-Fr [26], SSD [16], YOLO [17]), MDRL, and Mdr-lEcf are performed on three popular datasets, and the results are shown in Tables 3-5. All the bold numbers in the tables indicate the best results.  Table 3 presents the comparisons of detection accuracies on the NWPU VHR-10 dataset. It can be easily found that the mAP of MdrlEcf is higher than the other compared methods, which reaches 83.4% as shown in Table 3. Analyzing the AP values of each category, we found that SSD and YOLO show better results in some categories, but their overall evaluation is not as good as MdrlEcf. In the ablation experiment, MDRL presents better values of mAP and training time than DRL-Fr. Additionally, the mAP values and the training time of MdrlEcf are much better than those of MDRL, and most AP values of MdrlEcf are superior to those of MDRL.
For the SAR-ship-dataset presented in Table 4, MdrlEcf achieves the best mAP (91.7%). This dataset has only one category of the ship, and the ratio of the ship's length or width to the image size is in the range of 0.04 to 0.24, which is much smaller than the PASCAL VOC's 0.2 to 0.9. From Table 4, the superior values of MdrlEcf fully show its effectiveness in detecting small targets.
For the RSOD dataset presented in Table 5, our method shows obvious improvements in detecting the small and medium-sized targets (like oil tanks, playgrounds, and aircraft). For the overpass, it only has shape information compared to other classes and YOLO achieves the best mAP (85.1%). This is because overpass belongs to large-scale objects in remote sensing images, and the standard YOLO is usually good for detecting largescale objects. For the overpass, the AP obtained by MdrlEcf increases by almost 4% compared with those of DRL-Fr and MDRL, which greatly demonstrates that the added local attention and the modified reward function can well enhance the detection performance.
From Tables 3-5, although the proposed MdrlEcf does not show the superior AP values in some categories, its overall performance is better than other compared methods, which means MdrlEcf can stably detect the small targets for VHR remote sensing datasets. In ablation experiments, MdrlEcf has superior values of mAP and takes much less training time, which verifies the effectiveness of the proposed modules in Section 3. Additionally, the proposed MdrlEcf takes a longer time to train compared with SSD or YOLO, and its training time grows with the increasing number of images and IoUs. In the future, we will try to reduce the training time.

Visualization Results and Analysis
The visualization results of different VHR datasets are shown in     Figure 4 shows a few sample detection results of the proposed MdrlEcf on the NWPU VHR-10 dataset. It can be clearly observed that the proposed MdrlEcf can accurately detect the objects that belong to different classes with small or medium size, such as airplanes, vehicles, ships, and playgrounds. Additionally, the objects that densely stand together are also clearly detected.
SAR-ship-Dataset is a two-dimensional ship dataset, which contains a large amount of small objects. As shown in Figure 5, the proposed MdrlEcf can effectively detect ships with different sizes and angles; however, some ships fail to be detected in Figure 5. This can be attributed to the fact that similar backscattering mechanisms are shared by the targets in different backgrounds such as buildings, harbors, and islands. Therefore, there is still a lot of room for improving the proposed method.
The detection results of the RSOD dataset are shown in Figure 6. Obviously, the proposed MdrlEcf not only can accurately detect small and medium-sized targets (aircrafts, oil tanks, and playgrounds) but can also detect large-sized objects that overpass complex backgrounds correctly. Noticing that, the targets of RSOD are much smaller than those of the NWPU VHR-10 datasets. This fact further demonstrates the detection effectiveness of MdrlEcf for detecting small objects.
To compare the effectiveness of locating the objects, we design an experiment to evaluate the accuracies of prediction bounding boxes obtained by Faster R-CNN, DRL-Fr, MDRL, and MdrlEcf. Figure 7 presents the numerical results of the IoU. Compared to the numbers in the blue labels, it can be clearly seen that the average IoU results of MdrlEcf are better than the other Faster R-CNN, DRL-Fr, MDRL, and the related prediction bounding boxes are more accurate. In Figure 7, the Faster R-CNN, DRL-Fr, and MDRL models are not well fitted, so multiple prediction bounding boxes appear in the results. To further evaluate the accuracies of MdrlEcf in locating the objects, Figure 8 shows the comparisons between the prediction bounding boxes and the related ground truth, where the red boxes represent the prediction bounding boxes and the green boxes are the ground truth. Comparing the bounding boxes and the values of IOU, it can be observed that the prediction bounding boxes of MdrlEcf can well match those of the ground truth, which can be also found by the values of IOU in the blue labels. From the visualization analysis in Figures  4-8, it can further prove that the proposed method presents superior effectiveness and robustness to other compared methods.

Conclusions
By modifying deep reinforcement learning and enhancing convolution features, the MdrlEcf model is proposed for small object detection of VHR remote sensing images. Using the local attention mechanism and CNN, the attention network is constructed to integrate low-level features into high-level features. The complementary feature map can also be obtained, which can effectively extract the remarkable characteristics of small targets. After that, the DRL with an improved reward function is explored to utilize the feature map and achieve small object detection. Using the redesigned reward function, the modified DRL can greatly increase the searching rewards, and efficiently guide the agent to search the informative features. After that, the small objects can be well located and classified. Three popular VHR remote sensing images are applied to evaluate the performance of MdrlEcf and six compared models, whose categories are quite different. The experimental results efficiently verify the effectiveness and feasibility of the proposed MdrlEcf and suggest that MdrlEcf can achieve better results in quality and quantity.
There is no doubt that it still has room to improve the proposed approach. Based on the deep reinforcement learning, MdrlEcf takes a longer time in training. For further research, we will try to improve the speed of predicting the bounding boxes in deep reinforcement learning and we are also interested in designing the lightweight detector and transplanting it to portable hardware.