Small-Object Detection in Remote Sensing Images with End-to-End Edge-Enhanced GAN and Object Detector Network

The detection performance of small objects in remote sensing images is not satisfactory compared to large objects, especially in low-resolution and noisy images. A generative adversarial network (GAN)-based model called enhanced super-resolution GAN (ESRGAN) shows remarkable image enhancement performance, but reconstructed images miss high-frequency edge information. Therefore, object detection performance degrades for small objects on recovered noisy and low-resolution remote sensing images. Inspired by the success of edge enhanced GAN (EEGAN) and ESRGAN, we apply a new edge-enhanced super-resolution GAN (EESRGAN) to improve the image quality of remote sensing images and use different detector networks in an end-to-end manner where detector loss is backpropagated into the EESRGAN to improve the detection performance. We propose an architecture with three components: ESRGAN, Edge Enhancement Network (EEN), and Detection network. We use residual-in-residual dense blocks (RRDB) for both the ESRGAN and EEN, and for the detector network, we use the faster region-based convolutional network (FRCNN) (two-stage detector) and single-shot multi-box detector (SSD) (one stage detector). Extensive experiments on a public (car overhead with context) and a self-assembled (oil and gas storage tank) satellite dataset show superior performance of our method compared to the standalone state-of-the-art object detectors.

with and without noise. These models have two subnetworks: a generator and a discriminator. Both subnetworks consist of deep CNNs. Datasets containing HR and LR image pairs are used for training and testing the models. The generator generates HR images from LR input images, and the discriminator predicts whether generated image is a real HR image or an upscaled LR image. After sufficient training, the generator generates HR images that are similar to the ground truth HR images, and the discriminator cannot correctly discriminate between real and fake images anymore.
Although the resulting images look realistic, the compensated high-frequency details such as image edges may cause inconsistency with the HR ground truth images [22]. Some works showed that this issue negatively impacts land cover classification results [23,24]. Edge information is an important feature for object detection [25], and therefore, this information needs to be preserved in the enhanced images for acceptable detection accuracy.
In order to obtain clear and distinguishable edge information, researchers proposed several methods using separate deep CNN edge extractors [26,27]. The results of these methods are sufficient for natural images, but the performance degrades on LR and noisy remote sensing images [22]. A recent method [22] used the GAN-based edge-enhancement network (EEGAN) to generate a visually pleasing result with sufficient edge information. EEGAN employs two subnetworks for the generator. One network generates intermediate HR images, and the other network generates sharp and noise-free edges from the intermediate images. The method uses a Laplacian operator [28] to extract edge information and in addition, it uses a mask branch to obtain noise-free edges. This approach preserves sufficient edge information, but sometimes the final output images are blurry compared to a current state-of-the-art GAN-based SR method [21] due to the noises introduced in the enhanced edges that might hurt object detection performance.
Another important issue with small-object detection is the huge cost of HR imagery for large areas. Many organizations are using very high-resolution satellite imagery to fulfill their purposes. When it comes to continuous monitoring of a large area for regulation or traffic purposes, it is costly to buy HR imagery frequently. Publicly available satellite imagery such as Landsat-8 [29] (30 m/pixel) and Sentinel-2 [30] (10 m/pixel) are not suitable for detecting small objects due to the high ground sampling distance (GSD). Detection of small objects (e.g., oil and gas storage tanks and buildings) is possible from commercial satellite imagery such as 1.5-m GSD SPOT-6 imagery but the detection accuracy is low compared to HR imagery, e.g., 30-cm GSD DigitalGlobe imagery in Bing map.
We have identified two main problems to detect small-objects from satellite imagery. First, the accuracy of small-object detection is lower compared to large objects, even in HR imagery due to sensor noise, atmospheric effects, and geometric distortion. Secondly, we need to have access to HR imagery, which is very costly for a vast region with frequent updates. Therefore, we need a solution to increase the accuracy of the detection of smaller objects from LR imagery. To the best of our knowledge, no work employed both SR network with edge-enhancement and object detector network in an end-to-end manner, i.e, using joint optimization to detect small remote sensing objects.
In this paper, we propose an end-to-end architecture where object detection and super-resolution is performed simultaneously. Figure 1 shows the significance of our method. State-of-the-art detectors miss objects when trained on the LR images; in comparison, our method can detect those objects. The detection performance improves when we use SR images for the detection of objects from two different datasets. Average precision (AP) versus different intersection over union (IoU) values (for both LR and SR) are plotted to visualize overall performance on test datasets. From figure 1, we observe that for both the datasets, our proposed end-to-end method yields significantly better IoU values for the same AP. In section 4.2, we discuss AP and IoU in more detail and these results are discussed in section 4. Detection on LR (low-resolution) images (60cm/pixel) is shown in (I); in (II), we show the detection on generated SR (super-resolution) images (15 cm/pixel). The first row of this figure represents the COWC (car overhead with context) dataset [31], and the second row represents the OGST (oil and gas storage tank) dataset [32]. AP (average precision) values versus different IoU (intersection over union) values for the LR test set and generated SR images from the LR images are shown in (III) for both the datasets. We use FRCNN (faster region-based CNN) detector on LR images for detection. Then instead of using LR images directly, we use our proposed end-to-end EESRGAN (edge-enhanced SRGAN) and FRCNN architecture (EESRGAN-FRCNN) to generate SR images and simultaneously detect objects from the SR images. The red bounding boxes represent true positives, and yellow bounding boxes represent false negatives. IoU=0.75 is used for detection.

Contributions of Our Method
Our proposed architecture consists of two parts: EESRGAN network and a detector network. Our approach is inspired by EEGAN and ESRGAN networks and showed a remarkable improvement over EEGAN to generate visually pleasing SR satellite images with enough edge information. We employed a generator subnetwork, a discriminator subnetwork, and an edge-enhancement subnetwork [22] for the SR network. For the generator and edge-enhancement network, we used residual-in-residual dense blocks (RRDB) [21]. These blocks contain multi-level residual networks with dense connections that showed good performance on image enhancement.
We used a relativistic discriminator [33] instead of a normal discriminator. Besides GAN loss and discriminator loss, we employed Charbonnier loss [34] for the edge-enhancement network. Finally, we used different detectors [8,10] to detect small objects from the SR images. The detectors acted like the discriminator as we backpropagated the detection loss into the SR network and, therefore, it improved the quality of the SR images.
We created the oil and gas storage tank (OGST) dataset [32] from satellite imagery (Bing map), which has 30 cm and 1.2 m GSD. The dataset contains labeled oil and gas storage tanks from the Canadian province of Alberta, and we detected the tanks on SR images. Detection and counting of the tanks are essential for the Alberta Energy Regulator (AER) [35] to ensure safe, efficient, orderly, and environmentally responsible development of energy resources. Therefore, there is a potential use of our method for detecting small objects from LR satellite imagery. The OGST dataset is available on Mendeley [32].
In addition to the OGST dataset, we applied our method on the publicly available car overhead with context (COWC) [31] dataset to compare the performance of detection for varying use-cases. During training, we used HR and LR image pairs but only required LR images for testing. Our method outperformed standalone state-of-the-art detectors for both datasets.
The remainder of this paper is structured as follows. We discuss related work in section 2. In section 3, we introduce our proposed method and describe every part of the method. The description of datasets and experimental results are shown in section 4, final discussion is stated in section 5 and section 6 concludes our paper with a summary.

Related Works
Our work consists of an end-to-end edge enhanced image SR network with an object detector network. In this section, we discuss existing methods related to our work.

Image Super-Resolution
Many methods were proposed on SR using deep CNNs. Dong et al. proposed super-resolution CNN (SRCNN) [17] to enhance LR images in an end-to-end training outperforming previous SR techniques. The deep CNNs for SR evolved rapidly, and researchers introduced residual blocks [20], densely connected networks [36], and residual dense block [37] for improving SR results. He et al. [38] and Lim et al. [39] used deep CNNs without the batch normalization (BN) layer and observed significant performance improvement and stable training with a deeper network. These works were done on natural images.
Liebel et al. [40] proposed deep CNN-based SR network for multi-spectral remote sensing imagery. Jiang et al. [22] proposed a new SR architecture for satellite imagery that was based on GAN. They introduced an edge-enhancement subnetwork to acquire smooth edge details in the final SR images.

Object Detection
Deep learning-based object detectors can be categorized into two subgroups, region-based CNN (R-CNN) models that employ two-stage detection and uniform models using single stage detection [41]. Two-stage detectors comprise of R-CNN [42], Fast R-CNN [43], Faster R-CNN [8] and the most used single stage detectors are SSD [10], You only look once (YOLO) [11] and RetinaNet [9]. In the first stage of a two-stage detector, regions of interest are determined by selective search or a region proposal network. Then, in the second stage, the selected regions are checked for particular types of objects and minimal bounding boxes for the detected objects are predicted. In contrast, single-stage detectors omit the region proposal network and run detection on a dense sampling of all possible locations. Therefore, single-stage detectors are faster but, usually less accurate. RetinaNet [9] uses a focal loss function to deal with the data imbalance problem caused by many background objects and often showed similar performance as the two-stage approaches.
Many deep CNN-based object detectors were proposed on remote sensing imagery to detect and count small objects, such as vehicles [13,44,45]. Tayara et al. [13] introduced a convolutional regression neural network to detect vehicles from satellite imagery. Furthermore, a deep CNN-based detector was proposed [44] to detect multi oriented vehicles from remote sensing imagery. A method combining a deep CNN for feature extraction and a support vector machine (SVM) for object classification was proposed [45]. Ren et al. [46] modified the faster R-CNN detector to detect small objects in remote sensing images. They changed the region proposal network and incorporated context information into the detector. Another modified faster R-CNN detector was proposed by Tang et al. [47]. They used a hyper region proposal network to improve recall and used a cascade boosted classifier to verify candidate regions. This classifier can reduce false detection by mining hard negative examples.
An SSD-based end-to-end airplane detector with transfer learning was proposed, where, the authors used a limited number of airplane images for training [48]. They also proposed a method to solve the input size restrictions by dividing a large image into smaller tiles. Then they detected objects on smaller tiles and finally, mapped each image tile to the original image. They showed that their method performed better than the SSD model. In [49], the authors showed that finding a suitable parameter setting helped to boost the object detection performance of convolutional neural networks on remote sensing imagery. They used YOLO [11] as object detector to optimize the parameters and infer the results.
In [3], the authors detected conifer seedlings along recovering seismic lines from drone imagery. They used a dataset from different seasons and used faster R-CNN to infer the detection accuracy. There is another work [50] related to plant detection, where authors detected palm trees from satellite imagery using sliding window techniques and an optimized convolutional neural network.
Some works produced excellent results in detecting small objects. Lin et al. [51] proposed feature pyramid networks, which is a top-down architecture with lateral connections. The architecture could build high-level semantic feature maps at all scales. These feature maps boosted the object detection performance, especially for small object detection, when used as a feature extractor for faster R-CNN. Inspired by the receptive fields in human visual systems, Liu et al. [52] proposed a receptive field block (RFB) module that used the relationship between the size and eccentricity of receptive fields to enhance the feature discrimination and robustness. Hence, the module increased the detection performance of objects with various sizes when used as the replacement of the top convolutional layers of SSD.
A one-stage detector called single-shot refinement neural network (RefineDet) [53] was proposed to increase the detection accuracy and also enhance the inference speed. The detector worked well for small object detection. RefineDet used two modules in its architecture: an anchor refinement module to remove negative anchors and an object detection module that took refined anchors as the input. The refinement helped to detect small objects more efficiently than previous methods. In [54], feature fusion SSD (FSSD) was proposed where features from different layers with different scales were concatenated together, and then some downsampling blocks were used to generate new feature pyramids. Finally, the features were fed to multibox detector for prediction. The feature fusion in FSSD increased the detection performance for both large and small objects. Zhu et al. [55] trained single-shot object detectors from scratch and obtained state-of-the-art performance on various benchmark datasets. They removed the first downsampling layer of SSD and introduced root block (with modified convolutional filters) to exploit more local information from an image. Therefore, the detector was able to extract powerful features for small object detection.
All of the aforementioned works were proposed for natural images. A method related to small object detection on remote sensing imagery was proposed by Yang et al. [56]. They used modified faster R-CNN to detect both large and small objects. They proposed rotation dense feature pyramid networks (R-DFPN), and the use of this network helped to improve the detection performance of small objects.
There is an excellent review paper by Zhao et al. [57], where the authors showed a thorough review of object detectors and also showed the advantages and disadvantages of different object detectors. The effect of object size was also discussed in the paper. Another survey paper about object detection in remote sensing images by Li et al. [58] showed review and comparison of different methods.

Super-resolution along with Object Detection
The positive effects of SR on object detection tasks was discussed in [5] where the authors used remote sensing datasets for their experiments. Simultaneous CNN-based image enhancement with object detection using single-shot multibox detector (SSD) [10] was done in [59]. Haris et al. [60] proposed a GAN-based generator to generate a HR image from a LR image and then used a multi-task network as a discriminator and also for localization and classification of objects. These works were done on natural images, and LR and HR image pairs were required. In another work [12], a method using simultaneous super-resolution with object detection on satellite imagery was proposed. The SR network in this approach was inspired by the cycle-consistent adversarial network [61]. A modified faster R-CNN architecture was used to detect vehicles from enhanced images produced by the SR network.

Method
In this paper, we aim to improve the detection performance of small objects on remote sensing imagery. Towards this goal, we propose an end-to-end network architecture that consists of two modules: A GAN based SR network and a detector network. The whole network is trained in an end-to-end manner and HR and LR image pairs are needed for training.
The SR network has three components: generator (G), discriminator (D Ra ), and edge-enhancement network (EEN). Our method uses end-to-end training as the gradient of the detection loss from the dectector is backpropagated into the generator. Therefore, the detector also works like a discriminator and encourages the generator G to generate realistic images similar to the ground truth. Our entire network structure can also be divided into two parts: A generator consisting of the EEN and a discriminator, which includes the D Ra and the detector network. In figure 2, we show the role of the detector as a discriminator. The generator G generates intermediate super-resolution (ISR) images, and then final SR images are generated after applying the EEN network. The discriminator (D Ra ) discriminates between ground truth (GT) HR images and ISR. The inverted gradients of D Ra are backpropagated into the generator G in order to create SR images allowing for accurate object detection. Edge information is extracted from ISR, and the EEN network enhances these edges. Afterwards, the enhanced edges are again added to the ISR after subtracting the original edges extracted by the Laplacian operator and we get the output SR images with enhanced edges. Finally, we detect objects from the SR images using the detector network.
We use two different loss functions for EEN: one compares the difference between SR and ground truth images, and the other compares the difference between the extracted edge from ISR and ground truth. We also use the VGG19 [62] network for feature extraction that is used for perceptual loss [21]. Hence, it generates more realistic images with more accurate edge information. We divide the whole pipeline as a generator, and a discriminator, and these two components are elaborated in the following.

Generator
Our generator consists of a generator network G and an edge-enhancement network EEN. In this section, we describe the architectures of both networks and the corresponding loss function. We use the generator architecture from ESRGAN [21], where all batch normalization (BN) layers are removed, and RRDB is used. The overall architecture of generator G is shown in figure 3, and the RRDB is depicted in figure 4.

Generator Network G
Inspired by the architecture of ESRGAN, we remove BN layers to increase the performance of the generator G and to reduce the computational complexity. The authors of ESRGAN also state that the BN layers tend to introduce unpleasant artifacts and limit the generalization ability of the generator when the statistics of training and testing datasets differ significantly.
We use RRDB as the basic blocks of the generator network G that uses a multi-level residual network with dense connections. Those dense connections increase network capacity, and we also use residual scaling to prevent unstable conditions during the training phase [21]. We use the parametric rectified linear unit (PReLU) [63] for the dense blocks to learn the parameter with the other neural network parameters. As discriminator (D Ra ), we employ a relativistic average discriminator similar to the work represented in [21].  In equation 1 and 2, the relativistic average discriminator is formulated for our architecture. Our generator G depends on the discriminator D Ra , and hence we briefly discuss the discriminator D Ra here and then, describe all details in section 3.2. The discriminator predicts the probability that a real image (I HR ) is relatively more realistic than a generated intermediate image (I ISR ).
In equation 1 and 2, σ, C(·) and E I ISR represents the sigmoid function, discriminator output and operation of calculating mean for all generated intermediate images in a mini-batch. The generated intermediate images are created by the generator where I ISR = G(I LR ). It is evident from equation 3 that the adversarial loss of the generator contains both I HR and I ISR and hence, it benefits from the gradients of generated and ground truth images during the training process. The discriminator loss is depicted in equation 4.
We use two more losses for generator G: one is perceptual loss (L percep ), and another is content loss (L 1 ) [21]. The perceptual loss is calculated using the feature map (vgg f ea (·)) before the activation layers of a fine-tuned VGG19 [62] network, and the content loss calculates the 1-norm distance between I ISR and I HR . Perceptual loss and content loss is shown in equation 5 and equation 6.
The EEN network removes noise and enhances the extracted edges from an image. An overview of the network is depicted in figure 5. In the beginning, Laplacian operator [28] is used to extract edges from the input image. After the edge information is extracted, it is passed through convolutional, RRDB, and upsampling blocks. There is a mask branch with sigmoid activation to remove edge noise as described in [22]. Finally, the enhanced edges are added to the input images where the edges extracted by the Laplacian operator were subtracted.
The EEN network is similar to the edge-enhancement subnetwork proposed in [22] with two improvements. First, we replace the dense blocks with RRDB. The RRDB shows improved performance according to ESRGAN [21]. Hence, we replace the dense block for improved performance of the EEN network. Secondly, we introduce a new loss term to improve the reconstruction of the edge information. In [22], authors extracted the edge information from I ISR and enhanced the edges using an edge-enhancement subnetwork which is afterwards added to the edge-subtracted I ISR . To train the network, [22] proposed to use Charbonnier loss [34] between the I ISR and I HR . This function is called consistency loss for images (L img_cst ) and helps to get visually pleasant outputs with good edge information. However, sometimes the edges of some objects are distorted and produce some noises and consequently, do not give good edge information. Therefore, we introduce a consistency loss for the edges (L edge_cst ) as well. To compute L edge_cst we evaluate the Charbonnier loss between the extracted edges (I edge_SR ) from I SR and the extracted edges (I edge_HR ) from I HR . The two consistency losses are depicted in equation 7 and equation 8 where ρ(·) is the Charbonnier penalty function [64]. The total consistency loss is finally calculated for both images and edges by summing up the individual loss. The loss of our EEN is shown in equation 9.
Finally, we get the overall loss for the generator module by adding the losses of the generator G and the EEN network. The overall loss for the generator module is shown in equation 10 where λ 1 , λ 2 , λ 3 , and λ 4 are the weight parameters to balance different loss components. We empirically set the values as λ 1 = 1, λ 2 = .001, λ 3 = .01, and λ 4 = 5.

Discriminator
As described in the previous section, we use the relativistic discriminator D Ra for training the generator G. The architecture of the discriminator is taken from ESRGAN [21] which employs the VGG-19 [62] architecture. We use Faster R-CNN [8] and SSD [10] for our detector networks. The discriminator (D Ra ) and the detector network jointly act as discriminator for the generator module. We briefly describe these two detectors in the next two sections.

Faster R-CNN
The Faster R-CNN [8] is a two-stage object detector and contains two networks: a region proposal network (RPN) to generate region proposals from an image and another network to detect objects from these proposals. In addition, the second network also tries to fit the bounding boxes around the detected objects.
The task of the RPN is to return image regions that have a high probability of containing an object. The RPN network uses a backbone network such as VGG [62], ResNet, or ResNet with feature pyramid network [51]. These networks are used as feature extractors, and different types of feature extractors can be chosen based on their performance on public datasets. We use ResNet-50-FPN [51] as a backbone network for our faster R-CNN. We use this network because it displayed a higher precision than VGG-19 and ResNet-50 without FPN (especially for small object detection) [51]. Even though the use of a larger network might lead to a further performance improvement, we chose ResNet-50-FPN due to its comparably moderate hardware requirements and more efficient convergence times.
After the RPN, there are two branches for detection: a classifier and a regressor. The classification branch is responsible for classifying a proposal to a specific object, and the regression branch finds the accurate bounding box of the object. In our case, both datasets contain objects with only one class, and therefore, our classifier infers only two classes: the background class and the object class.

SSD
The SSD [10] is a single-shot multibox detector that detects objects in a single stage. Here, single-stage means that classification and localization are done in a single forward pass through the network. Like Faster R-CNN, SSD also has a feature extractor network, and different types of networks can be used. To serve the primary purpose of SSD, which is speed, we use VGG-16 [62] as a feature extractor network. After this network, SSD has several convolutional feature layers of decreasing sizes. This representation can seem like a pyramid representation of images at different scales. Therefore, the detection of objects happens in every layer, and finally, we get the object detection output as class values and coordinates of bounding boxes.

Loss of the discriminator
The relativistic discriminator loss (L Ra D ) is already described in the previous section and depicted in equation 4. This loss is added to the detector loss to get the final discriminator loss.
Both Faster R-CNN and SSD have similar regression/localization losses but different classification losses. For regression/localization, both use smooth L 1 [8] loss between detected and ground truth bounding box coordinates (t * ). Classification (L cls_ f rcnn ) and regression loss (L reg_ f rcnn ) and overall loss (L det_ f rcnn ) of Faster R-CNN are given in the following: L det_ f rcnn = L cls_ f rcnn + λL reg_ f rcnn (13) Here, λ is used to balance the losses, and it is set to 1 empirically. Det cls_ f rcnn and Det reg_ f rcnn are the classifier and regressor for the Faster R-CNN. Classification (L cls_ssd ), regression loss (L reg_ssd ) and overall loss (L det_ssd ) of SSD are as following: Here, α is used to balance the losses, and it is set to 1 empirically. Det cls_ssd and Det reg_ssd are the classifier and regressor for the SSD.

Training
Our architecture can be trained in separate steps or jointly in an end-to-end way. We discuss the details of these two types of training in the next two sections.

Separate Training
In separate training, we train the SR network (generator module and discriminator D Ra ) and the detector separately. Detector loss is not backpropagated to the generator module. Therefore, the generator is not aware of the detector and thus, it only gets feedback from the discriminator D Ra . For example, in equation 11, no error is backpropagated to the G G_een network (the network is detached during the calculation of the detector loss) while calculating the loss L cls_ f rcnn .

End-to-End Training
In end-to-end training, we train the whole architecture end-to-end that means the detector loss is backpropagated to the generator module. Therefore, the generator module revceives gradients from both detector and discriminator D Ra . We get the final discriminator loss (L D _ det ) as following: Here, η is the parameter to balance the contribution of the detector loss and we empirically set it to 1. Finally, we get an overall loss (L overall ) for our architecture as follows.

Experiments
As mentioned above, we trained our architecture separately and in an end-to-end manner. For separate training, we first trained the SR network until convergence and then trained the detector networks based on the SR images. For end-to-end training, we also employed separate training as pre-training step for weight initialization. Afterwards SR and object detection networks were jointly trained, i.e., the gradients from the the object detector were propagated into the generator network.
In the training process, the learning rate was set to 0.0001 and halved after every 50k iterations. The batch size was set to 5. We used Adam [65] as optimizer with β 1 = 0.9, β 2 = 0.999 and updated the whole architecture weights until convergence. We used 23 RRDB blocks for the generator G and 5 RRDB blocks for the EEN network. We implemented our architecture with the PyTorch framework [66] and trained/tested using two NVIDIA Titan X GPUs. The end-to-end training with COWC took 96 hours for 200 epochs. The average inference speed using faster R-CNN was approximately 4 images/second and 7 images/second for SSD. Our implementation can be found in GitHub [67].

Cars Overhead with Context Dataset
Cars overhead with context (COWC) dataset [31] contains 15 cm (one pixel cover 15 cm distance at ground level) satellite images from six different regions. The dataset contains a large number of unique cars and covers regions from Toronto in Canada, Selwyn in New Zealand, Potsdam and Vaihingen in Germany, Columbus and Utah in the United States. Out of these six regions, we used the dataset from Toronto and Potsdam. Therefore, when we refer to the COWC dataset, we refer to the dataset from these two regions. There are 12651 cars in our selected dataset. The dataset contains only RGB images, and we used these images for training and testing.
We used 256-by-256 image tiles, and every image tile contains at least one car. The average length of a car was between 24 to 48 pixels, and the width was between 10 to 20 pixels. Therefore, the area of a car was between 240 to 960 pixels, which can be considered as a small object relative to the other large satellite objects. We used bi-cubic downsampling to generate LR images from the COWC dataset. The downscale factor was 4x, and therefore, we had 64 pixels to 64 pixels size for LR images. We had a text file associated with each image tile containing the coordinates of the bounding box for each car.
Our experiments considered the dataset having only one class, car, and did not consider any other type of object. Figure 6 shows examples from the COWC dataset. We experimented with a total of 3340 tiles for training and testing. Our train/test split was 80%/20%, and the training set was further divided into a training and a validation set by an 80% to 20% ratio. We trained our end-to-end architecture with an augmented training dataset with random horizontal flips and ninety-degree rotations.

Oil and Gas Storage Tank Dataset
The oil and gas storage tank (OGST) dataset has been complied in Alberta Geological Survey (AGS) [68], a branch of the Alberta Energy Regulatory (AER) [35]. AGS provides geoscience information and support to AER's regulatory functions on energy developments to be carried out in a manner to ensure public and environmental safety. To assist AER with sustainable land management and compliance assurance [69], AGS is utilizing remote sensing imagery for identifying the number of oil and gas storage tanks inside well pad footprints in Alberta.
While the SPOT-6 satellite imagery at 1.5 m pixel resolution provided by the AGS has sufficient quality and details for many regulatory functions, it is difficult to detect small objects within well pads, e.g., oil and gas storage tanks with ordinary object detection methods. The diameter of a typical storage tank is about 3 m and their placements are usually vertical and side-by-side with less than 2 m. To train our architecture for this use-case, we needed a dataset for providing pairs of low and high-resolution images. Therefore, we have created the OGST dataset using free imagery from the Bing map [70].
The OGST dataset contains 30 cm resolution remote sensing images (RGB) from the Cold Lake Oil Sands region of Alberta, Canada where there is a high level of oil and gas activities and concentration of well pad footprints. The dataset contains 1671 oil and gas storage tanks from this area.
We used 512-by-512 image tiles, and there was no image without any oil and gas storage tank in our experiment. The average area covered by an individual tank was between 800 to 1600 pixels. Some industrial tanks were large, but most of the tanks covered small regions on the imagery. We downscaled the HR images using bi-cubic downsampling with the factor of 4x, and therefore, we got a LR tile of size 128-by-128 pixels. Every image tile was associated with a text file containing the coordinates of the bounding boxes for the tanks on a tile. We have showed examples from the OGST dataset in figure 7.  As with the COWC dataset, our experiments considered one unique class here, tank, and we had a total of 760 tiles for training and testing. We used a 90%/10% split for our train/test data. The training data was further divided by 90%/10% for the train/validation split. The percentage of training data was higher here compared to the previous dataset to increase the training data because of the smaller size of the dataset. The dataset is available at [67].

Evaluation Metrics for Detection
We obtained our detection output as bounding boxes with associated classes. To evaluate our results, we used average precision (AP), and calculated intersection over union (IoU), precision, and recall for obtaining AP.
We denote the set of correctly detected objects as true positives (TP) and the set of falsely detected objects of false positives (FP). The precision is now the ratio between the number of TPs relative to all predicted objects: precision = |TP| |TP| + |FP| (19) We denote the set of objects which are not detected by the detector as false negatives (FN). Then, the recall is defined as the ratio of detected objects (TP) relative to the number of all objects in the data set: To measure the localization error of predicted bounding boxes, IoU computes the overlap between two bounding boxes: the detected and the ground truth box. If we take all the boxes that have an IoU ≥ τ as TP and consider all other detections as FP, then we get the precision at τ IoU. If we now vary τ from 0.5 to 0.95 IoU with a step size of 0.05, we receive ten different precision values which can be combined into the average precision (AP) at IoU=0.5:0.95 [8]. Let us note that in the case of multi-class classification, we would need to compute the AP for object each class separately. To receive a single performance measure for object detection, the mean AP (mAP) is computed which is the most common performance measure for object detection quality.
In this paper, both of our datasets only contain single class, and hence, we used AP as our evaluation metric. We mainly showed the results of AP at IoU=0.5:0.95 as our method performed increasingly better compared to other models when we increased the IoU values for AP calculation. We show this trend in section 4.3.4.

Detection without Super-Resolution
We ran the two detectors to document the object detection performance on both LR and HR images. We used SSD with vgg16 [62] network and Faster R-CNN (FRCNN) with ResNet-50-FPN [51] detector. We trained the two models with both HR and 4x-downscaled LR images. Testing was also done with both HR and LR images. Table 1. Detection on LR (low-resolution) and HR (high-resolution) images without using super-resolution. Detectors are trained with both LR and HR images and AP (average precision) values are calculated using 10 different IoUs (intersection over union). In table 1, we show the results of the detection performance of the detectors with different train/test combinations. When we only used LR images for both training and testing, we observed 64% AP for Faster R-CNN. When training on HR images and testing with LR images, the accuracy dropped for both detectors. We also added detection results (using LR images for training/testing) for both the datasets using SSD with RFB modules (SSD-RFB) [52], where accuracy slightly increased from the base SSD.

Model
The last two rows in table 1 depict the accuracy of both detectors when training and testing on HR images. We have achieved up to 98% AP with the Faster R-CNN detector. This, shows the large impact of the resolution to the object detection quality and sets a natural upper bound on how close a SR-based method can get when working on LR images. In the next sections, we demonstrate that our approaches considerably improve the detection rate on LR imagery and get astonishingly close to the performance of directly working on HR imagery. In this experiment, we created 4x upsampled images from the LR input images using bicubic upsampling and different SR methods. Let us note that no training was needed for applying bicubic upsampling since it is a parameter free function. We used the SR images as test data for two types of detectors. We compared three GAN architectures for generating SR images, our new EESRGAN architecture, ESRGAN [21] and EEGAN [22]. Each network was trained separately on the training set before the object detector was trained. For the evaluation, we again compared detectors being trained on the SR images from the particular architecture and detectors being directly trained on the HR images.

Separate Training with Super-Resolution
In table 2, the detection output of the different combinations of SR methods and detectors is shown with the different combinations of train/test pairs. As can be seen, our new EESRGAN architecture displayed the best results already getting close to the detection rates which could be observed when working with HR images only. However, after training EESRGAN can be directly applied to LR imagery where no HR data is available and still achieved very good results. Furthermore, we could observe that other SR methods EEGAN and ESRGAN have already improved the AP considerably when used for preprocessing of LR images. However, for both data sets, EESRGAN have outperformed the other two methods.

End-to-End Training with Super-Resolution
We trained our EESRGAN network and detectors end-to-end for this experiment. The discriminator (D Ra ), and the detectors jointly acted as a discriminator for the entire architecture. Detector loss was backpropagated to the SR network, and therefore, the loss contributed to the enhancement of LR images. At training time, LR-HR image pairs were used to train the EEGAN part, and then the generated SR images were sent to the detector for training. At test time, only the LR images were fed to the network. Our architecture first generated a SR image of the LR input before object detection was performed. We also compared our results with different architectures. We used ESRGAN [21] and EEGAN [22] with the detectors for comparison. Table 3 clearly shows that our method delivers superior results compared to others.

AP versus IoU curve
We have calculated the AP values on different IoUs. In figure 8, we plot the AP versus IoU curves for our datasets. The performance of EESRGAN-FRCNN, end-to-end EESRGAN-FRCNN, and FRCNN is shown in the figure. The end-to-end EESRGAN-FRCNN network has performed better than the separately trained network. The difference is most evident for the higher IoUs on the COWC dataset.
Our results indicate excellent performance compared to the highest possible AP values obtained from standalone FRCNN (trained and tested on HR images) The OGST dataset has displayed less performance variation compared to the COWC dataset. The object size of the OGST dataset is larger than that of the COWC dataset. Therefore, the performance difference was not similar to the COWC dataset when we compared between standalone FRCNN and our method on the OGST dataset. To conclude, training our new architecture in an end-to-end manner has displayed an improvement for both the datasets.

Precision versus Recall
In figure 9, precision-recall curves are shown for both of our datasets. The precision-recall curve for COWC dataset is depicted in 9a and 9b represents the curve for OGST dataset. For each dataset, we plot the curves for standalone faster R-CNN with LR training/testing images, and our method with/without end-to-end training. We used IoU=0.5 to calculate precision and recall.  The precision-recall curves for both datasets show that our method has higher precision values in higher recall values compared to the standalone faster R-CNN models. Our models with end-to-end training performed better than our models without the end-to-end training. In particular, the end-to-end models have detected more than 99% of the cars with 96% AP in the COWC dataset. For the OGST dataset, our end-to-end models have detected more than 81% of the cars with 97% AP.

Effects of Dataset Size
We trained our architecture with different training set sizes and tested with a fixed test set. In figure  10, we plot the AP values (IoU=0.5:0.95) against different numbers of labeled objects for both of our datasets (training data). We used five different dataset sizes: {500, 1000, 3000, 6000, 10000(cars)} and {100, 200, 400, 750, 1491(tanks)} to train our model with and without the end-to-end setting.
We got the highest AP value of 95.5% with our full COWC training dataset (10000 cars), and we used the same test dataset (1000 cars) for all combinations of the training dataset (with end-to-end setting). We also used another set of 1000 labeled cars for validation. Using 6000 cars, we got an AP value near to the highest AP, as shown with the plot of AP versus dataset size (COWC). The AP value decreased significantly when we used only 3000 labeled cars as training data. We got the lowest AP using only 500 labeled cars, and the trend of AP was further decreasing as depicted in figure 10a. Therefore, we can infer that we needed around 6000 labeled cars to get precision higher than 90% for the COWC dataset. We observed slightly lower AP values for all sizes of COWC datasets when we did not use the end-to-end setting, and we observed higher differences between the two settings (with and without end-to-end) when we used less than 6000 labeled cars.  The OGST dataset gave 83.2% AP (with end-to-end setting) using the full training dataset (1491 tanks), and we used 100 labeled tanks as test and same amount as validation data for all combinations of the training dataset. We got high AP values with 50% of our full training dataset as depicted in 10b. AP values dropped below 80% when we further decreased the training data. Similar to the COWC datasets, we also got comparatively lower AP values for all sizes of OGST datasets. We observed slightly higher differences between the two settings (with and without end-to-end) when the dataset consisted of less than 400 labeled tanks, as shown in the plot of AP versus dataset size (OGST dataset).
We used 90% of the OGST dataset for training while we used the 80% of the COWC dataset for the same purpose. The accuracy of the testing data (OGST) slightly increased when we added more training data, as depicted in figure 10b. Therefore, we used a larger percentage of training data for the OGST dataset than for the COWC dataset, and it slightly helped to improve the relatively low accuracy of the OGST test data.

Enhancement and Detection
In figure 11, we have shown our input LR images, corresponding generated SR image, enhanced edge information and final detection. The image enhancement has helped the detectors to get high AP values and also makes the images visually good enough to identify the objects easily. It is evident from the figure that the visual quality of the generated SR images is quite good compared to the corresponding LR images, and the FRCNN detector has detected most of the objects correctly. In EEGAN [22], only image consistency loss (L img_cst ) was used for enhancing the edge information. This loss generated edge information with noise, and as a result, the final SR images became blurry. The blurry output with noisy edge using only L img_cst loss is shown in figure 12a. The blurry final images gave lower detection accuracy compared to sharp outputs. Therefore, we have introduced edge consistency loss (L edge_cst ) in addition to L img_cst loss that gives noise-free enhanced edge information similar to the edge extracted from ground truth images and the effects of the L edge_cst loss is shown in figure 12b. The ground truth HR image with extracted edge is depicted in figure 12c.

Discussion
The detection results of our method presented in the previous section have indicated that our end-to-end SR-detector network improved detection accuracy compared to several other methods. Our method outperformed the standalone state-of-the-art methods such as SSD or faster R-CNN when implemented in low-resolution remote sensing imagery. We used EESRGAN, EEGAN, and ESRGAN as the SR network with the detectors. We showed that our EESRGAN with the detectors performed better than the other methods and the edge-enhancement helped to improve the detection accuracy. The AP improvement was higher in high IoUs and not so much in the lower IoUs. We have also showed that the precision increased with higher resolution. The improvement of AP values for the OGST dataset was lower than that for the COWC dataset because the area covered by a tank was slightly bigger than that of a car, and tanks sizes and colors were less diverse than the cars.
Our experimental results indicated that AP values of the output could be improved slightly with the increase of training data. The results also demonstrated that we could use less training data for both the datasets to get a similar level of accuracy that we obtained from our total training data.
The faster R-CNN detector gave us the best result, but it took a longer time than an SSD detector. If we need detection results from a vast area, then SSD would be the right choice sacrificing some amount of accuracy.
We had large numbers of cars from different regions in the COWC dataset, and we obtained high AP values using different IoUs. On the other hand, the OGST dataset needed more data to get a general detection result because we used data from a specific area and for a specific season and this was one of the limitations of our experiment. Most likely, more data from different regions and seasons would make our method more robust for the use-case of oil and gas storage tank detection. Another limitation of our experiment was that we showed the performance of the datasets that contain only one class with less variation. We would be looking forward to exploring the performance of our method on a broader range of object types and landscapes from different satellite datasets.
We have used LR-HR image pairs to train our architecture, and the LR images were generated artificially from the HR counterparts. To our knowledge, there is no suitable public satellite dataset that contains both real HR and real LR image pairs and ground truth bounding boxes for detecting small objects. Therefore, we have created the LR images which do not precisely correspond to true LR images. However, improvement of resolution through deep learning always improved object detection performance on remote sensing images (for both artificial and real low-resolution images), as discussed in the introduction and related works section of this paper [5]. Impressive works [61,71] exist in literature to create realistic LR images from HR images. For future work, we are looking forward to exploring the works to create more accurate LR images for training.

Conclusions
In this paper, we propose an end-to-end architecture that takes LR satellite imagery as input and gives object detection results as outputs. Our architecture contains a SR network and a detector network. We have used a different combination of SR systems and detectors to compare the AP values for detection using two different datasets. Our experimental results show that the proposed SR network with faster R-CNN has yielded the best results for small objects on satellite imagery. However, we need to add more diverse training data in the OGST dataset to make our model robust in detecting oil and gas storage tanks. We also need to explore diverse datasets and the techniques to create more realistic LR images. In conclusion, our method has combined different strategies to provide a better solution to the task of small-object detection on LR imagery.