YOLO-Fine: One-Stage Detector of Small Objects Under Various Backgrounds in Remote Sensing Images

: Object detection from aerial and satellite remote sensing images has been an active research topic over the past decade. Thanks to the increase in computational resources and data availability, deep learning-based object detection methods have achieved numerous successes in computer vision, and more recently in remote sensing. However, the ability of current detectors to deal with (very) small objects still remains limited. In particular, the fast detection of small objects from a large observed scene is still an open question. In this work, we address this challenge and introduce an enhanced one-stage deep learning-based detection model, called You Only Look Once (YOLO)-ﬁne, which is based on the structure of YOLOv3. Our detector is designed to be capable of detecting small objects with high accuracy and high speed, allowing further real-time applications within operational contexts. We also investigate its robustness to the appearance of new backgrounds in the validation set, thus tackling the issue of domain adaptation that is critical in remote sensing. Experimental studies that were conducted on both aerial and satellite benchmark datasets show some signiﬁcant improvement of YOLO-ﬁne as compared to other state-of-the art object detectors.


Introduction
Object detection consists in both estimating the exact object position (object localization) and determining which category it belongs to (object classification). It is a central component in many real-world applications, including detection of pedestrians and vehicles in autonomous driving, detection of people in order to analyze their behaviors, detection, and further recognition of faces for surveillance purposes, etc. (see [1] for a recent survey). However, due to the high variations of viewpoints, poses, occlusion, and lighting conditions, object detection remains a challenging problem. Before the advent of deep learning, the standard object detection pipeline in computer vision was made of three main steps: (i) selecting regions of interest (ROIs), where objects may appear or be present, (ii) extracting some characteristics from these regions/objects (a.k.a. feature extraction), and (iii) feeding a supervised classifier with these features. These so-called handcrafted feature design-based approaches have generally provided good results for real-time applications in operational systems [1]. However, they were mostly unable to exceed a certain level of accuracy. Within the last decade, the emerging development of deep neural networks, and more particularly of convolutional neural networks (CNNs), brought both a paradigm change and some significant improvements [2,3]. This gain in accuracy has been possible, thanks not only to some methodological advances, but also to the availability of large amounts of data and to the increased computing capacity as well. One of the most remarkable milestones in object detection was achieved by the proposition of Region-based CNN (R-CNN) in 2014 [4]. Since then, object detection using deep learning approaches has been evolved at an unstoppable speed [1][2][3], attempting to achieve better and better detection accuracy as well as to ensure faster and faster detection for real-time applications. We briefly mention here some milestones in two-stage detection frameworks, i.e., which first generate regions of interest and then perform classification and bounding box regression for each proposal, such as Fast R-CNN [5], Faster R-CNN [6], R-FCN (Region-based Fully Convolutional Network) [7], Feature Pyramid Network [8], Mask R-CNN [9], etc., and one-stage detectors, i.e., which perform a one-pass regression of class probabilities and bounding box locations, such as YOLO (You Only Look Once) [10], YOLOv2 [11], YOLOv3 [12], SSD (Single Shot Multibox Detector) [13], DSOD (Deeply Supervised Object Detector) [14], RetinaNet [15], RON (Reverse connection with Objectness prior Networks) [16], M2Det [17], EfficientDet [18], etc. In general, two-stage detectors favor detection accuracy while one-stage models are faster. To illustrate, we recall the results reported in [14]: the inference speed of Faster R-CNN, R-FCN, SSD, and YOLOv2 for PASCAL VOC data is, respectively, 7, 11, 46, and 81 FPS (frames per second), while the detection rates provided by the two former (two-stage) were better than the two latter (one-stage). For more details about their concept and performance, we refer readers to the aforementioned surveys.
In remote sensing, object detection has also been an active research area due to its central role in various Earth Observation (EO) applications: detecting and counting man-made structures (such as roads, buildings, airports, etc.) [19][20][21]; detecting and tracking moving objects (such as vehicles, ships, etc.) [22][23][24][25]; and, detecting endangered species (e.g., wildlife animals, sea mammals, etc.) [26,27]. The popularity of CNN-based object detection methods introduced in the computer vision domain made these solutions a first choice when detecting objects from optical remote sensing data (e.g., aerial and satellite images). However, such a transfer from computer vision to remote sensing is not straightforward due to the specific characteristics of geospatial objects that appear in the observed scenes. Unlike in natural images, objects in remote sensing data are highly varied in size and shape, from very small (vehicle, animal, etc.) to very large (airport, football yard, swimming pool, etc.) ones, with also a high range of spectral signatures depending on the acquisition sensor, acquisition lighting, weather conditions, etc. Unsurprisingly, data collection and annotation play an important role in achieving good performance in addition to the design or improvement of detection methods. We refer readers to some recent surveys on object detection in remote sensing using deep learning [28][29][30] for more details about detectors, data, and applications in the remote sensing community.
One of the most challenging tasks pointed out by the above surveys is the limited performance of current detectors when dealing with small and very small objects (i.e. with less than 10 pixels in width), such as vehicles or wildlife animals in high resolution optical satellite images [28][29][30]. The trade-off between detection accuracy and computational time does not allow for rapidly detecting small objects in large EO scenes, in particular to perform real-time detection or to use tracking systems within an operational context. Both two-stage and one-stage detector models have their own issues to deal with small objects. While two-stage frameworks are very slow (as reported in [14]), one-stage models are often designed for fast inference and not to deal with small objects. In this paper, we aim at addressing real-time detection of small objects in remote sensing images. To do so, we propose to build a one-stage object detector that could succeed in detecting small objects with both high accuracy and high speed. The proposed detector is named YOLO-fine and it focuses on small object detection by performing finer regression to search for finer objects. Our YOLO-fine is inspired from YOLOv3 [12], which is one of the state-of-the-art detectors in computer vision as well as in remote sensing, as illustrated in the next section. We also investigate the ability of our detector to deal with new backgrounds that appear in the validation set. Experiments, conducted on both aerial and satellite data sets , show that YOLO-fine significantly improves the detection performance w.r.t. state-of-the-art detectors, better deals with various backgrounds to detect known objects, and it is able to perform real-time prediction.
In the remainder of the paper, Section 2 briefly goes through literature works in object detection in remote sensing with a focus on small objects. Section 3 revisits the one-stage YOLO detector family and describes our proposal, called YOLO-fine. In Section 4, we report our experimental studies on three different datasets, including two aerial (VEDAI [31] and MUNICH [32]) and one satellite (XVIEW [33]) remote sensing datasets. We also provide our experimental design to investigate the performance of our detector regarding various types of backgrounds. Finally, the main contributions and improvement prospects are considered in Section 5.

Related Studies
Many efforts have been devoted to tackle the detection of small objects in the computer vision domain, mostly by adapting the existing one-stage and two-stage frameworks mentioned in our introduction to better deal with small objects. We refer readers to a very recent survey on small object detectors in computer vision based on deep learning [34]. In this review, the authors have mentioned five crucial aspects that are involved in recent small object detection frameworks, including multiscale feature learning, data augmentation, training strategy, context-based learning, and generative network-based detection. They also highlighted some powerful models to detect generic small objects, such as improved Faster R-CNN [35,36], Feature-fused SSD [37], MDSSD (Multi-scale deconvolutional SSD) [38], RefineDet [39], SCAN (Semantic context aware network) [40], etc. In the remote sensing community, the detection of small objects has been mostly tackled by exploiting two-stage object detectors thanks to their capacity to generally provide more accurate detection performance as compared to one-stage detectors. In this scenario, the region-based approach appears to be more relevant since the first stage whose aim is to search candidate objects could be set to focus on small regions and ignore large ones. Many recent works have exploited state-of-the-art two-stage detectors, such as Faster R-CNN [22,[41][42][43], deconvolution R-CNN [44], and deformable R-CNN [45,46] to detect e.g., small vehicles, airplanes, ships, man-made structures, in remote sensing datasets collected mostly from aerial or Google Earth images. In [41], the RPN (Region proposal network) of Faster-RCNN was modified by setting appropriate anchors and leveraging a single high-level feature map of finer resolution for small region searching. Authors also incorporated contextual information with the proposal region to further boost the performance of detecting small objects. In [44], Deconv R-CNN was proposed by setting a deconvolution layer after the last convolutional layer in order to recover more details and better localize the position of small targets. This simple but efficient technique helped to increase the performance of ship and plane detection compared to the original Faster R-CNN. In [46], an IoU-adaptive deformable R-CNN was developed with the goal of adapting IoU threshold according to the object size, to ease dealing with small objects whose loss (according to the authors) would be absorbed during training phase. Therefore, such an IoU-based weighted loss was adopted to train the network and improve the detection performance on small objects in the large-scale aerial DOTA dataset [47]. Recently, other developments that were based on feature pyramid detector (MCFN) [48], R2CNN [49], or SCRNet [50] were proposed based on the two-stage detection scheme to tackle small object detection. While these methods achieve a proper lever of accuracy, they are still limited by their computational cost (see results from [14] recalled in introduction).
In contrast to the growing number of studies relying on two-stage detection networks to deal with remote sensing objects of small size, the use of one-stage detectors has been less explored. Few studies have been proposed in particular using the YOLO-based frameworks (i.e. including YOLOv2 and YOLOv3) to perform small object detection, such as persons [51], aircrafts [52], ships [53], and building footprints [54] from remote sensing images. This relatively limited interest could be explained by the fact that, although YOLO detectors could be seen as a first choice for real-time applications in computer vision, they provide much lower detection accuracy when compared to region-based detectors, notably in detecting small objects. In [51], UAV-YOLO was proposed to adapt the YOLOv3 to detect small objects from unmanned aerial vehicle (UAV) data. Slight modification from YOLOv3 was done by concatenating two residual blocks of the network backbone having the same size. The authors actually focused more on training optimization of their dataset with UAV-viewed perspectives. They reported superior performance when compared to YOLOv3 and SSD models. In [52], the authors exploited YOLOv3 without any modification and showed that its performance in computational time is much better than Faster R-CNN or SSD. In [53], the authors compared YOLOv3 and YOLT (You Only Look Twice) [55], which is actually quite similar to YOLOv2, for ship detection. Again, no modification of the original architecture was proposed. In [54], the authors proposed the locally-constrained YOLO (named LOCO) to specifically detect small and dense building footprints. However, LOCO was designed for building detection with rectangular forms and could not be generalized to other objects with various shapes and forms. We argue that YOLO or other one-stage detectors, despite their limited use in remote sensing, remain appealing for real-time applications in remote sensing thanks to their high efficiency. In the sequel, we will present our enhanced one-stage architecture, called YOLO-fine, to achieve the detection of small objects with both high accuracy and efficiency, thus enabling its embedding into real-time operating systems.

Inspiration from the YOLO Family
YOLO (You Only Look Once) [10] is an end-to-end deep learning-based detection model that determines the bounding boxes of the objects present in the image and classifies them in a single pass. It does not involve any region proposal phase conversely to two-stage detectors, such as the R-CNN family. The YOLO network first divides the input image into a grid of S × S non-overlapping cells. For each cell, YOLO predicts three elements: (1) the probability of an object being present; (2) if an object exists in this cell, the coordinates of the box surrounding it; and, (3) the class c the object belongs to and its associated probability. As illustrated in Figure 1a, for each cell, the network predicts B bounding boxes. Each of them contains five parameters (x, y, w, h, sc), where sc is the objectness confidence score of the box. Subsequently, the network calculates the probabilities of the classes for each cell. If C is the number of classes, the output of YOLOv1 is a tensor of size (S, S, B × 5 + C).
In the literature, it is usually reported that the first YOLO version (YOLOv1) is faster (i.e. lower computational time), but less accurate than SSD [13], another popular one-stage detector. In 2017, YOLOv2 [11] (also called YOLO9000 at the beginning) was proposed to significantly improve the detection accuracy while making it even faster. Many changes were proposed as compared to the first version. YOLOv2 only uses convolutional layers without fully-connected layers and it introduces the anchor boxes rather than arbitrary boxes present in YOLOv1. These are a set of predefined enclosing boxes of certain height and width to capture the scale and aspect ratio of the objects to be detected. Anchors are usually set based on the size (and aspect ratio) of the objects in the training set. The class probabilities are also calculated for each anchor box and not for each cell as in YOLOv1 (see Figure 1b). Finally, the detection grid is also refined (i.e. 13 × 13 instead of 7 × 7 in YOLOv1).
Inspired from the evolution of two-stage detection networks in particular the Feature Pyramid Network [8], YOLOv3 was proposed in 2018 [12] to further improve the detection accuracy, and in particular to be able to find objects of different sizes in the same image. YOLOv3 offers three detection levels instead of only one in the two previous versions, which helps to search for smaller objects (see Figure 1c). Compared to YOLOv2, the following changes were made: (1) predict three box anchors for each cell instead of five in YOLOv2; (2) detect at three different levels with the searching grids S × S, 2S × 2S, and 4S × 4S; (3) exploit a deeper backbone network (Darknet-53) for feature map extraction. As a result, the number of layers highly increases to 106 as compared to 31 in the two previous versions. Obviously, because YOLOv3 offers a deeper feature extraction network with three-level prediction, it also becomes slower than YOLOv2 (the fastest in the YOLO family). However, because one-stage detectors in general, and YOLO in particular, are characterized by rather lower accuracy than their two-stage counterparts, they remain largely unexplored for detecting small objects in remote sensing, as noted in Section 2. By default in [10], S = 7, B = 2 and C = 20 for the PASCAL VOC dataset. For an input image of size 448 × 448 pixels, the output is a tensor of size 7 × 7 × 30. (b) In YOLOv2, the output is a tensor of dimension (S, S, B × (5 + C)). The difference is that the class probabilities are calculated for each anchor box. By default in [11], S = 13, B = 5 anchor boxes and C = 20 for the PASCAL VOC dataset. For an input image of size 416 × 416 pixels, the output is a tensor of size 13 × 13 × 125. (c) In YOLOv3, the output consists of 3 tensors of dimension (S, S, B × (5 + C)), (2S, 2S, B × (5 + C)) and (4S, 4S, B × (5 + C)) which correspond to the 3 detection levels (scales). By default in [12], S = 13, B = 3 anchor boxes and C = 80 for the COCO dataset. For an input image of size 416 × 416 pixels, the outputs are three tensors of size 13 × 13 × 255, 26 × 26 × 255 and 52 × 52 × 255.

Proposed Model: YOLO-Fine
The YOLO-fine network proposed in this paper relies on the structure of YOLOv3 and pursues three objectives: (1) detect small and very small objects (dimension, i.e., height or width lower than 10 pixels) better than any other one-stage detector; (2) be efficient enough to enable prediction in real-time applications; and, (3) be lighter in parameters and weights to facilitate the implementation within an operational context. The proposed network is summarized in Figure 2 and compared to the three reference YOLO versions in Table 1. Table 1. Comparison of the three YOLO versions and our YOLO-fine model in terms of architecture parameters. We note that the reported input image size of each model was the one originally proposed by the authors. In our experiments, we set all to 512× 512 in order to perform fair comparison (cf. Section 4). In order to achieve the aforementioned objectives, we have designed YOLO-fine with the following improvements.

1.
While YOLOv3 is appealing for operational contexts due to its speed, its performance in detecting small objects remains limited because the input image is divided into three detection grids with subsampling factors of 32, 16, and eight. As a result, YOLOv3 is not able to detect objects measuring less than eight pixels per dimension (height or width) or to discriminate two objects that are closer than eight pixels. The ability of YOLOv3 to detect objects of a wide range of sizes is relevant in numerous computer vision applications, but not in our context of small object detection in remote sensing. The highly sub-sampled layers are then not necessary anymore, thus our first proposition is to remove two coarse detection levels related to large-size objects often observed in natural images. We replace them by two finer detection levels dedicated to lower sub-sampling factors of four and two with skip connections to the corresponding feature maps from high-level layers to those from low-level layers but with very high spatial resolution. The objective is to refine the object search grid in order to be able to recognize and discriminate objects smaller than eight pixels per dimension from the image. Moreover, objects that are relatively close (such as building blocks, cars in a parking, small boats in a harbor, etc.) could be better discriminated.

2.
In order to facilitate the storage and implementation within an operational context, we also attempt to remove unnecessary convolutional layers from the backbone Darknet-53. Based on our experiments, we found that the last two convolutional blocks of Darknet-53 include a high number of parameters (due to the high number of filters), but are not useful to characterize small objects due to their high subsampling factors. Removing these two blocks results in both a reduction of the number of parameters in YOLO-fine (making it lighter) and of the feature extracting time required by the detector.

3.
The three-level detection (that does not exist in YOLOv1 and YOLOv2) strongly helps YOLOv3 to search and detect objects at different scales in the same image. However, this comes with a computational burden for both YOLOv3 and our YOLO-fine. Moreover, when refining the search grid, the training and prediction times will also increase from which the compromise between detection accuracy and computing time. The reason that we maintain the three detection levels in YOLO-fine is to make YOLO-fine able to provide good results for various sizes of very small and small objects, as well as provide at least equivalent performance to YOLOv3 on medium or larger objects (i.e., the largest scale of YOLO-fine is the smallest of YOLOv3). To this end, thanks to the efficient behavior of YOLO in general, our YOLO-fine architecture remains able to perform detection in real-time using GPU. We report this accuracy/time trade-off later in the experimental study.
input (512×512)  Table 1 provides a comparison of YOLO-fine with the three YOLO versions from the literature. We can see that our YOLO-fine model contains significantly fewer layers than YOLOv3 (68 versus 103) due to the removal of two final convolution blocks of Darknet-53 as mentioned above. Although our model is deeper than YOLOv1 and YOLOv2 (31 layers each), it provides multilevel detection, as in YOLOv3. More importantly, the size of the weight file remains the smallest with only 18 MB, as compared to 237 MB for YOLOv3 and even higher for the two older versions. This weight file size depends on the number of convolution filters of each layer more than on the number of layers. The size of the input image is set to 512 × 512 pixels, since this choice allow for us to work with experimental datasets that will be described in Section 4. This size can be easily set in the network during implementation. It is also important to note that even if YOLO-fine is developed to target the detection of small objects (dimension ≤ 20 pixels in favor) in operational mode, it can still detect medium or large objects. In this case, its accuracy is comparable to YOLOv3 but with a higher prediction time because of its finer searching grids. We support these observations by a set of experiments reported in the next section.

Experimental Study
In this section, we first provide details of the public datasets used in our experiments, namely VEDAI (color and infrared versions) [31], MUNICH [32] and XVIEW [33]. The first two datasets are made of aerial images while the last one contains satellite images. Next, we describe the experimental setup and recall the standard evaluation metrics used in our study. We then show and discuss the detection accuracy obtained by our model as compared against its counterparts, as well as provide a comparison on their performance in terms of storage requirement and computational time. More interestingly, we also set up and study the effect of new backgrounds appearing in validation sets.

VEDAI
The VEDAI aerial image dataset [31] serves as a reference public source for multiple small-vehicle detection under various environments. The objects in VEDAI, in addition to being small, present different variabilities, such as multiple orientations, changes in lighting, shading, and occlusions. Additionally, different types of background can be found in this database, including urban, peri-urban, and rural areas, or even more varied environments (desert, forest, etc.). Each image in VEDAI is available in both color and infrared versions. The spatial resolutions are 12.5 cm for the 1024 × 1024 images and 25 cm for the 512 × 512 images, while object dimension varies between 16 and 40 pixels (for the 12.5 cm version) and between eight and 20 pixels (for the 25 cm version). In our experiments, we exploit both 25-cm and 12.5-cm versions (called VEDAI512 and VEDAI1024 in the sequel), and also both color and infrared versions to study the performance of our proposed model. Note that, for the 12.5 cm version (VEDAI1024), the images were cut into patches of 512 × 512 pixels to fit the network's input size. The original dataset contains 3 757 objects belonging to nine different classes: car, truck, pickup, tractor, camper, ship, van, plane, other. However, since there are only few objects of the class "plane", we decided to merge them with the class "other". An illustrative image is shown in Figure 3. A precise experimental protocol by cross-validation is also provided by the authors of [31], ensuring that the experimental results obtained by different models can be properly replicated and compared. Indeed, the dataset that contains 1250 images is divided into training and validation sets using 10-fold cross-validation (it is divided into 10 subsets and at each experiment, nine are used for training, and the rest is used for validation). We also adopted this cross-validation protocol within our experiments.

MUNICH
We also consider the DLR 3K Munich Vehicle dataset [32] (called MUNICH in this section). Similarly to VEDAI, MUNICH offers a large number of small vehicles and a large variability of backgrounds. The dataset contains 20 large images of size 5616 × 3744 pixels taken by a 1000 m-high RGB sensor with a spatial resolution of 13 cm. Two classes of vehicles are considered: 9300 cars and 160 trucks. MUNICH is particularly interesting to evaluate detectors in a context where the number of objects of different categories is largely unbalanced due to its high class imbalance property.
In this work, we first cut the large images into patches of 1024 × 1024 pixels, then resize them into 512 × 512 pixels (which corresponds to a spatial resolution of 26 cm), so that the vehicles would have smaller sizes (from 8 to 20 pixels). Our dataset finally contains 1226 training images and 340 validation images. Figure 4 shows an illustrative image.

XVIEW
The third and last dataset considered in our study is XVIEW [33]. It was collected from WorldView-3 satellites at 30-cm spatial resolution (ground sample distance). A total number of 60 classes are available, but since we focus here on small objects, we gather 19 classes of vehicles, including {17, 18,19,20,21,23,24,26,27,28,32,41, 60, 62, 63, 64, 65, 66, 91} (these numbers correspond to the initial classes from the original XVIEW data) to create only one vehicle class. Our purpose is not to achieve state-of-the-art detection rate on the XVIEW dataset, but to experiment and validate the capacity of YOLO-fine to detect vehicles from such high resolution satellite images. This single-class dataset presents the highest number of training and validation images with a total of 35k vehicles whose size varies from six to 15 pixels. Three sample images are given in Figure 5.

Experimental Setup and Evaluation Criteria
Experiments were conducted on our three datasets to study the performance of the proposed YOLO-fine as compared to several state-of-the-art object detection frameworks. We considered several detection models from the YOLO family (https://github.com/AlexeyAB/darknet), including YOLOv2 [11], YOLOv3 [12], YOLOv3-tiny (which is a fast and light version of YOLOv3) and YOLOv3-spp (which is YOLOv3 with spatial pyramid pooling operator [56]). We also investigated other one-stage and two-stage methods, such as SSD (https://github.com/lufficc/SSD) [13,57], Faster R-CNN (https://github.com/facebookresearch/maskrcnn-benchmark) [6,58] (both with VGG16 backbone), RetinaNet (https://github.com/yhenon/pytorch-retinanet) [15] (with ResNet-50 backbone), and EfficientDet (https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch) [18] to enrich our comparative study. Although different implementations were adopted (mentioned in the provided links), equivalent parameters were set in order to perform a fair comparison. We retained the standard parameter setting of each framework and adapted the anchor sizes with regard to the training set of each dataset. The input size to all networks was fixed to 512 × 512 and batch size was set to 64. All of the detectors were trained with a learning rate of 0.001 with a momentum of 0.9 during a fixed number of 200 epochs from which the best detection result was reported for each detector.
The following standard criteria are exploited to quantitatively evaluate and compare the detection accuracy: • Intersection over Union (IoU) is one of the most widely used criteria in the literature to evaluate the object detection task. This criterion measures the overlapping ratio between the detected box and the ground truth box. IoU varies between 0 (no overlap) and 1 (total overlap) and decides if the predicted box is an object or not (according to a threshold to be set with TP, FP, and FN denoting True Positive, False Positive, and False Negative, respectively. • Precision/Recall curve plots the precision in function of the recall rate. It is a decreasing curve. When we lower the threshold of the detector (the confidence index), the recall increases (as we will detect more objects), but the precision decreases (as our detections are more likely to be false alarms), and vice versa. The visualization of the precision/recall curve gives us a global vision of the compromise between precision and recall.

Detection Performance
The detection performance is now assessed on each datasets (VEDAI, MUNICH, and XVIEW). Note that for VEDAI, we focus on VEDAI512 dataset (with 25-cm resolution) as it is more relevant to our paper scope (objects size is smaller (from eight to 20 pixels)). Nevertheless, detection results on VEDAI1024 dataset are provided in order to show that YOLO-fine provides slightly better or at least equivalent accuracy for detecting medium and larger objects. Finally, the computational time of YOLO-fine compared to other YOLO versions are reported, performed on three different GPUs.

VEDAI512 Color and Infrared
The detection results on VEDAI512 obtained with different detection models compared to the proposed YOLO-fine are shown in Tables 3 and 4 for the color and infrared versions, respectively. We remind that the object size varies here from eight to 20 pixels ( Table 2). As observed in Table 3a, our YOLO-fine reached mAP = 68.18%, showing an increase of 6.09% when compared to the second-best counterpart, which is YOLOv3 (mAP = 62.09%). The YOLOv3-spp obtained equivalent results as YOLOv3 with mAP = 61.57%, while YOLOv3-tiny provides the poorest results (mAP = 44.97%) among the YOLO-based models. This is not surprising, since a YOLOv3-tiny model was designed to perform very fast detection for tracking systems. We report its performance in computational time later within the results on XVIEW. Other detectors, including SSD, EfficientDet, Faster R-CNN, and RetinaNet, provided lower detection accuracy, since they were not designed to detect small objects. Even their anchor size was adapted according to the training set, since they could not naturally yield good detection results for small objects without any modification and adaptation of network architectures. However, we will observe later their good performance in detecting medium and larger objects in VEDAI1024. Back to Table 3a, we observe the best results for 4/8 classes (pickup, tractor, boat and van) were achieved by the proposed YOLO-fine. It is important to note that very good results were obtained for the three classes of small vehicles: car (76.77%, second-best after YOLOv3-spp), pickup (74.35%, best), and tractor (78.12%, best), especially for the tractor class which is often difficult to be detected with all other detectors (the second-best result was only 67.46%, thus more than 10% lower than YOLO-fine model). From Table 3b, YOLO-fine globally produced a good number of true positives and not many false positives/negatives, thus achieved a good balance of precision (0.67) and recall (0.70) rates, as well as the highest F1-score (0.69). This behavior is confirmed in Figure 6a, which shows the precision/recall curves obtained by all detectors for the color VEDAI512, when the detector confidence threshold ranges from 0 and 1. We observe that the YOLO-fine curve (red) remains higher than the others, thus confirming the superior performance of YOLO-fine w.r.t. existing methods, regardless of the detection threshold. From Table 4, similar results were obtained on the infrared VEDAI512. Let us note that this specific VEDAI infrared version has rarely been investigated in the literature. We are interested here to assess the performance of our model on infrared images, which are of high importance in many remote sensing scenarios. Our first observation is that, for this dataset, only exploiting the infrared band also provided equivalent results to the use of color bands. This remark shows that infrared VEDAI images also contain rich information for discriminating different vehicle classes, as good as the three color bands. Back to the detection results from Table 4a, YOLO-fine achieved the highest mAP of 68.83%, i.e., a gain of 5.27% when compared to the second-best result of YOLOv3-spp (mAP = 63.56%). Again, YOLOv3 yielded equivalent result to YOLOv3-spp, while SSD, Faster R-CNN, and EfficientDet achieved equivalent results to YOLOv2 (mAP = 50.36%), being much lower than the proposed YOLO-fine. In terms of classwise accuracy, the proposed model again achieved the best results for pickup (70.65%) and boat (60.84%), as the previous case of color VEDAI512. Good results were yielded as well for other small vehicle classes, which are car (79.68%) and pickup (74.49%), even if they were not the best. Finally, the results from Table 4b and Figure 6b also confirm good behavior of recall/precision balance provided by YOLO-fine. The best F1-score of 0.71 was achieved by our model, significantly higher than the other detectors, such as EfficientDet(D1) and YOLOv3 (both 0.64), YOLOv3-spp (0.65), or SSD (0.63).   Table 5 provides the detection results on VEDAI1024 (12.5-cm resolution). This dataset is a high-resolution version of the previous one which was also proposed by the authors in [31]. We remind the size of the vehicles in VEDAI1024 ranges from 16 to 40 pixels, thus double the size of those in the previous VEDAI512. We consider the objects in VEDAI1024 are of medium size and not small size. By conducting experiments on this dataset, we show that, although YOLO-fine is designed to mainly detect small objects, it could be able to provide at least equivalent detection accuracy as the original YOLOv3 for other object sizes. Let us observe Table 5, which shows the comparative detection results of different detectors yielded on the color and infrared versions of VEDAI1024. Globally, the detection results of all detectors are better than those on VEDAI512 version. In particular, we now observe good results from reference detectors, including SSD, Faster R-CNN, RetinaNet, and, in particular, EfficientDet, which was not the case for VEDAI512, as previously reported. This could be explained by the fact that objects in VEDAI1024 are larger so that these state-of-the-art detectors could naturally provide their good detection capacity. For our YOLO-fine model, while providing significant improvement w.r.t. the previous VEDAI512 dataset (i.e., gain of 6.09% and 5.07% in mAP on the color and infrared versions, respectively), it now provides slightly better performance than other reference models on VEDAI1024. For the color version (left part of Table 5), it achieved a mAP = 76.0% with 0.96% better than the second-best YOLOv3-spp and 1.99% better than the third-best EfficientDet(D1). For the infrared version (right part of the table), it achieved mAP = 75.17%, again slightly better than YOLOv3-spp (gain of 1.47%). In terms of F1-score results, EfficientDet(D1) slightly outperformed YOLO-fine in both cases (i.e., 0.79 compared to 0.78 on color version, and 0.78 as compared to 0.76 on infrared version). However, these differences in F1-score are not significant and given the fact that YOLO-fine yielded better mAP in both cases (1.99% for color version and 3.94% for infrared version), we consider that YOLO-fine provide at least equivalent detection performance as EfficientDet in detecting medium and larger objects from VEDAI1024. Finally, Figures 7 and 8 illustrate some detection results for the sake of qualitative assessment to enrich our experiments on VEDAI data. We can observe in Figure 7 that the YOLO-fine model often provides the best performance in terms of true positives for the five sample images (and for the entire validation set as previously observed in Table 5). For instance, a perfect detection has been achieved by YOLO-fine for the second image containing two "vans", three "trucks", and one "car". YOLOv3 missed two "trucks" and yielded two false detections on this image, while YOLOv2 confused a "truck" as a "van" and also gave a false positive of class "car". Within the whole validation set, we observe that the two classes "car" and "pickup" were often confused (for example, in the fifth image where YOLO-fine detected two "cars" as "pickups"). It is also noted that, sometimes, the false alarms produced by the network correspond to objects that visually look very similar to vehicles as shown in the first image with an object at the bottom that looks like an old vehicle; or in the third image with an object visually similar to a boat. Qualitative remarks about the infrared base from Figure 8 are similar to those made for the color version. We observe that the YOLOv3 and YOLO-fine models produce few false positives but more false negatives in infrared images than in color images. We can see that YOLO-fine missed three "tractors" on the first image, one "camping car" on the third image and two objects of class "other" on the fourth image. These false negatives are to be avoided in some applications. The false negative/false positive (FN/FP) trade-off is similar to the recall/precision trade-off observed with the recall/precision curves. Finally, we note that all detectors, especially our YOLO-fine model, are able to deal with infrared images. Indeed, the loss of accuracy remains insignificant if the objects are well characterized by their shape rather than their color or spectral signatures.

Ground truth YOLOv2
YOLOv3 YOLO-fine  Table 6 provides the detection results of YOLO-fine compared to reference detectors for the MUNICH dataset. A very high detection rate was achieved by our YOLO-fine (mAP = 99.69%) with an increase of 1.82% over YOLOv3 (second-best), 2.12% over SDD (third-best) and 19.93% over YOLOv2 (lowest mAP). In terms of F1-score, both YOLO-fine and SSD provided the highest value of 0.98. Compared to VEDAI512 (with similar spatial resolution and object size), MUNICH appears less complicated, because: 1/ it contains only two very different classes, namely "car" and "truck" classes (as compared to the eight classes of vehicles that look similar in VEDAI images); 2/ the number of objects in MUNICH is higher (9460 compared to 3757 in VEDAI); and, 3/ the level of background variation is low (the images were all acquired from a semi-urban area of the city of Munich). Therefore, achieving good performance on MUNICH is a requirement to demonstrate the relevance of our proposed model. Besides, the detection rate is higher for the "car" class (99.93%) than for the "truck" class (99.45%), since there are many more cars than trucks in the training set (and also in the validation one). However, the difference is very small. This observation is confirmed by Figure 9, where some detection results are provided. We can observe that: (i) YOLO-fine detects cars and trucks with very high confidence indices (close to 1); (ii) cars are well detected even if they are on the image borders (see yellow rectangles in the third image in the third row); (iii) the image ground truth is sometimes wrongly annotated by ignoring the vehicles under the shadow, but the YOLO detectors manage to detect them (see the yellow rectangles in the second image from the second row); and, (iv) vehicles in a parking lot are also well detected and discriminated even though they are very close to each other (see the fifth image in the fifth row). In summary, the studied MUNICH dataset represents an easy small object detection task (two distinct classes from simple background) and the detection performance achieved by the proposed YOLO-fine was confirmed.

XVIEW
The last dataset used in our experiments is XVIEW and the detection results measured on this dataset are given in Table 7. Some illustrative results are also given in Figure 10 in order to show a good quality of detection results yielded by YOLO-fine, close to the ground truth boxes. As observed with the two previous datasets, YOLO-fine again provided best detection mAP = 84.34%, with a gain of 1.83% over EfficientDet, 5.41% over YOLOv3, and even 16.25% over SDD and 27.33% over Faster R-CNN. We note that objects in XVIEW are quite small when compared to those in MUNICH or VEDAI (spatial resolution of 30 cm compared to 25 cm in MUNICH and VEDAI512). Thus, the mAP = 84.34% achieved by YOLO-fine could be considered as successful for real-time prediction using a one-stage approach. One may wonder why this result is higher that those of VEDAI512. The answer is that for XVIEW, we only consider one class of vehicle (which was, in fact, merged by 19 vehicle classes from the original dataset as described in Section 4.1), hence the detection task becomes simpler than in the VEDAI dataset. To this end, the proposed YOLO-fine is thus able to perform small object detection from both aerial and satellite remote sensing data.

Ground truth YOLOv2
YOLOv3 YOLO-fine Figure 9. Illustration of detection results on MUNICH. Ground truth YOLO-fine Finally, using the XVIEW data, we conducted a comparative study in terms of computational performance. We note that only detector models from the YOLO-family were investigated, since the other detectors were implemented using different frameworks, hence hindering a fair comparison related to computational time. In Table 8, we report the comparison of YOLO-fine against YOLOv2, YOLOv3, YOLOv3-tiny, and YOLOv3-spp in terms of network weights (i.e., required disk space for storage), BFLOPS (billion floating point operations), and the prediction time in millisecond/image as well as in frame per second (FPS), tested on three different NVIDIA GPUs including Titan X, RTX 2080ti and Tesla V100. According to the table, YOLO-fine has the smallest weight size with only 18.5 MB compared to the original YOLOv3 with 246.3 MB. The heaviest weight size comes from YOLOv3-spp with 250.5 MB, while YOLOv2 also requires 202.4 MB of disk space. In terms of BFLOPS (the lower the better), it is also reduced in YOLO-fine with 63.16 as compared to 98.92 of YOLOv3. In terms of prediction time (average prediction time computed from 1932 XVIEW images from the validation set), YOLO-fine is the slowest (due to the accuracy/computation time trade-off), but, and as already mentioned, it can still achieve real-time performance with 34 FPS, 55 FPS, and 84 FPS on GPU Titan X, RTX 2080ti and Tesla V100, respectively. One may wonder why YOLO-fine, with several removed layers and significantly lighter weight than YOLOv3, is still slower (34 FPS compared to 37 FPS on Titan X). This is due to the fact that we refine the detection grid to look for small and very small objects. However, this slower speed is not significant w.r.t. the great gain in accuracy observed against YOLOv3 shown in previous results with our three datasets. We can also note that YOLOv2 remains the fastest detector among the YOLO family (as observed in the literature) with 57 FPS on Titan X, 84 FPS on RTX 2080i, and 97 FPS on Tesla V100. Subsequently, the YOLOv3-tiny is a very light version of YOLOv3 with small weight size (34.7 MB compared to 246.4 MB), only 8.25 BFLOPS when compared to 98.92 BFLOPS and very fast prediction time (215 FPS on Tesla V100). However, both YOLOv2 and YOLOv3-tiny provided significantly inferior detection accuracy than YOLO-fine. Accordingly, the table confirms that YOLO-fine is the detector that has the best compromise between accuracy, size and time among the five comparative models.

Setup
Detecting known objects under various backgrounds (which are related to the environments and conditions of image acquisition) is an important challenge in many remote sensing applications. To do so, a deep learning model should have the capacity to generalize the characteristics of sought objects, without prior knowledge of the image background. Ideally, the same objects should be detected effectively, regardless of their environment (or image background). We aim here at evaluating the ability of the YOLO-fine model in order to address unknown backgrounds.
For these experiments, we relied on the 25-cm VEDAI512 dataset by dividing it into three subsets, each corresponding to a different type of background. Indeed, we split the data into a training set composed of images acquired in rural, forest and desert environments, and two validation sets with new backgrounds representing residential areas on the one hand, and very dense urban areas on the other hand. Figure 11 illustrates some sample images from the training set and the 2 validation sets (namely "urban" and "dense urban"). Table 9 provides some information related to the number of images and objects of each class included in the three sets. Even if VEDAI offers a limited number of images and objects, we make sure each class comes with enough samples in the training set, but also in the two validation sets in order to conduct a fair comparative study.

Results
We report in Table 10 the detection results that were obtained by YOLOv2, YOLOv3 and our YOLO-fine model on the 2 validation sets. We first observe that all detectors led to better results for the first set than for the second set, e.g., with a respective mAP of 62.79% and 48.71% for YOLO-fine. This can be explained by the level of similarity between the background of the validation set and the background of the training set. The rural environment (training set) is more similar to the semi-urban environment (validation set 1) than to the dense urban environment (validation set 2). When comparing the results with those reported in Table 3, where YOLO-fine was achieving a mAP = 68.18% when trained/evaluated on all backgrounds, we can see that introducing new, unseen backgrounds in the validation set led to a loss in performance (i.e., 5.39% for the urban background and 19.47% for the dense urban background). However, when compared to its counterparts YOLOv2 and YOLOv3, YOLO-fine still achieves the best performance with a gain over YOLOv3 of 7.11% on the first background and 5.34% on the second background. YOLO-fine also achieved best results for 6/8 classes from each background. A satisfying accuracy is maintained for the 4 classes "car", "truck", "pickup", and "tractor", notably on the first background. On the other hand, a significant drop in detection accuracy is observed for the class "other" on which YOLO-fine reached only 6.61% of accuracy with the second background. This class is indeed quite challenging for every detector, since it gathers various objects (bulldozer, agrimotor, plane, helicopter) with various shape and size. Table 10. Performance of YOLO-fine as compared to YOLOv2 and YOLOv3 on detection of known objects with the appearance of new backgrounds in validation sets: the first background contains residential/semi-urban areas and the second background very dense urban areas. Best results in bold.

Conclusions and Future Works
In this paper, we have presented an enhanced one-stage detection model, named YOLO-fine, to deal with small and very small objects from remote sensing images. Our detector was designed based on the state-of-the-art YOLOv3 with the main purpose of increasing the detection accuracy for small objects while being light and fast to enable real-time prediction within further operational contexts. Our experiments on three public benchmark datasets, namely VEDAI, MUNICH, and XVIEW, demonstrate the overall superiority of our YOLO-fine model over the well-established YOLOv3 solution. Indeed, YOLO-fine provides the best compromise between detection accuracy (highest mAP), network size (smallest weight size), and prediction time (able to perform real-time prediction).
While YOLO-fine brings promising results, some issues still remain and call for further research. First, as already mentioned, YOLO-fine focuses on small and very small objects, while its performances on medium and large objects remains comparable to YOLOv3, but with a slower prediction speed. In other words, YOLO-fine might not be the best choice if the sought objects have a very wide range of sizes. This bottleneck could be solved by making a transparent pass between YOLO-fine and YOLOv3 in the same model, allowing for an efficient detection of objects from very small sizes to very large sizes. Second, our investigation related to the behavior of YOLO-fine when dealing with new background appearances has provided some preliminary good results, but it remains limited to the proposed split made on the VEDAI dataset. It would be interesting to explore further this issue and consider other datasets where the difference between backgrounds in training and test scenes could be more significant. Finally, because EO sensors usually provided (close to) nadir views of the observed scenes, rotation invariance is a useful property for object detection in remote sensing. Thus, we aim to integrate the oriented bounding boxes' principle into YOLO-fine.

Conflicts of Interest:
The authors declare no conflict of interest.