Correcting Imprecise Object Locations for Training Object Detectors in Remote Sensing Applications

: Object detection on aerial and satellite imagery is an important tool for image analysis in remote sensing and has many areas of application. As modern object detectors require accurate annotations for training, manual and labor-intensive labeling is necessary. In situations where GPS coordinates for the objects of interest are already available, there is potential to avoid the cumbersome annotation process. Unfortunately, GPS coordinates are often not well-aligned with georectiﬁed imagery. These spatial errors can be seen as noise regarding the object locations, which may critically harm the training of object detectors and, ultimately, limit their practical applicability. To overcome this issue, we propose a co-correction technique that allows us to robustly train a neural network with noisy object locations and to transform them toward the true locations. When applied as a preprocessing step on noisy annotations, our method greatly improves the performance of existing object detectors. Our method is applicable in scenarios where the images are only annotated with points roughly indicating object locations, instead of entire bounding boxes providing precise information on the object locations and extents. We test our method on three datasets and achieve a substantial improvement (e.g., 29.6% mAP on the COWC dataset) over existing methods for noise-robust object detection.


Introduction
Applications of machine learning and artificial intelligence have gained much attention in the remote sensing community over the last years. In particular, object detection methods are often employed to recognize and localize objects of interest in aerial and satellite imagery, e.g., [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. This kind of image analysis and interpretation is key for several different areas of application, such as urban planning, precision agriculture, geological hazard detection, or geographic information system (GIS) updating [16,17]. However, a critical part for learning-based systems to work well in practice is data annotation. In the case of deep-learning-based object detectors, annotations consist of bounding boxes in pixel coordinates within the training images and class labels describing the object classes. Generally, these bounding box annotations have to be available in high quality and large numbers. As annotation is usually carried out manually and potentially requires expert knowledge [2][3][4]6,8,9], the availability of annotations poses an obstacle in many scenarios.
At the same time, object locations in the form of GPS coordinates are often available. In addition, geolocation data can be aligned with any georectified imagery of the same geospatial region; for instance, imagery from different points in time or imagery resulting from varying camera equipment. Hence, using GPS labels as a source of supervision for object detection tasks is desirable, as it allows us to save time and resources in the annotation process [11]. However, training an object detector based on GPS locations is challenging in practice due to various reasons.
On the one hand, the prediction quality of a machine learning model heavily depends on the quality of annotations used for training. In the computer vision com-munity, it is usually assumed that training annotations for object detectors consist of correct class labels, as well as precise bounding boxes. Consequently, models are optimized for these conditions, and data annotations are created specifically in a format that meets the needs and expectations. However, deviating settings mostly lead to suboptimal results [18][19][20]. In particular, it has been shown that neural networks have the capacity to overfit to noisy supervision [21] and that noise concerning both the class labels and bounding boxes degrades the performance of object detectors [18].
On the other hand, the process of aligning data from different sources based on their geolocation is not exact, leading to discrepancies between the imagery and the annotations. An obvious reason for these discrepancies is that GPS occasionally suffers from imprecisions [22], which poses one source of noise in the localization of objects. Furthermore, the process of georectifying aerial imagery is never completely exact due to inevitable factors, such as changing camera altitudes and tilts or uneven terrain. These uncertain conditions lead to spatial errors that can be problematic in practical applications [23][24][25]. In addition, [5] report noticeable offsets (https://sites.research.google/open-buildings/ ?lat=6.49073243944365&lng=3.3867427492778024&zoom=19#explore accessed on 5 October 2021) due to the orthorectification.
Existing non-learning-based methods for the correction of such localization errors, such as [26][27][28], are severely limited in practice, as such methods often rely on multi-view data, strong assumptions, or manual steps. Apart from that, when object locations are collected as GPS coordinates, they are usually not provided as bounding boxes, as required by common object detectors. Instead, resulting object locations are single points indicating only the object centers and not their spatial extents.
All of these factors can result in critical obstacles for the training of object detectors on such data. However, the amount of available GPS data makes it a rich source of data annotations. Making these annotations appropriate and usable for object detection models could greatly benefit remote sensing. Thus, we aim to bridge the gap between GPS annotations and machine learning approaches for object detection. We present a framework that allows for a neural network to learn robustly against the noisy locations in geo-annotations. Employing this network to correct the annotations, we manage to substantially improve the performance of object detectors in a situation where, altogether, only rough point labels are available. In summary, our contribution is threefold:

•
We propose a training framework (called co-correction) that builds upon a novel label correction scheme and allows for the learning of accurate class activation maps from noisy point supervision; • We propose a label correction scheme that takes noisy object locations (as single points, not boxes), as well as a learned class activation map, as an input and corrects them toward their true location; • We demonstrate the high quality of our learned class activation maps by successfully mining bounding box sizes from them in a simplistic manner.

Object Detection in Remote Sensing
In remote sensing, there are numerous applications of learning-based object detectors on aerial and satellite imagery [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. Thus, object detection is an important tool for image interpretation in a broad variety of subfields, such as vegetation monitoring and urban planning. There are also dedicated and publicly available datasets, such as NPWU VHR-10 [29], COWC [29], and DIOR [16], that are used to advance and benchmark work in this field. These datasets mostly adopt the annotation format with precise object locations and bounding boxes, which is common in computer vision. Furthermore, many applied studies perform manual data labeling in order to obtain annotations in this format [2][3][4]6,8,9]. In contrast, we focus on a data format that is natural in many situations but mostly neglected: single points given as GPS coordinates. Two of the very few works that move into that direction and utilize geo-annotations for object detection are [1,11]. However, they do not attempt to solve the arising issue of imprecise localizations, which we address in this study.

Learning with Noisy Labels in General
Corrupted and noisy labels have been shown to harm the training and generalization of neural networks [21]. Therefore, general methods for making the training of neural networks more robust against label noise have been proposed [30][31][32][33][34]. Their basic idea is to identify noisy labels and to separate them from clean labels.
These techniques for noise-robust learning may be effective for tasks such as classification, but they are not perfectly suited for training object detectors with noisy annotations. The reason is that localization noise and classification noise can occur independently. Furthermore, a sample, i.e., an image, cannot be easily categorized as either noisy or clean, as it can contain multiple annotations for different objects that can be only partially noisy or clean. Furthermore, when speaking of localization noise, it is hard to draw a line between noisy and clean.

Object Detection with Noisy Labels
As a consequence, noise-tolerant training strategies for object detectors have been developed. We compare our framework to the following ones: The work of [35] represents a natural specialization of the co-teaching framework [31] for object detection. In [18], a label correction scheme is used instead of sample selection. Both of these methods are designed to deal with the bounding box and label noise. In this regard, they are more general than our work, which only assumes the object locations to be noisy. On the other hand, they also require at least noisy information on the object sizes, which is not available in our setting. The approach of [36] assumes that annotation noise only occurs in the bounding boxes. Hence, it is closest to ours with respect to the setting. Nonetheless, it also requires bounding boxes, and not just point labels.
From the technical perspective, our learning framework is similar to [20], a method for sparsely annotated object detection. As we can consider the sparse annotations as annotations that were corrupted by class label noise, this work also belongs to the field of noise-tolerant object detection. However, a direct comparison with our method is not valid because the approaches focus on complementary aspects of annotation noise. Furthermore, the approach of [37] has been shown to work well under class label noise.
Aside from the mentioned methods, there are works that employ noise-robust object detection techniques in an auxiliary task for reaching other goals. In particular, [38,39] tackle the problems of weakly supervised and semi-supervised object detection, respectively. We do not compare our method to these works, as they were clearly outperformed by [18] on the task of noise-resistant object detection-especially in the case of heavy bounding box noise.

Contrastive Learning
Contrastive learning approaches, such as [40][41][42][43], recently successfully managed to narrow the gap between supervised and unsupervised learning. Our framework, with its two architecture branches and the usage of two augmented image versions, generally resembles a contrastive learning approach. However, like [20], we use this machinery for noise-robust learning instead of unsupervised learning. The concrete purpose here is the reduction in self-confirmation bias. To the same end, [31][32][33]35] use two distinct partner networks instead of two input versions, which constitutes a slight conceptual difference.

Object Detection with Point Labels
Few works have been published on object detection with weak supervision in the form of point labels. The methods of [44,45] are conceived to predict object centers instead of bounding boxes. The approach of [46] is able to estimate object extents as well, but was specifically designed for crowded scenes. Nevertheless, none of these methods incorporate mechanisms to account for noisy supervision. In contrast, the point-label-based object detection framework proposed in [47] is robust with respect to the exact placement of the point labels, i.e., the points can be placed anywhere within the object boundaries. However, this method does not learn the notion of accurate bounding boxes ex nihilo, but relies on a sample subset with complete and clean annotations. In our setting, such a set of clean annotations is not available, making this method inapplicable.

Problem Setting
Our problem setting is as follows: we assume that we are given a dataset where each sample consists of an image I ∈ R 3×H×W and the corresponding noisy object locations (c Using this information, our main goal is to recover the true object locations (c ..k l , with k l being the true number of objects of class l. We denote our estimates for the true locations by (ĉ (l) i ) i=1...k l . Thereby, we assume k l =k l , i.e., we assume that there are no or only an insignificant number of object annotations missing for every class.
Depending on the application, it might also be desirable to estimate the sizes and extents of the present objects. To this end, we mine bounding boxes in the format of (ĉ for every class. That is, the first two values of the quadruples are the corrected object center coordinates and the latter two correspond to the object heights and widths, respectively.

Method
In the following, we first explain our framework for learning class activation maps from noisy point labels. Then, we introduce our point label correction algorithm, which is used to generate corrected object locations. Finally, we describe a method to estimate bounding box sizes based on the corrected box locations and the learned class activation maps.

Learning Class Activation Maps from Noisy Point Supervision
We want to train a neural network Φ that, when given an image I, produces a class activation mapŶ (l) = Φ(I) (l) ∈ [0, 1] H ×W for every class l. Depending on the architecture of Φ, we might already have H = H and W = W or we can easily achieve this condition by upsampling the input image to the original size. Alternatively, one can also downscale the annotations to match the class activation maps.
As depicted in the overview in Figure 1, we first created two transformed versions I 1 and I 2 of I by image augmentation with color jitter, image noise, flipping, and transposing. Without loss of generality and to simplify the notation, we assumed that I 1 is obtained without geometric transformations and that the coordinate space of the original annotations (c (l) i ) i=1...k l is aligned with I 1 . For I 2 , we denoted the geometric transformation with g(·) and its inverse with g −1 (·), i.e., (g(c Thereafter, we fed both I 1 and I 2 into Φ and obtained our class activation mapsŶ  Our proposed co-correction framework. Two image versions are created and fed through a Siamese network. The outputs are used to generate corrected annotations, which are, in turn, employed as supervision for the other branch. In doing so, we need to apply the geometric transformation g(·) to the upper branch and undo it on the lower branch with g −1 (·). Aerial image from [48].
Similar to [20,[31][32][33]35], we employed the corrected object centers resulting from one branch, i.e., one image version, to supervise the network output of the other branch. Therefore, we dubbed our method co-correction. This technique has the advantage of reducing self-confirmation bias, which can occur if the supervision used to update a network is generated by the same network with the same data.
We utilized a pixelwise target Y Here, the standard deviation σ is a predefined hyperparameter and the normalization factor of the bivariate normal density function is omitted to obtain a reasonable scale for the target values. Analogously, the target Y As the loss function, we chose the binary cross-entropy, which is also suitable for our soft targets. That is, An illustration of the described process that offers more detail than the overview in Figure 1 can be found in Figure 2.
The reasons for using soft targets instead of deterministic target distributions for the pixels are twofold: first, the softness reflects the uncertainty in the target distribution due to the label noise. Furthermore, second, when given point labels, it is the natural choice to use Gaussian blobs to localize the objects in the target. Using deterministic targets of a certain shape always poses a stronger assumption on the underlying features. If such assumptions can be justified-for instance, because not only point labels but also the spatial extents of the objects are given-it can of course be sensible to adjust the targets accordingly. Let us also note that, this way, soft targets Y (l) 1(x) > 1 are theoretically possible. However, in practice, this hardly occurs, and is not detrimental for training.

Label Correction
Once we obtained an activation map, we employed Algorithm 1 to refine the original noisy object locations. This algorithm is a modified version of the weighted k-means clustering algorithm, which we applied to pixel locations. More precisely, we initialized the k l centroids as the given noisy object locations (c (l) i ) i=1...k l . Then, we iteratively updated the centroids as the weighted mean of pixels that are (i) closer to the respective centroid than to all other centroids and (ii) within a range of d from the original centroid. The weights correspond to the activations from the activation map learned in the previous step.
Apart from restriction (ii), this is exactly the weighted k-means algorithm applied on pixel locations. The underlying idea is that we wanted to place each point label into a local center of (relatively) high activations. Simultaneously, we wanted to avoid placing all point labels in the same location. In addition, we included restriction (ii), which limits the displacements and, therefore, introduces the principle of locality. This reflects the assumption that the noisy labels are still reasonably close to the true locations, and it prevents erroneous and distant activations from distorting the other centroids. In other words, when trying to find the optimal label position for an object in one image corner, we did not have to take distant pixels in opposite corners into account. This robustness, and the property that a slight activation near a noisy label suffices in attracting and correcting the label, are the main advantages of the proposed algorithm.

Box Size Estimation
On some occasions, it might be necessary to not only estimate the locations, but also the spatial extents of objects. To address this additional problem, we proposed a simple method (see Algorithm 2) that allowed us to estimate the object sizes from the learned class activation maps. The inputs for this method are the predicted class activation maps and the corrected object centers. Furthermore, a standard box size s (l) as an initial guess for the object size is required. To choose this hyperparameter, one can use an estimated average object size for a class or employ prior knowledge.
In the algorithm, we determined the set of pixel locations that are assigned to the respective centroid for every object present in the image. From these locations, we disregarded the ones with an activation below a certain threshold τ. The first estimate for the box shape is then the minimal bounding box around the locations in this cluster of activated pixels. As this estimate is not always reliable, we further refined it in two ways. First, depending on the hyperparameter setup and the class activation maps, this method tends to overestimate the real sizes of objects. Therefore, we introduced an additional hyperparameter α, with which, we scale down the estimated box size. Furthermore, second, to counter the effect of false or missing activations, we take a convex combination of the estimated box with the initial, standard-sized box. The proportion parameter λ for the convex combination is chosen according to the average confidence in the activations for this object. The exact choice of the function to determine λ has no special justification and may also be seen as a subject to tuning.

Overall Training Process
When using this method, we trained the network Φ as described in Figure 1. Before we started with the co-correction, we trained on the uncorrected, raw labels for a few epochs in a warm-up phase. The usage of corrected targets during training supports the learning process (see Section 5.5). After determining the corrected object locations with the fully trained network and the proposed label correction method (Algorithm 1), one can use Algorithm 2 to estimate box sizes and shapes if desired. Finally, the refined annotations were stored and used as the ground truth for training object detectors in a second training stage. Note that, for every image, our correction scheme outputs exactly as many object locations as it received as an input. Therefore, our method only deals with bounding box noise and not with noise regarding the object classes. Furthermore, our method is not directly involved in the final training of the object detector. Hence, it is arbitrarily combinable with other object detection methods having certain strengths (see Appendix A). For example, wrong class labels could be corrected with methods such as [18,35] after our correction of locations has been applied.

Setting and Data
We evaluated our label correction method on three datasets: COWC [49], NWPU VHR-10 [29], and a third dataset published in [48], which we subsequently refer to as PalmTrees. The datasets contain objects from different domains, such as vegetation, urban artifacts, and means of transport. By this choice, we want to demonstrate that our method is not restricted to a specific use case, but is rather suitable for many different applications in remote sensing. As all of these datasets are annotated with clean labels, we simulate localization noise in our experiments.

COWC
This dataset contains aerial imagery of multiple locations and was published for the task of detecting vehicles. We sliced the images into patches of 512 by 512 pixels and used random splits of 10% for validation and testing, respectively. The original labels are provided as points representing the centers of the cars. For our label correction scheme, we could directly use these point labels, whereas for training object detectors, we created a bounding box with a fixed size around each center. For the size, we chose 32 pixels, corresponding to a height and width of 4.8 m, as the standardized pixel size is 15 cm.
To simulate annotation noise, we displaced the vehicle locations with Gaussian noise with a mean of zero and a standard deviation of 16 pixels. Labels being located outside the image extent after adding the noise were removed. Apart from that, all labels were kept and only their locations were manipulated. Altogether, our simulation of noise results in an average intersection over union (IoU) of roughly 30% when comparing the corrupted with the ground-truth boxes.

PalmTrees
Apparently, the objects of interest in this aerial image dataset are palm trees. Once again, we extracted patches of 512 by 512 pixels. The annotation consists of a set of GPS coordinates, which are also available at OpenStreetMap. In the training of object detectors, we assume a fixed size of 64 pixels for the trees. With a ground sampling distance of 8cm, this corresponds to a tree diameter of 5.12 m. Since the alignment of the point annotations with the imagery has already been corrected in this imagery, we simulate noisy locations by adding Gaussian noise with zero mean and a standard deviation of 24 pixels to the initial positions.

NWPU VHR-10
Compared to the first two datasets, this dataset is more diverse. Instead of a single class, it contains objects of ten different classes: bridge, harbor, airplane, ship, vehicle, storage tank, baseball diamond, tennis court, basketball court, and ground track field. The image sizes vary, which is why we rescaled every image to 1024 by 1024 pixels. Here, the original annotations are also complete bounding boxes instead of just single point locations, as these objects have a larger variation in their sizes. However, for our method, we only used the box centers and the average box sizes per class to create training targets. For the standard deviations of the simulated Gaussian localization noise, we used one third of the average box size in pixels for each class. As this dataset contains precise information on the object sizes, we used it to evaluate our box size estimation method. For the other two datasets, this part of our framework was omitted because we could not evaluate it with the given annotations.

Co-Correction
For our label correction technique, we used a pretrained ResNet-50 [50] with a reduced stride in the last two layers; that is, the resolution of our class activation maps is one eighth of the input. Furthermore, we applied the sigmoid function to the outputs to ensure that the activations lie within [0, 1]. The Gaussian blobs for the targets were created with standard deviations depending on the dataset and the object class. We noticed that reducing the width of the target blobs, compared to the standard deviations of the noise, led to better results due to a better separation of single instances. Moreover, we augmented the training images by mirroring, adding noise, and color jittering. As an optimization algorithm, we used Adam [51].
Before feeding the obtained class activation maps into our label refinement algorithms (Algorithms 1 and 2), we preprocessed them by setting small values (below 0.01) to zero and taking the fourth power to obtain a better distinction of activated and non-activated regions. Discarding the small activations is especially essential for the box estimation, which is why we explicitly included this in Algorithm 2.

Object Detection
In our experiments, we demonstrated the effectiveness of our method by comparing the performance of different (noise-robust) object detection frameworks when trained on noisy, corrected, and clean labels, respectively. In particular, we used RetinaNet [52] and Faster R-CNN [53], as well as Noise-resistant Faster R-CNN [18], Faster R-CNN trained with the annotation refinement of [36], and RetinaNet trained with co-teaching [35].
As the code for the latter three training strategies is not (yet) published, we implemented them ourselves. For this, we stuck as close as possible to the descriptions in the papers and plugged the modifications into the Faster R-CNN or RetinaNet architecture, respectively. In particular, for co-teaching, we used RetinaNet as a one-stage detector and, for the Noise-resistant Faster R-CNN and annotation refinement, we employed the Faster R-CNN architecture. Moreover, we omitted the soft label correction module in the Noise-resistant Faster R-CNN, as it is not necessary in our setting without label noise. To achieve a fair comparison, all detectors were equipped with a pretrained ResNet-50 backbone. Approaches based on RetinaNet were trained with Adam, whereas the Faster R-CNN architectures achieved a better performance with SGD and a momentum of 0.9.
We used Hyperopt [54] for the hyperparameter search in all experiments. Details on the best hyperparameter settings can be found in Appendix B. Furthermore, our experiments were executed on a single NVIDIA GeForce RTX 2080 Ti or a comparable device.

Quantitative Results and State-of-the-Art Comparison
The most straightforward way to evaluate our correction method is to directly measure the distances between the corrected locations and their ground-truth locations. The effectiveness of our approach is demonstrated by the decrease in root-mean-square errors (RMSE) between the refined labels and their ground-truth counterparts in Table 1. As a baseline, we used the noisy object locations without any preprocessing. The corrected labels were obtained by our method when training only with the initial noisy annotations. Even if the improvement is substantial, the corrected labels still do not perfectly match their true positions. Nonetheless, this improvement is crucial for the training of object detectors, which we analyze in the following.
We trained object detectors with different approaches and settings and, thereafter, conducted a thorough comparison (see Table 2). As a measure for performance, we chose the widely used mean average precision (mAP) with an IoU threshold of 0.5 on the test set. In the first two sections of the table, only noisy annotations containing square boxes of a fixed size for each class (we consider this as equally informative as point labels) were used for training the models. In the second section, we applied our refinement method on the noisy annotations before training the detectors. For NWPU VHR-10, we also employed our box size estimation. Since none of the existing methods are designed to estimate box sizes from scratch, i.e., without even noisy boxes sizes as supervision, a direct comparison of these methods with our framework is not valid when integrating the box size estimation. Nevertheless, we included the scores for the sake of completeness and to show the practical use of this technique. The box size estimation setting was omitted for the other two datasets, as their point label annotations do not allow for a meaningful evaluation. We also included the scores for RetinaNet and Faster R-CNN when training with the clean and complete ground-truth bounding box annotations to obtain an upper bound for the performance on noisy labels.
From the existing methods, the annotation refinement of [36] and especially the Noiseresistant Faster R-CNN of [18] are the best-performing. Unfortunately, we were not able to reproduce the promising results of the co-teaching approach of [35]. However, it is evident that our label, co-correction, is superior to all other methods for dealing with localization noise. For COWC and PalmTrees, it largely closes the gap between noisy and clean supervision. On NWPU VHR-10, this gap remains rather large, which we found out to be primarily caused by inaccurate box sizes, and not their locations. Note that our setting for NWPU VHR-10 poses noisy and weak supervision, making it more challenging.
We reason that the main advantage of our approach is that it makes maximal use of every single label. Our correction scheme is forced to move every label to a position that is in agreement with the learned features. At this position, rather low activations in the true position suffice to correct the label if the surrounding activations are even lower. Another advantageous feature of our label correction is that the locations are optimized jointly and not independently from each other, i.e., the correction of one label influences the correction of other nearby labels. Our method fulfills this property because it is based on k-means clustering. Thus, the cluster centroids, i.e., the corrected labels, cannot degenerate and share the same location. This property is especially useful when objects are close to each other and, for example, the noise shifts a label to a neighboring object. In such a scenario, our method will mostly avoid assigning two labels to one object while assigning no label to the other object. In the COWC dataset, this kind of situation can particularly occur in parking areas, where many vehicles are densely located in a small region. Table 1. RMSE between the noisy and corrected object locations and their closest ground-truth object center. For NWPU VHR-10, we normalized the Euclidean errors (in pixels) with the average box size for each class to prevent classes of larger objects from dominating the class macro average of RMSEs.  Table 2. mAP scores (in percent) for different object detection methods and datasets at test time.

Dataset
Results in the first, second, and third section were obtained when training on noisy, our refined, and clean ground-truth annotations, respectively. Scores for ground-truth labels serve as an upper bound for the models in the first two sections.

Ablation Study
Since it is the only dataset where our full framework was employed, we conducted an ablation study on the NWPU VHR-10 dataset. The scores measuring the quality of the corrected and refined labels are given in Table 3. The "Initial noisy annotations" are the boxes with a fixed size per class and a randomly displaced center. The "Refined locations (single branch)" were obtained with our proposed correction framework but with only a single branch, i.e., the supervision for an output was generated based on the very same output. As we can see from the large margin in the scores, our correction method is highly effective. If we add a second branch ("Refined locations (co-correction)"), the score is improved once again. However, the gap is rather marginal and, in cases where the computational overhead caused by the second branch is an issue, the single-branch method might be preferable. This method does not make any improvements on the box sizes and shapes and, therefore, we can achieve another substantial increase for the score if we employ our box size estimation ("Full (co-correction + box size estimation)"). The effect of the box size estimation can also be observed in Table 2. However, the weakest point of our overall framework is the estimation of box sizes. We noticed that, in some cases, the class activation maps adopt the shapes of the Gaussian kernels in the targets instead of directly relating the activations to the present objects and their spatial extents. This can hurt the quality of the estimated box sizes, but it has barely any impact on the object localization. Table 3. Ablation study on NWPU VHR-10 with scores for the quality of annotations obtained in different settings. The scores are the percentages of boxes having an IoU of at least 0.5 with their corresponding ground truth box.

Setting Score
Initial noisy annotations 15.2 Refined locations (single branch) 56.7 Refined locations (co-correction) 57.1 Full (co-correction + box size estimation) 63.8 To provide further insights, we compare training and validation curves for training with and without co-corrected supervision on the PalmTrees dataset in Figure 3. The validation errors are the percentages of corrected labels that are not located in a square box around their corresponding ground-truth label. The size of these boxes was chosen to be 32 pixels, which is half the height and width of the ground-truth boxes we used for evaluating the object detectors on this dataset. When we start applying our co-correction scheme and use the corrected labels as supervision, both the training and the validation errors drop instantly. At the same time, if the corrected labels are not used for supervision, the losses and validation errors stay at a relatively high level. Hence, using our corrected labels as supervision is extremely beneficial for the learning process, as not only does it decrease the training loss, but it also improves the performance on unseen data.  As can be seen, it leads to a much better object localization, which, in turn, allows for a much better training of object detectors. Furthermore, the predicted box shapes are decent estimates for the real object extents. The only failure case among these examples can be observed in the top right. Here, one label is placed in the middle of two neighboring cars, whereas another car is covered by two labels. However, in this case, a manual assignment of labels to objects is not straightforward here, as the displacements due to the noise are occasionally larger than the distances between objects. Furthermore, it is noteworthy that, even under the hard conditions in this example, no label is placed into a region of no object by our correction mechanism.
In addition, we also show some learned class activation maps in Figure 5. Again, our correction method still works in crowded scenes with many objects, and the separation of single instances is consistently good. Furthermore, we can see that the class activation maps provide meaningful information on the spatial extents of the objects. We utilize this in our bounding box size estimation. However, in some cases, and particularly for some classes, such as the baseball field on the right, the class activation maps adopt the Gaussian kernel shapes from the training targets, making it harder to estimate object sizes from them. We can definitely see potential for improvement with respect to that in the future. Nevertheless, this behavior does not harm the correction of object locations, which works well on all datasets and classes we used for our experiments.

Qualitative Results
As a third visual example, we depict the predictions of a RetinaNet trained with and without corrected labels in Figure 6. The predictions on the left were obtained after training on the original noisy annotations and the predictions on the right were obtained after training with our corrected object locations. We can see that poorly placed anchors have a relatively high confidence, which leads to imprecise and redundant predictions. This is caused by noisy labels providing positive supervision for these anchors during training. Consequently, the network struggles to learn a distinction between well-placed and poorly placed bounding boxes. This effect does not occur with our corrected annotations, demonstrating the benefit of our method.

Conclusions
In this work, we propose a label correction technique that can be employed for training object detectors with annotations only containing noisy object locations as points. It consists of a co-correction training framework to learn class activation maps and object locations in a noise-robust way. Moreover, the correction module works in a one-to-one manner, i.e., it proposes exactly one corrected location for every noisy point label. On top of that, we used our learned class activation maps to estimate bounding box sizes, making our framework applicable in settings with noisy and weak supervision.
A major advantage of our method is that it can be seen as a label preprocessing step, which makes it able to be combined with any other approach for (noise-tolerant) object detection. In doing so, we observed a remarkable improvement in performance when training on our corrected labels instead of the initial noisy annotations. The most room for improvement of our framework lies in the estimation of precise bounding boxes. Our approach is rather simple and is mainly conceived to demonstrate an additional use of our learned class activation maps. However, more sophisticated machinery may produce better box estimates. Furthermore, it will be interesting to integrate our technique into an end-toend trainable approach and also to extend it to deal with other types of annotation noise in the future. Particularly, sparsely annotated datasets, i.e., datasets where annotations are missing for an unknown subset of objects [1,20], pose a problem that is not addressed by our study.
In the context of remote sensing, we hope that this work is a first step toward the successful usage of GPS annotations for object detection. This would improve the practicability of these powerful methods in many real-world applications. For instance, in vegetation monitoring, GPS records resulting from field surveys could be employed to train object detectors on corresponding aerial imagery. Subsequently, expensive field surveys may be replaced by cheaper aerial surveys. Another use case could be urban planning, where existing data from GIS, e.g., building locations, may be aligned with aerial or satellite images more precisely.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Combination with Other Methods
In Table A1, we evaluate different object detection methods on different sets of labels on the COWC dataset. This also includes the combinations of our approach with other noise-robust methods. The effectiveness of our proposed label correction technique can be seen in the rightmost column with the scores for the detectors trained on corrected labels. These values show that the improvement due to our label correction is much more influential than the improvements due to the noise-tolerant training methods of existing works.
From the existing methods, the annotation refinement [36] and the Noise-resistant Faster R-CNN [18] outperform the others regarding the noisy setting. They also lead to an improvement over the vanilla Faster R-CNN with corrected labels, which indicates that our method can indeed be combined with other techniques. However, when training on corrected labels, the vanilla RetinaNet seems to outperform all of the noise-tolerant models. As we already mentioned in the paper, we were not able to produce good results with co-teaching for object detection [35]. Often, the best scores for this method were observed during the warm-up phase and, during the actual training with a selected set of annotations, the performance decreased. This effect might occur because, in our setting, each annotation provides at least a rough localization of the object and is thus still informative. Supervision with these rough locations could therefore be better than ignoring these "hard" instances completely.

Appendix B. Hyperparameters
Here, we give details on the best hyperparameter configurations for different models and datasets. In general, we used an Adam optimizer for RetinaNet-based methods and SGD with a momentum of 0.9 for Faster R-CNN-based methods. Furthermore, the ReduceOnPlateau learn rate scheduler of Pytorch was used in all experiments, with a decay rate of 0.1. As already mentioned in the paper, we employed Hyperopt [54] for searching the best hyperparameters. Depending on the number of hyperparameters and their possible range, we conducted the searches with different numbers of runs. From these runs, we chose the one with the best performance on the validation set for every method and reported its performance on the test set. In Table A2(a), we list the hyperparameter settings for our co-correction method. Here, σ refers to the standard deviation of the Gaussian kernels in the targets. Apparently, choosing this standard deviation to be significantly smaller than the average object sizes is advantageous. We reason that this allows for a better separation of single instances. Note that s (l) denotes the average box size for class l on NWPU VHR-10. Parameter d refers to the maximum correction distance in Algorithm 1 in the paper.

Appendix B.2. Label Correction and Box Size Estimation
For the correction of object locations (Alogorithm 1 in the paper), the only hyperparameter is the maximum correction distance d. This was chosen to be 64 for COWC and PalmTrees and s (l) · 4/3 for NWPU VHR-10.
Table A2(b) contains the best hyperparameters for our box size estimation on the NWPU VHR-10 dataset. The parameter names correspond to the names in Algorithm 2 in the paper. Once again, the standard box size s (l) was chosen as the average box size per class. The best hyperparameter configurations for the standard object detectors RetinaNet [52] and Faster R-CNN [53] are given in Table A3(a,b). The four sections correspond to the different label sets and noise settings. The "Estimated Boxes" section refers to the annotations obtained with our correction of locations and our estimation of box sizes. On COWC and Palm Trees, we did not apply our box estimation scheme, as the available point labels do not allow us to sensibly evaluate it (all objects are assumed to be of the same size).