Benchmarking Anchor-Based and Anchor-Free State-of-the-Art Deep Learning Methods for Individual Tree Detection in RGB High-Resolution Images

Urban forests contribute to maintaining livability and increase the resilience of cities in the face of population growth and climate change. Information about the geographical distribution of individual trees is essential for the proper management of these systems. RGB high-resolution aerial images have emerged as a cheap and efficient source of data, although detecting and mapping single trees in an urban environment is a challenging task. Thus, we propose the evaluation of novel methods for single tree crown detection, as most of these methods have not been investigated in remote sensing applications. A total of 21 methods were investigated, including anchor-based (one and two-stage) and anchor-free state-of-the-art deep-learning methods. We used two orthoimages divided into 220 non-overlapping patches of 512 × 512 pixels with a ground sample distance (GSD) of 10 cm. The orthoimages were manually annotated, and 3382 single tree crowns were identified as the ground-truth. Our findings show that the anchor-free detectors achieved the best average performance with an AP50 of 0.686. We observed that the two-stage anchor-based and anchor-free methods showed better performance for this task, emphasizing the FSAF, Double Heads, CARAFE, ATSS, and FoveaBox models. RetinaNet, which is currently commonly applied in remote sensing, did not show satisfactory performance, and Faster R-CNN had lower results than the best methods but with no statistically significant difference. Our findings contribute to a better understanding of the performance of novel deep-learning methods in remote sensing applications and could be used as an indicator of the most suitable methods in such applications.


Introduction
The urban population is expected to grow at the highest rates in human history in the next decades, with an increase of 1.2 billion urban residents worldwide by 2030 [1]. Densely populated areas are hotspots of numerous environmental problems, including air pollution [2,3] and hydrological disturbance [4,5], and are also linked to mental illness and health. [6,7]. Global climate change affects climate patterns, and the increase of surface temperature has led to more frequent, longer, and more severe heatwaves [8] and, likewise, increased the occurrence of floods [9]. In this scenario, urban forests could play an important role in mitigating some of these threats [10] and filling the gap between sustainable and livable cities. These systems are important assets to achieve urban sustain-Santos at al. [19] applied three deep-learning methods (Faster R-CNN, YOLOv3, and RetinaNet) to detect one tree species, Dipteryx alata Vogel (Fabaceae), in Unmanned Aerial Vehicle (UAV) high-resolution RGB images. The authors found that RetinaNet achieved the best results. Culman et al. [39] implemented RetinaNet to detect palm trees in aerial high-resolution RGB images achieving a mean average precision of 0.861. Further, Oh et al. [40]used YOLOv3 to count cotton plants. Roslan et al. [41] applied RetinaNet for this task in super-resolution RGB images in a tropical forest. For a tropical forest again, [42] evaluated RetinaNet. However, most of the research on this field has been done using methods, such Faster R-CNN and RetinaNet, both being dated before 2018. With the constant development of new methods, there is a need to assess the performance of these methods in remote sensing applications.
Despite these initial efforts, there is a lack of studies assessing the performance of the novel deep-learning methods for individual tree crown detection, regarding tree species or size, in urban areas. This task is challenging in the urban context due to the heterogeneity of these scenes [20], with different tree types and sizes combined with overlap between objects, shades, and other situations. Our objective is to benchmark anchor-based and anchor-free detectors for tree crown detection in high-resolution RGB images in urban areas.
To the best of our knowledge, our study is the first to present a large assessment of novel deep learning detection methods for individual tree crown detection in urban areas. Further, we also provide an analysis covering the main lines of research in computer vision for anchor-based methods (one and two-stages) and anchor-free methods. Different from previous studies, our focus is to detect all trees, regarding tree species or size in an urban environment. Thus, our study intends to fill the gap and demonstrate the performance of the most advanced object detection methods in remote sensing applications.
Two high-resolution RGB orthoimages were manually annotated and split into nonoverlapping patches. We evaluate 21 novel deep-learning methods for the proposed task, covering the main directions in object detection research. We present a quantitative and qualitative analysis of the performance for each method and for each main types of detectors. The dataset is publicly provided for further investigation in: https://github. com/pedrozamboni/individual_urban_tree_crown_detection (accessed on 21 June 2021).

Image Dataset
We used two RGB high-resolution orthoimages with 5619 × 5946 pixels with a ground sample distance (GSD) equal to 10 cm of Campo Grande urban area, Mato Grosso do Sul state, Brazil ( Figure 1). These are airborne images collected in 2013 by the city hall of Campo Grande. Campo Grande has 96.3% of urban households on public roads with the afforestation being recognized [43], in 2019, as a Tree City of the World by the Food and Agriculture Organization of the United Nations and the Arbor Day Foundation (Figure 1). A total of 161 plant species were identified on the streets of the municipality totaling more than 150 thousand trees [44]. Licania tomentosa is the most abundant species, representing 18.35%, followed by Ficus benjamina with 18.18%, and 66 species presented only one individual [44].
We manually annotated the orthoimages with rectangles (bounding boxes) in QGIS software. Since the object detection inputs are patches of images; the orthoimages were split into 220 non-overlapping patches of 512 × 512 pixels (51.20 × 51.20 m), which represents an area of 2621.44 m 2 per patch. The manually annotated polygons were converted into bounding boxes ( Figure 2) where 3382 trees were identified as ground-truth. The object detection methods were trained to learn and predict the bounding box coordinates in the images given the ground-truth data. For our experiments, we divided the patches into training (60%), validation (20%), and test (20%) sets (Table 1). The validation set was used during an intermediate phase in training to select the model hyperparameters.

Individual Tree Crown Detection Approach
Our experiment was divided into two parts ( Figure 3). First, 21 state-of-the-art algorithms were evaluated in this task. These methods cover the most diverse approaches currently used to detect objects, including anchor-based (one and two-stage) and anchorfree (Table 2). Second, we selected the best five methods in terms of AP 50 . Faster R-CNN and RetinaNet were also included (among the best ones) since these methods are present as a standard baseline in the remote sensing literature.
We evaluated seven (top five + Faster R-CNN + RetinaNet) methods using hold out repeated four times to obtain a more robust evaluation due to bias-variance tradeoffs. In the holdout procedure with four repetitions, we randomly shuffled and split the data into three disjoint sets: training, validation, and test sets. Then, the average and standard deviation values for the AP 50 considered the five repetitions for each method. We also performed One-Way ANOVA with the Holm-Bonferroni post hoc test to assess if these AP 50 averages were statistically different. We used the methods implemented in the MMDetecion project source code proposed by Multimedia Laboratory [45].  The workflow for individual tree crown detection. Initially, the images were annotated with bounding boxes. In the first step, 21 deep-learning methods were trained, and the best methods were selected based on the value of the third quartile plus Faster R-CNN and RetinaNet. In the second step, the selected methods were trained four more times with randomly shuffled datasets.
For the training, the backbone of all methods was initialized with pre-trained weights from the well-known ImageNet dataset. A stochastic gradient descent optimizer with a momentum of 0.9 and weight decay of 0.0001 was applied. The initial learning rate was empirically set to 0.00125. All the models were trained over 24 epochs. Figure 4 illustrates the training and validation loss curves. The training loss decreased rapidly after a few epochs and stabilized at the end. This indicates that the number of epochs was sufficient and that the learning rate was adequate. The training and testing procedures were conducted in Google Colaboratory with GPU. Faster R-CNN X101-32x4d-FPN-dconv-c3-c5-1x 2018 [49] AB-TS YoloV3 DarkNet-53 2018    (u) YoloV3 Figure 4. Loss curves for training (blue) and validation (orange) for each object detection method. For YoloV3, NAS-FPN, and FoveaBox, we only show the validation curves since the log for these two methods did not return the training loss.

Performance Evaluation
We assessed the overall performance of the methods using the Average Precision (AP). The AP is the area under the precision-recall curve. The precision and recall were estimated using Equations (1) and (2). To obtain the precision and recall values, we defined the Intersection Over Union (IoU). The IoU is the relation between the overlapping area and the union area between the predicted and ground-truth bounding box. When a predicted bounding box reaches a greater IoU value than the threshold, the prediction is classified as true positive (TP). On the other hand, if the IoU value is below the threshold, the prediction is a false positive (FP). Further, if a ground-truth bounding box is not detected by any prediction, it is considered a false negative (FN). We used IoU thresholds of 0.5 (AP 50 ), the most common IoU thresholds used in computer vision.

Statistical Analysis
We performed the Shapiro-Wilk test to check the normality of the data. All samples reported P-values greater than 0.05; therefore, we cannot reject the null hypothesis that the samples were normally distributed. We also conducted Bartlett's test for equal variances. The p-values were greater than 0.05, failing to reject the null hypothesis, and we can, thus, assume that the samples had equal variance. As for data independence, the samples were randomly obtained from the set.
An ANOVA test with the Holm-Bonferroni post hoc test was performed in order to assess if the means of best methods (top five + Faster R-CNN + RetinaNet) were statistically different. For the ANOVA, the P-value was compared to the significance level (α = 0.05) to assess the null hypothesis. If the p-value was equal to or less than the significance level, there were statistically significant differences between the means. However, the ANOVA test did not identify differences between pairs but indicated that not all AP 50 means were equal. Therefore, after rejecting the null hypothesis using ANOVA, the evaluation proceeded using the post hoc test to identify the differences between pairs of algorithms. We used the Holm-Bonferroni as a Post hoc to run the assessment of the experiment.

Results
Here, we present the results of our experiments. First, we performed a quantitative and qualitative analysis for all 20 methods. Therefore, the results were separated by the type of method, i.e., anchor-based (AB-OS: one-stage; AB-TS: two-stage; and AB-MS: multistage) and anchor-free (AF). In the quantitative analysis, we evaluated the methods using the IoU threshold of 0.5 (AP 50 ). The qualitative analysis was conducted to identify in which situations the models had good and bad performance over different conditions, such as shadow and overlap by other objects.
Later, we present the results for the second part of the experiments with the top five models, Faster R-CNN, and RetinaNet. The images presented in this section were from the test set; therefore, the images provide a better indication of the performance of the models. Even though different areas (with different tree species, tree crown sizes, and distributions) were used to train and test the model, the two images are from the same city. Thus, it is not possible to comment on the capacity for generalizability of these models on different datasets.

Anchor-Based (AB) Detectors
In this section, we discuss the performance of the one, two, and multi-stage anchorbased detectors. For one-stage methods, the average AP 50 was 0.657 ± 0.032. Table 3 shows the test set results for the one-stage methods. We observed that the Gradient Harmonized Single-stage Detector outperformed all the others in AP 50 . The increase in performance ranged from 1.4% to 10%. RetinaNet, NAS-FPN, and SABL provided similar results. Probabilistic Anchor Assignment and Generalized Focal Loss presented similar performances, and YoloV3 was the worst method. Table 4 shows the performance for the two-stage and multi-stage (DetectoRS) methods. During the test, on average, the two-stage and multi-stage methods reached 0.669 ± 0.023 for AP 50 . The Double Heads method achieved the best performance for these methods when analyzing the AP 50 , outperforming the others from 0.2% to 6.8%. The CARAFE and Empirical Attention methods obtained performances similar to Double Heads in terms of the AP 50 . Faster R-CNN, DetectoRS, Deformable ConvNets v2, and Dynamic R-CNN reached similar performances, and Weight Standardization provided the worst results. Figures 5 and 6 show the tree detection achieved using the one-stage methods. As we can see in Figure 5, for smaller tree crowns and even medium-sized ones, the one-stage methods had good assertiveness. However, for larger crowns (Figure 6), we observed a decrease in the performance, with Probabilistic Anchor Assignment being the unique method with good performance. For more irregular trees, where the crown did not have a circular shape, the methods usually detected more than one bounding box for a given ground-truth annotation. In areas where there were large agglomeration of trees, the methods did not detect the trees or detected only a part.  Figures 7 and 8 present the detection for two-stage methods. Similar to the one-stage methods, the two-stage methods presented good performance in detecting smaller and medium-sized tree crowns. For larger ones and in areas with a greater agglomeration of objects (Figure 8), the two-stage methods performed substantially better than the one-stage methods. Thus, these methods appeared to generalize the problem better with better assertiveness in detecting the tree crowns in more complex scenes. Further, we observed that the presence of shadow did not cause a great decrease in the detection. We observed that the main challenge was to detect single trees with larger crowns and areas where the limits of each object were not clear. In such cases (Figures 6 and 9), even for the human eye, it is difficult to separate the trees from each other.    Figure 9. Examples of tree detection in areas with high density using the anchor-free methods.

Anchor-Free (AF) Detectors
The results obtained for anchor-free (AF) methods are described in Table 5. In the test, the anchor-free methods achieved an average performance of 0.686 ± 0.014. FSAF reached the best performance in terms of the AP 50 with 0.701. This demonstrated a superior performance over the others, ranging from 0.9% to 3.7%. FoveaBox, ATSS, and VarifocalNet (2) had similar results in terms of the AP 50 , and VarifocalNet (1) had the worst in performance.
Anchor-free methods demonstrated similar behavior when compared with the onestage methods. These models performed well for small trees ( Figure 10). For occluded objects and more irregular tree crowns, we observed a decrease in the performance, with multiple detections and the detection of only part of the object. For areas with larger tree crowns and more agglomerations, the performance also decreased. VarifocalNet (2) was the only method that managed to produce relatively good detection in the most complex images (Figure 9). This highlights that these areas with larger canopies and more agglomerations are the main challenges for the methods.  Figure 10. Example of tree detection using anchor-free methods.

Analysis of the Best Methods
Here, we present the best five methods considering AP 50 , which were FSAF, Double Heads, CARAFE, ATSS, and FoveaBox. We also included Faster R-CNN and RetinaNet, since these two are commonly used in remote sensing. We noticed that none of the five best were a one-stage method. As seen in the previous sections, the anchor-free methods showed better average performance compared with the one and two-stage methods in terms of the AP 50 . Figure 11 shows the box plot for the methods. Figures 12-15 show some results for the best methods. We observed that Double Heads, a two-stage method, achieved the best average AP 50 (0.732), with differences ranging from 0.4%, when compared to ATSS, and 4.6%, when compared to RetinaNet. ATSS and CARAFE achieved similar values with averages AP 50 of 0.728 and 0.724, respectively, which were close to Double Heads. FSAF and FoveaBox had slightly worse performances with average AP 50 values of 0.720 and 0.719. Faster R-CNN (average AP 50 of 0.700) and RetinaNet (average AP 50 of 0.686) obtained the worst average results.
Despite the performance analysis conducted using the AP 50 , we performed One-Way ANOVA to assess if the averages of the AP 50 values of the best methods were significantly different. One-Way ANOVA for the top five, Faster R-CNN, and RetinaNet indicated a P-value of 0.019, which is less than the significance level (α = 0.05). Therefore, we can reject the null hypothesis that the results were similar. We continued the evaluation using a post hoc test to identify differences between pairs of algorithms.
A simple strategy in multiple comparisons is to use α m to evaluate the P-value, which is the Bonferroni correction. However, this value is rigorous and can lead to the rejection of a true null hypothesis (Type I error). Holm-Bonferroni adjusts the rejection criteria for each comparison reducing the chance of a Type I error. The Holm-Bonferroni sorts the pvalues in increasing order creating a rank of P 1 , ..., P k , ..., P m and compares them with α m+1−k where k is the ranking order in the comparison. When P k < α m+1−k is false, the procedure stops, and we cannot reject the null hypothesis of the subsequent P k . Table 6 shows the results of the Holm-Bonferroni test. For simplicity, the column P-value corr represents this comparison, and, when its value is lower than 0.05, we can reject the null hypothesis.
The results indicate that the results of RetinaNet were significantly different from ATSS, CARAFE, and Double Heads. Further, a comparison between the other methods showed no statistically significant differences. The test indicates that RetinaNet, among the tested models, was not indicated for the proposed task.

Discussion
Anchor-based one-stage methods achieved the worst average precision (0.657). Gradient Harmonized Single-Stage Detector was the best with an AP 50 of 0.691, and YoloV3 was the worst with 0.591 precision. The commonly used RetinaNet had an AP 50 of 0.650, being the second worst one-stage method.
A previous study [39] implemented RetinaNet to detect Phoenix palms with the best AP value of 0.861; however, the authors split the dataset only into training and validation sets and used a score threshold of 0.2 and an IoU of 0.4. This may be lead to better performance. They also considered only one tree species as the target. Roslan et al. [41] used RetinaNet to detect individual trees and achieved superior results with a precision of 0.796, and similar results were found by [42]. Refs. [41,42] utilized images of non-urban areas (tropical forests). In our experiments, RetinaNet provided less accurate results among the one-stage methods.
Anchor-based two-stage methods had the second best highest average AP 50 of 0.669. Double Heads had the best performance among these methods, and DetecoRS had the worst. Faster R-CNN and RetinaNet (baseline) had similar results. Santos et al. [19] investigated both methods and concluded that RetinaNet outperformed Faster R-CNN and YOLOv3 in the detection of a single tree species, achieving an AP 50 higher than 0.9. Wu et al. [65] proposed a model that used Faster R-CNN as a detector in a hybrid model to detect and segment apple tree crowns in UAV imagery.
In the detection section, the authors achieved high-precision for the task. However, these authors considered only one tree species and used images with higher resolution with small variation in scale. These factors may lead to better performance for the methods. In the other hand, the anchor-free methods had the best average precision with 0.686. FSAF, ATSS, and FoveaBox stood out among others. The results for the anchor-free methods corroborate with the study of Gomes et al. [23], where ATSS also outperformed Faster R-CNN and RetinaNet by about 4%.
Previous studies [28,29] also reported that two-stage methods had higher performance over the one-stage methods, which corroborates our findings. We found that anchor-free methods performed similarly to anchor-based two-stage methods. This behavior has already been reported in the literature. The advantage of anchor-free detectors is the removal of the hyper-parameters associated with the anchors, implying potentially better generalization [29]. RetinaNet (one-stage) and Faster R-CNN (two-stage) showed relatively poor results when compared with the top five methods selected. It is important to note that these two methods have been reported in the literature as having superior performance in other remote sensing applications [19,24].
As previously presented, our experiment aimed to detect all the trees in urban scenes. Compared to the previous work that only targeted a single tree species, our objective was considerably more challenging. First, urban scenes are more complex and heterogeneous. Second, our dataset presented various tree species and tree crown sizes with overlap between objects, shadows, and other situations. In the Campo Grande city urban area, there are 161 tree species and more than 150 thousand trees. Thus, this complexity in the task led to better performance of the two-stage anchor-based methods, especially in more challenging images as can be seen in Figure 15. These methods first filter the region that may contain an object and then they eliminate most negative regions [27]. Comparatively, [66] proposed the identification of trees in urban areas using street-level imagery and Mask-RCNN [67]. They found an AP 50 between 0.620 and 0.682.

Conclusions
Here, we presented a large assessment of the performance of novel deep-learning methods to detect single tree crowns in urban high-resolution aerial RGB images. We evaluated a total of 21 object detection methods, including anchor-based (one, two, and multi-stage) and anchor-free detectors in a remote sensing relevant application. We provided a quantita-tive and qualitative analysis of each type of method. We also provided a statistical analysis of the best methods as well as RetinaNet and Faster R-CNN.
Our results indicate that the anchor-free methods showed the highest average AP 50 , followed by anchor-based two-stage and anchor-based one-stage. Our findings suggest that the best methods for the current task were the two-stage anchor-based and anchorfree detectors. For the one-stage anchor-based detectors, only the Gradient Harmonized Single-stage Detector performed slightly worse than the best methods. This may be an indication that one-stage methods are not recommended for the proposed task. Meanwhile, the two-stage (Double Heads and CARAFE) and anchor-free (FSAF, ATSS, and FoveaBox) detectors achieved superior performance, which is the study's suggestion for urban single tree crown detection.
Our experimental results demonstrated that RetinaNet, one of the most used methods in remote sensing, did not have satisfactory performance for the proposed task and unperformed several of the best methods (ATSS, CARAFE, and Double Heads). This may indicate that this method is not suitable for the proposed task. Faster R-CNN had slightly inferior results compared with the best methods; however, no statistically significant difference was found. However, it is worth mentioning that research aimed at detecting single trees in an urban environment is still incipient, and further investigation regarding the most appropriate techniques is needed. In our work, we set out to detect all tree crowns in an urban environment. This task is considerably more complex than detecting specific species or types of trees since there will be a greater variety of trees. Likewise, images from the urban environment are more complex and challenging than rural environments as they present a more heterogeneous environment.
Our work demonstrates the potential of the existing techniques based on deep learning by leveraging the application of different methods for remote sensing data. This study may contribute to innovations in remote sensing based on deep-learning object detection. The majority of the research applying deep learning in remote sensing was done using methods dated before 2018 (e.g., Faster R-CNN and RetinaNet), and, with the development of new methods, it is essential to evaluate their performance in these tasks. The development of techniques capable of accurately detecting trees using RGB images is essential in preserving and maintaining forest systems. These tools are essential for cities, where accelerated population growth and climate change are becoming significant threats. Future works will focus on developing a method capable of working with high density objects. We also intend to increase the size of the dataset with images from different cities in order to obtain models with better generalization capabilities.