Individual Tree-Crown Detection and Species Classiﬁcation in Very High-Resolution Remote Sensing Imagery Using a Deep Learning Ensemble Model

: Traditional methods for individual tree-crown (ITC) detection (image classiﬁcation, segmentation, template matching, etc.) applied to very high-resolution remote sensing imagery have been shown to struggle in disparate landscape types or image resolutions due to scale problems and information complexity. Deep learning promised to overcome these shortcomings due to its superior performance and versatility, proven with reported detection rates of ~90%. However, such models still ﬁnd their limits in transferability across study areas, because of di ﬀ erent tree conditions (e.g., isolated trees vs. compact forests) and / or resolutions of the input data. This study introduces a highly replicable deep learning ensemble design for ITC detection and species classiﬁcation based on the established single shot detector (SSD) model. The ensemble model design is based on varying the input data for the SSD models, coupled with a voting strategy for the output predictions. Very high-resolution unmanned aerial vehicles (UAV), aerial remote sensing imagery and elevation data are used in di ﬀ erent combinations to test the performance of the ensemble models in three study sites with highly contrasting spatial patterns. The results show that ensemble models perform better than any single SSD model, regardless of the local tree conditions or image resolution. The detection performance and the accuracy rates improved by 3–18% with only as few as two participant single models, regardless of the study site. However, when more than two models were included, the performance of the ensemble models only improved slightly and even dropped.


Introduction
The identification of individual tree-crowns (ITC) is an important research topic in forestry, remote sensing and computer vision [1]. It is a requirement in forest management and monitoring as it generalization [23,24]. In inductive learning, generalization is the main objective of a classifier, as it aims to use a finite set of input data to classify new examples [25] accurately. The interest in combining different neural networks started with the first designs of such models, and the clear advantages in the generalization power were observed for model ensembles when compared with single (monolithic) neural networks [26][27][28]. The design of any ensemble model of neural networks needs to take into account the concept of error independence, meaning that the neural networks involved in the ensemble need to make independent prediction errors [28]. It has been shown that error correlation and prediction power are inversely related. In consequence, an ensemble model needs to increase diversity in order to lower the degree of correlation between the networks [29,30]. Multiple approaches for designing error-independent ensemble models are described in the literature [28]: • Varying the training data on which a neural network model is trained; • Varying the initial set of random weights from which each neural network is trained, but keeping the training data constant; • Varying the topology or the architecture of the hidden layers within the same algorithm; • Varying the algorithm used for training the same data.
Generally, a neural network ensemble has better performance than any single neural network involved in the design. Hence, the last step in constructing an ensemble network is to combine the predictions to increase accuracy. Various methods for combining the predictions of ensemble neural networks are described in the literature [28,31]: averaging and weighted averaging, majority rules, voting schemes, stacked generalization and Bayesian methods. Examples of ensemble models are presented in [32], where an ensemble model which surpassed traditional CNNs in terms of accuracy for object detection is presented; the authors of [33] classified remote sensing imagery with different models and found that an ensemble of neural and statistical algorithms exceeded single models performance; the authors of [19] implemented a cascade neural network for ITC detection with improved results over single trained models. However, many of the methods to design ensemble models are either specific to the nature of the neural network algorithm, are computationally intensive because of intermediate processing tasks or are mathematically complex, leading to a decreased degree of reproducibility. It has been shown [28,34] that in the design of ensemble models, two of the best methods for obtaining error-independent models are varying the training data and varying the algorithm.
In this study, a novel design of deep learning ensemble models for ITC detection is proposed. It takes advantage of the multiple data products available from UAV and LiDAR scanning. The novelty of this approach consists in the application of the single shot detector (SSD) [35] to a deep learning ensemble model in order to reduce the complexity of implementation and increase the transferability of the design for disparate spatial patterns. The proposed deep learning ensemble models are built using different input remote sensing data and output voting strategies. The objectives of the study are as follows: (1) demonstrate the efficiency of the ensemble model design in ITC detection and species classification compared to single SSD models, (2) establish the ensemble model's performance and limits regarding input data variation and output predictions and (3) demonstrate the transferability of the model in contrasting spatial patterns and image resolutions. The performance of the models is evaluated both globally and at the level of species.

Study Sites and Materials
In order to assess the ITC detection under disparate spatial pattern conditions, three different study sites were chosen ( Figure 1). The first site is an orchard belonging to the University of Agronomic Sciences and Veterinary Medicine, situated in the NE of Bucharest, Romania, at the approximate coordinates 26 •  species. The trees are planted in straight lines at intervals of 3-5 m and between-trees distances of 1.5-3 m. The stem density is moderate but there are few overlapping tree-crowns, especially for walnut trees. The understory level consists of continuous vegetation cover of small herbaceous plants which do not exceed 10cm in height. The terrain is predominantly flat, with an overall elevation difference of~1m/km. There are few topographic irregularities, but these are very superficial, with slopes below 2 degrees. We used a DJI Phantom 4 UAV to survey the site and capture RGB imagery in late August 2019. The UAV imagery was processed using Drone2Map for ArcGIS, which generated an RGB orthophoto, a Digital Surface Model (DSM) and a Digital Terrain Model (DTM) at 6-cm (RGB) and 10-cm (DSM and DTM) spatial resolutions, respectively.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 23 Agronomic Sciences and Veterinary Medicine, situated in the NE of Bucharest, Romania, at the approximate coordinates 26°15′43′′E longitude 44°30′8′′N latitude. The orchard has an area of roughly 47 hectares and consists of plum (Prunus domestica), apricot (Prunus armeniaca) and walnut (Juglans regia) tree species. The trees are planted in straight lines at intervals of 3-5 m and between-trees distances of 1.5-3 m. The stem density is moderate but there are few overlapping tree-crowns, especially for walnut trees. The understory level consists of continuous vegetation cover of small herbaceous plants which do not exceed 10cm in height. The terrain is predominantly flat, with an overall elevation difference of ~1m/km. There are few topographic irregularities, but these are very superficial, with slopes below 2 degrees. We used a DJI Phantom 4 UAV to survey the site and capture RGB imagery in late August 2019. The UAV imagery was processed using Drone2Map for ArcGIS, which generated an RGB orthophoto, a Digital Surface Model (DSM) and a Digital Terrain Model (DTM) at 6-cm (RGB) and 10-cm (DSM and DTM) spatial resolutions, respectively. The second site is located in Brașov county, Romania, at the approximate coordinates 25°15′35′′E longitude 45°25′52′′N latitude. The site is a naturally wooded pasture, with the vegetation cover dominated by mixed tree species of Norway spruce (Picea abies) and European beech (Fagus sylvatica). The spatial pattern is heterogeneous, as trees are found either isolated or clustered. The wooded pastures are wide open and are larger in surface than the forested area. The understory level in the forested area is composed of smaller trees from the same two main species. In the wooded pastures, The second site is located in Bras , ov county, Romania, at the approximate coordinates 25 • 15 35 E longitude 45 • 25 52 N latitude. The site is a naturally wooded pasture, with the vegetation cover dominated by mixed tree species of Norway spruce (Picea abies) and European beech (Fagus sylvatica). The spatial pattern is heterogeneous, as trees are found either isolated or clustered. The wooded pastures are wide open and are larger in surface than the forested area. The understory level in the forested area is composed of smaller trees from the same two main species. In the wooded pastures, the understory is a continuous vegetation cover of small herbaceous plants. On direct visual assessment, the canopy closure varies between very dense in the forested area to sparse in the wooded pastures. Other spatial features include bushes of Juniperus communis or small rock outcrops. The local terrain is uneven, with moderate slopes and altitudes that vary between 1290 and 1350 m.
The data for this site consisted of a LiDAR point cloud obtained through an airborne laser scanning campaign in the autumn of 2013 using a Reigl LMS-Q560 scanner. An RGB orthophoto was also obtained during the same flight using a multispectral camera. The LiDAR point cloud has an average point density of 22.5 points/m 2 , and the RGB orthophoto has a spatial resolution of 12 cm. We further processed the LiDAR point cloud using the tools available in the ArcGIS 10.8 software and obtained a DSM and DTM of 1-m spatial resolution.
The third site is located in Erfurt city, Thuringen, Germany, and covers the Central-West portion of the city at the coordinates 10 • 59 28 E longitude 50 • 58 15 N latitude. This is an urban site and consists of a heavily mixed spatial pattern of artificial features and vegetation. Trees are either singular, surrounded by buildings in the residential areas, or bundled in small groups in the park areas. Canopy cover has not been evaluated, but the tree density is higher in parks and other small green fields and lower in built-up areas. As directly assessed on the imagery, the understory is composed of a sparse vegetation cover in the park areas and totally absent in built-up areas. The terrain is relatively flat, with an overall elevation difference of 30 m/km. The highest altitudes (~300 m) are in the western part of the city and slowly decrease towards east, along the Gera river, reaching values of~200 m. An RGB orthophoto at 20-cm spatial resolution as well as a LiDAR-derived DSM and DTM at 1-m spatial resolution were downloaded for free use from the Thuringian State Office for Soil Management and Geographic Information (Thüringer Landesamt für Bodenmanagement und Geoinformation) [36].

Overview
The flowchart of the deep learning tree-detection workflow is shown in Figure 2. First, a series of derived remote sensing products were generated from the RGB orthophoto and DSM using the ArcGIS 10.8 software (Figure 2-P1.1). Then, in addition to the derived products, we also generated two multi-band rasters which combine DSM and RGB information (Figure 2-P1.2). A data products stack (Table 1) was obtained from the DSM and RGB input data products.
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 23 the understory is a continuous vegetation cover of small herbaceous plants. On direct visual assessment, the canopy closure varies between very dense in the forested area to sparse in the wooded pastures. Other spatial features include bushes of Juniperus communis or small rock outcrops. The local terrain is uneven, with moderate slopes and altitudes that vary between 1290 and 1350 m. The data for this site consisted of a LiDAR point cloud obtained through an airborne laser scanning campaign in the autumn of 2013 using a Reigl LMS-Q560 scanner. An RGB orthophoto was also obtained during the same flight using a multispectral camera. The LiDAR point cloud has an average point density of 22.5 points/m 2 , and the RGB orthophoto has a spatial resolution of 12 cm. We further processed the LiDAR point cloud using the tools available in the ArcGIS 10.8 software and obtained a DSM and DTM of 1-m spatial resolution.
The third site is located in Erfurt city, Thuringen, Germany, and covers the Central-West portion of the city at the coordinates 10°59′28′′E longitude 50°58′15′′N latitude. This is an urban site and consists of a heavily mixed spatial pattern of artificial features and vegetation. Trees are either singular, surrounded by buildings in the residential areas, or bundled in small groups in the park areas. Canopy cover has not been evaluated, but the tree density is higher in parks and other small green fields and lower in built-up areas. As directly assessed on the imagery, the understory is composed of a sparse vegetation cover in the park areas and totally absent in built-up areas. The terrain is relatively flat, with an overall elevation difference of 30 m/km. The highest altitudes (~300 m) are in the western part of the city and slowly decrease towards east, along the Gera river, reaching values of ~200 m. An RGB orthophoto at 20-cm spatial resolution as well as a LiDAR-derived DSM and DTM at 1-m spatial resolution were downloaded for free use from the Thuringian State Office for Soil Management and Geographic Information (Thüringer Landesamt für Bodenmanagement und Geoinformation) [36].

Overview
The flowchart of the deep learning tree-detection workflow is shown in Figure 2. First, a series of derived remote sensing products were generated from the RGB orthophoto and DSM using the ArcGIS 10.8 software (Figure 2-P1.1). Then, in addition to the derived products, we also generated two multi-band rasters which combine DSM and RGB information (Figure 2-P1.2). A data products stack (Table 1) was obtained from the DSM and RGB input data products.  Table 1. Remote sensing data products and spatial resolution for each study site. Original product column indicates the source datasets. Input product column describes the input products for single shot detector (SSD) models. Combinations are single file three-band rasters from the input products described. Trees were manually digitized for each site, and labels with species information were assigned to the polygons in order to create a tree label database (

Data Pre-Processing
As stated above, the RGB orthophotos and DSMs were used to derive six additional data products used to train the SSD models (Figure 2-P1.1). From the RGB data, two products were derived: a grayscale image and the first component of Principal Components Analysis (PCA) [37], at the same spatial resolution. Four main products were derived from the DSM: slope, slope normalized for frequency distribution, hillshade and Canopy Height Model (CHM). Slope and hillshade were derived in a standard approach as implemented in ArcGIS/Spatial Analyst version 10.8. The distribution of slope values is typically skewed, so the statistical analysis of the slope layer is often biased [38]. Thus, an additional layer was derived, consisting of slope normalized to frequency distribution by the Box Cox transformation using a tool developed by the authors of [38]. CHM was derived by subtracting the digital terrain model from the DSM.
In addition to the six derived data products, two three-band rasters were generated with the following layers as bands: Grayscale-DSM-Slope and DSM-Slope-Hillshade, with a spatial resolution equal to that of the DSM (Figure 2-P1.2). In total, the input data products stack consists of 10 rasters, which were used to train separate single shot detector deep learning models with identical parameters.

Preparing the Training and Validation Data
For each site, scattered trees were manually digitized and labeled with species information (Figure 2-P2.1). Understory trees were not digitized, as they are only present in site 2 and the RGB imagery is of not sufficient resolution for this task. In site 1, all the trees from the plot were digitized, as some of them were available from a field campaign. Each dataset of field trees was randomly split into 80% training and 20% validation (  Table 2 describes the number of labels used for training and validation. Image chips of the labeled tree locations were exported using ArcGIS API for Python ( Figure 3). In order to reduce the risk of overfitting, data augmentation was used to boost the number of training chips to the order of tens of thousands. Augmentation included sample rotation at different angles and stride shift with a 50% overlap, thus obtaining additional subsets from the main chip image. The resulting image chips were rectangular subsets clipped from the input raster data and had different sizes according to the spatial resolution of the rasters, which differs between RGB and DSM.  Table 2 describes the number of labels used for training and validation. Image chips of the labeled tree locations were exported using ArcGIS API for Python ( Figure 3). In order to reduce the risk of overfitting, data augmentation was used to boost the number of training chips to the order of tens of thousands. Augmentation included sample rotation at different angles and stride shift with a 50% overlap, thus obtaining additional subsets from the main chip image. The resulting image chips were rectangular subsets clipped from the input raster data and had different sizes according to the spatial resolution of the rasters, which differs between RGB and DSM. The different resolution of input data is discernable between the three sites. The shape of the tree-crown polygon is irrelevant to the training process, as the exported image chips store the polygon geometry as extent coordinates, which always describe a rectangular shape.
Next, SSD deep learning models were trained using ArcGIS API for Python ( Figure 2-P3). The SSD ( Figure 4) is implemented in the API using the Fast.AI [39] and PyTorch [40] frameworks for deep learning. SSD has high speed and accuracy due to the use of multiple boxes of different sizes and an aspect ratio for detecting features. Predictions from multiple feature maps of different resolutions are combined to handle objects of various sizes [35]. The training of SSD was done using ResNet-152 architecture [41] from Torchvision version 0.3.0 [42]. The different resolution of input data is discernable between the three sites. The shape of the tree-crown polygon is irrelevant to the training process, as the exported image chips store the polygon geometry as extent coordinates, which always describe a rectangular shape.
Next, SSD deep learning models were trained using ArcGIS API for Python (Figure 2-P3). The SSD ( Figure 4) is implemented in the API using the Fast.AI [39] and PyTorch [40] frameworks for deep learning. SSD has high speed and accuracy due to the use of multiple boxes of different sizes and an aspect ratio for detecting features. Predictions from multiple feature maps of different resolutions are combined to handle objects of various sizes [35]. The training of SSD was done using ResNet-152 architecture [41] from Torchvision version 0.3.0 [42].  [35] and [43]. The full parameters and layers of the SSD model that we used are available in Appendix A.
To further reduce overfitting, in addition to data augmentation, early stopping was used for all models, which stopped the training if the validation loss did not improve for 5 epochs. The full model architecture with all parameters and layers is presented in Appendix A.

Detecting with Single Models
The trained SSD individual models on the ten input data products were used to predict tree locations and species in the three chosen sites. For each data product, multiple rectangular bounding boxes around predicted trees were obtained (Figure 2-P4), each bounding box having a confidence score ranging from 0 to 1, which indicates the degree of certitude for the presence of a tree. Afterwards, a non-maximum suppression algorithm [44] was used to remove the redundant bounding boxes that overlapped, by keeping the one with the highest confidence score. Finally, all bounding boxes with a confidence score < 0.2 were removed, thus retaining the best bounding box candidates to validate the detection algorithm.
The validation samples, along with the predicted tree bounding boxes, were used to compute the intersection-over-union statistic (IoU). IoU is a geometrical statistic which measures the area of the intersection divided by the area of overlap of the ground truth bounding box and the predicted bounding box. This indicator is very commonly used for the validation of deep learning object detection models, and an IoU > 0.5 is generally accepted as a proper threshold for a successful detection [21,32,45,46]. An IoU > 0.5 was the threshold to select the bounding boxes that were further statistically processed to assess the detection performance. The validation statistics used in reporting the results are the detection percentage and the f1-Score diagnostic [47], all computed using the metrics of recall and precision. Equations (1)-(5) describe the validation statistics used for validations: Detection percentage = T P T S * 100 TP (true positive) indicates the number of trees successfully detected, FP (false positive) denotes the number of objects incorrectly detected as trees; FN indicates the number of trees not detected; f1n indicates the f1-Score computed for a certain species; wn indicates the weights attributed for each species and TS denotes the total number of trees used for validation. The weights used for computing the overall f1-Score are according to the number of validation samples per species. The f1-Score (Equation (3)) was used to assess the performance of species classification; therefore, it was only computed for sites 1 and 2, where species information was available. Due to the class imbalance  [35] and [43]. The full parameters and layers of the SSD model that we used are available in Appendix A.
To further reduce overfitting, in addition to data augmentation, early stopping was used for all models, which stopped the training if the validation loss did not improve for 5 epochs. The full model architecture with all parameters and layers is presented in Appendix A.

Detecting with Single Models
The trained SSD individual models on the ten input data products were used to predict tree locations and species in the three chosen sites. For each data product, multiple rectangular bounding boxes around predicted trees were obtained (Figure 2-P4), each bounding box having a confidence score ranging from 0 to 1, which indicates the degree of certitude for the presence of a tree. Afterwards, a non-maximum suppression algorithm [44] was used to remove the redundant bounding boxes that overlapped, by keeping the one with the highest confidence score. Finally, all bounding boxes with a confidence score < 0.2 were removed, thus retaining the best bounding box candidates to validate the detection algorithm.
The validation samples, along with the predicted tree bounding boxes, were used to compute the intersection-over-union statistic (IoU). IoU is a geometrical statistic which measures the area of the intersection divided by the area of overlap of the ground truth bounding box and the predicted bounding box. This indicator is very commonly used for the validation of deep learning object detection models, and an IoU > 0.5 is generally accepted as a proper threshold for a successful detection [21,32,45,46]. An IoU > 0.5 was the threshold to select the bounding boxes that were further statistically processed to assess the detection performance. The validation statistics used in reporting the results are the detection percentage and the f1-Score diagnostic [47], all computed using the metrics of recall and precision. Equations (1)-(5) describe the validation statistics used for validations: Detection percentage = T P T S * 100 T P (true positive) indicates the number of trees successfully detected, F P (false positive) denotes the number of objects incorrectly detected as trees; F N indicates the number of trees not detected; f1 n indicates the f1-Score computed for a certain species; w n indicates the weights attributed for each species and T S denotes the total number of trees used for validation. The weights used for computing the overall f1-Score are according to the number of validation samples per species. The f1-Score (Equation (3)) was used to assess the performance of species classification; therefore, it was only computed for sites 1 and 2, where species information was available. Due to the class imbalance between species, a modified f1-Score (Equation (4)) implemented in the Scikit-learn package [48] was used when reporting the f1-Score for the overall performance of a model, regardless of species.

Ensemble Learning
The ensemble models were created from bounding box outputs of the single models (Figure 2-P5). The bounding boxes predicted by the single models were stacked together in different input data product combinations (see Figure 2). Ensemble models were created by handling the ten input data products as a mathematical set. Using a mathematical combination, a number of 1023 unique k-combinations, excluding the empty set, were obtained. In order to reduce data redundancy and improve the variation, the combinations with duplicate input information were excluded. The duplicate information consists of the cases where any of the data product combinations described in Table 3 appear in a set. In this manner, the number of possible combinations was reduced to 150. Table 3. Invalid input product pairs for the ensemble models.

Invalid Product Pairs
Using the validation dataset, each ensemble model was first validated for tree detection by applying a veto rule and a threshold IoU > 0.5. The veto rule is implemented for tree detection only and accepts all single input predictions, counting a correct tree detection if at least one single model part of the ensemble reaches IoU > 0.5. Secondly, for species identification, a voting strategy [49] was implemented. In this voting strategy, each single SSD model that was part of an ensemble which correctly identifies a tree was treated as binarized output (true/false). For each ensemble model, an array was constructed which contained true/false tree detection values, as well as tree species information corresponding to each single SSD model in the ensemble.
Then, a voting algorithm takes the output array of an ensemble and decides on the species detected. We tested four voting strategies, as follows: • Majority: the majority of single models must agree the output species detected; • Unison: all single models must agree the output species detected; • Confidence: the model gives the output species with the highest SSD confidence value; • Weighted: the output species is given by a weighted sum that applies weights based on the single models' accuracy in terms of the f1-Score.
The ensemble models' validation results by the voting strategies were ultimately statistically processed (Figure 2-P6) to estimate the detection performance by using the same indicators presented in Section 2.2.5 for single models.

Single Product Models
As seen in Table 4, the overall detection rates were below 50%, yet with the notable exceptions of RGB in site 2 (detection percentage 64.73%) and site 3 (detection percentage 73%); Grayscale, with a detection percentage of 60.33% in site 3 and Box Cox in site 1, with a 56.7% detection. However, the detection varies by species. For example, in site 1 (Figure 5), the walnut detection values are much higher (20-30% higher) than other species in almost all products except for Box Cox. In site 2 (Figure 6), the coniferous species has a slightly higher detection percentage (5-10% higher) for all products except RGB and DSM-Slope-Hillshade.

Single Product Models
As seen in Table 4, the overall detection rates were below 50%, yet with the notable exceptions of RGB in site 2 (detection percentage 64.73%) and site 3 (detection percentage 73%); Grayscale, with a detection percentage of 60.33% in site 3 and Box Cox in site 1, with a 56.7% detection. However, the detection varies by species. For example, in site 1 (Figure 5), the walnut detection values are much higher (20-30% higher) than other species in almost all products except for Box Cox. In site 2 (Figure 6), the coniferous species has a slightly higher detection percentage (5-10% higher) for all products except RGB and DSM-Slope-Hillshade.   The variation between species manifests a more significant discrepancy as observed in terms of accuracy for site 1 (Figure 8a). Walnut trees reach a maximum of 0.76 f1-Score in grayscale and much higher values for the other products except for Box Cox. Plum trees have the lowest accuracy, with f1-Score values consistently below 0.4, except for Box Cox, where they reach the value of 0.7. In site 2 (Figure 8b), the coniferous and deciduous trees have roughly the same f1-Score in RGB, while for the rest of the models, the deciduous species have a lower accuracy by a margin of 0.1-0.2, except for DSM-Slope-Hillshade.  The variation between species manifests a more significant discrepancy as observed in terms of accuracy for site 1 (Figure 8a). Walnut trees reach a maximum of 0.76 f1-Score in grayscale and much higher values for the other products except for Box Cox. Plum trees have the lowest accuracy, with f1-Score values consistently below 0.4, except for Box Cox, where they reach the value of 0.7. In site 2 (Figure 8b), the coniferous and deciduous trees have roughly the same f1-Score in RGB, while for the rest of the models, the deciduous species have a lower accuracy by a margin of 0.1-0.2, except for DSM-Slope-Hillshade. The variation between species manifests a more significant discrepancy as observed in terms of accuracy for site 1 (Figure 8a). Walnut trees reach a maximum of 0.76 f1-Score in grayscale and much higher values for the other products except for Box Cox. Plum trees have the lowest accuracy, with f1-Score values consistently below 0.4, except for Box Cox, where they reach the value of 0.7. In site 2 (Figure 8b), the coniferous and deciduous trees have roughly the same f1-Score in RGB, while for the rest of the models, the deciduous species have a lower accuracy by a margin of 0.1-0.2, except for DSM-Slope-Hillshade.

Ensemble Models
The ensemble models reached overall detection values of over 70% in all sites (Table 5). In site 1, the highest detection rate reached 76.8% with the combination of DSM + Slope + Hillshade + PCA + Box Cox + CHM. The other combinations reached nearly the same detection percentage, with differences of 0.2-0.8%. In site 2, the highest detection percentage is 71.82%, reached equally by the first two models. The next two models have only a ~0.2% detection reduction. In site 3, the first four models all reach a maximum detection percentage of 76.33%. Table 5. Best performing ensemble models in terms of % overall tree detection in each site, based on majority voting strategy. If two ensemble models had the same detection percentage, the one reported was with the least number of individual single models. Regarding species differences, Tables 6 and 7 summarize the detection percentages and f1-Scores for the top best performing ensemble model in each of the two sites.

Ensemble Models
The ensemble models reached overall detection values of over 70% in all sites (Table 5). In site 1, the highest detection rate reached 76.8% with the combination of DSM + Slope + Hillshade + PCA + Box Cox + CHM. The other combinations reached nearly the same detection percentage, with differences of 0.2-0.8%. In site 2, the highest detection percentage is 71.82%, reached equally by the first two models. The next two models have only a~0.2% detection reduction. In site 3, the first four models all reach a maximum detection percentage of 76.33%. Table 5. Best performing ensemble models in terms of % overall tree detection in each site, based on majority voting strategy. If two ensemble models had the same detection percentage, the one reported was with the least number of individual single models. Regarding species differences, Tables 6 and 7 summarize the detection percentages and f1-Scores for the top best performing ensemble model in each of the two sites. Table 6. Detection percentage per species and voting strategy of the best performing ensemble models in terms of % of trees detected (Det. per. %) in sites 1 and 2. If two ensemble models had the same detection percentage, the one reported was with the least number of individual single models.  Table 7. Detection performance per species and voting strategy of the best ensemble models in terms of f1-Score (f1-S.) in sites 1 and 2. If two ensemble models had the same f1-Score, the one reported was with the least number of individual single models. In site 1, walnut species has the best detection accuracy of over 90.29% for the weighted method on the combination of DSM + Slope + Hillshade + Grayscale + Box Cox + CHM and also the best f1-Score of 0.925 for the majority method on the combination of DSM + Slope + Hillshade + Grayscale + CHM. The plum and apricot species have similar detection values and f1-Scores, at around 59-71% and 0.73-0.82%, respectively, with apricot slightly underperforming.

Site
For each site, a frequency evaluation of each input data product for the top 15% of the ensemble models was also performed based on f1-Scores for sites 1 and 2 and percentage of trees detected for site 3. All results are summarized in Table 8. The most heavily present input data products are Box Cox in site 1, with a count of 23 across all voting methods, and RGB in sites 2 and 3, with 22 counts for all voting methods. Other input data products such as CHM, Hillshade, Slope and DSM are also frequent in all sites. Some models have very low or no contribution to ensemble models, such as the three-band rasters Grayscale-DSM-Slope and DSM-Slope-Hillshade. We next investigated whether there is a relationship between the number of input data products in an ensemble model and the detection percentage or accuracy of the model. In Figure 9, the maximum detection percentage and f1-Score were plotted against the number of single models in an ensemble model.
In site 1, walnut species has the best detection accuracy of over 90.29% for the weighted method on the combination of DSM + Slope + Hillshade + Grayscale + Box Cox + CHM and also the best f1-Score of 0.925 for the majority method on the combination of DSM + Slope + Hillshade + Grayscale + CHM. The plum and apricot species have similar detection values and f1-Scores, at around 59-71% and 0.73-0.82%, respectively, with apricot slightly underperforming.
For each site, a frequency evaluation of each input data product for the top 15% of the ensemble models was also performed based on f1-Scores for sites 1 and 2 and percentage of trees detected for site 3. All results are summarized in Table 8. The most heavily present input data products are Box Cox in site 1, with a count of 23 across all voting methods, and RGB in sites 2 and 3, with 22 counts for all voting methods. Other input data products such as CHM, Hillshade, Slope and DSM are also frequent in all sites. Some models have very low or no contribution to ensemble models, such as the three-band rasters Grayscale-DSM-Slope and DSM-Slope-Hillshade. We next investigated whether there is a relationship between the number of input data products in an ensemble model and the detection percentage or accuracy of the model. In Figure 9, the maximum detection percentage and f1-Score were plotted against the number of single models in an ensemble model. For both sites, it is observed that combining single models in an ensemble model generally increases the detection percentage and, to a lesser degree, the f1-Score. Furthermore, by looking at the percentage difference in detection performance or f1-Score when adding a new single model into an ensemble model, it can be observed (Figure 10) that the maximum increase for both indicators is reached when adding just one more single model. For both sites, it is observed that combining single models in an ensemble model generally increases the detection percentage and, to a lesser degree, the f1-Score. Furthermore, by looking at the percentage difference in detection performance or f1-Score when adding a new single model into an ensemble model, it can be observed (Figure 10) that the maximum increase for both indicators is reached when adding just one more single model. For example, an increase of 18.8% in detection performance and a 14.3% increase in f1-Score were recorded for site 1. Sites 2 and 3 follow a similar trend but with slightly lower values: a 6.74% increase in detection performance and 1.5% increase in f1-Score in site 2 and a 2.7% increase in detection performance in site 3. Continuously adding new single models into an ensemble model leads to a rapid reduction in the accuracy parameters or even a decrease in f1-Score, as observed in site 2. The same relationship between the number of ensemble models and indicators of model performance is observed in all sites, although at different scales.

Single vs. Ensemble Models
Designing an ensemble with only two single models increased the detection accuracy by a large margin of 3-18%. Adding more single models further increased the performance, but in a slower measure, or even actually decreasing the accuracy. This effect has also been observed in [17] and [18], for certain species in terms of detection performance in similar remote sensing imagery, and can be attributed to subtle spectral differences between species, other small-scale effects which are not captured by the deep learning algorithm. Another explanation could be that increasing the number of combined models to form an ensemble decreases the error independence between the single models. Although interspecies differences are present, the overall performance and effects of ensemble models are common in all sites ( Figure 10) and clearly show superior efficiency. This validates and strengthens the general findings [31] that ensemble neural networks are superior to single ones in terms of accuracy, not only in the context of non-geospatial data but also in applications that deal with high-resolution remote sensing data. Studies which designed and tested ensemble models with remote sensing data reported analogous outcomes, albeit with different ensemble model designs. For example, in object detection tasks, the authors of [17] proposed an ensemble model by fusing hyperspectral imagery and LiDAR data for tree-crown detection, a model that resulted in For example, an increase of 18.8% in detection performance and a 14.3% increase in f1-Score were recorded for site 1. Sites 2 and 3 follow a similar trend but with slightly lower values: a 6.74% increase in detection performance and 1.5% increase in f1-Score in site 2 and a 2.7% increase in detection performance in site 3. Continuously adding new single models into an ensemble model leads to a rapid reduction in the accuracy parameters or even a decrease in f1-Score, as observed in site 2. The same relationship between the number of ensemble models and indicators of model performance is observed in all sites, although at different scales.

Single vs. Ensemble Models
Designing an ensemble with only two single models increased the detection accuracy by a large margin of 3-18%. Adding more single models further increased the performance, but in a slower measure, or even actually decreasing the accuracy. This effect has also been observed in [17] and [18], for certain species in terms of detection performance in similar remote sensing imagery, and can be attributed to subtle spectral differences between species, other small-scale effects which are not captured by the deep learning algorithm. Another explanation could be that increasing the number of combined models to form an ensemble decreases the error independence between the single models. Although interspecies differences are present, the overall performance and effects of ensemble models are common in all sites ( Figure 10) and clearly show superior efficiency. This validates and strengthens the general findings [31] that ensemble neural networks are superior to single ones in terms of accuracy, not only in the context of non-geospatial data but also in applications that deal with high-resolution remote sensing data. Studies which designed and tested ensemble models with remote sensing data reported analogous outcomes, albeit with different ensemble model designs. For example, in object detection tasks, the authors of [17] proposed an ensemble model by fusing hyperspectral imagery and LiDAR data for tree-crown detection, a model that resulted in superior accuracy to single trained models by a margin of 5-15%; the authors of [32] designed an ensemble model for object detection in remote sensing imagery at different scales, which outperformed most traditional CNNs in terms of accuracy by 5-10%, especially for densely packed objects; the authors of [18] implemented a fusion of remote sensing imagery and LiDAR information in the training data space of a dense convolutional network, achieving superior accuracy for tree detection in an urban environment by a margin of 5%-10% than single product trained models. Ensemble models used for remote sensing imagery classification report analogous results. For instance, the authors of [33] designed a similar voting strategy ensemble for the classification of remote sensing imagery, which resulted in substantially better performance over single models, within a statistical significance of over 95%. In [50] an ensemble model fusing two CNN architectures for land classification using remote sensing data achieved an accuracy of up to 4% higher to that of the single CNN models tested; the authors of [51] implemented a cascade ensemble from hyperspectral imagery and LiDAR data that outperformed traditional CNN-based methods in a classification task by a factor of 4-8%. On account of all the above, we conclude that ensemble models' performance surpassed the single product trained models by a large margin in terms of detection percentage as well as accuracy.

Input Data Products Performance vs. Image Resolution
The input data products, including the derived ones, manifested variation in terms of performance by species and site. Table 8 presents the frequency of input data products in the top 15% best performing ensemble models. RGB is the most frequent in sites 2 and 3, as it appears 22 times, while in site 1, it appears only 6-8 times. Sites 2 and 3 have similar pixel sizes of 12 cm and 20 cm, respectively, while site 1 has a 6-cm spatial resolution. In site 1, due to the ultra-high spatial resolution, the spectral information of objects is rich. Consequently, the discrimination between species is hindered, especially between plum and apricot species, which, on visual interpretation of the imagery data, appear almost structurally and spectrally indistinguishable ( Figure 11). Therefore, the difference between species is made by adding data derived from DSM, namely Box Cox, which is present 23 times, followed by CHM, Hillshade and Slope with frequencies of 17-19 times. The DSM and CHM performances for this site are roughly the same in terms of overall detection percentage (Table 4) and contribution to ensemble models ( Table 8). The same effect is observed in site 3, which has a similar, relatively flat topography. This seems to show that in flat topography, the DSM and CHM information have largely the same importance for individual tree-crown detection.
superior accuracy to single trained models by a margin of 5-15%; the authors of [32] designed an ensemble model for object detection in remote sensing imagery at different scales, which outperformed most traditional CNNs in terms of accuracy by 5-10%, especially for densely packed objects; the authors of [18] implemented a fusion of remote sensing imagery and LiDAR information in the training data space of a dense convolutional network, achieving superior accuracy for tree detection in an urban environment by a margin of 5%-10% than single product trained models. Ensemble models used for remote sensing imagery classification report analogous results. For instance, the authors of [33] designed a similar voting strategy ensemble for the classification of remote sensing imagery, which resulted in substantially better performance over single models, within a statistical significance of over 95%. In [50] an ensemble model fusing two CNN architectures for land classification using remote sensing data achieved an accuracy of up to 4% higher to that of the single CNN models tested; the authors of [51] implemented a cascade ensemble from hyperspectral imagery and LiDAR data that outperformed traditional CNN-based methods in a classification task by a factor of 4-8%. On account of all the above, we conclude that ensemble models' performance surpassed the single product trained models by a large margin in terms of detection percentage as well as accuracy.

Input Data Products Performance vs. Image Resolution
The input data products, including the derived ones, manifested variation in terms of performance by species and site. Table 8 presents the frequency of input data products in the top 15% best performing ensemble models. RGB is the most frequent in sites 2 and 3, as it appears 22 times, while in site 1, it appears only 6-8 times. Sites 2 and 3 have similar pixel sizes of 12 cm and 20 cm, respectively, while site 1 has a 6-cm spatial resolution. In site 1, due to the ultra-high spatial resolution, the spectral information of objects is rich. Consequently, the discrimination between species is hindered, especially between plum and apricot species, which, on visual interpretation of the imagery data, appear almost structurally and spectrally indistinguishable ( Figure 11). Therefore, the difference between species is made by adding data derived from DSM, namely Box Cox, which is present 23 times, followed by CHM, Hillshade and Slope with frequencies of 17-19 times. The DSM and CHM performances for this site are roughly the same in terms of overall detection percentage (Table 4) and contribution to ensemble models ( Table 8). The same effect is observed in site 3, which has a similar, relatively flat topography. This seems to show that in flat topography, the DSM and CHM information have largely the same importance for individual tree-crown detection.
In site 2, on the other hand, the difference between species is made primarily by the RGB information. As seen in Table 4, the DSM has no detection as a single model, a fact which might be due to the significant differences in elevation values between trees. The detrimental effect of DSM on Figure 11. Spectral differences between apricot (a) and plum (b). Unmanned aerial vehicle (UAV) RGB imagery at 6-cm spatial resolution.
In site 2, on the other hand, the difference between species is made primarily by the RGB information. As seen in Table 4, the DSM has no detection as a single model, a fact which might be due to the significant differences in elevation values between trees. The detrimental effect of DSM on tree detection in uneven terrain has also been shown in [52]. However, the coniferous and deciduous trees have clear spectral and structural differences, information which drives the distinction between these species. With counts of 22 for RGB, compared to 10-12 for DSM and other derivatives, the advantage of RGB in this context is straightforward.
The influence of canopy cover on tree detection accuracy has not been evaluated. In a recent study [53], different tree-crown detection algorithms were assessed under various canopy cover conditions and a reduction in accuracy directly related to an increase in canopy cover rate was reported. However, the authors report a high dependency on spatial resolution of the tree detection methods and did not employ deep learning techniques. Given the large diversity of spatial features between our sites and the objectively good results for tree-crown detection across image resolutions, we can uphold the idea that the ensemble model examples are robust enough to deal with forested sites that have a high rate of canopy cover. The risk of overfitting is greatly reduced by data augmentation and the early stopping procedure. In addition, the ensemble models are designed to reduce information duplication which can lead to overfitting, by removing invalid combinations (Table 4).

Performance of Voting Strategies
In sites 1 and 2, we tested the performance of the voting strategies for species discrimination. Tables 6 and 7 summarize the detection percentages and f1-Scores for each species and voting strategy. In site 1, the majority strategy outperforms the other three by 1-5%, in the case of plum and apricot species, while the weighted strategy performs better for walnut. In site 2, the weighted strategy surpassed the other ones in nearly all cases by a margin of 1-3%. In a study [54] which tested two box voting strategies for ensemble models, an accuracy difference of 2% was found, which is similar to our results. The different results between voting strategies, while small at first glance, are actually indicative of slight divergences of the voting design, which can be exploited in order to satisfy the various objectives of an object detection task with ensemble models. In detail, they can be used to balance between precision and recall metrics in order to accomplish the detection objective better. For example, in Table 9, the unison strategy, which only accepts as correct when all models part of the ensemble give the same result, yields the highest precision in all sites and at the same time has the lowest recall. For an ensemble model which needs to maximize precision, the unison voting strategy may be used. In reverse, if a model needs to have a good recall-that is, to find as many objects as possible while still maintaining good precision-a weighted or a confidence strategy may be used.

Conclusions
In this article, an ensemble deep learning design based on a single shot detector (SSD) model was developed for individual tree-crown detection and species classification, based on very high-resolution remote sensing data. The design was tested in disparate study sites in terms of spatial pattern. Results have shown the increased performance of ensemble models compared to single ones by a margin of 3%-18%. RGB information was found to be the most important factor influencing the species identification. DSM derived data were shown to have significant importance in species discrimination, especially in the structurally complex site 1, where RGB trained models performed poorly. Lastly, the voting strategies for combining the outputs allowed us to better tune the ensemble models in order to accommodate specific detection objectives. Due to the common effects observed for the ensemble models, our design proposal has been shown to have notable transferability capabilities across disparate tree conditions. Further research is necessary in order to investigate more complex data derivates from RGB and DSM in addition to those presented in this study. Acknowledgments: Some of the datasets used for this study were obtained from the collaboration between the Faculty of Geography, University of Bucharest, and the Faculty of Horticulture from the University of Agronomic Sciences and Veterinary Medicine of Bucharest (Adrian Peticila). We also want to acknowledge the contribution of Radu Irimia for image collection and tree species mapping. The authors also acknowledge the support of Esri Romania for kindly providing the hardware infrastructure to train the deep learning models. We sincerely thank the three anonymous reviewers, whose valuable suggestions and comments helped to greatly improve the quality of the article.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A
The following tables describe the full architecture (layers and parameters) of the SSD models used. The total number of parameters is 24,141,904, and the number of hidden layers is 105. The learning rate, batch size and other training options are also presented.