A Novel Deep Learning Method to Identify Single Tree Species in UAV-Based Hyperspectral Images

: Deep neural networks are currently the focus of many remote sensing approaches related to forest management. Although they return satisfactory results in most tasks, some challenges related to hyperspectral data remain, like the curse of data dimensionality. In forested areas, another common problem is the highly-dense distribution of trees. In this paper, we propose a novel deep learning approach for hyperspectral imagery to identify single-tree species in highly-dense areas. We evaluated images with 25 spectral bands ranging from 506 to 820 nm taken over a semideciduous forest of the Brazilian Atlantic biome. We included in our network’s architecture a band combination selection phase. This phase learns from multiple combinations between bands which contributed the most for the tree identiﬁcation task. This is followed by a feature map extraction and a multi-stage model reﬁnement of the conﬁdence map to produce accurate results of a highly-dense target. Our method returned an f-measure, precision and recall values of 0.959, 0.973, and 0.945, respectively. The results were superior when compared with a principal component analysis (PCA) approach. Compared to other learning methods, ours estimate a combination of hyperspectral bands that most contribute to the mentioned task within the network’s architecture. With this, the proposed method achieved state-of-the-art performance for detecting and geolocating individual tree-species in UAV-based hyperspectral images in a complex forest.


Introduction
The rapid development of lightweight sensors, associated with the market availability of unmanned aerial vehicles (UAV), has contributed to the development of techniques for fast and accurate acquisition of surface information [1]. In forest monitoring, UAV-based images have become a powerful tool to constantly monitor regional or local areas because UAVs offer advantages related to operational costs WorldView-2/3 to classify urban tree species with the Dense Convolutional Network (DenseNet) method [37]. Nevertheless, RGB and multispectral sensors cannot provide a similar amount of spectral information as hyperspectral sensors. This type of spectral heterogeneity can significantly contribute to tree-species differentiation [17].
Although the previously mentioned studies were able to return satisfactory performances for most tasks regarding tree detection, some challenges related to hyperspectral data are still faced by the remote sensing community. One of which is known as the Hughes phenomenon; also called the curse of dimensionality. This issue is often persistent, more specifically when dealing with small sample sizes [38]. The high dimensionality of data could be problematic even for deep neural networks because an increased number of features may decrease its performance, as it introduces noise and sparsity in the feature space [39]. When applying a CNN, which is one of the most commonly used deep learning architectures for image and pattern recognition [40], data dimensionality reduction approaches are sure to be expected. For this purpose, either a principal component analysis (PCA) or mutual information is normally used [41].
In many environments, hyperspectral data can deliver highly detailed views of objects according to their response to the analyzed spectral band. In many cases, it is common to use a band selection step to identify the bands that best characterize the object of interest [42]. A PCA [43] is a common example of a band selection technique widely used in data analysis [11,44,45]. The PCA is a linear scheme for reducing the dimensionality of high-dimensional data [46]. Still, PCA learns to reduce the spectral bands without considering the target position such as individual trees or any other information in a supervised manner. Therefore, with the growth in data volumes due to the large increase of spectral bands, more efficient methods are needed.
Another challenge related to remote sensing images of forested areas comes from the high density of their environment. Most of the spectral divergences between trees and non-trees pixels are important because the brighter pixels are often recognized as the tree-crown, while darker pixels are viewed as indicative of their boundary [47,48]. In highly-dense areas, this type of differentiation could be difficult, even for deep neural network-based approaches as some of them rely on bounding-box [32,49]. In this manner, in a previous study, we developed a CNN based method to deal with highly-dense vegetation [50]. In this study, however, we evaluated the performance of a primary version of our network to identify citrus-trees in an orchard. This method, implemented with data captured by a multispectral sensor in the UAV platform, significantly outperformed object detection methods based on bounding-box estimation like RetinaNet and Faster-RCNN.
The aforementioned challenges still impose problems for UAV hyperspectral data processes and we intend to fill part of this gap in the forest environment context. In this paper, we propose a novel deep learning method for hyperspectral imagery to detect and geolocate single-tree species in a tropical forest. Our approach was constructed to cope with a highly-dense scene while implementing a strategy to deal with the Hughes phenomenon. Differently from a PCA, which is considered a pre-processing step, we aim to estimate a combination of hyperspectral bands that most contribute to the mentioned task within the network's architecture. For this, we included this band selection phase as the initial step of our network. The phase learns from multiple combinations between bands which contributed the most for the tree identification task. This is followed by a feature map extraction and a multi-stage model refinement of the confidence map to produce an accurate result of the tree geolocation in a highly-dense scene. The rest of the paper is organized as follows: Section 3 describes in detail the method adopted; Section 4 presents and discusses the results, and; finally, Section 5 summarizes the main conclusions.

Study Area
To assess the proposed method, we used a transect area inside a forest fragment known as Ponte Branca (Figure 1). The Ponte Branca fragment is composed of a submontane semideciduous forest, which is part of the Black-Lion-Tamarin Ecological Station, in the countryside of the western region of the São Paulo state, in Brazil. The area has been protected by governmental laws since 2002 [51,52] and suffered illegal logging until the end of the 1970s [53]. From the 1970s to the 2000s, forest degradation was noticed in the northern part of Ponte Branca [54], where the transect is located. In the transect area, more than 20 tree species were encountered [17,53,54]. These species are considered as pioneers and secondaries tree species, with their majority considered within the primary degree of a regeneration state [17,54].
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 19 of the São Paulo state, in Brazil. The area has been protected by governmental laws since 2002 [51,52] and suffered illegal logging until the end of the 1970s [53]. From the 1970s to the 2000s, forest degradation was noticed in the northern part of Ponte Branca [54], where the transect is located. In the transect area, more than 20 tree species were encountered [17,53,54]. These species are considered as pioneers and secondaries tree species, with their majority considered within the primary degree of a regeneration state [17,54]. From the tree species present in this area, Syagrus romanzoffiana is one key species since it is one of the most common palm trees in the Brazilian Atlantic forest [55]. Palm trees can be considered as a key species in tropical forests because of its abundance of fruits and seeds and its importance for contributing to the forest structure [56,57]. Syagrus romanzoffiana is an evergreen tree, tolerant to shadows, with great potential to be used for fauna restoration and conservation [58]. As Syagrus romanzoffiana blooms and produces fruits almost the entire year [55,59], it can be related to animal dispersion. Its fruits are consumed by at least 60 different vertebrate species [60]. Among the frugorive animals, there are crab-eating, raccoons and mainly, tapirs [55,61]. Besides, Syagrus romanzoffiana density can be related to the successional stage of forests in the area. According to the Brazilian Ministry of the Environment [58], there is a higher number of Syagrus romanzoffiana samples in early secondary forests than in late secondary forests. In this manner, this tree species can be used as an indicator of forest regeneration. Aside from that, a higher frequency of Syagrus romanzoffiana indicates that the Atlantic forest in the initial stage of regeneration, where a lower frequency indicates a more preserved forest.

Image acquisition
The images that composed the dataset used were acquired on 16 August 2016, 01 July 2017, and 16 June 2018. They were acquired during the winter and dry season using a Rikola hyperspectral camera (Senop Oy, Oulu, Finland). The Rikola camera was onboard a UX4 UAV quadcopter (Nuvem UAV, Presidente Prudente, Brazil). This camera produces 25 spectral bands ranging from 506 nm to 820 nm, which were acquired over a transect area, depicted in Section 2.1 ( Figure 1, Table 1). Each From the tree species present in this area, Syagrus romanzoffiana is one key species since it is one of the most common palm trees in the Brazilian Atlantic forest [55]. Palm trees can be considered as a key species in tropical forests because of its abundance of fruits and seeds and its importance for contributing to the forest structure [56,57]. Syagrus romanzoffiana is an evergreen tree, tolerant to shadows, with great potential to be used for fauna restoration and conservation [58]. As Syagrus romanzoffiana blooms and produces fruits almost the entire year [55,59], it can be related to animal dispersion. Its fruits are consumed by at least 60 different vertebrate species [60]. Among the frugorive animals, there are crab-eating, raccoons and mainly, tapirs [55,61]. Besides, Syagrus romanzoffiana density can be related to the successional stage of forests in the area. According to the Brazilian Ministry of the Environment [58], there is a higher number of Syagrus romanzoffiana samples in early secondary forests than in late secondary forests. In this manner, this tree species can be used as an indicator of forest regeneration. Aside from that, a higher frequency of Syagrus romanzoffiana indicates that the Atlantic forest in the initial stage of regeneration, where a lower frequency indicates a more preserved forest.

Image Acquisition
The images that composed the dataset used were acquired on 16 August 2016, 01 July 2017, and 16 June 2018. They were acquired during the winter and dry season using a Rikola hyperspectral camera (Senop Oy, Oulu, Finland). The Rikola camera was onboard a UX4 UAV quadcopter (Nuvem UAV, Presidente Prudente, Brazil). This camera produces 25 spectral bands ranging from 506 nm to 820 nm, which were acquired over a transect area, depicted in Section 2.1 ( Figure 1, Table 1). Each image datacube is acquired by the two CMOS sensors of the camera, both with 5.5 µm of pixel size and frame format with 1017 pixels × 648 pixels. The flights were conducted 160 m high above the ground with a speed of 4 m·s −1 , providing images with a ground sample distance (GSD) equal to 10 cm, and forward and side overlaps higher than 70% and 50%, respectively. After the image acquisition, the dark current correction was performed with a dark image acquired before the flight campaign. In sequence, geometric processing was carried out in the Agisoft PhotoScan software (version 1.3) (Agisoft LLC, St. Petersburg, Russia) using initial interior orientation parameters (IOPs) and exterior orientation parameters (EOPs) from the global position navigation (GPS) receiver of the camera. Additionally, during the bundle block adjustment process, three ground control points (GCPs) were used for each flight. The geometric process was carried out for the bands centered at 550.39 nm, 609.00 nm, 679.84 nm, and 769.89 nm of each dataset, being the remaining ones estimated by the method developed in [62,63]. The following products were created during this process: refined EOPs and IOPs; a sparse point cloud and a digital surface model (DSM) of the area.
In a subsequent step, we used the EOPs, IOPs, sparse point cloud and DSM of the area for the radiometric block adjustment. This step is based on the methodology developed by Honkavaara et al. [62,64] and aims to reduce illumination differences among images and to correct them from the Bidirectional Reflectance Distribution Function (BRDF) effects. The radiometric process was carried out in the radBA software [62,64] and uses common points among the images, the Sun position (i.e., the Sun zenithal and azimuthal angles), and the incident and reflected angles of each pixel. As the final product, we obtained the orthomosaics of each year radiometrically corrected. Moreover, the empirical line [65] was applied to transform the digital numbers (DN) into reflectance factor values. The empirical line parameters were calculated using three radiometric reference targets colored in light-grey, grey and black. More details about radiometric block adjustment can be seen in [17,62,64,66]. It is worth noting that from now on the hyperspectral orthomosaic will be referred to as hyperspectral image.

Proposed Method
The proposed CNN method takes a hyperspectral image as input and computes the individual tree positions. The hyperspectral image has 25 bands with w × h pixels each. The tree identification and location are modeled as a 2D confidence map estimation, following the procedures related in [50,67]. The confidence map is a 2D representation of the likelihood of a tree occurring in each pixel of the image. First, the hyperspectral images go through a band learning process before extracting the feature map. This allows the method to improve its accuracy by learning the best band combination for the trees detection. We included the Pyramid Pooling Module (PPM) [68] that uses global and local information to improve the estimation of the confidence map. Besides, we implemented a multi-stage prediction that refines the confidence map to a more accurate prediction of the center of the trees. Figure 2 presents our approach for tree detection and geolocation. The method starts with a band-learning module that is responsible for learning m new bands from the hyperspectral image ( Figure 2b). Additionally, a feature map (Figure 2c) is extracted using the output volume of the band-learning module. This feature map obtains global and local neighborhood information when passing through the PPM (Figure 2d). The volume is then processed by a Multi-Stage Module (MSM) (Figure 2e) with T stages to refine the tree detection. Finally, we obtain the trees' positions ( Figure 2f) at the end of the method.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 19 implemented a multi-stage prediction that refines the confidence map to a more accurate prediction of the center of the trees. Figure 2 presents our approach for tree detection and geolocation. The method starts with a band-learning module that is responsible for learning m new bands from the hyperspectral image ( Figure 2(b)). Additionally, a feature map (Figure 2(c)) is extracted using the output volume of the band-learning module. This feature map obtains global and local neighborhood information when passing through the PPM (Figure 2(d)). The volume is then processed by a Multi-Stage Module (MSM) (Figure 2(e)) with T stages to refine the tree detection. Finally, we obtain the trees' positions ( Figure 2(f)) at the end of the method. The following sections detail the main modules of the proposed method: Section 2.3.1 shows the band learning module; Section 2.3.2 presents the feature map and its enhancement with the PPM module; The refinement of the confidence map by the MSM and the obtaining of the tree positions are presented in Section 2.3.3.

Band learning machine module
To improve the band selection process of our network, we propose an end-to-end band learning module. This module receives a hyperspectral image with w × h pixels and 25 bands and learns m filters with size 1 × 1 × 25 to generate an output image with dimensions w × h × m. Figure 3 illustrates an example of the application of the last filter, represented by the yellow color. Each filter is convolved through the input image (Figure 3(a)) with a stride of 1 pixel, creating a corresponding output volume (Figure 3(c)). During training, each filter has its weights adjusted to detect the bands that have more influence on the single-tree detection task. In this way, the layers that have more response in detecting objects will be enhanced, while the others will be discarded in the process. The following sections detail the main modules of the proposed method: Section 2.3.1 shows the band learning module; Section 2.3.2 presents the feature map and its enhancement with the PPM module; The refinement of the confidence map by the MSM and the obtaining of the tree positions are presented in Section 2.3.3.

Band Learning Machine Module
To improve the band selection process of our network, we propose an end-to-end band learning module. This module receives a hyperspectral image with w × h pixels and 25 bands and learns m filters with size 1 × 1 × 25 to generate an output image with dimensions w × h × m. Figure 3 illustrates an example of the application of the last filter, represented by the yellow color. Each filter is convolved through the input image ( Figure 3a) with a stride of 1 pixel, creating a corresponding output volume (Figure 3c). During training, each filter has its weights adjusted to detect the bands that have more influence on the single-tree detection task. In this way, the layers that have more response in detecting objects will be enhanced, while the others will be discarded in the process.

Feature map extraction
The feature map is extracted using a CNN (Figure 2(c)), based on the VGG19 method [34], from the hyperspectral image learned in the previous step (Section 2.2.1). Our CNN has eight convolutional layers composed of 64, 128 and 256 convolutional filters with size 3 × 3 to consider spatial information. After the second and fourth convolutional layers, we reduce the spatial volume size in half using the max-pooling layer with a 2 × 2 window. In each convolutional layer, we applied a Rectified Linear Units (ReLU) function.
To characterize global and local information from the image, we adopt the PPM [68]. This module aims to make our method invariant to scale, which is important for detecting trees at different scales and even growth stages. The PPM module (Figure 2(d)) receives the feature map and applies four branches with max-pooling layers, resulting in four volumes with resolutions of 1 × 1, 2 × 2, 3 × 3, and 6 × 6. The general level, shown in orange shown in Figure 2(d), creates a feature map that describes the global context of the image while the other branch divides the feature map into subregions to better characterize the local information. The features of each branch are upsampled to the same size as the input feature map and are concatenated with the input feature map to form an improved description of the image.

Tree localization
The tree's positions are located using a refined confidence map obtained by the MSM (Figure2(e)). The MSM estimates a confidence map from the feature map obtained in the last module (see Section 2.3.2) and is composed of T refinement stages. The first stage contains three layers with 128 convolutional filters of 3 × 3 size, one layer with 512 convolutional filters of 1 × 1 size, and the last layer with a single convolutional filter that corresponds to the confidence map 1 of the first stage.
The T-1 final stage refines the positions predicted in the first stage, forming hierarchical learning of the trees' positions. In a stage t ∈ [2, 3, ..., T], the prediction returned by the previous stage −1 and the feature map from the PPM module is concatenated and used to produce a refined confidence map 1 . These stages have, in total, seven convolutional layers: five layers with 128 filters of 7 × 7 size; and one layer with 128 filters of 1 × 1 size. The last layer has a sigmoid activation function so that each pixel represents the probability of the occurrence of a tree (with values between [0,1]). The remaining layers have a ReLU activation function. Additionally, the use of the improved feature map at the entrance of each stage allows multi-scale features, obtained from global and local context information, to be incorporated into the refinement process.
Later, to avoid the vanishing gradient problem during the training phase, we adopted a loss function at the end of each stage as shown in the following equation (1).

Feature Map Extraction
The feature map is extracted using a CNN (Figure 2c), based on the VGG19 method [34], from the hyperspectral image learned in the previous step (Section 2.3.1). Our CNN has eight convolutional layers composed of 64, 128 and 256 convolutional filters with size 3 × 3 to consider spatial information. After the second and fourth convolutional layers, we reduce the spatial volume size in half using the max-pooling layer with a 2 × 2 window. In each convolutional layer, we applied a Rectified Linear Units (ReLU) function.
To characterize global and local information from the image, we adopt the PPM [68]. This module aims to make our method invariant to scale, which is important for detecting trees at different scales and even growth stages. The PPM module (Figure 2d) receives the feature map and applies four branches with max-pooling layers, resulting in four volumes with resolutions of 1 × 1, 2 × 2, 3 × 3, and 6 × 6. The general level, shown in orange shown in Figure 2d, creates a feature map that describes the global context of the image while the other branch divides the feature map into subregions to better characterize the local information. The features of each branch are upsampled to the same size as the input feature map and are concatenated with the input feature map to form an improved description of the image.

Tree Localization
The tree's positions are located using a refined confidence map obtained by the MSM (Figure 2e). The MSM estimates a confidence map from the feature map obtained in the last module (see Section 2.3.2) and is composed of T refinement stages. The first stage contains three layers with 128 convolutional filters of 3 × 3 size, one layer with 512 convolutional filters of 1 × 1 size, and the last layer with a single convolutional filter that corresponds to the confidence map C 1 of the first stage.
The T-1 final stage refines the positions predicted in the first stage, forming hierarchical learning of the trees' positions. In a stage t ∈ [2, 3, ..., T], the prediction returned by the previous stage C t−1 and the feature map from the PPM module is concatenated and used to produce a refined confidence map C 1 . These stages have, in total, seven convolutional layers: five layers with 128 filters of 7 × 7 size; and one layer with 128 filters of 1 × 1 size. The last layer has a sigmoid activation function so that each pixel represents the probability of the occurrence of a tree (with values between [0, 1]). The remaining layers have a ReLU activation function. Additionally, the use of the improved feature map at the entrance of each stage allows multi-scale features, obtained from global and local context information, to be incorporated into the refinement process.
Later, to avoid the vanishing gradient problem during the training phase, we adopted a loss function at the end of each stage as shown in the following Equation (1).
whereĈ t and C t are, respectively, the ground truth and the refined confidence maps of the location p at stage t. The general loss functions are given by: To train our approach, a confidence mapĈ t is generated as the ground truth for each stage t using the annotations of the trees. The ground-truth confidence map is generated by placing a 2D Gaussian kernel at the labeled tree centers. The Gaussian kernel has a standard deviation σ t that controls the spread of the peak. Our approach uses different values of σ t for each stage t to refine the tree prediction during each stage. The σ 1 of the first stage is set to a maximum value σ max while the σ T of the last stage is set to a minimum value σ min . The σ t for each intermediate stage is equally spaced between [σ max , σ min ]. During the early phase of our experiment, the usage of different σ helped to refine the confidence map, improving its robustness.
The tree's locations are then obtained from the last stage C T of the MSM module. For the tree location we estimate the peaks (local maximum) of the confidence map by analyzing the 4-pixel neighborhood of each given location of p. Thus, p = (x p , y p ) is a local maximum if C T (p) > C T (v) for all the neighbors v, where v is given by (x p ± 1, y p ) or (x p , y p ± 1). An example of the tree location from the confidence map peaks is shown in Figure 4.
where ̂ and are, respectively, the ground truth and the refined confidence maps of the location p at stage t. The general loss functions are given by: To train our approach, a confidence map ̂ is generated as the ground truth for each stage t using the annotations of the trees. The ground-truth confidence map is generated by placing a 2D Gaussian kernel at the labeled tree centers. The Gaussian kernel has a standard deviation that controls the spread of the peak. Our approach uses different values of for each stage t to refine the tree prediction during each stage. The 1 of the first stage is set to a maximum value while the of the last stage is set to a minimum value . The for each intermediate stage is equally spaced between [ , ]. During the early phase of our experiment, the usage of different helped to refine the confidence map, improving its robustness.
The tree's locations are then obtained from the last stage of the MSM module. For the tree location we estimate the peaks (local maximum) of the confidence map by analyzing the 4-pixel neighborhood of each given location of . Thus, = ( , ) is a local maximum if ( ) > ( ) for all the neighbors , where is given by ( ± 1, ) or ( , ± 1). An example of the tree location from the confidence map peaks is shown in Figure 4. To avoid noise or low probability of occurrence of the positions p, a peak in the confidence map is considered as an object only if ( ) > . For this, we set a minimum distance to prevent the detection of objects very close to each other. After a preliminary experiment, we used = 1 pixel and = 0.35.

Experimental setup
The images were split into patches with 256 × 256 pixels without overlapping. The patches were randomly divided into training, validation and testing sets, in a proportion of 50%, 25%, and 25%, To avoid noise or low probability of occurrence of the positions p, a peak in the confidence map is considered as an object only if C T (p) > τ. For this, we set a minimum distance δ to prevent the detection of objects very close to each other. After a preliminary experiment, we used δ = 1 pixel and τ = 0.35.

Experimental Setup
The images were split into patches with 256 × 256 pixels without overlapping. The patches were randomly divided into training, validation and testing sets, in a proportion of 50%, 25%, and 25%, respectively. Figure 5 shows the images used to extract the training, validation and test sets in each year (2016, 2017, and 2018) and Table 2 shows the number of samples. It is noted the different number of samples for each year because of slight differences in the images acquisition. For training, we initialized the first part weights of our network with pre-trained weights on ImageNet and applied a stochastic gradient descent optimizer with a moment of 0.9. The validation set was used to adjust the learning rate and the number of epochs, reducing the risk of overfitting in our method. After the adjustments, the learning rate was set to 0.001 and the number of epochs was set to 100. The proposed approach was implemented in Python on Ubuntu 18.04 operating system and used the Keras-Tensorflow API. The workstation used for both training and testing has an Intel (R) Xeon (E) E3-1270\@3.80 GHz CPU, 64 GB memory and an NVIDIA Titan V graphics card, that includes a 5120 CUDA (Compute United Device Architecture) cores and 12 GB of graphics memory. Lastly, to evaluate the performance of the approaches, we adopted three metrics: precision, recall, and f-measure [69]. They were calculated for the 311 tree samples ( Table 2) which were not used in the previous steps.

Validation of the Parameters
We first evaluate the influence of the proposed method parameters using only the validation images and reported the average f-measure of the three years. Parameters σ min , σ max and the number of stages, responsible for the refinement task in the density map prediction, were evaluated in the data displayed in Figure 6. From the f-measured shown in Figure 6a, σ min = 1 obtained the best result. Smaller values in this graphic represent a small spread of the density maps' peak around the center of the trees, thus impairing their detection. On the other hand, higher values of σ min in the last stage of our method returns a large spread that can cover more than one tree per area. In this sense, only one tree would be detected instead of two, as an example. As shown in Figure 4, σ max may be larger since it determines the density map of the first stage that is refined in the subsequent stages. This parameter can be situated between 2.8 and 3.2, although the best value in the experiment was 3 (Figure 6b). The number of stages n ranged from 2 to 8 as shown in Figure 6c. We found that n = 6 achieved the highest overall f-measure. In this manner, the refinement step of our network used the following parameters: σ min = 1, σ max = 3, and n = 6.
Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 19 We first evaluate the influence of the proposed method parameters using only the validation images and reported the average f-measure of the three years. Parameters , and the number of stages, responsible for the refinement task in the density map prediction, were evaluated in the data displayed in Figure 6. From the f-measured shown in Figure (6a), = 1 obtained the best result. Smaller values in this graphic represent a small spread of the density maps' peak around the center of the trees, thus impairing their detection. On the other hand, higher values of in the last stage of our method returns a large spread that can cover more than one tree per area. In this sense, only one tree would be detected instead of two, as an example. As shown in Figure 4, may be larger since it determines the density map of the first stage that is refined in the subsequent stages. This parameter can be situated between 2.8 and 3.2, although the best value in the experiment was 3 (Figure 6(b)). The number of stages n ranged from 2 to 8 as shown in Figure 6(c). We found that n = 6 achieved the highest overall f-measure. In this manner, the refinement step of our network used the following parameters: = 1, = 3, and = 6. The input images in the experiment have a total of 25 spectral bands. Our method can detect how many of them contributed effectively to the tree detection task. We then evaluated the proposed convolutional layer for learning m linear band combinations in Figure 7. The experiment showed that the number of band combinations m = 5 reached the best f-measure of 0.939 against 0.892 when considering all the 25 spectral bands. The data shows that adding more linear combinations does not Figure 6. Evaluation of (a) σ min = 1, (b) σ max = 3, and (c) number of stages responsible for the refinement of the density map prediction.
The input images in the experiment have a total of 25 spectral bands. Our method can detect how many of them contributed effectively to the tree detection task. We then evaluated the proposed convolutional layer for learning m linear band combinations in Figure 7. The experiment showed that the number of band combinations m = 5 reached the best f-measure of 0.939 against 0.892 when considering all the 25 spectral bands. The data shows that adding more linear combinations does not improve the results. These results confirm that the proposed layer appropriately combines which bands should be considered while avoiding the correlation and the scarcity that hinder most deep learning methods.

Band Analysis
To determine the robustness of the band selection module as an initial step of our network, we performed a comparison with our network baseline (i.e., every step beyond the feature map extraction, Figure 2) and different inputs. One input consisted of all the 25 spectral bands, whereas the other input was composed of spectral bands obtained through a PCA approach. It is also important to emphasize that the results of this section were obtained from the test images, whereas

Band Analysis
To determine the robustness of the band selection module as an initial step of our network, we performed a comparison with our network baseline (i.e., every step beyond the feature map extraction, Figure 2) and different inputs. One input consisted of all the 25 spectral bands, whereas the other input was composed of spectral bands obtained through a PCA approach. It is also important to emphasize that the results of this section were obtained from the test images, whereas the parameters of the methods were estimated from the validation set. Additionally, the PCA

Band Analysis
To determine the robustness of the band selection module as an initial step of our network, we performed a comparison with our network baseline (i.e., every step beyond the feature map extraction, Figure 2) and different inputs. One input consisted of all the 25 spectral bands, whereas the other input was composed of spectral bands obtained through a PCA approach. It is also important to emphasize that the results of this section were obtained from the test images, whereas the parameters of the methods were estimated from the validation set. Additionally, the PCA contained 99. 27% of the total information. Table 3 displays the overall precision, recall, and f-measure for the test images in the different scenarios described in the previous paragraph. By analyzing the precision values, it is evident that the baseline of our method in conjunction with the PCA spectral bands returned higher values when in comparison with the baseline plus all 25 bands. These precision values indicate that they do not have many false positives (i.e., do not detect trees incorrectly). When the recall values are analyzed, the proposed method with the band selection module is better than both approaches. This indicates that the proposed method detects most trees while others fail to detect them in the same manner. When considering the f-measure, viewed as the harmonic mean of precision and recall, it is observed that the use of all 25 bands was exceeded by the PCA (from 0.889 to 0.921). Compared to the baseline with the 25 spectral bands, the proposed method using five linear band combinations significantly improved the f-measure; from 0.889 to 0.956. Besides, the supervised reduction of bands proposed here proved to be superior to the PCA method, with an increase of 3.8% in f-measure (from 0.921 to 0.959) and 7.4% in recall (from 0.871 to 0.945). Figure 9 shows a qualitative view of the results of tree detection for the test images obtained in the 2016 and 2018 years. In Figure 9, detected trees have a yellow circle (meaning true-positive) while undetected trees have a red circle (false-negative). The yellow dots indicate incorrect detection by both methods (false-positive). By implementing all bands, the network returned the worst results due to the redundancy of spectral information; corroborating with the Hughes phenomenon. The PCA improved the detection of trees (Figure 9b) although it failed to detect a portion of them, which explains the low recall values when compared to the proposed method. As showcased here, the proposed method was able to detect the majority of trees correctly (Figure 9c).

Discussion
The methodological contribution of our CNN based method is evident when comparisons, both quantitatively and qualitatively, are made ( Figure 9 and Table 3). The implementation of a band selection module within our network's architecture not only reduces the amount of noise provoked by the dimensionality of hyperspectral data but also achieved better performance in the proposed task. A comparison with the PCA method, which is a common practice to reduce the number of bands needed, demonstrates the importance of adopting a method that considers the spectral information of the labeled object to select the right number of bands. This feature is not a common procedure for deep neural networks to consider within their architectures, and future methods could benefit from the module proposed here.
Concerning the high-density scene, the remaining process of our network already proved to be effective against other conditions [50]. Nonetheless, this was the first time that we have used a heavily-dense forested environment and hyperspectral data. The PPM module and the MSM stage refinement are important phases since they produce a high-quality density map containing the object's location. This returns high predictions even when trees are located near each other. In this sense, these modules are important as they enable our method to predict both overlapping and isolated trees (Figure 9(c)).

Discussion
The methodological contribution of our CNN based method is evident when comparisons, both quantitatively and qualitatively, are made ( Figure 9 and Table 3). The implementation of a band selection module within our network's architecture not only reduces the amount of noise provoked by the dimensionality of hyperspectral data but also achieved better performance in the proposed task. A comparison with the PCA method, which is a common practice to reduce the number of bands needed, demonstrates the importance of adopting a method that considers the spectral information of the labeled object to select the right number of bands. This feature is not a common procedure for deep neural networks to consider within their architectures, and future methods could benefit from the module proposed here.
Concerning the high-density scene, the remaining process of our network already proved to be effective against other conditions [50]. Nonetheless, this was the first time that we have used a heavily-dense forested environment and hyperspectral data. The PPM module and the MSM stage refinement are important phases since they produce a high-quality density map containing the object's location. This returns high predictions even when trees are located near each other. In this sense, these modules are important as they enable our method to predict both overlapping and isolated trees (Figure 9c).
Bearing the results of the proposed network baseline in the detecting Syagrus romanzoffiana, it is highlighted the high f-measure value achieved (0.959 as shown in Table 3). This palm tree is essential to forest regeneration [58] and its accurate identification can improve the monitoring of forest successional stages. Additionally, Syagrus romanzoffiana identification can be applied to fauna studies, such as the one related to tapirs monitoring since this mammal is one of the main consumers of this palm tree fruits and spread its seeds by the feces, contributing to the tree species dissemination [55,61].
Moreover, besides the developed method, the Syagrus romanzoffiana characteristics may assist this tree species identification. Results from Miyoshi et al. [17] showed the higher reflectance factor of this tree species when compared with the other seven tree species belonging to the transect area, especially in the near-infrared region of the electromagnetic spectrum. In this region, the vegetation response is mainly affected by the leaf's cell structure [70] and is an important region to tree species identification [71,72]. Beyond that, there is the unique crown spatial distribution of Syagrus romanzoffiana. Its crown shape is like a star, while the other tree species has umbrella, oval, broad, or irregular shapes among others, not counting the difference in the existence of different layers in these crowns [17].
Lastly, when comparing the results with different researches that applied deep learning, it is noticed that they are consistent with ours. Sothe et al. [23] showed a better performance of CNN than SVM and RF when identifying tree species from the ombrophilous dense forest. Safonova et al. [24] found values of f-measures up to almost 93% when applying data augmentation and CNN in RGB images. Furthermore, Nezami et al. [22] also achieved high precision and recall values (i.e., higher than 0.9) when identifying three tree species using a 3D-CNN. Using the Residual Neural Network (ResNet) and RGB images acquired with UAV over three years, Natesan et al. [73] achieved an average f-measure value of 80% to identify three types of pine trees. The use of deep learning in RGB images is also shown by Santos et al. [32] achieving an average precision of 92% in Dipteryx alata tree species identification. These accuracies demonstrate that our method, with an f-measure equal to 0.959 (Table 3), was also able to return state-of-the-art performance for the detection of tree species in a forest environment.

Conclusions
In this paper we presented a novel deep learning method, based upon a CNN architecture, to deal with high dimensionality data of hyperspectral UAV-based images to detect single-tree species. Our approach was constructed with a band selection feature in its initial step. This implementation within the network proved to be appropriate to deal with high dimensionality and was superior when compared with the baseline method considering all the 25 spectral bands and the PCA approach. Our CNN architecture is also followed by a feature map extraction and a multi-stage model refinement of the confidence map. The constructed architecture considers the possibility of every pixel in the image to be correspondent with an actual tree-species. This was important to produce accurate results in a highly-dense scene. The proposed method returned a state-of-the-art performance for detecting and geolocating trees in UAV-based hyperspectral images, with an f-measure, precision and recall values equal to 0.959, 0.973, and 0.945 respectively. Differently from other current deep neural networks, our method estimates a combination of hyperspectral bands that most contribute to the mentioned task within the network's architecture. The approach demonstrated here is important to deal with forest environment monitoring while providing accurate identification of single-trees.

Conflicts of Interest:
The authors declare no conflict of interest.