Tree Crown Delineation Algorithm Based on a Convolutional Neural Network

: Tropical forests concentrate the largest diversity of species on the planet and play a key role in maintaining environmental processes. Due to the importance of those forests, there is growing interest in mapping their components and getting information at an individual tree level to conduct reliable satellite-based forest inventory for biomass and species distribution qualiﬁcation. Individual tree crown information could be manually gathered from high resolution satellite images; however, to achieve this task at large-scale, an algorithm to identify and delineate each tree crown individually, with high accuracy, is a prerequisite. In this study, we propose the application of a convolutional neural network—Mask R-CNN algorithm—to perform the tree crown detection and delineation. The algorithm uses very high-resolution satellite images from tropical forests. The results obtained are promising—the Recall , Precision , and F 1 score values obtained were were 0.81, 0.91, and 0.86, respectively. In the study site, the total of tree crowns delineated was 59, 062. These results suggest that this algorithm can be used to assist the planning and conduction of forest inventories. As the algorithm is based on a Deep Learning approach, it can be systematically trained and used for other regions. F.H.W.; Software, J.R.G.B. and V.P.; Supervision, F.H.W.; Validation, J.R.G.B.; Visualization, J.R.G.B.; Writing—original draft, J.R.G.B. and F.H.W.; Writing—review and editing, J.R.G.B., V.P., R.D., M.P.F., Y.T., L.E.O.C.A., H.F.d.C.V., E.H.S., F.H.W.


Introduction
Forest ecosystems are important for maintaining life on our planet, as they secure food for local population, contribute to soil conservation, mitigate the effects of climate change, and provide habitats for species and regulate water flow [1]. In particular, tropical forests have a fundamental role for maintaining biodiversity. For example, the Amazon rainforest is responsible for hosting about a (called local maximum algorithms) may be useful in temperate forest regions, but this category of algorithm may not be suitable for a tropical forest region due to a large variety of tree crown formats [13]. In addition, pixels with maximum brightness may not be at the top of the crowns but in a region close to the edge; this situation can occur mainly in rounded tree crowns [13].
The techniques that use region growing are based on the crown spectral characteristics. The result of the application of this category depends on the density of the forest, the tree position and the dataset resolution [29,31]. Region growing is a segmentation approach which splits an image in different areas and recognizes objects within each sub-image. This technique depends on the assumption that the color intensity is high on the top of the tree crown and decreases gradually until the border is reached, which has a shaded area [32].
The template matching approach is based on the tree crown's shape [31]. Generally, this approach models a tree crown using an ellipsoid (template equation), and different tree crowns shapes can be modeled by varying the ellipsoid surface (changing the ellipsoid parameters equation). Then, crowns with high correlation with the template equation are considered likely to be tree crowns [31]. Artificial neural networks (ANNs) also have been applied as a template matching step in TCDD algorithms [33,34].
Recently, a novel ANN approach, called convolutional neural network (CNN), has become the state of the art for solving different computer vision problems, such as face recognition [35], object detection [36], human pose estimation [37], and tree species detection [38]. Due to its promising results in image processing, the CNNs have been used to solve different problems within remote sensing, such as land cover classification [39], scene classification [40], object extraction [41], species classification (e.g., oil palm tree detection in a region located in the south of Malaysia [42]), fine-grained mapping of vegetation species and communities in the central Chile [43], tree crown detection [44], and very high-resolution regional tree species maps [13].
The CNN (a deep learning algorithm) is a feed-forward neural network trained in a supervised way, which has gained prominence due to its application in computer vision, mainly for solving instance segmentation problems in a scene [45]. Instance segmentation aims to identify an object at the pixel level and perform its complete delineation. Within the CNN's architectures, the Mask R-CNN stands out, which has outperformed the results obtained by other architectures designed for instance segmentation tasks [45]. Despite the promising results reached by this CNN procedure in recent studies, such as hangar detection [46], livestock farming management [47], and ship detection [48], very little was studied about its application in high spatial resolution satellite images.
One of the main problems faced during the application of CNNs (including Mask R-CNN) is the composition of a training set with enough training examples (training patterns) for neural network learning and, hence, to solve the problem in a satisfactory way [44,45]. Deep neural networks, including CNN, require a large training set due to the number of free parameters (weights and bias) belonging to the network architecture; these free parameters need to be adjusted during the learning process. There is a directly proportional relationship between the number of free parameters and the number of patterns as input for the neural network in the training phase [49,50]. Gathering patterns for the CNN training, allowing the algorithm to solve the object detection or instance segmentation problems is costly and difficult because training sets must be composed of thousands of images, all with objects of interest with the correct delineations [44]. In addition, the quality and quantity of training patterns can impact the prediction accuracy. When using a CNN for TCDD, the collecting of training samples may become more difficult because, even in high resolution images, especially over tropical forest regions, identify ITC samples are not trivial. One solution proposed for this problem is the use of an unsupervised algorithm to select the training patterns, but the inaccuracy of this algorithm during the selection of samples for training may negatively impact the CNN's performance [44,51]. Another alternative is the use of LiDAR point cloud information to help with the manual delineation [44].
According to this context, this research proposes the application of Mask R-CNN to perform TCDD in very high spatial resolution WV-2 images (0.5 m per pixel) from a highly diverse tropical forest area.
To construct the training set, an algorithm was implemented to obtain synthetic images. This algorithm produces the synthetic image using some hand-annotated crowns, and its implementation has two main objectives: first, to overcome the need to delineate by hand a large training set composed of images with all ITCs delineated, and, second, to evaluate its use as an alternative to other techniques (such as LiDAR and unsupervised algorithms) during the training set construction. The main objective of the study is to present a new employment of deep learning to delineate each tree crown individually in tropical forests. The main innovation presented is the use of a deep learning-based algorithm to perform the TCDD over a tropical forest image, which produces as a response the tree crown delineation. In addition, according to Weinstein et al. [44], there is a difficulty to compare results about TCDD due to variation in applied measurements; therefore, this research provides set of metrics and graphical analysis that can be used like a guideline for the analysis of others research that will perform the TCDD. For these reasons, this methodology can be applied as an auxiliary tool in the development of forest inventories of tropical regions.

Study Site
The study site is the Santa Genebra Forest Reserve, a remaining fragment of Atlantic tropical rainforest located in the municipality of Campinas (São Paulo State, Brazil); see Figure 1. The Santa Genebra Reserve is located at 22 • 49 13.46 S and 47 • 06 38.47 W. The canopy cover in this region is highly heterogeneous and comprises deciduous and evergreen by hectare; the reserve is well-preserved and occupies an area of 237.6 ha [20]. Surveys performed in Santa Genebra Reserve found near of 100 woody species within one hectare [52,53]. The predominant climate of the region is tropical humid, with rainfall distributed throughout the year. The region receives approximately 1500 mm of precipitation per year, but with a rainy season in the summer months (December to February) with monthly rainfall exceeding 200 mm and a drier winter (June To August) with monthly precipitation below 100 mm monthly [13].

WorldView-2 Satellite Image
The WV-2 satellite was launched in 2009 by DigitalGlobe (DigitalGlobe, Inc., Westminster, CO, USA). The WV-2 images contain eight spectral bands. The multi-spectral bands encompass the electromagnetic spectral range from 400 nm to 1500 nm with 2.0 m of spatial resolution. The panchromatic band covers the spectrum from 450 nm to 800 nm with 0.5 m of spatial resolution; see Table 1. A pan-sharpening process was applied to obtain an RGB image (a combination of the red, green, and blue bands) at 0.5 m spatial resolution (i.e., the same resolution as the panchromatic band) to produce an image which allowed the manual delineation of the tree crowns. The pan-sharpening algorithm used was the local mean variance matching (LMVM), which succeeded with yielding significant results in pan-sharpening technique comparison studies [54,55]. Figure 2 shows the results of the LMVM algorithm over a sub-image of the study area.

Individual Tree Crown Dataset
From the pan-sharpened WV-2 image, individual examples of tree crowns were manually delineated. Only crowns that are clearly identified in the satellite image were outlined to compose the training and validation set (validation during the learning process). There was no specific concern with the tree species to construct the training set. For the formation of the training set, tree crowns of different sizes, shapes, and colors were collected, so that the neural network could delineate the ITCs of any species from the Santa Genebra forest. A total of 1506 tree crowns were manually delineated, and, among them, 1050 (69.7%) were selected to compose the training set and 456 (30.3%) were selected to compose the validation set (validation during the neural network training), Figure 3 shows some training and validation patterns examples.  Table 2 shows the minimum, mean, and maximum area of ITCs in the training and validation datasets.

Instance Segmentation with Mask R-CNN
According to Bai and Urtasun [56] the instance segmentation seeks to identify the semantic class of each pixel as well as associate each pixel with a physical instance of an object. The instance segmentation is a challenge within the computer vision problems because it encompasses two hard tasks from image processing: object detection, with the purpose of classifying all the objects in the scene and locating each within a bounding box, and semantic segmentation, which seeks to determine the pixels that belong to a specific object of the scene [45,56]. The instance segmentation performs the correct delineation of different objects in a scene and gives for each object a specific number for its identification (an ID value).
The Mask R-CNN ( Figure 4) is an extension of CNN, an algorithm developed to perform object detection within an image [45]. Its innovation is the inclusion of a new branch in the Faster R-CNN architecture to perform the instance segmentation [57]. The Mask R-CNN can be divided into two distinct modules that work together to perform the instance segmentation. The first module is composed of the Faster R-CNN, which performs the following operations. (i) A set of convolution layers extracts a feature map from the image. (ii) In the Faster R-CNN, there is a lightweight neural network called the region proposal network (RPN). The feature map is inputted into the RPN, and it scans the feature map and finds areas where the probability of containing an object is high; these areas are called regions of interest (RoI). (iii) Each RoI obtained from RPN could have a different shape; hence, the algorithm applies an operation (performed by a polling layer) to convert all RoI to the same shape. (iv) The fully connected networks (FCNs) work as a RoI classifier, determining the class (label) of the object and refining the location and size of the bounding box to encapsulate the object.
The second module (new branch) of Mask R-CNN is composed of a set of convolution layers which performs the masking around the object. The algorithm selects the RoIs with higher overlap precision with the ground truth (the positive RoIs). Then, the convolution layers work over the positive RoIs and determine the pixels belonging to each object. Therefore, the Mask R-CNN response is a correct bounding box, which contains the interest object and its classification within the target classes as well as the object mask, including all the pixels in the scene belonging to the object. This new branch increases the computational time, but, even with this increase in time processing, the real-time instance segmentation is still an effective solution [45].

Synthetic Forest Images for Training
In images of tropical forests, the construction of a set of samples for training a CNN is a challenge mainly due to the density of trees. The manual delineation of all the crowns in a region such as the Santa Geneva Reserve, with approximately 100 tree species per hectare, is almost impossible. To overcome this difficulty, an algorithm was developed for the creation of synthetic forest images using a set of well-delineated tree crowns created by hand. The algorithm steps are described in the Algorithm 1.
Algorithm 1: Algorithm for building a a synthetic forest image to compose the training dataset.
Result: Synthetic forest image initialization; count = 1; create a matrix with the same dimensions of subimage; while (count < numberO f Crowns) do select one place within the subimage; select one crown within the polygons; if (place is free) then put the crown in the subimage; fill matrix with count value using place; count + = 1; end end create a geometry matrix from matrix; return (matrix, subimage) In the Algorithm 1: • subimage is a background where the synthetic forest will be created. Literally, it can be an image patch from a region of a WV-2 image or a black-image. The dimensions are determined by the user, but the channels must be R, G, and B; • polygons is the set of manually delineated crowns. In the specific case of this research, it is a shape-file with the geometries of each manually delineated crown; • count variable to control the numbers of crowns in the subimage; the algorithm copies for the subimage a specific number of crowns; • matrix is a two-dimensional array which is polygonized to create the correspondent forest shape-file, where for each crown there is a geometry; • the algorithm checks if the place selected to copy the crown into the subimage is free. In other words, it assesses if putting the new tree crown into the subimage will cover most part of an existing tree crown; • the matrix is filled in the same selected place with count values. • the matrix is converted in shape-file with the geometries of each tree within; and • the algorithm return the forest (subimage) image and its shape-file (matrix).

Training the Mask R-CNN for TCDD
Using the algorithm for synthetic forest image creation and the vector files containing the manually delineated tree crowns, 19,656 synthetic images and their respective labels (an example in Figure 6) were created. From this sample, 15,122 were used for the Mask R-CNN training, and 4534 images were used for validation. The number of tree crowns within these synthetic forest images ranged from 4 to 150. Each synthetic image was generated with a dimension of 128 × 128 pixels, which represents an area of 4096 m 2 . In the WV-2 image applied in this research, the number of tree crowns within a grid of 128 × 128 (or 4096 m 2 ) pixel was normally less than 150. The dimensions 128 × 128 pixels for the synthetic images was determined to avoid problems (mainly memory allocation errors) during the training procedure. The hardware used for image processing was the main constrain that led to the definition of the dimensions of the synthetic images (hardware configuration, Table 3). The Mask R-CNN algorithm uses the graphics processing unit (GPU) to improve training and prediction performance. During the Mask R-CNN training on the hardware used in this research, when the neural network was fed with an image with more than 150 interest objects, the hardware could not allocate enough GPU memory to work on the image, so the computer stopped the training execution. Sometimes the training was executed, but the algorithm could not detect all the tree crowns present in the image.  The values of initial learning rate and momentum are 0.001 and 0.9, respectively. The training of the model was run for 120 epochs. Each epoch is a full pass over the training set. To improve the training process, a learning rate decay (a reduction in the value of the learning rate) of learning rate 10 every 40 epochs was applied. Therefore, the value of the learning rate in the epoch 41 and 81 was 1 × 10 −4 and 1 × 10 −5 , respectively. During the training phase, data augmentation was randomly applied over the training images before they were to be inputted into the neural network. The data augmentation was composed of three image transformations (1) horizontal or vertical flip, (2) rotation of 90 • , 180 • or 270 • , and (3) pixel brightness value change within the range of 50% to 150%), one of these transformations was selected and applied over a training image. Other transformations in the image, such as shearing and changing hue and saturation, were also tested, but the training with data augmentation composed by flipping, rotation, and brightness change was what produced the best results. The metrics applied to evaluate the Mask R-CNN training and validation are class loss, bounding box loss, mask loss, and total loss. Their values after 120 epochs can be seen in Table 4.
• class loss-how close the model is to predicting the correct class; • bounding box loss-the distance between the ground-true (validation) bounding box parameters (height and width) and the prediction bounding box parameters; in other words, how good the model is at locating objects within the image; • mask loss-measures per pixel misclassification by comparing the ground-true pixels and the predicted pixels; and • total loss-is the sum of the others' losses; The stopping criterion for the neural network training was the stabilization of the total loss metric. Figure 7 shows the evolution of the total loss for both training and validation during the 120 epochs. After the training, the Mask R-CNN was applied over the WV-2 pan-sharpened image (depicted in Figure 1B) to perform the TCDD. This image was split in patches of 128 × 128 pixel using a regular grid with an overlap of two columns and two rows of pixels between patches, and each patch was presented to the Mask R-CNN. This overlap is important because, together with the regular grid, they help to merge a tree crown that was split by two patches. This algorithm for merging two parts of a tree crown was developed in R, and it performs the following operations: 1-selects all the tree crowns that intersect the grid lines 2-then, the algorithm checks if between two tree crowns there is an intersection (a polygon, not a line or point) between them; if so, these two segments are merged. The dimension cited in this paragraph was determined also to keep the same dimension of the training images. The Mask R-CNN output is a vector file where each tree detected in the image was delimited by a polygon with an individual identification number. All the codes developed in this research were implemented R in Python and using the libraries Tensorflow [58], Keras [59] and Gdal [60]. The Mask R-CNN algorithm was obtained from the implementation present in [61]. All the implemented codes, the Mask R-CNN for TCDD, the algorithm for creation of synthetic images, and the code for merging the tree crows split by grid is available on github: https://github.com/jgarciabraga/MASK_RCNN_TCDD.

Independent Algorithm Assessment
We conducted the algorithm validation with two main objectives: (1) to provide for future research a set of metrics (these metrics were applied in different previous research and they were summarized here) which can be applied for results analysis, thus it is possible to establish a standard for results analysis about TCDD, (2) to compare with results obtained in others previous research about TCDD, and (3) to compare to an independent evaluation dataset.
The evaluation dataset was obtained using a set of 989 points randomly generated over the Santa Genebra forest. Over these points, a visual interpretation was done, and 428 points were classified as a true crown and each one was manually delineated; 561 were marked as no crowns. Within the true crowns, the average area was 15.23 m 2 , the area ranged from 3.18 m 2 to 567.55 m 2 , and most values ranged from 5.43 m 2 to 133.85 m 2 (the 5th and 95th percentiles, respectively).
The objective of the assessment was to obtain the algorithm's performance in the tree crown detection accuracy and in the tree crown delineation accuracy. First, the detection accuracy was verified to evaluate the Mask R-CNN's ability to correctly detect a tree crown in the region of tropical forest. The confusion matrix was applied with this purpose. The confusion matrix is a statistical technique, in our study, made up of two rows and two columns, which reports the number of true positive (when a true crown is detected by the algorithm), false positive (occurs when a crown is detected by algorithm where there was no crown), true negative (occurs when no crown is not detected where there was no crown) and false negative (occurs when a no crown is detected where there was a crown) [13]. Besides that, for a tree crown detected by Mask R-CNN to be classified as true positive, at least 50% of its pixel must be correctly classified. When less than 50% of the pixels are correctly identified, the tree crown is classified as false negative. From the confusion matrix, we computed the Kappa index [62] and the overall accuracy; these metrics were obtained to check the algorithm's ability within the detection problem.
Other metrics were applied to evaluate the algorithm from the perspective of the automatic delineation of the tree crowns. Using the true positive results, the Mask R-CNN delineation and the manually delineation (ground-truth) were compared, and the following metrics were computed. The pixel excess (number of pixels of the segmented crown outside the true crown) and the pixel deficit (number of pixels inside the true crown missing in the segmented crown) were computed [13]. The intersection over union (IoU) computes the bounding box area overlapped by manual delineation and Mask R-CNN delineation (intersection area) divided by the sum of bounding boxes from the manual and neural network (union area). An object is correctly delineated when its IoU was ≥50%. The Recall is the value of intersection area divided by the bounding box from a manually delineated crown. The Precision is the intersection area divided by the bounding box area from the object delineated by the proposal algorithm. To clarify, Figure 8 shows how to obtain the Recall, Precision, and IoU values for each delineated crown. Recall and Precision lead to obtaining the F 1 score, and this formula is exhibited in Equation (1). In the F 1 score, Recall, Precision, and IoU are traditional metrics for evaluating delineation object algorithms, and higher values mean better algorithm result:

Detection Accuracy
From the evaluation dataset of 428 tree crowns, the algorithm correctly detected (true positive) 395 (92.3%), and only 33 (7.7%) were not detected (false negative). Within the 561 non-crown points (for example, a shadow), 555 (98.9%) (true negative) were correctly classified, and just 6 points (1.1%) were classified as a tree crown (false positives). These results can be resumed in the confusion matrix (see Table 5). From the confusion matrix (Table 5), we calculated the Kappa Index, which was equal to 0.919, and the overall accuracy, equal to 96%. Within the group of 395 tree crowns classified as a true positive, the mean of correct pixels were 84%, and 345 tree crowns (or 81%) obtained more than 70% of their pixels correctly detected.
Within the correctly detected crowns, 355 were intersected by just one tree crown. The average crown area from this group was 28 m 2 , and the range area varied between 3.18 m 2 and 567.54 m 2 . Forty were intersected by two or more crowns as an algorithm response, and the range area within this group varied between 11.83 m 2 and 405.10 m 2 , with an average of 113.20 m 2 . Table 6 presents the frequency and the number of segments that intersect the crown (i.e., the line which presents 2 segments with a frequency of 33 meanings that 33 crowns from evaluation set were intersected by 2). The number of segments was the number of tree crowns that compose a single crown. Table 6. The number of segments intersecting each tree canopy in the evaluation set, with respective frequency (relation to the total number of detected crowns, i.e., a total of 395).  Figure 9 shows the number of crowns by the numbers of segments, as defined by the delineation algorithm.

Delineation Accuracy
The mean area of the 395 tree crowns correctly detected was 14 m 2 (56 pixels). The area value ranged from 2.75 m 2 (11 pixels) to 333.5 m 2 (1334 pixels), and the majority of the values ranged from 5.5 m 2 (22 pixels) to 127 m 2 (508 pixels) (5th and 95th percentiles, respectively). The relationship between the tree crown area (in pixels) from the evaluation set and the Mask R-CNN delineation obtained a coefficient of determinations (R 2 ) of 0.9312 (see Figure 10). The comparison between the area (in pixels) of each canopy tree from the evaluation set and each delineated canopy resulting from the Mask R-CNN delineation were used to calculate the pixel deficit (see Figure 11A) and excess (see Figure 12A). The mean pixel deficit was 16.6%, and the number of tree crowns with pixel deficit was 188 (47.6% from the total of 395 detected crowns). The average area of tree crowns with pixel deficit was 49 m 2 . Using the graphical analysis, exhibited in Figure 11A, it was possible to note that there was no association between the validation crown size and pixel deficit, R 2 is equal to −0.005. Within the tree crowns with pixel deficit, the Mask R-CNN obtained an average area accuracy (the percentage of each crown area from evaluation set correctly determined.) of 77.7%, and 147 (78.2% from 183 with pixel deficit) tree crowns had an area accuracy over 70%; see Figure 11B. The mean pixel excess was 25.2% and the number of segmented crowns within this group was 196 (49.6% from the total of 395 detected crowns). There was no association between the validation tree crown size and pixel excess (see Figure 12A). The average area of tree crowns with pixel excess was 25 m 2 . The black dashed line in Figure 12A correspond to an approximation between the pixel excess and the tree crown area (from the evaluation set) by a linear model with a R 2 equal to 0.01. A total of 11 (2.8%) tree crowns had no pixel deficit or excess. The average area accuracy within the tree crowns with pixel excess was equal to 92.1%, and 191 (97.4% from 196 with pixel excess) had an area accuracy over 70%, see Figure 12B. The pixel excess and deficit frequency had a normal distribution with a mean of −6.1 pixels and a standard deviation of 65.5 pixels (see Figure 13). Over a subset from Santa Genebra Reserve image ( Figure 14A), the Mask R-CNN was able to delineate ( Figure 14B) and to identify (give a specific ID, represented as different fill colors in; see Figure 14C) most of the visually noticeable tree crowns. This region has 40, 501.5 m 2 (around 4 ha) and the deep learning algorithm detected and delineated 1283 tree crowns. The subset image had a large number of trees with different crown sizes. The largest tree crown delineated by the algorithm had an area of 414 m 2 and the smallest has an area of 2 m 2 . The average area from the delineated tree crown in this region was 12.39 m 2 and most crown areas ranged from 2.5 m 2 to 29.45 m 2 (the 5th and 95th percentiles, respectively). The total number of tree crowns delineated by Mask R-CNN in the Santa Genebra Forest was 59, 062 (see Figures S1 and S2 in the Supplementary Materials). From the 395 tree crowns detected by Mask R-CNN, 349 (88%) obtained an IoU value greater than 0.5. For the evaluation set, the average for the IoU value was 0.61. Within the group with IoU ≥ 0.5, the average of IoU was 0.73. Figure 15 shows the graphic of distribution of IoU values and the relation between the IoU value and the area (in meters) of each crown from the evaluation set. A total of 216 (55%) achieved an IoU greater than 0.7. This IoU value indicates high fidelity to the ground truth (i.e., the hand-annotated tree crowns). In cases where the IoU value was over 0.7, the bounding box overlay between the response and the ground truth is almost perfect (see Figure 16). In the Supplementary Materials, Figure S3 is provided, where it is possible to check the training patterns, the evaluation set, and the Mask R-CNN delineation. Considering the evaluation set applied, the Mask R-CNN obtained an average value F 1 score of 0.77 for all the tree crowns detected, and for the tree crowns with IoU value ≥ 0.5 the average for F 1 score was 0.86 (see Table 7).

Discussion
Our research proposes the application of one of the latest-developed deep learning techniques for image instance segmentation, known as Mask R-CNN [45], to perform tree crown detection and delineation in a tropical forest with a very high-resolution image. The study site was the forest of Santa Genebra Reserve, a well preserved fragment of Atlantic rainforest with a heterogeneous canopy cover [20].
One of the main difficulties faced during the work with CNN for image segmentation is building the training set, as it needs thousands of images with the object of interest manually delineated for it to be able to properly train the network. For the TCDD within a tropical forest, the difficulty to obtain images with all tree crown hand-delineated is high due to the environment complexity (e.g., the number of tree crowns and species), even using very high spatial resolution images. As an alternative to the difficulty of obtaining forest images with hand-annotated tree crown examples for deep learning training, this research uses an algorithm for the creation of synthetic forests. Using the proposed algorithm together a set of hand-delineated tree crowns, it was possible to construct thousands of synthetic forest images, as the algorithm created images with tree crown overlap and keeps the correct delineation. There is also the possibility of varying the number of tree crowns within the image, thereby allowing the creation of regions with variety in canopy density. Furthermore, each new tree crown overlay is a new pattern to be input into the neural network during the training process, which could increase its capacity in detecting tree crowns within a forest and which avoids the overfitting during the learning process.
The main advantages of the method proposed in this article include the ability to detect and delineate tree crown within a tropical forest with high accuracy and working only with an RGB satellite image to facilitate its reproduction to other areas of interest. Our findings confirmed that the CNN-based model is able to accurately detect and delineate tree crown in a highly heterogeneous tropical forest. We discuss this in greater detail in the sub-sections below the model's performance, limitations, and perspectives.

TCDD Detection Performance
With a Kappa index and global accuracy values for tree crown detection of 0.919 and 96%, respectively, Table 5), the method proposed proves to be useful for application over tropical forests. Using a number of pixels correctly detected over 70%, a total of 345 tree crowns were considered a true positive tree crown, and the value of Kappa index and the global accuracy were 0.813 and 90%, respectively. Instead, the use of the number of pixels correctly detected to evaluate the detection accuracy, and the IoU value can be used for this purpose. The research developed for [44] considered the calculus of metrics using the results with IoU minimum value of 0.5 more stringent. Our research detects 349 tree crowns with an IoU value ≥ 0.5 and 79 had an IoU value < 0.5 or were not detected. Using these values, the Kappa index obtained was 0.821 and the global accuracy detection was 90%.
Comparing those results obtained with a recent study on the same region, our research achieved better results, as Wagner et al. [13] obtained (for crown detection) a Kappa index value of 0.70 and a global accuracy of 85%. The research developed by Wagner et al. [13] proposed an algorithm based on edge detection and region growing to perform the TCDD.
In the research developed by Larsen et al. [63], which compared six different approaches to tree crown detection algorithms, global accuracy results ranged from 32.9% (region with high crown density) to 99.7% (tree planting region). The six algorithms used in the research developed by Larsen et al. [63] were the region growing, treetop technique, template matching (but not a machine learning approach), scale-space, Markov random fields, and marked point process. Thus, the results obtained for detection are close to the best values obtained, with the difference that the results obtained in our research were obtained over a highly diverse tropical forest region while the results of Larsen et al. [63] were obtained over a coniferous forests. Table 8 summarizes the detection results obtained by Larsen et al. [63], in the region with high crown density, the detection result obtained by Wagner et al. [13], and the detection results of our research. Table 8. Comparison between the detection accuracy in the research developed by [63] (region with high crown density), Wagner et al. [13] and our research considering three situations: for tree crowns with pixels correctly detected (PCD) ≥ 50%; crowns with PCD ≥ 70%; and for tree crowns with IoU ≥ 0.5.

Research Algorithm Detection Accuracy
Larsen et al. [63] region growing 59.2% Larsen et al. [63] treetop technique 52.3% Larsen et al. [63] template matching 52.6% Larsen et al. [63] scale-space 32.9% Larsen et al. [63] Markov random fields 47.5% Larsen et al. [63] marked point process 49.2% Wagner et al. [13] edge detection and region growing 85% Another important result concerns the Mask R-CNN robustness in avoiding tree crown over-segmentation and under-segmentation. Over-segmentation occurs when the Mask R-CNN splits one example from the evaluation set into two or more tree crowns. Under-segmentation occurs when two or more examples from evaluation set are detected by the neural network as just one tree crown. From the total tree crowns detected by Mask R-CNN (395), 89.9% were intersected by one segment. In other words, they were composed by only one tree crown (one segment); hence, they did not suffer from over-segmentation (see Table 6). Only 10.1% were intersected (are composed) by two or more segments (tree crown). As shown in the graph in Figure 9, the number of segments that intersect a tree crown tends to increase with the dimensions of the crown. This occurs because the great majority of tree crowns obtained for the neural network training have a crown area smaller than 20 m 2 . An alternative, to circumvent this problem, could be reached introducing larger tree crowns into the training set. Under-segmentation did not occur in this research.

TCDD Delineation Performance
In this research, 88% of tree crowns from the evaluation dataset obtained IoU values over 0.5 (the overall accuracy of delineation); from these tree crowns, the Recall, Precision, and F 1 scores values were 0.81, 0.91, and 0.86, respectively. The value of Recall, Precision, and F 1 score considering all tree crowns detected by Mask R-CNN were respectively, 0.68, 0.89, and 0.77. For semantic segmentation problems using deep learning, this result is considered significant. In the research developed by Weinstein et al. [44], a pipeline was proposed using hand-annotation examples and LiDAR data to train a CNN to perform the TCDD, the authors evaluated their research using tree different approaches for CNN's training: the first used the hand-annotation data during the training phase, the second used a self-supervised model and the third, called a full model, combine the two previous strategies. The study developed by Gomes et al. [64] performed the TCDD over sub-meter satellite imagery and also applied the metrics Recall, Precision, and F 1 for results evaluation. The algorithm proposed by [64] was based on a marked point process (MPP) to detect and delineate individual tree crowns, and the research obtained the following average values 0.67, 0.60, and 0.63 for Recall, Precision, and F 1 score, respectively, but none of the study sites used have the same tree crown complexity of a tropical rain-forest. Table 9 summarizes the value of Recall, Precision, and F 1 score of recent research that performed TCDD over sub-meter satellite imagery. Table 9. Average values of evaluation metrics Recall, Precision, and F 1 score of recent research that also conducted the TCDD.

Research
Recall Precision F 1 Score Silva et al. [65] obtained from [44] 0.14 0.07 0.09 Gomes et al. [64] 0 With the overall accuracy rate of 88%, our method proves to be useful for performing the TCDD because recent research of TCDD algorithm development has obtained close accuracy values; Tochon et al. [25], and Singh et al. [15] have reported accuracy of 68% and 69.2%, respectively, Wagner et al. [13] has reported 80% and Dalponte et al. [66] achieved 88.8%. Moreover, the relationship between the tree crown area (in pixels) of examples from the evaluation set and tree crown area (in pixels) obtained by the proposed segmentation algorithm, Figure 10, has a value of R 2 of 0.9312, which demonstrates the robustness of the Mask R-CNN to estimate the crowns dimensions and hence to perform the tree crown delineation.
Dalponte et al. [66] reported that the size of small treetops tends to be underestimated when optical images are applied in the TCDD process. In our study, this problem did not occur since there is no relationship between pixel deficit and tree crown area (see Figure 11). Pixel deficit occurs mainly in larger tree crowns (see Figure 11). In addition, our results show that the tree crowns with the smallest area have the best value of IoU; see Figure 15A.
However, the proposed algorithm tended to overestimate the tree crown area because the number of examples of excess pixel crowns is greater when compared to the number of examples with pixel deficit (see Figures 11 and 12). The overestimated areas occur mainly in tree crowns with area less than 12.5 m 2 (see Figure 12), but the result for the crown delineation was not significantly impaired because the pixel difference between the segmented tree crowns and the evaluation tree crowns had a normal distribution with a mean of -6 pixels (see Figure 13) and an IoU value distribution greater than 0.5 for 88% of examples. In addition, the F 1 score achieved significant values, over 0. 70. In this research, the total of tree crowns detected and delineated by Mask R-CNN was of 59, 062, and the tree crown area ranges from 2 m 2 to 413.75 m 2 (see Figure S1 in the Supplementary Materials). The average crown area was 8.75 m 2 and the most part of the area ranged from 3.75 m 2 to 34 m 2 (the 5th to 95th percentiles, respectively). In the research developed by Wagner et al. [13], which implemented a traditional TCDD algorithm to work over the Santa Genebra Forest Reserve, only 23,278 tree crowns were detected and delineated.

Shade Effect-Limitations and How to Resolve Them
One of the main issues for the accuracy of TCDD algorithm is the shade effect [13,66], and, in our analysis, some tree crowns present in shadow regions were ignored by the model, Figure 14. According to Dalponte et al. [66], the shadow effect in TCDD algorithms could be solved using LiDAR data. Considering the TCDD performed by a neural network approach, to decrease the shadow effect, another strategy to work around the problem using multi-spectral images could be to feed the model with a labeled image of trees present in a shadow region during the neural network training.

How to Deal with the Leaf Fall Effect
The image of this study was taken during the wet season, when all crowns are likely foliated. However, the Santa Genebra reserve is a seasonal semi-deciduous forest with a leaf loss ranging from 20% to 50% during the dry season [20] plus seasonal changes in spectral characteristics [67]. These characteristics could increase the shadow presence and decrease the algorithm accuracy. As the algorithm presented in this research is based on a supervised neural network, to handle with leafless tree crowns, the presentation of images with leafless trees during training can be an alternative to reduce algorithm errors. Moreover, with the adoption of this alternative, the algorithm could operate over images taken throughout the year. Further works are needed to improve the training sample of the model and limit the effects of shade and seasonal changes in reflectance.

Algorithm Limitations
The prediction of the algorithm was made on patches of 128 × 128 pixels in a grid approach; between the patches, there is an overlap of two columns and rows of pixels, and then the prediction for each patch were merged. After the prediction, the algorithm that performs the tree crown merging is applied to make the union between the crown which was split by the grid.
The grid size (128 × 128 pixels) is not a requirement of the algorithm but was designed to deal with the limitation of the GPU hardware, mainly the memory (hardware configuration Table 3). However, more recent hardware could make the prediction with a larger grid size.
Through a visual analysis of Figure 14B, it is possible to observe that the algorithm that performs the tree crown merging works correctly. However, if a different grid size is adopted, this algorithm must be modified because it was developed to solve the merging following the limitations of this research.

Algorithm's Advance
Since the launch of Ikonos in 1999, a great number of very high-resolution optical imagery from different satellites (e.g., WorldView, Geoeye, and Quickbird) have become available for the study of ITCs. The development of automatic techniques to perform the TCDD using imagery from these satellites have attracted attention from research from image processing, remote sensing, and forestry [26,64].
Most of the algorithms which perform the TCDD on very high-resolution satellite imagery were developed to work on specific regions of temperate forests [26]. Typically, these algorithms use techniques such as region growing, edge detection, local maximum, and template matching (with a specific geometric form, such as an ellipse) [64]. The use of these algorithms over tropical forests may be very difficult as they could need several configurations because, in these regions, the ITCs are not so homogeneous (size and shapes are different) and their spectral characteristics and texture vary widely. However, the application of our technique could be extended to any type of region (for example, temperate forests or other tropical forests) because it depends only some hand-annotated ITCs to feed the algorithm of synthetic image creation.
Tropical forests can be composed of a great amount of tree crowns with extensive variety of colors, sizes, and shapes, and even using very high-resolution images is difficult (or impossible) to hand-annotate all ITCs within a specific region. For the training of a delineation algorithm based on a CNN approach, all the ITCs within the imagery of the training set must be annotated, or else the algorithm does not converge to a desired response. Therefore, our approach, which uses the creation of synthetic images, could be an alternative to overcome the need to hand-annotate all tree crowns within a tropical forest region. Besides that, our technique proved to be useful because it achieved the F 1 score and the IoU average value of 0.86 and 0.73, respectively (considering the Mask R-CNN response with IoU ≥ 0.5).

Application Perspectives
An important aspect of tropical forests is that their biomass is concentrated in large trees [68]. The research developed by Blanchard et al. [69] shows that the relationship between diameter breast height and crown area for individual trees, within tropical regions, is stable with no significant variation. In a perspective for biomass estimation, our delineation algorithm could be applied for this purpose using optical images and large-scale assessments using optical imagery. However, a more detailed work applying our method and validation analyses over different forested areas are still needed to support this idea. For example, field data can be used to calibrate our algorithm and then use its answer to estimate the biomass and biomass change of large area using high resolution satellite images.
Another perspective for a possible application of our algorithm of TCDD is related to mapping species. In the research developed in the same forest site by Wagner et al. [13], Ferreira et al. [67], after the delineation, support vector machines (SVMs) were successfully applied to determine the species of each delineated tree crown, showing that spectral information could be used to predict the species. The Mask R-CNN could also be applied to mapping species. The Mask R-CNN works with two distinct modules: (i) one to determine the bounding box of the object of interest; and (ii) another to determine the pixels belonging to each object of interest (see Figure 4). These two modules are formed by a set of convolutional layers, also known as convolutional filters. In the module formed by Faster R-CNN, the convolutional filters closest to the input layers are responsible for detecting the low-level features. On the other hand, the last convolutional filters from Faster R-CNN are more specialized and they are responsible for detecting the high-level features. Thus, an analysis over the response of high level filter from Mask R-CNN (mainly those that compose the Faster R-CNN) could be applied to identify the filters that are activated during the prediction; the features set obtained from these filters can be applied to perform a more specialized image segmentation, such as species classification. A review about CNN filter analysis can be obtained in Bilal et al. [70]. For example, after the delineation, the high level feature information can be obtained from high level filters and used for specie recognition, since each tree is delineated as a unique object for our algorithm, and another machine learning technique (other CNN or even the Mask R-CNN) could be used to analyze those filter information and identify the tree species. However, as with a biomass estimation perspective, more detailed work on this issue should be developed to support this assumption.
Another potential application for our CNN-based TCDD algorithm is on forest dynamics applications related to tree mortality and logging detection (either legal or illegal). As an example, the study developed by Dalagnol et al. [71] demonstrated that multi-temporal very high-resolution optical imagery (e.g., WV-2 and GeoEye-1) allows the semi-automatic detection of individual tree crowns loss with moderate accuracy (>60%). However, their tree loss detection was based on applying a simple watershed-based method for TCDD and analyzing the spectral difference between the two dates' imageries. Therefore, if such study would use an improved TCDD method such as ours, it could produce more reliable estimates of spectral difference between imagery because of the improved tree detection and segmentation and minimal inclusion of shadows in the tree crown segments. Moreover, a more precise TCDD would even allow a direct spatial comparison between dates which was not conducted in such study by comparing which objects are present or absent between dates.

Conclusions
This work presents the application of a state-of-the-art CNN model for instance segmentation called Mask R-CNN for TCDD in very high-resolution RGB satellite images. Additionally, we developed a methodology to produce simulated forested images which enable the training of the CNN model. The main advantage of the proposed method is (i) to ease the production of training sample images for dense forest and (ii) to obtain individual TCDD with an unprecedented high accuracy for this highly diverse tropical forest (the detection accuracy and Kappa index were 96% and 0.919 respectively, considering a PCD ≥ 50%. The average F 1 score for all tree crowns detected was 0.77. Considering the IoU ≥ 0.5, the detection accuracy, Kappa index, and F 1 score value were 90%, 0.821, and 0.86, respectively). Besides that, our research proposes the use of several already applied as guidelines for the evaluation of future research about TCDD.
Further work is needed to test the algorithm in other tropical forest regions and using images of different spatial resolutions, such as images from WorldView-3 satellite, and to use the proposed method for tree species mapping and biomass estimation in tropical forests. However, the use of this method is by no means restricted to tropical forests and can be used in other regions. Besides that, the method used for the generation of training samples could be applied to create training and validation sets. Another future work that could be done relates to increasing the accuracy during the gathering of training patterns, data from a LiDAR sensor or field data can be used to improve this gathering, and the results obtained could be used to validate this research and even improve the results.