Comparison of Classical Methods and Mask R-CNN for Automatic Tree Detection and Mapping Using UAV Imagery

: Detecting and mapping individual trees accurately and automatically from remote sensing images is of great signiﬁcance for precision forest management. Many algorithms, including classical methods and deep learning techniques, have been developed and applied for tree crown detection from remote sensing images. However, few studies have evaluated the accuracy of different individual tree detection (ITD) algorithms and their data and processing requirements. This study explored the accuracy of ITD using local maxima (LM) algorithm, marker-controlled watershed segmentation (MCWS), and Mask Region-based Convolutional Neural Networks (Mask R-CNN) in a young plantation forest with different test images. Manually delineated tree crowns from UAV imagery were used for accuracy assessment of the three methods, followed by an evaluation of the data processing and application requirements for three methods to detect individual trees. Overall, Mask R-CNN can best use the information in multi-band input images for detecting individual trees. The results showed that the Mask R-CNN model with the multi-band combination produced higher accuracy than the model with a single-band image, and the RGB band combination achieved the highest accuracy for ITD (F1 score = 94.68%). Moreover, the Mask R-CNN models with multi-band images are capable of providing higher accuracies for ITD than the LM and MCWS algorithms. The LM algorithm and MCWS algorithm also achieved promising accuracies for ITD when the canopy height model (CHM) was used as the test image (F1 score = 87.86% for LM algorithm, F1 score = 85.92% for MCWS algorithm). The LM and MCWS algorithms are easy to use and lower computer computational requirements, but they are unable to identify tree species and are limited by algorithm parameters, which need to be adjusted for each classiﬁcation. It is highlighted that the application of deep learning with its end-to-end-learning approach is very efﬁcient and capable of deriving the information from multi-layer images, but an additional training set is needed for model training, robust computer resources are required, and a large number of accurate training samples are necessary. This study provides valuable information for forestry practitioners to select an optimal approach for detecting individual trees.


Introduction
The management of young forests has long-term impacts on forest establishment [1]. In the context of extensive tree planting programs, which have been implemented in many the predecessor-Faster R-CNN, Mask R-CNN has the ability to execute the classification and the segmentation parts independently [31]. It can predict an exact mask within the bounding box. Therefore, Mask R-CNN has the potential to become one of the most widely used algorithms for tree crown detection and delineation in the future. For instance, Mask R-CNN approach was used to detect and segment coconut trees, achieving an overall 91% mean average precision [32]. Combined object-based image analysis, Mask R-CNN was used to segment scattered vegetation in drylands using high-resolution optical images [33]. Relying on Mask R-CNN, Braga et al. reported the promising results for tree crowns delineation in tropical forests from high-resolution satellite images-a total of 59,062 tree crowns delineated (F1 score = 0.86) [34]. In addition, Mask R-CNN is capable of detecting other tree attributes simultaneously. For example, tree crown and height were detected and delineated automatically and simultaneously by Hao et al. in a young forest plantation [35].
The LM algorithm, MCWS, and Mask R-CNN are proven methods for tree detecting and mapping. However, based on existing literature, few studies have reported comparing the performance of these algorithms for tree crown detection. This study evaluated these algorithms in a newly forested plantation and assessed the accuracy, convenience (potential time in data processing) of these algorithms for detecting individual tree crowns. The aim of this research is to compare the accuracies of different algorithms and evaluate the data and processing requirements of different algorithms for detecting individual trees.

Study Site
The study site comprised a 4-ha plantation forest located in the Pushang national forest farm, Shunchang County, Fujian Province, China ( Figure 1). The plantation forest is situated in a mountain range with an elevation between 174 m and 226 m, and an average slope of 27.8 • . The area is mainly covered by several types of Chinese fir, which was planted in 2018, with some replanting in 2019 due to tree mortality. The Chinese fir has a rounded tree crown, a height between 1 m and 4 m, and a north-south crown width between 0.7 m and 2.8 m in the study site. The tree spacing is 2.0 m × 1.5 m, with 5 to 7 columns of Chinese fir in each group. A single row of broad-leaved trees was used to isolate each Chinese fir type.

. Image Acquisition and Preprocessing
A DJI Phantom4-Multispectral (https://www.dji.com/p4-multispectral, accessed on 1 October 2021) was used to obtain site UAV imagery in December 2019. The integrated camera has six imaging sensors (1600 × 1300 pixels), including five multispectral sensors (blue (450 ± 16 nm), green (560 ± 16 nm), red (650 ± 16 nm), red edge (730 ± 16 nm), near infrared band (NIR) (840 ± 26 nm) and one RGB sensor [36]. The flight parameters included a flight altitude of 30 m above ground height with an 85% forward overlap and 80% side-lap. During the flight, the real-time kinematic (RTK) positioning and navigation system on the UAV was linked to the D-RTK 2 Mobile GPS station to derive high-precision UAV waypoint positions. An XY precision of 2 cm and Z precision of 3 cm was achieved. A total of 5616 images were collected for the study site. Next, the DJI Terra software was used for generating the digital surface model (DSM) and ortho-mosaic images. The spatial resolution of generated images (0.76 cm for ortho-mosaics, 1.47 cm for DSM) were resampled to 2.0 cm pixel −1 . The DJI Terra software does not generate a separate point cloud, so it is not possible to compute the CHM directly. Therefore, non-forest DSM locations were identified, and the altitude of the corresponding locations was extracted to create the digital terrain model (DTM) using interpolation. A total of 5518 non-forest locations were randomly selected, and the CHM with 2.0 cm of spatial resolution was created by subtracting the DTM from the DSM [37].

Tree Crown Delineation
The aim of the tree crown delineation was to assess the accuracy of different algorithms. A previous study reported that it was possible to accurately delineate tree crowns manually [35]. In this study, tree crowns were manually delineated using visual interpretation using the GIS interface by a person with extensive forestry and remote sensing expertise and double checked by another expert in order to rule out misinterpretation [38]. A total of 1818 Chinese fir were manually delineated from the UAV imagery in the study site and divided into two regions ( Figure 1). Region 1 (1019 Chinese fir) was used to train the Mask R-CNN model, region 2 (797 Chinese fir) was used to serve as baseline data for comparison with automated detection algorithms.

Individual Tree Crown Detection (ITD)
In this study, ITD was derived using the LM algorithm, MCWS algorithm, and Mask R-CNN, respectively. The single band of blue, green, red, red edge, NIR, and CHM was selected as the test image for three algorithms, respectively. Additionally, the different band combinations (RGB band combination and Multi band) and the visible-spectrum RGB imagery from the RGB sensor were also used as the input image for Mask R-CNN model training. Thus, a total of six test images for LM and MCWS algorithms, and nine test images for Mask R-CNN were used to detect individual trees (Table 1). Visible-spectrum RGB imagery Blue, green, red -- Multi band Blue, green, red, red edge, near-infrared --+ Note: "+" represents the image was used for testing, "-" represents the image was not used for testing.  Figure 2 shows a workflow example for ITD using the LM algorithm. Based on the characteristics of Chinese fir's crown shape, pixels corresponding to each tree crown have higher height values or spectral reflectance than the surrounding area (a mountainous structure) in the image [5]. Therefore, a circular smoothing window with a radius of 10 cm (5 pixels) was used to first smooth the image. Smoothing can help to reduce the false positive detections since the pixels of some individual crowns are not homogenous in the high spatial resolution imagery [39]. Next, the smoothed image was used to generate the local maximum image using the focal statistics tool with a radius of 50 cm (25 pixels) fixed circular window size. Then, the possible treetops were identified by subtracting the smoothed image from the local maximum image, where they were identified as the 0 values. Finally, possible treetops with height lower than 0.3 m were filtered out in order to remove outliers [37]. All the processing used the same for six test images and was implemented in ArcGIS 10.8 (ESRI, Redlands, CA, USA). Additionally, an automated detecting model for ITD using the LM algorithm was developed with ArcGIS in this study (Appendix A, Figures A1 and A2).

Marker-Controlled Watershed Segmentation (MCWS) Algorithm
The MCWS algorithm, which is based on the concept of flow direction and watershed boundaries, was used to delineate Chinese fir crown extents in this study ( Figure 3) [14,23]. First, the image was inverted to make the treetop locations in the image appear as local minima where each tree crown becomes a water basin. Next, a circular smoothing window with a radius of 10 cm (5 pixels) was used to smooth the inverse image, in order to remove outliers. After inversion, the smoothed image was used to generate the flow directions, and the local minimum image was extracted from the smoothed image with a fixed circular window size of 50 cm (25 pixels) [37]. Then, tree crown profiles were delineated based on the edge of the water in each pit using the watershed tool in ArcGIS 10.8, and the results were converted into polygon representation in shapefile format. Finally, the tree crown polygon was generated by intersecting the image of tree crown profiles with >0.3 m from the CHM, in order to remove the ground region [40]. All the processing used the same for six test images and was implemented in ArcGIS 10.8. Additionally, an automated detecting model for ITD by MCWS algorithm was developed based on ArcGIS in this study (Appendix A, Figures A3 and A4).

Training and Application of Mask R-CNN Model
The Mask R-CNN workflow for ITD is presented in Figure 4. (1) Input Image Preparation Mask R-CNN allows different combined input predictors as the input image. In order to test the optimal input image for Chinese fir crown detection, nine input images were used for Mask R-CNN model training (Table 1).

(2) Training Dataset Preparation and Model Training
Manually delineated tree crowns in region 1 (1019) were used as a training and validation sets and manually delineated tree crowns in region 2 (797) were used as a test set for this study. Next, the training and validation set, and each input image were used to generate the training dataset. Each input image was split into 256 × 256-pixel image tiles with a 50% overlap of the stride shift for processing [41] to match the input constraints for the Mask R-CNN architecture and to ensure all trees could be captured in at least one image tile [2]. In addition, each image was rotated 90 • , 180 • , and 270 • for the data augmentation process because the Mask R-CNN model needs an extensive training set [34]. In summary, nine training datasets were used in this study, and each training dataset contained 3384 image chips and 26,584 features.
Finally, the Mask R-CNN model was trained using each training dataset. The pretraining ResNet-50 architecture framework was used to transfer learning. In order to prevent overfitting when training the model, the training and validation set was split into 90% for model training, and 10% was retained for validation. Moreover, the training would stop during the training phase if the validation loss did not improve for five epochs [42]. Except for the different input images, all the same processing procedures were used for the nine Mask R-CNN models that were implemented in ArcGIS API for Python. A laptop with an AMD Ryzen 9 CPU, 16 GB RAM memory, and a Nvidia GeForce RTX 2060 GPU was used for model training.

(3) Model Application
A total of nine models were applied to predict ITD in the corresponding input image. The output result of each Mask R-CNN model was a vector file of tree crowns. Each tree crown contains a confidence value between 0-1. This value indicates the likelihood of confidence in the existence of trees. In this study, tree crowns with the confidence score >0.2 were used. For overlapping tree crowns with a confidence score >0.2 [42], the higher value was kept, and the lower one was removed using a non-maximum suppression algorithm [43].

Accuracy Evaluation
Model performance was assessed using recall, precision, and F1 score (Equations (1)- (3)) [44]. The recall is the value of correctly identified trees divided by all of the trees delineated using the independent visual assessment. The precision is the value of correctly identified trees divided by all predicted trees from the algorithm. The F1 score is the overall accuracy considering recall and precision. The presented MCWS algorithm and Mask R-CNN have the ability to perform tree crown detection and tree crown delineation. Therefore, the intersection over union (IoU) was used to evaluate the accuracy of the tree-crown polygons (Equation (4)). IoU is the ratio of intersection area and union area of manual crown delineation and predicted crown delineation [2,34,42].
where TP represents the correctly identified tree, FN represents the omitted trees, FP represents the false positive detections (e.g., broad-leaved trees or weeds).
where B actual is the crown polygons from the test set, B predicted is predicted crown polygons from the MCWS algorithm or Mask R-CNN model. For both algorithms, IoU was considered correctly delineated when it was higher than 50% [42,45]. Additionally, the confidence score was higher than 0.2 for Mask R-CNN. The intersection operation and the union operation represent the common area, and the combined area of B actual and B predicted , respectively.

Results
The results of individual tree detection using the LM algorithm, MCWS algorithm, and Mask R-CNN is presented in Table 2. For the LM algorithm, 797 manual delineated tree crown polygons in region 2 were used as reference, a single point within a tree-crown polygon was regarded as the correctly identified treetop. More than one point within a tree-crown polygon and points outside tree-crown polygons were considered to be a false detection. A tree-crown polygon with no point was regarded as not being detectable ( Figure 5). It can be seen that the highest accuracy was achieved when the LM algorithm used the CHM (F1 score = 87.86%), followed by red edge image (F1 score = 86.85%) and NIR image (F1 score = 86.58%). The lowest accuracy for LM algorithm occurred when using a green image (F1 score = 71.95%) or red image (F1 score = 66.67%).  The results using different images as the test image for the MCWS algorithm to detect individual trees is shown in Table 2. It can be seen that the total number of detected trees ranged from 1007 to 1543 using different test images in region 2, while the correctly identified Chinese fir ranged from 351 to 775 with IoU > 50%. In comparison, the detection of ITD using CHM as the test image for Chinese fir achieved the highest accuracy (F1 score = 85.92%), and the example of ITD using CHM in this study is shown in Figure 6. Followed by red edge image (F1 score = 81.35%) and NIR image (F1 score = 79.65%). The lowest accuracy of ITD was achieved using the single red band image (F1 score = 30.00%). The manual delineated Chinese fir in region 2 was used to evaluate the performance of nine Mask R-CNN models. The results of ITD from the Mask R-CNN models are shown in Table 2. For Mask R-CNN, when comparing the input image by band combination and the single band, the former achieved higher accuracy for ITD. The higher F1 score was achieved when the model used RGB band combination (F1 score = 94.68%), Multi band (F1 score = 94.07%), and visible-spectrum RGB imagery (F1 score = 93.60%) as the input image. Followed by the model with red edge image (F1 score = 90.90%), NIR image (F1 score = 89.16%), and CHM (F1 score = 84.72%). For the single-band image, the model with red edge image produced the highest accuracy for ITD, and the lowest accuracy was achieved by the model with green image (F1 score = 69.79%) and red image (F1 score = 65.55%).

Classical Detection Methods
Classical methods gained popularity due to the ease of data processing, especially considering the lower computer processing requirements for data processing. For the LM algorithm, the suitable window size is a critical factor affecting tree crown detection accuracy [46]. In a forest or plantation with a relatively consistent crown size, the LM algorithm with fixed window size can achieve good accuracy for ITD. Fawcett et al. reported high mapping accuracy of 98.2% for palms from CHM using the LM algorithm with a fixed window size of 9 m [47]. In a forest with large differences in tree height and crown size, the detection accuracy of the LM algorithm with fixed window size would likely decrease significantly [48,49]. Previous studies demonstrated that a tree would be divided into multiple trees when the window size is too small [50,51]. When the window size is set too large, the existent trees could miss detection because of the window containing several trees [17]. To address this issue, the LM algorithm with variable window size was developed, and has been used successfully for tree crown detection [52]. The variable window size is typically determined by a relationship between crown radius to height [14,53], the local slope break [17], or an average semivariance range [51]. Xu et al. proposed that a revised LM algorithm was used to find the crown center seeds by searching for local maximal values in the transects along row and column directions of an image and applied this algorithm on four high spatial-resolution images, which achieved overall accuracies between 85% and 91% [26]. For the MCWS algorithm, the accuracy of tree crown delineation depends on the precision of each tree's location. In this study, the MCWS algorithm provided the tree crown boundary based on using the LM algorithm to detect treetops. Previous research has reported using different approaches for determining the tree location. For instance, Fang et al. proposed to exploit tree inventories as a guide for tree identification using the MCWS approach [54]. Combining a random forest classification and existing tree inventory data, Wallace et al. applied a workflow for detecting and delineating all of the individual trees in the city of Melbourne, Victoria, Australia [15]. However, it is still a challenge for classical methods to detect individual trees in a complex forest due to the cross and variable size of tree crowns [8]. In this study, owing to the study site being a newly forested plantation with few overlapping crowns, sufficient distance between trees, and a relatively consistent age of Chinese fir, it is feasible to use the fixed window size for LM algorithm and MCWS algorithm. The circular window was used in this study because the Chinese fir has a concentrated and distributed radially crown shape. For the parameters, the circular smoothing window size of 10 cm and the fixed circular window size of 50 cm was selected as the appropriate combination for ITD. Hao et al. has reported that details on the impacts of factors for ITD optimization [37].
Comparing the results of ITD by classical methods, it was found that higher accuracy was achieved by LM algorithm than MCWS algorithm when using the same test image. This may be explained because the MCWS algorithm needs to delineate individual trees, which could reduce the detected accuracy. Moreover, the CHM is the optimal image for ITD, which generated a higher accuracy (F1 score = 87.86% for LM algorithm and F1 score = 85.92% for MCWS algorithm) compared to other test images. This may be explained by the CHM is normalized DSM via subtraction of DTM, which removes the influence of ground elevation, and contains the most information of tree crowns. Compared to other single-band images, it has the simple signatures of data.

Mask R-CNN
It has shown that CNNs can outperform classical methods because they no longer rely on consistent spectral signatures or rule-based algorithms [55]. Compared with classical methods (e.g., LM algorithm and MCWS algorithm), one of the main advantages of CNNs is the capability to extract the information from multi-band images. After the model training, it can produce the corresponding results as long as the remote sensing image was imported into the model. However, the more classical methods find detection of individual trees challenging from a multi-band image, which needs to derive or compress the multi-dimensional data into a single band image. For example, to select the appropriate image band, Xu et al. [26] used the principal component method and the ratio of bands to obtain the single band image for ITD. For the input image, the Mask R-CNN model with the multi-band combined image achieved higher accuracy than the model with the single band. The highest accuracy of the Mask R-CNN model with a single band was 90.09%, and lowest accuracy was 65.55%, which is even lower than the results of ITD using the LM algorithm or MCWS algorithm. This can be explained by the single band, which contains limited information, and which cannot meet the feature requirements of Mask R-CNN. In addition, the model with the red edge image and NIR image was higher than the model with the blue, green, and red image. The result can be explained by the red edge and NIR images containing more vegetation information than the blue, green, and red images.
Another advantage of Mask R-CNN is to distinguish target tree species from images. In this study, for tree crown detection, the optimal Mask R-CNN model using the RGB band combination achieved higher accuracy in Chinese fir detection (F1 score = 94.68%) than the optimal prediction from LM algorithm (F1 score = 87.86%) and MCWS algorithm (F1 score = 85.92%). This difference in accuracy was explained by the detection of broad-leaved trees in the study site using the LM and MCWS algorithms, which were calculated as false detections here. If the detected broad-leaved trees were removed from the results, the F1 score of LM algorithm and MCWS algorithm using a CHM as the input image increased to 95.21% and 93.04%, respectively.
Although the Mask R-CNN model has significant power to identify tree species, it requires manual or semi-automatic delineated training samples for model training [34]. In this study, due to the morphology of Chinese fir and the clear tree crown, it was possible to delineate tree crowns manually for model training. However, for complex forests, it is a challenge to delineate tree crowns by hand, even using high spatial resolution images. In addition, the Mask R-CNN requires a large volume and highly accurate training samples, which is the main difficulty to utilize the model [34]. To address this limitation, the data augmentation process and pre-trained ResNet-50 architecture framework was used in this study. In many scenarios, it is not feasible to fully design and train a new ConvNet for CNNs [56], because it is time-consuming to label data and has high computational demands for the processing of model training. It has been shown that model weights can be fine-tuned to adapt the detected object during the transfer learning, which can greatly reduce the need of training samples [57]. For instance, Fromm et al. used a pre-trained Faster R-CNN network to detect conifer seedlings, which did not cause a significant loss of accuracy even with the training dataset reduced to a couple of hundred seedlings [10]. Once the model training was completed, its application on other images was accurate. Kattenborn et al. reported that although it took 2.5 h to train the model, the application on the detection of cover fractions of plant species and communities is very efficient and took only seconds for each tile [58].

The Limitation and Application of These Algorithms
All the algorithms used for detecting tree crown exhibited advantages and disadvantages in terms of data processing, requirements, and accuracy. Based on the developed detecting models, it was found that the LM algorithm took approximately 9.5 s to complete the detection of individual trees in region 1, and the MCWS algorithm needed 32.9 s, while it took about 2 min for Mask R-CNN to detect individual trees in region 1. The biggest weakness of these algorithms is that they have difficulty detecting individual tree crowns in highly overlapping forests [16,59,60]. And the main limitation of LM and MCWS algorithms is the inability to identify tree species. In the case of monoculture forests, classical methods can achieve good accuracy for ITD because there is no interference caused by differences from multiple tree species [47,61]. For mixed forest, it is suggested to manually delineate the region of target tree species and then use the classical methods for individual tree detection to solve the limitation from this tree species identification problem. For example, Aeberli et al. achieved banana tree detection by delineating the study area for banana trees from a commercial banana farm and surrounding area that hosted forestry, residential farming land uses [39]. For mixed forests with multiple tree species, deep learning methods become a better option than classical methods because of the capability of tree species identification. However, although the Mask R-CNN exhibited advantages in terms of tree species identification and multi-layer input image application, it is still in the initial stage of application in forestry. Moreover, there is no model for identifying different tree species that can be applied or used universally. Therefore, model training is a necessary step for the application of the Mask R-CNN technique. Currently, the complexity of model training is still the main limitation for applying this method, but with the further development of deep learning techniques, it is expected that common models will be become available. Deep learning exhibits the power to identify individual trees and is an alternative and promising survey technique for forest management.

Conclusions
The effectiveness of detecting and mapping individual trees is a critical way to monitor the growth of young forests. This study explored the ability of LM algorithm, MCWS algorithm, and Mask R-CNN for tree detection and map in a newly forested Chinese fir planting using UAV imagery. Manually delineated tree crowns based on the UAV imagery were used for evaluating the accuracy of ITD by LM algorithm, MCWS algorithm, and Mask R-CNN. In total, nine different test images, including a single-band image, band combinations, and derivations of the UAV imagery, were tested for selecting the most appropriate approach and the optimal image for ITD. The Mask R-CNN model with RGB band combination had superior performance when compared with other models in this study, yielded recall = 93.73%, precision = 95.65%, F1 score = 94.68%. Moreover, the accuracy of ITD using the optimal Mask R-CNN was higher than the optimal LM algorithm (F1 score = 87.86%) and MCWS algorithm (F1 score = 85.92%), indicating that Mask R-CNN has higher accuracy potential compared to classical methods for individual tree detection. This study provides valuable information on how to select the optimal approach to detect and delineate tree crowns from UAV imagery.