Application and Evaluation of a Deep Learning Architecture to Urban Tree Canopy Mapping

Urban forest is a dynamic urban ecosystem that provides critical benefits to urban residents and the environment. Accurate mapping of urban forest plays an important role in greenspace management. In this study, we apply a deep learning model, the U-net, to urban tree canopy mapping using high-resolution aerial photographs. We evaluate the feasibility and effectiveness of the U-net in tree canopy mapping through experiments at four spatial scales—16 cm, 32 cm, 50 cm, and 100 cm. The overall performance of all approaches is validated on the ISPRS Vaihingen 2D Semantic Labeling dataset using four quantitative metrics, Dice, Intersection over Union, Overall Accuracy, and Kappa Coefficient. Two evaluations are performed to assess the model performance. Experimental results show that the U-net with the 32-cm input images perform the best with an overall accuracy of 0.9914 and an Intersection over Union of 0.9638. The U-net achieves the state-of-the-art overall performance in comparison with object-based image analysis approach and other deep learning frameworks. The outstanding performance of the U-net indicates a possibility of applying it to urban tree segmentation at a wide range of spatial scales. The U-net accurately recognizes and delineates tree canopy for different land cover features and has great potential to be adopted as an effective tool for high-resolution land cover mapping.


Introduction
Urban forests are an integral part of urban ecosystems. They provide a broad spectrum of perceived benefits, such as improved air quality, lower surface and air temperatures, and reduced greenhouse gas emissions [1][2][3][4]. Additionally, urban trees can improve human mental health by adding aesthetic and recreational values to the urban environment [5][6][7]. The last decade has witnessed a dramatic decline of urban tree cover. From 2009 to 2014, there was an estimated 70,820 hectares of urban tree cover loss throughout the United States [8]. A better understanding of tree canopy cover is more important than ever for sustainable monitoring and management of urban forests.
Urban tree canopy, defined as the ground area covered by the layer of tree leaves, branches and stems [9], is among the most widely used indicators to understand urban forest pattern. Conventionally, field-based surveys are conducted to manually measure the area of urban vegetation [3,10,11]. The field-based surveys are mostly conducted by regional forestry departments and various research programs, and many of the field surveys are highly costly and labor intensive [12]. The availability of digital images and advances of remote sensing technologies provide unique opportunities for effective urban tree canopy mapping [13,14]. For instance, the Moderate Resolution Imaging Spectroradiometer-Vegetation Continuous Field (MODIS-VCF) product provides global coverage of percent tree canopy cover at a spatial resolution of 500 m [15]. This product is one of the most adopted vegetation map products to study global tree canopy trends and vegetation dynamics. Other satellite images such as the Landsat imagery [16] and the Satellite Pour l' accuracy was assessed based on the location of tree stems rather than the extent of tree canopies.
Compared to other urban land cover types, urban tree canopy mapping is a challenging task because tree canopies can take a variety of shapes and forms depending on the age, size, and species of the trees. As trees and other types of green vegetation (e.g., shrubs/grass) are usually planted together, accurate segmentation of trees from other vegetation types is rather difficult due to their spectral similarity. In this study, we utilized the U-net architecture to map the urban tree canopy over Vaihingen, Germany. Coupling aerial images and the deep U-net model, this study aims to: (1) apply the U-net to urban tree canopy mapping using aerial photographs, (2) assess the performance of the U-net architecture at multiple spatial scales, and (3) test the effectiveness of the U-net in comparison with the OBIA approach.

Study Area and Data
This study utilized an image dataset published by the International Society for Photogrammetry and Remote Sensing (ISPRS). Images were taken over Vaihingen, Germany in 2013 ( Figure 1). The dataset contains 33 patches, each of which consists of an orthophoto and a labeled ground truth image. The orthophotos have three bands, near-infrared (NIR), red, and green with a spatial resolution of 8 centimeters (cm).

U-Net Architecture
The U-net architecture was designed for boundary detection and localization built from the FCN architecture [46]. It can be visualized as a symmetrical, U-shaped process with three main operations, convolution, max-pooling, and concatenation. Figure 2 gives a visual demonstration of the U-net architecture for tree canopy segmentation.

U-Net Architecture
The U-net architecture was designed for boundary detection and localization built from the FCN architecture [46]. It can be visualized as a symmetrical, U-shaped process with three main operations, convolution, max-pooling, and concatenation. Figure 2 gives a visual demonstration of the U-net architecture for tree canopy segmentation.
The U-net architecture consists of two paths, a contracting path as shown on the left side of the U-shape, and an expansive path as shown on the right side of the U-shape ( Figure 2). In the contraction path, the black solid arrows refer to the convolution operation with a 3 × 3 kernel (conv 3 × 3). With the convolution, the number of channels increased from 3 to 64. The black dash arrows pointing down refer to the max-pooling operation with a 2 × 2 kernel (Max pool 2 × 2). In the max-pooling operation, the size of each feature map was reduced from 128 × 128 to 64 × 64. The preceding processes were repeated four times. At the bottom of the U-shape, an additional convolution operation was performed twice.  The U-net architecture consists of two paths, a contracting path as shown on the left side of the U-shape, and an expansive path as shown on the right side of the U-shape ( Figure 2). In the contraction path, the black solid arrows refer to the convolution operation with a 3 × 3 kernel (conv 3 × 3). With the convolution, the number of channels increased from 3 to 64. The black dash arrows pointing down refer to the max-pooling operation with a 2 × 2 kernel (Max pool 2 × 2). In the max-pooling operation, the size of each feature map was reduced from 128 × 128 to 64 × 64. The preceding processes were repeated four times. At the bottom of the U-shape, an additional convolution operation was performed twice.
The expansive path restores the output image size from the contraction path to the original 128 × 128. The black dash arrows pointing up refer to the transposed convolution operation, which increases feature map size while decreasing channels. The green arrows pointing horizontally refer to a concatenation process that concatenates the output images from the previous step with the corresponding images from the contracting path. The concatenation process combines the information from the previous layers to achieve a more precise prediction. The preceding processes were repeated four times. The gray solid arrow at the upper right corner refers to a convolution operation with a 1 × 1 kernel (conv 1 × 1) to reshape the images according to prediction requirements.

Model Training
A total of four experiments were performed, with the spatial scales of input images being downsampled from 8 cm to 16 cm, 32 cm, 50 cm, and 100 cm, respectively. Figure 3 shows the model training workflow for the 16-cm experiment. The same workflow was repeated in the other three experiments. First, the original image with an 8 cm resolution was cropped into tiles of 256 × 256 pixels. Then, both the training and test datasets were downsampled to 16 cm, resulting in an image size of 128 × 128. Ninety percent of the tiles was used for training and 10% was used for testing. In the training process, 85% of the dataset was used for training and 15% was used for validation. We performed two evaluations to assess the model performance. In the first evaluation, the predicted output was first upsampled to 8 cm and then compared with the original 8-cm ground truth data. In the second evaluation, the ground truth data were resampled to the spatial resolution of the predicted output dataset before comparison. For example, to evaluate the performance of the 16-cm model, the ground truth data were downsampled to 16 cm before the accuracy assessment. The expansive path restores the output image size from the contraction path to the original 128 × 128. The black dash arrows pointing up refer to the transposed convolution operation, which increases feature map size while decreasing channels. The green arrows pointing horizontally refer to a concatenation process that concatenates the output images from the previous step with the corresponding images from the contracting path. The concatenation process combines the information from the previous layers to achieve a more precise prediction. The preceding processes were repeated four times. The gray solid arrow at the upper right corner refers to a convolution operation with a 1 × 1 kernel (conv 1 × 1) to reshape the images according to prediction requirements.

Model Training
A total of four experiments were performed, with the spatial scales of input images being downsampled from 8 cm to 16 cm, 32 cm, 50 cm, and 100 cm, respectively. Figure 3 shows the model training workflow for the 16-cm experiment. The same workflow was repeated in the other three experiments. First, the original image with an 8 cm resolution was cropped into tiles of 256 × 256 pixels. Then, both the training and test datasets were downsampled to 16 cm, resulting in an image size of 128 × 128. Ninety percent of the tiles was used for training and 10% was used for testing. In the training process, 85% of the dataset was used for training and 15% was used for validation. We performed two evaluations to assess the model performance. In the first evaluation, the predicted output was first upsampled to 8 cm and then compared with the original 8-cm ground truth data. In the second evaluation, the ground truth data were resampled to the spatial resolution of the predicted output dataset before comparison. For example, to evaluate the performance of the 16-cm model, the ground truth data were downsampled to 16 cm before the accuracy assessment.

Dice Loss Function
In deep learning, neural networks are trained iteratively. At the end of each training, a loss function is used as a criterion to evaluate the prediction outcome. In this study, we employed the Dice loss function in the training process. The Dice loss function can be calculated from the Dice similarity coefficient (DSC), a statistic developed in the 1940s to gauge the similarity between two samples [49]. The Dice loss is given by where p i and g i represent the pixel values of the i th pixel in the training output and the corresponding ground truth images, respectively. In this study, the pixel values in the ground truth images are 0 and 1, referring to the non-tree class and tree class respectively. The Dice loss was calculated during the training process to provide an assessment of the training performance at each iterated epoch. The value of the Dice loss ranges from 0 to 1, where a lower value denotes a better training performance. Figure 4 shows how the Dice loss changes at each epoch for the four experiments during the model training process. As the number of epochs increases, the Dice loss drops rapidly and stabilizes at Epoch 99, 243, 97, and 300 for the 16-cm, 32-cm, 50-cm, and 100-cm experiment, respectively.

Dice Loss Function
In deep learning, neural networks are trained iteratively. At the end of each training, a loss function is used as a criterion to evaluate the prediction outcome. In this study, we employed the Dice loss function in the training process. The Dice loss function can be calculated from the Dice similarity coefficient (DSC), a statistic developed in the 1940s to gauge the similarity between two samples [49]. The Dice loss is given by where and represent the pixel values of the pixel in the training output and the corresponding ground truth images, respectively. In this study, the pixel values in the ground truth images are 0 and 1, referring to the non-tree class and tree class respectively. The Dice loss was calculated during the training process to provide an assessment of the training performance at each iterated epoch. The value of the Dice loss ranges from 0 to 1, where a lower value denotes a better training performance. Figure 4 shows how the Dice loss changes at each epoch for the four experiments during the model training process. As the number of epochs increases, the Dice loss drops rapidly and stabilizes at Epoch 99, 243, 97, and 300 for the 16-cm, 32-cm, 50-cm, and 100cm experiment, respectively.

Model Parameters and Environment
We used randomly selected tiles as our training datasets. For the experiment at each scale, 300 epochs with 8 batches per epoch were applied, and the learning rate was set at 0.0001 for all training models ( Table 1). The Adam optimizer was utilized during the training process. We applied the horizontal shift augmentation to all images to increase the number of tiles in the training dataset. The number of shift pixels was different at each scale to ensure enough training samples were included in each model. The U-net architecture, as well as the whole semantic segmentation procedure, were implemented on the

Model Parameters and Environment
We used randomly selected tiles as our training datasets. For the experiment at each scale, 300 epochs with 8 batches per epoch were applied, and the learning rate was set at 0.0001 for all training models ( Table 1). The Adam optimizer was utilized during the Remote Sens. 2021, 13, 1749 6 of 14 training process. We applied the horizontal shift augmentation to all images to increase the number of tiles in the training dataset. The number of shift pixels was different at each scale to ensure enough training samples were included in each model. The U-net architecture, as well as the whole semantic segmentation procedure, were implemented on the Tensorflow and Keras deep learning framework in Python programming language. All other processing and analyses were carried out using open-source modules, including GDAL, NumPy, Pandas, OpenCV, Scikit-learn, among others. The deep learning network experimentation and modeling were executed on the Google co-lab platform.

Performance Evaluation
We used four accuracy metrics to evaluate the performance of the U-net model, including the overall accuracy (OA), the DSC, the Intersection over Union (IoU) and the kappa coefficient (KC). All metrics were calculated based on a confusion matrix (Table 2), which records the percentage of pixels that are true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). The OA was computed as the percentage of correctly classified pixels (Equation (2)). The DSC was a commonly used metric in semantic segmentation [50]. It was employed in Section 2.4 for calculating the loss function and was used here again for accuracy assessment (Equation (3)). The IoU, also known as the Jaccard Index, represents the ratio of the intersection to union between the predicted output and ground truth labeling images (Equation (4)) [51]. The KC, used widely in remote sensing applications, is a metric of how the classification results compare to values assigned by chance (Equation (5)) [52]. All metrics range from 0 to 1 with higher scores indicating higher accuracies.
where N is the number of pixels (5)

Object-Based Classification
The output of the U-net model was further compared to that of the object-based classification, a popular and widely applied approach to fine-scale land cover mapping. The object-based classification involves two main steps, segmentation and classification. We first used the multiresolution segmentation to group spectrally similar pixels into discrete objects. The shape, compactness, and scale factors were set to 0.1, 0.7, and 150, respectively. Based on a series of trial-and-error tests, an object with a mean value in the near-infrared band greater than 120 and a maximum difference value greater than 0.9 is being classified as tree canopy. The object-based classification was implemented in the eCognition software package [53].

Performance of the U-Net Model at Multiple Scales
We performed two evaluations to assess the model performance. Evaluation 1 compared the predicted results with the 8-cm ground truth images. The accuracy metrics are shown in Table 3. All accuracy metric scores were higher than 0.91 except for those at the 100-cm scale. Consistently across all metrics, there was a score increase when the scale changed from 16 cm to 32 cm, followed by a slight decrease from 32 cm to 50 cm, and a drastic drop from 50 cm to 100 cm. The U-net model achieved the best performance on the 32-cm dataset and the worst on the 100-cm dataset. The highest metric score was 0.9914 (OA) and the lowest score was 0.7133 (IoU). Evaluation 2 compared the predicted results with the ground truth images after adjusting to the spatial resolution of the predicted output. Table 4 shows the accuracy metric scores for Evaluation 2 at the four scales. First, all metric values were higher than those in Evaluation 1 regardless of scale. Second, all metrics were above 0.99 except for the 16-cm experiment. Similar to Evaluation 1, there was a substantial increase in all four metrics when the scale went from 16 cm to 32 cm. Different from Evaluation 1 though, the trend flattened out when the spatial resolution changed from 32 cm to 100 cm. Lastly, Evaluation 2 yielded much higher metric scores than Evaluation 1 at the 100-cm scale with the highest and lowest metric score of 0.9984 (OA) and 0.9934 (IoU), respectively.

Visual Evaluation of the U-Net Performance
To visually assess the performance of the U-net model, we selected an example area with moderate tree canopy cover and compared the predicted output with the ground truth images. Figure 5 shows the selected area on an aerial photo (Figure 5a), ground truth images at 8 cm, 16 cm, and 100 cm (Figure 5b,d,f), and predicted output at 16 cm and 100 cm (Figure 5c,e). Overall, both U-net predictions (Figure 5c,e) showed great consistency with the ground truth images (Figure 5b,d,f). Specifically, the similar patterns in Figure 5b,c echoed the high accuracy scores at 16 cm in Evaluation 1 (Table 3). Because Figure 5d was downscaled to match the spatial resolution of Figure 5c, there was a higher level of similarity between Figure 5c,d than that between Figure 5c,b. This was corroborated by the higher accuracy scores in Table 4 than Table 3  the 100-cm scale in Table 3 were in part due to the blurry canopy boundaries as seen in Figure 5e as a result of down sampling. A better result was identified comparing Figure 5e with Figure 5f, consistent with the much higher accuracy scores at 100 cm in Table 3.  Table 5 shows the accuracy scores of tree canopy mapping generated by the OBIA approach and the U-net model at the 16-cm scale. Both models were compared with the 8-cm ground truth data. For the OBIA, the highest metric score was 0.857 (OA) and the lowest was 0.489 (IoU). All the OBIA scores were lower than the U-net scores. It is worthy of note that even the highest score for the OBIA (OA: 0.857) was lower than the lowest score for the U-net (IoU: 0.9138), indicating that the U-net was superior to the OBIA in accurately mapping tree canopy at the 16-cm scale.  Figure 6 shows a comparison between the predicted output of the U-net and the OBIA in reference to the 8-cm ground truth image. By visual inspection, the U-net output (Figure 6c) showed a much better consistency with the ground truth image (Figure 6b) compared to the OBIA output (Figure 6d). There were much more misclassified pixels in the OBIA output than that in the U-net output, especially when identifying trees from  16-cm ground truth image, (e) 100-cm predicted tree canopy, and (f) 100-cm ground truth image (white areas refer to tree canopy pixels; black areas refer to non-tree pixels). Table 5 shows the accuracy scores of tree canopy mapping generated by the OBIA approach and the U-net model at the 16-cm scale. Both models were compared with the 8-cm ground truth data. For the OBIA, the highest metric score was 0.857 (OA) and the lowest was 0.489 (IoU). All the OBIA scores were lower than the U-net scores. It is worthy of note that even the highest score for the OBIA (OA: 0.857) was lower than the lowest score for the U-net (IoU: 0.9138), indicating that the U-net was superior to the OBIA in accurately mapping tree canopy at the 16-cm scale.  Figure 6 shows a comparison between the predicted output of the U-net and the OBIA in reference to the 8-cm ground truth image. By visual inspection, the U-net output ( Figure 6c) showed a much better consistency with the ground truth image (Figure 6b) compared to the OBIA output (Figure 6d). There were much more misclassified pixels in the OBIA output than that in the U-net output, especially when identifying trees from grass and shrubs (yellow rectangle in Figure 6d). Further, the U-net model successfully distinguished trees from buildings, while both the OBIA and the ground truth failed to do so (red rectangle in Figure 6d). Overall, the U-net outperformed both the OBIA and the ground truth image in accurately extracting tree canopy cover from other complex urban land cover features.

Performance Comparison between the U-Net and OBIA
Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 15 grass and shrubs (yellow rectangle in Figure 6d). Further, the U-net model successfully distinguished trees from buildings, while both the OBIA and the ground truth failed to do so (red rectangle in Figure 6d). Overall, the U-net outperformed both the OBIA and the ground truth image in accurately extracting tree canopy cover from other complex urban land cover features.

Performance of the U-Net in Urban Tree Canopy Mapping
In this study, we tested the effectiveness of the U-net in urban tree canopy mapping. We conducted the experiments at four different scales and performed two evaluations to assess the model performance from two different angles. Evaluation 1 (Table 3) compared the predicted output with the 8-cm ground truth. It aims to test the model performance operated on the datasets at different scales. Our results show that the U-net performed the best on the 32-cm dataset with an overall accuracy of 0.9914 (Table 3). While the 32-cm dataset is a coarser-resolution dataset compared to the 8-cm and 16-cm datasets, each image patch of the 32-cm dataset contains more geographic features than the other two. In deep learning, the spatial extent of the input is called "the receptive field". It is defined as the size of the region in the input that produces the feature [54]. In this context, the receptive field indicates how many land cover features can be perceived from an input patch. Figure 7 shows examples of input image patches of three scales: (a) 16 cm, (b) 32 cm, and (c) 100 cm. According to Figure 7a, despite the higher spatial resolution, a 16-cm patch is too small to cover enough land cover objects such as buildings and trees with large crowns [54]. This in part explains the lower accuracies with the 8-cm and 16-cm datasets as they come with too small of a receptive field.
An overly large receptive field may also be problematic. Figure 7c shows an example of a 100-cm input patch. While it is large enough to include a significant number of land cover features, it comes with a cost of degradation of spatial details, resulting in a loss of locational accuracy especially at the tree edges. This is part of the reason why the DSC and KC values in the 100-cm experiment are much lower than the experiments performed at other scales. It is recognized that the size of the receptive field plays an important role in Figure 6. Comparison of the urban tree canopy segmentation between the U-net and OBIA: (a) original orthophoto, (b) ground truth image, (c) predicted output of the U-net, and (d) OBIA classification result (white areas refer to tree canopy pixels; black areas refer to non-tree pixels; red rectangle shows an example area where trees are adjacent to buildings; yellow rectangle shows an example area where there is a mix of trees, grass, and shrubs).

Performance of the U-Net in Urban Tree Canopy Mapping
In this study, we tested the effectiveness of the U-net in urban tree canopy mapping. We conducted the experiments at four different scales and performed two evaluations to assess the model performance from two different angles. Evaluation 1 (Table 3) compared the predicted output with the 8-cm ground truth. It aims to test the model performance operated on the datasets at different scales. Our results show that the U-net performed the best on the 32-cm dataset with an overall accuracy of 0.9914 (Table 3). While the 32-cm dataset is a coarser-resolution dataset compared to the 8-cm and 16-cm datasets, each image patch of the 32-cm dataset contains more geographic features than the other two. In deep learning, the spatial extent of the input is called "the receptive field". It is defined as the size of the region in the input that produces the feature [54]. In this context, the receptive field indicates how many land cover features can be perceived from an input patch. Figure 7 shows examples of input image patches of three scales: (a) 16 cm, (b) 32 cm, and (c) 100 cm. According to Figure 7a, despite the higher spatial resolution, a 16-cm patch is too small to cover enough land cover objects such as buildings and trees with large crowns [54]. This in part explains the lower accuracies with the 8-cm and 16-cm datasets as they come with too small of a receptive field. a loss of spatial details and decline of locational accuracy [55]. An optimal receptive field ensures a good number of land cover features go to the training model while retaining the spatial accuracy of the dataset. That is the case for the 32-cm dataset in our study, which achieves the highest accuracy scores and the best model performance (Table 3). It is worthy of note that even with the 100-cm dataset, the U-net is still able to achieve an OA of 0.9324 (Table 3), indicating the overall effectiveness of the U-net architecture in urban tree canopy mapping. We performed Evaluation 2 to assess the accuracy of the U-net models based on input and output of the same spatial resolution. This evaluation is essential because high-resolution ground truth data are not always attainable. Results show that the performance of the U-net architecture is exceptional based on the incredibly high accuracy metric scores ( Table 4). All metric scores are above 0.99 for scales from 32 cm to 100 cm. These results indicate promising applications of the U-net architecture. An example is the National Agriculture Imagery Program (NAIP), which offers freely accessible satellite imagery across the United States at 1-m (100-cm) spatial resolution. From the results in Table 4, all metric scores are above 0.99 at the 100-cm scale, suggesting a highly effective and promising application of the U-net model to fine-scale land cover mapping based on NAIP data.

Comparison between the U-Net and OBIA
The OBIA has been a mainstream approach for high-resolution land cover mapping during the last decade. Myint et al. (2011) used the OBIA method to extract major land cover types in Phoenix from QuickBird images. In their study, the DSC score for the tree class was 0.8551 [28]. Over the same study area,  developed another set of decision rules using the NAIP imagery and successfully raised the DSC score to 0.88 [32]. Apart from the multispectral satellite imagery, Zhou (2013) supplemented the height and intensity from the LiDAR data and yielded a DSC score of 0.939 [56]. With a large number of existing studies using the OBIA approach, our study using the U-net model achieves a higher mapping accuracy than almost all the OBIA-based studies in the literature (DSC: 0.9816).
To further compare the performance of the U-net with the OBIA, we selected a sample study area with a variety of land cover types and applied the 16-cm U-net model and the OBIA approach to the same area. Figure 6 provides a visual comparison between the OBIA and U-net output. Urban tree canopy mapping is challenging because trees are typically planted on grassland or closely adjacent to buildings (Figure 6a). The U-net model has the unique capacity of accurately distinguishing trees from grass and buildings (Figure 6c) while the OBIA approach is not as effective (Figure 6d).
One of the major downsides of the OBIA is a requisite for enough expert knowledge of the study area and the land cover types under investigation [57]. In contrast, the U-net An overly large receptive field may also be problematic. Figure 7c shows an example of a 100-cm input patch. While it is large enough to include a significant number of land cover features, it comes with a cost of degradation of spatial details, resulting in a loss of locational accuracy especially at the tree edges. This is part of the reason why the DSC and KC values in the 100-cm experiment are much lower than the experiments performed at other scales. It is recognized that the size of the receptive field plays an important role in the training process of a deep learning neural network. Too small of a receptive field can limit the amount of contextual information while too large of a receptive field may cause a loss of spatial details and decline of locational accuracy [55]. An optimal receptive field ensures a good number of land cover features go to the training model while retaining the spatial accuracy of the dataset. That is the case for the 32-cm dataset in our study, which achieves the highest accuracy scores and the best model performance (Table 3). It is worthy of note that even with the 100-cm dataset, the U-net is still able to achieve an OA of 0.9324 (Table 3), indicating the overall effectiveness of the U-net architecture in urban tree canopy mapping.
We performed Evaluation 2 to assess the accuracy of the U-net models based on input and output of the same spatial resolution. This evaluation is essential because highresolution ground truth data are not always attainable. Results show that the performance of the U-net architecture is exceptional based on the incredibly high accuracy metric scores ( Table 4). All metric scores are above 0.99 for scales from 32 cm to 100 cm. These results indicate promising applications of the U-net architecture. An example is the National Agriculture Imagery Program (NAIP), which offers freely accessible satellite imagery across the United States at 1-m (100-cm) spatial resolution. From the results in Table 4, all metric scores are above 0.99 at the 100-cm scale, suggesting a highly effective and promising application of the U-net model to fine-scale land cover mapping based on NAIP data.

Comparison between the U-Net and OBIA
The OBIA has been a mainstream approach for high-resolution land cover mapping during the last decade. Myint et al. (2011) used the OBIA method to extract major land cover types in Phoenix from QuickBird images. In their study, the DSC score for the tree class was 0.8551 [28]. Over the same study area,  developed another set of decision rules using the NAIP imagery and successfully raised the DSC score to 0.88 [32]. Apart from the multispectral satellite imagery, Zhou (2013) supplemented the height and intensity from the LiDAR data and yielded a DSC score of 0.939 [56]. With a large number of existing studies using the OBIA approach, our study using the U-net model achieves a higher mapping accuracy than almost all the OBIA-based studies in the literature (DSC: 0.9816).
To further compare the performance of the U-net with the OBIA, we selected a sample study area with a variety of land cover types and applied the 16-cm U-net model and the OBIA approach to the same area. Figure 6 provides a visual comparison between the OBIA and U-net output. Urban tree canopy mapping is challenging because trees are typically planted on grassland or closely adjacent to buildings (Figure 6a). The U-net model has the unique capacity of accurately distinguishing trees from grass and buildings (Figure 6c) while the OBIA approach is not as effective (Figure 6d).
One of the major downsides of the OBIA is a requisite for enough expert knowledge of the study area and the land cover types under investigation [57]. In contrast, the U-net automatically learns the features in the study area without too much human intervention on parameter decisions [58]. Moreover, the U-net, as a deep learning network, comes with a very high level of automation with a minimal need of manual editing after the classification. This is a prominent advantage of the U-net over the OBIA because the accuracy of the OBIA depends to a great extent on post-classification manual editing which is time-consuming and labor-intensive. As the U-net is free of manual editing, it has a great potential to become a mainstream mapping tool especially when dealing with large amounts of high-resolution data.

Comparison between the U-Net and Other Deep Learning Methods
The test dataset of this study was made available from the ISPRS 2D Semantic Labeling Contest. The dataset contains orthophotos, digital surface model (DSM) images, and normalized DSM (nDSM) images over the Vaihingen city in Germany. The same set of data was utilized by a number of deep learning studies on tree canopy mapping. Table 6 lists a couple of these studies along with their methods, datasets, and DSC values on the tree class. Audebert et al. (2016) utilized the multimodal and multi-scale deep networks on the orthophotos. The DSC score on the tree class was 0.899 [59]. Sang and Minh (2018) used both the orthophotos and the nDSM images to train a fully convolutional neural network (FCNN). The classification accuracy was not improved in spite of adding the nDSM images on top of the orthophotos [60]. Paisitriangkrai et al. (2015) took advantage of the entire dataset (all three data sets) to train a multi-resolution convolutional neural network, yielding a DSC value of 0.8497. Compared to the above studies using the Vaihingen dataset, our U-net model conducted at 16 cm, 32 cm, and 50 cm outperforms all of them with a DSC of above 0.95. Note that adding DSM images fails to improve the overall model performance. Our best-performing model, the 32-cm U-net model, achieves a DSC of 0.9816. This surpasses the DSC of the contest winner (0.908), indicating the exceptional effectiveness of the U-net architecture on tree canopy mapping. Numerous studies have made modifications to the U-net architecture in an attempt to improve the model performance [42,62,63]. For instance, Diakogiannis et al. (2020) integrated the U-net with the residual neural network using high-resolution orthophotos and DSM images [64]. The dataset they used was also published by the ISPRS 2D Semantic Labeling Contest [65] and was similar to the Vaihingen dataset used in this study. The modified framework achieved a DSC score of 0.8917 on the tree class compared to 0.9816 in this study (Table 3). While the U-net is simpler, it is more effective than the other, more complex, deep learning architecture. This is consistent with Ba and Caruana (2014) that while depth can make the learning process easier, it may not always be essential [66]. Choosing the most efficient and suitable neural network is a top priority to ensure the best overall performance of a deep learning framework.

Conclusions
Mapping urban trees using high-resolution remote sensing imagery is important for understanding urban forest structure for better forest management. In this study, we applied the U-net to urban tree canopy mapping using high-resolution aerial photos. We tested the effectiveness of the U-net at four different scales and performed two evaluations to assess the model performance from two different angles. Evaluation 1 shows that the U-net performed the best on the 32-cm dataset, with an overall accuracy of 0.9914. The experiments conducted at four scales indicate the significance of an optimal receptive field for training a deep learning model. Evaluation 2 shows that the U-net can be used as a highly effective and promising tool for fine-scale land cover mapping with exceptional accuracy scores. Moreover, the comparison experiment shows the outstanding performance of the U-net compared to the widely used OBIA approach and other deep learning methods.
This study shows the utility of the U-net in urban tree canopy mapping and discusses the possibility of extending its use to other applications. A broad application of the U-net to high-resolution land cover mapping faces several challenges. First, as with any fine-scale land cover mapping tasks, the availability of freely accessible high-resolution imagery is an issue. U-net model training often requires satellite images with a spatial resolution of 1 m or finer. It remains a challenge to acquire very high-resolution data for regions of interest at desired times. Second, the lack of publicly available training datasets poses another problem. Ground truth data are usually produced by local government or research institutions through field surveys or manual digitization. The process of generating accurate ground truth data is a complex and laborious task. A possible solution is to introduce the techniques and strategies in transfer learning. Approaches such as pre-training, fine-tuning, and domain adaptation can alleviate the dependence on large labeled dataset. Therefore, integrating the U-net and transfer learning is a potential direction of future research.  Data Availability Statement: Publicly available datasets used in this study can be accessed from https://www2.isprs.org/commissions/comm2/wg4/benchmark/2d-sem-label-vaihingen/ (accessed on 3 April 2021).

Conflicts of Interest:
The authors declare no conflict of interest.