Deep Learning-Based Detection of Urban Forest Cover Change along with Overall Urban Changes Using Very-High-Resolution Satellite Images

: Urban forests globally face severe degradation due to human activities and natural disasters, making deforestation an urgent environmental challenge. Remote sensing technology and very-high-resolution (VHR) bitemporal satellite imagery enable change detection (CD) for monitoring forest changes. However, deep learning techniques for forest CD concatenate bitemporal images into a single input, limiting the extraction of informative deep features from individual raw images. Furthermore, they are developed for middle to low-resolution images focused on speciﬁc forests such as the Amazon or a single element in the urban environment. Therefore, in this study, we propose deep learning-based urban forest CD along with overall changes in the urban environment by using VHR bitemporal images. Two networks are used independently: DeepLabv3+ for generating binary forest cover masks, and a deeply supervised image fusion network (DSIFN) for the generation of a binary change mask. The results are concatenated for semantic CD focusing on forest cover changes. To carry out the experiments, full scene tests were performed using the VHR bitemporal imagery of three urban cities acquired via three different satellites. The ﬁndings reveal signiﬁcant changes in forest covers alongside urban environmental changes. Based on the accuracy assessment, the networks used in the proposed study achieved the highest F1-score, kappa, IoU, and accuracy values compared with those using other techniques. This study contributes to monitoring the impacts of climate change, rapid urbanization, and natural disasters on urban environments especially urban forests, as well as relations between changes in urban environment and urban forests


Introduction
Urban forests, consisting of urban trees, grass, and forests, are components of urban ecosystems providing a full spectrum of services such as alleviating urban heat, enhancing air quality, reducing stormwater runoff, and reducing greenhouse gas emissions, benefiting humans directly or indirectly [1][2][3][4].However, urban forests around the world are under the significant pressure of degradation due to various reasons, including natural disasters and human activities such as wildfires, floods, new constructions, or illegal logging [5].As a result, these days deforestation has become one of the most intractable environmental problems [6].Generally, deforestation monitoring is usually conducted through tedious manual procedures including visual inspections, which require frequent visits to forest regions and can be costly and dangerous [7].
In the last few decades, with the advancements in remote sensing technology and the availability of bitemporal satellite imagery, change detection (CD) is being used for forest change monitoring [8].In traditional methods, either vegetation masks for bitemporal images are generated by using a conventional vegetation detection technique such as the normalized difference vegetation index (NDVI) [9], or CD is carried out by using traditional approaches such as pixel-based CD or object-based CD [10].The NDVI takes advantage of different solar radiation absorption phenomena of green plants in the red and near-infrared spectral bands [11].However, vegetation masks generated by NDVI from very-high-resolution (VHR) satellite images of urban environments can suffer from noise because of the abundance of detailed information in VHR imagery [12,13].Furthermore, researchers have shown that pixel-based CD techniques are sensitive to noise because they do not fully consider the spatial context [14,15].On the other hand, object-based CD approaches showed better accuracies [16].However, the effectiveness of these approaches depends on image segmentation quality [17].Because of the complex land cover types such as large urban areas in VHR satellite imagery, the over-and under-segmentation of objects occur, reducing the efficiency and accuracy of object-based CD techniques [17].Moreover, these techniques are usually developed for a specific dataset or site, meaning that similar results cannot be achieved when applied to a new dataset or site [18].
The use of deep learning networks reduces the number of manual steps in monitoring changes via automating feature extraction, avoiding feature selection, and reducing manual steps during CD [19].Recently, deep learning-based techniques have demonstrated considerable success in a range of applications, including segmentation and CD, particularly in the context of forest detection and forest change monitoring [20][21][22][23][24].For example, researchers in [25] have performed forest cover CD in incomplete satellite images by using a deep neural network in a data-driven format for automatic feature learning.In another study, land cover classification and CD using Sentinel-2 satellite data were carried out in which a fully convolutional network was combined with a long short-term memory network [26].A baseline Unet model and Sentinel-2 data for regular CD in a Ukrainian forest were used [27].Furthermore, analysts introduced a semantic segmentation-based framework for forest estimation and the CD technique, in which multitemporal Landsat-8 images were employed into a trained U-net model, and binary forest cover maps were generated.Afterward, the pixel-wise difference between two binary maps (i.e., pre-change and post-change binary maps) was calculated for generating a change map [28].In another study, forest CD in bi-temporal satellite images is performed by generating anenhanced forest fused difference image, extracting changed and unchanged regions of forest with a recurrent residual-based Unet network [29].Moreover, coastal forest CD was carried out by using convolutional neural networks (CNNs) [30].
However, most of the CD networks are modified from networks that are proposed for single-image semantic segmentation tasks.In these networks, bitemporal images are concatenated in order to meet the requirement of a single image input because of which early fusion networks fail to provide the informative deep features of individual raw images for image reconstruction [31].In [31], Zhang et al. addressed this problem by introducing a deeply supervised image fusion network (DSIFN) for CD in VHR imagery.In order to generate highly representative deep bitemporal features, feature extraction is conducted via an independently trained fully convolutional two-stream architecture [31].Furthermore, among different semantic segmentation networks, analysts have demonstrated the effectiveness of Deeplabv3+ [32] for various types of vegetation extraction and detection [33][34][35][36][37][38].With Deeplabv3+, high-level features of different scales can be extracted using atrous spatial pyramid pooling (ASPP).Additionally, Deeplabv3+ combines multiple features with the encoder-decoder approach, making it a highly efficient and accurate semantic segmentation method [39].
Earlier mentioned techniques either used low-resolution or middle-resolution satellite images in which small changes related to vegetation could be easily ignored or remain undetected.The final forest CD result may suffer from a large amount of false detections or miss detections when tested on VHR bitemporal imagery.Also, these studies focused on huge regions such as Amazon forests and did not consider forests around urban areas where changes in forest and other urban elements occur simultaneously due to rapid urban expansions.Additionally, forest changes in these regions are small compared to other change in an urban environment, and a deep learning technique used to directly detect changes in forest cover may suffer from the class imbalance problem because nonchange regions in the scene will be huge compared to the changes in the forest region only.By utilizing pre-and post-change binary forest covers with a binary change mask, forest change can be monitored together with overall changes in an urban environment.Therefore, in this study, we addressed the priorly mentioned problems by benefiting from Deeplabv3+ and DSIFN.We introduced transfer learning-based forest change (i.e., increase or decrease) detection together with the detection of overall urban changes in VHR bitemporal satellite imagery in a semantic CD manner [40].We trained the two networks independently on open-source datasets and then performed transfer learning using our own datasets.Trained Deeplabv3+ was used for the generation of binary forest masks from both pre-and post-change VHR images, while DSIFN was used for binary change mask generation.
The contributions of the proposed study are as follows: (1) the utilization of two networks for CD, in a semantic CD manner, in an urban environment while focusing on forest cover decrease as well as increase concerning overall changes in the scene, (2) the usage of VHR bitemporal imagery for deforestation detection, (3) the utilization of the detected binary forest mask of pre-and post-change imagery for reducing false detections, missed detections, and salt-and-pepper noise in the final result, and (4) the transfer learning of both networks trained on open-source datasets to our own VHR imagery dataset.

Datasets
For forest detection, a remote sensing land cover dataset for domain-adaptive semantic segmentation known as LoveDA [41] was used.The dataset consists of 2522 training images and 1669 validation images composed of 1024 × 1024 pixels with red, green, and blue bands.Labels have seven classes such as buildings, roads, water, barren, forest, agriculture, and background.However, as our task is related to urban forest detection, we extracted only the forest class from the labels and combined other classes with a background class.Moreover, due to memory issues, we cropped each image into four image patches of 512 × 512 pixels; the final dataset became 10,088 images for training and 6676 for validation with two classes (i.e., forest and background).For performing change detection, we used the dataset provided by the authors of DSIFN.Initially, the dataset consisted of 3600 images composed of 512 × 512 pixels with red, green, and blue spectral bands for training and 340 for validation.However, we reduced the image size to 128 × 128 pixels.
For the transfer learning of both networks and the evaluation of the proposed methodology, we generated datasets for each network from VHR bitemporal images of three sites acquired via three different satellites.The images were acquired over cities in South Korea such as Sejong, Daejeon, and Gwangju via Kompsat-3, QuickBird-2, and WorldView-3, respectively.The overall description of bitemporal images is provided in Table 1.Binary forest labels for each bitemporal image and binary CD labels were generated through the visual inspection and manual digitization of images.Bitemporal images together with binary forest labels are given in Figure 1, and their CD labels are shown in Figure 2. Briefly, 2800 image patches and corresponding label patches composed of 512 × 512 pixels for transfer learning for a forest detection network were generated from the bitemporal images of Sites 1 and 3. We extracted the patches with NIR, red, and green spectral bands because red and NIR bands give useful information regarding the vegetation in satellite images.Similarly, for the change detection network, 2800 image patches at a size of 128 × 128 pixels were generated with red, green, and blue spectral bands.The patches consisted of prechange, post-change, and CD label images.Site 2 was utilized as the test dataset to evaluate the performance of transfer learning.

Methodology
The proposed method is mainly divided into three steps: (1) binary forest mask generation by using a well-known semantic segmentation technique, Deeplabv3+, ( 2

Methodology
The proposed method is mainly divided into three steps: (1) binary forest mask g eration by using a well-known semantic segmentation technique, Deeplabv3+, (2) bin change mask generation through DSIFN, and (3) forest change monitoring with respec overall changes in the scene.The flowchart of the proposed method is provided in Fig 3 .VHR bitemporal images are independently employed in DeepLabv3+ for urban for mask generation.At the same time, these images are given as inputs to DSIFN for bin change mask generation.Then, the three binary masks are combined to generate a sem tic CD result and forest change is monitored with overall changes.

Binary Forest Mask Generation
In this study, to generate binary forest masks for VHR bitemporal satellite imag Deeplabv3+ was used.Deeplabv3+ is a semantic segmentation network designed for age classification at the pixel level and is developed for improving the segmentation sults.It extends by adopting an encoder-decoder architecture and improving the deco via the use of ASPP.The encoder uses a pre-trained CNN for generating high-level f tures from the input image.The input image passes through multiple convolutional lay for decreasing the spatial dimensions and enhancing the feature channels.Multi-scale c textual information is generated at the end of the encoder module by the ASPP.The coder module is responsible for restoring the spatial resolution of a segmented ima This is achieved by upsampling the feature maps and incorporating fine-grained featu between the encoder and decoder stages.Detailed information regarding the architect of DeepLabv3+ can be found in [32].
In this study, ResNet-50 trained on ImageNet was used for extracting high-level f tures.Through a series of convolutional layers, the spatial dimensions of the images w reduced while enhancing the feature channels.ASPP generated feature maps that c tained contextual information at different scales, and enhanced the model's ability to derstand and segment forest regions accurately.In the decoder module, the spatial re lution of the forest segmentation map was retrieved.This process ensured that detai information was maintained during the upsampling process, resulting in a higher-reso tion forest mask.Finally, pixel-level classification was performed using a sigmoid acti tion function.The sigmoid function transformed the pixel values to a range between 0 a 1, representing the probability of each pixel belonging to the forest class.To obtain a nary mask, a manual thresholding approach was employed in this study.Furthermo

Binary Forest Mask Generation
In this study, to generate binary forest masks for VHR bitemporal satellite images, Deeplabv3+ was used.Deeplabv3+ is a semantic segmentation network designed for image classification at the pixel level and is developed for improving the segmentation results.It extends by adopting an encoder-decoder architecture and improving the decoder via the use of ASPP.The encoder uses a pre-trained CNN for generating high-level features from the input image.The input image passes through multiple convolutional layers for decreasing the spatial dimensions and enhancing the feature channels.Multi-scale contextual information is generated at the end of the encoder module by the ASPP.The decoder module is responsible for restoring the spatial resolution of a segmented image.This is achieved by upsampling the feature maps and incorporating fine-grained features between the encoder and decoder stages.Detailed information regarding the architecture of DeepLabv3+ can be found in [32].
In this study, ResNet-50 trained on ImageNet was used for extracting high-level features.Through a series of convolutional layers, the spatial dimensions of the images were reduced while enhancing the feature channels.ASPP generated feature maps that contained contextual information at different scales, and enhanced the model's ability to understand and segment forest regions accurately.In the decoder module, the spatial resolution of the forest segmentation map was retrieved.This process ensured that detailed information was maintained during the upsampling process, resulting in a higher-resolution forest mask.Finally, pixel-level classification was performed using a sigmoid activation function.The sigmoid function transformed the pixel values to a range between 0 and 1, representing the probability of each pixel belonging to the forest class.To obtain a binary mask, a manual thresholding approach was employed in this study.Furthermore, both pre-change and post-change images were independently inputted into DeepLabv3+, and binary forest masks were then generated for each image.The overall architecture of binary forest mask generation is illustrated in Figure 4.
, x FOR PEER REVIEW 6 of 18 and binary forest masks were then generated for each image.The overall architecture of binary forest mask generation is illustrated in Figure 4.

Binary Change Mask Generation
To generate a binary change mask, we used DSIFN introduced in [31].The main idea behind the DSIFN is to develop a deep learning-based network that can effectively fuse information from two bi-temporal remote sensing images and perform CD.DSIFN preserves the change region boundaries and reconstructs high-quality maps by extracting deep bitemporal features independently and via the layer-wise concatenation of deep features and image difference features.DSIFN is divided into three streams.The first stream extracts deep features from the pre-change image using layers of a pre-trained VGG16 network.The second stream extracts deep features from the post-change image by sharing the structure and parameters of the first stream.The extracted features from pre-and postchange images are stacked at the same scales in order to supply both low-level and highlevel raw image features to the third stream (i.e., CD stream).Overall, the first two streams consist of several convolutional layers each followed by a non-linear activation function such as the rectified linear unit (ReLU).
CD stream uses a difference discrimination network responsible for upsampling the features back to the original resolution and generating the fused CD map.The lowest layers of the first two streams acquire broad receptive fields and condense global information after progressive abstraction using layered convolutional and pooling layers.Therefore, the last layers of these streams serve as an initial input to the difference discrimination network to generate a preliminary global change map of a small size.Earlier layers that include the low-level information of input images are skip-connected to a difference discrimination network with the same scales.Three convolutional layers are applied to generate compact-sized difference image features.For features map refinement across the spatial dimensions, a spatial attention module is used.Then, the image difference feature maps are upsampled for enlarging feature maps.For fusing raw deep features with image difference features, a channel attention module is used.A detailed explanation regarding DSIFN can be found in [31].The overall network architecture of DSIFN is provided in Figure 5.

Binary Change Mask Generation
To generate a binary change mask, we used DSIFN introduced in [31].The main idea behind the DSIFN is to develop a deep learning-based network that can effectively fuse information from two bi-temporal remote sensing images and perform CD.DSIFN preserves the change region boundaries and reconstructs high-quality maps by extracting deep bitemporal features independently and via the layer-wise concatenation of deep features and image difference features.DSIFN is divided into three streams.The first stream extracts deep features from the pre-change image using layers of a pre-trained VGG16 network.The second stream extracts deep features from the post-change image by sharing the structure and parameters of the first stream.The extracted features from preand post-change images are stacked at the same scales in order to supply both low-level and high-level raw image features to the third stream (i.e., CD stream).Overall, the first two streams consist of several convolutional layers each followed by a non-linear activation function such as the rectified linear unit (ReLU).
CD stream uses a difference discrimination network responsible for upsampling the features back to the original resolution and generating the fused CD map.The lowest layers of the first two streams acquire broad receptive fields and condense global information after progressive abstraction using layered convolutional and pooling layers.Therefore, the last layers of these streams serve as an initial input to the difference discrimination network to generate a preliminary global change map of a small size.Earlier layers that include the low-level information of input images are skip-connected to a difference discrimination network with the same scales.Three convolutional layers are applied to generate compactsized difference image features.For features map refinement across the spatial dimensions, a spatial attention module is used.Then, the image difference feature maps are upsampled for enlarging feature maps.For fusing raw deep features with image difference features, a channel attention module is used.A detailed explanation regarding DSIFN can be found in [31].The overall network architecture of DSIFN is provided in Figure 5.

Forest Change Monitoring
For generating the final semantic change result, firstly, pre-and post-chan masks obtained through the proposed forest detection technique were separately with a change mask generated via a change detection network (expressed in Equa for extracting the forest change pixels from the two forest masks.The pre-and pos binary forest maps were used for identifying forest cover decrease (Forest ) and (Forest ).From Equation (1), after the integration of the two masks, a binary forest cha was generated (e.g., Mask and Mask for the forest decrease map, and vice vers signing 0 to the unchanged forest pixels and changed non-forest pixels, and changed forest pixels.Via the aforementioned process, the forest maps' pixels b to the change in the two masks could be preserved and the pixels related to non forest regions could be eliminated.
After concatenating the bitemporal binary forest increase and decrease m the binary CD mask, we created a comprehensive semantic change map as show ure 6.The semantic change map provides a detailed representation of the fore changes during the specific period under consideration.The semantic change cludes four classes: forest cover increase, forest cover decrease, non-forest change and falsely change regions.

Forest Change Monitoring
For generating the final semantic change result, firstly, pre-and post-change forest masks obtained through the proposed forest detection technique were separately utilized with a change mask generated via a change detection network (expressed in Equation ( 1)) for extracting the forest change pixels from the two forest masks.The pre-and postchange binary forest maps were used for identifying forest cover decrease (Forest d ) and increase (Forest i ).
where Mask T1 and Mask T2 are the pre-and post-change binary forest masks, and Mask C denotes the change mask.
From Equation (1), after the integration of the two masks, a binary forest change map was generated (e.g., Mask T1 and Mask C for the forest decrease map, and vice versa) by assigning 0 to the unchanged forest pixels and changed non-forest pixels, and 1 to the changed forest pixels.Via the aforementioned process, the forest maps' pixels belonging to the change in the two masks could be preserved and the pixels related to non-change forest regions could be eliminated.
After concatenating the bitemporal binary forest increase and decrease masks and the binary CD mask, we created a comprehensive semantic change map as shown in Figure 6.The semantic change map provides a detailed representation of the forest cover changes during the specific period under consideration.The semantic change map includes four classes: forest cover increase, forest cover decrease, non-forest change regions, and falsely change regions.

Validation
To assess the overall extent of changes in the scene, we calculated the percentage of overall changed regions by dividing the total number of changed pixels in the semantic change map by the total number of pixels in the scene.To gain further insights into the forest cover changes, we analyzed the forest cover decrease and increase individually.The percentage of forest cover decrease is determined by dividing the total number of pixels indicating a decrease in forest cover by the total number of changed pixels in the semantic change map.Similarly, the percentage of forest cover increase is calculated by dividing the total number of pixels indicating an increase in forest cover by the total number of changed pixels.These metrics show the trend of forest cover change with respect to other urban changes.
forest regions could be eliminated.
After concatenating the bitemporal binary forest increase and decrease masks and the binary CD mask, we created a comprehensive semantic change map as shown in   Firstly, the two networks were trained independently on open-source datasets for each task.Then, transfer learning was performed using our dataset.Tests were carried out by using the full-scene bitemporal images of Site 2 as well as Sites 1 and 3.For the quantitative evaluation of networks used in this study, the F1-score, kappa, accuracy, intersection over union (IoU), false alarm rate (FAR), and miss rate (MR) were calculated using each predicted result and the manually digitized labels.The binary forest masks generated via DeepLabv3+ in this study were compared with the binary forest masks generated via Unet [42], SegNet [43], and the NDVI.Moreover, we compared the final semantic change detection result using the proposed method with the results generated by combining the change detection map in this study with the deforestation detection result generated by using unsupervised deforestation detection, which was introduced in [13].

Experimental Results
The networks were trained using Tensorflow, AMD Ryzen 7 5800X 8-Core Processor CPU with 64.0 GB RAM, and NVIDIA GeForce RTX 3060 GPU.Networks were trained via the open-source datasets on several epochs, and the ones with the best accuracies were chosen (i.e., 25 for DeepLabv3+, and 60 for DSIFN).A binary cross-entropy loss and an Adam optimizer were used for both networks.The minimum and maximum learning rates during training with a learning rate reduction for DeepLabv3+ were set to 0.000001 and 0.0001, and those for DSIFN were set to 0.000001 and 0.0001, respectively.The maximum learning rate was set differently for both networks according to the variations in the training and validation accuracies and losses.The batch size was set to 8 and 32 for DeepLabv3+ and DSIFN, respectively.
After training networks on open-source datasets, the final training and validation accuracies, and losses for DeepLabv3+ were 0.954 and 0.937, and 0.115 and 0.175, while those for DSIFN were 0.958 and 0.926, and 0.095 and 0.185, respectively.
Then, the transfer learning of both networks was performed using our own dataset.During transfer learning, the epochs with better accuracies achieved via DeepLabv3+, and DSIFN were 100, and 40, respectively.The training and validation accuracies of DeepLabv3+, and DSIFN were 0.942 and 0.903, and 0.991 and 0.972, respectively.Similarly, the losses were 0.148 and 0.267, and 0.022 and 0.087.The training and validation performance of the two neural networks are visually represented in Figures 7 and 8, where accuracy and loss metrics are depicted.

Binary Forest Masks
Firstly, full scene binary forest masks were generated using the bitemporal im the three sites.Patches were generated from the pre-change VHR image of each site

Binary Forest Masks
Firstly, full scene binary forest masks were generated using the bitemporal ima the three sites.Patches were generated from the pre-change VHR image of each site

Binary Forest Masks
Firstly, full scene binary forest masks were generated using the bitemporal images of the three sites.Patches were generated from the pre-change VHR image of each site.Then, a trained Deeplabv3+ network was used to predict and thus generate a forest cover mask from the patches.The resulting patches after prediction were combined to generate the same size result as that of the original image for each site.Afterward, multiple thresholds were tested and the one with the best results such as 0.4 was selected for binary forest mask generation.A similar process was repeated using a post-change VHR image.After binary forest mask generation for both (i.e., pre-change and post-change) images, we visually compared the results with the binary forest masks generated by using the NDVI.For generating the masks using the NDVI, a threshold with the best accuracy was selected.The binary forest masks generated via DeepLabv3+ in this study, and the NDVI from the pre-change image of each site are shown in Figure 9 along with label images.visually compared the results with the binary forest masks generated by using the NDVI.
For generating the masks using the NDVI, a threshold with the best accuracy was selected.
The binary forest masks generated via DeepLabv3+ in this study, and the NDVI from the pre-change image of each site are shown in Figure 9 along with label images.Compared to the label images (i.e., Figure 9c,f,i) and the results generated via the NDVI (Figure 9a,d,g), the proposed method effectively detected forest covers shown in Figure 9b,e,h.Through a visual inspection, binary forest covers generated via the NDVI for all three sites have missed as well as falsely detected regions, which makes them seem and the IoU was 0.737.At Site 2, the F1-score reached 0.824, the kappa coefficient was 0.817, the accuracy was 0.987, and the IoU was 0.701.Similarly, for Site 3, we observed an F1-score of 0.823, a kappa coefficient of 0.811, an accuracy of 0.977, and an IoU of 0.700.The FAR and MR of each site were 0.036 and 0.124, 0.005 and 0.201, and 0.006 and 0.243.The predicted CD masks and CD labels of each site are provided in Figure 10.It is apparent that the CD network in the proposed study detected the changes successfully in all the three sites.However, upon visual comparison with the CD label images, it can be observed that the boundaries of the detected objects in the results generated via the proposed method exhibited instances of both over-detection and missed detection.Furthermore, it is worth noting that certain falsely detected regions, such as high-rise buildings, were present in the results due to variations in the acquisition angles of the satellite sensor during image acquisition.

Finalizing Forest Cover Changes
After generating all the binary masks, they were concatenated in order to genera semantic change results for each site focusing on forest changes.This process helps minimizing noise as well as falsely detected forest change regions.The semantic chan map and reference maps were generated firstly by adding the predicted results and lab images for extracting forest change regions from the binary forest masks.Then, these fo est change regions were concatenated with the change mask for the final result.
In order to show the effectiveness of the proposed method, we compared the resu generated via the proposed method with an unsupervised deforestation detection tec nique [13].To this end, the unsupervised deforestation detection technique was used generate the deforestation masks (i.e., forest decrease masks) while the forest increa mask was generated by swapping the bitemporal images.However, the technique mainly developed for middle-to low-resolution satellite imagery and due to the use VHR imagery the final forest change masks suffered from falsely detected regions and copious amount of salt and pepper noise.Therefore, for effective comparison, we utiliz the masks with the change masks generated in the proposed study.The final seman

Finalizing Forest Cover Changes
After generating all the binary masks, they were concatenated in order to generate semantic change results for each site focusing on forest changes.This process helps in minimizing noise as well as falsely detected forest change regions.The semantic change map and reference maps were generated firstly by adding the predicted results and label images for extracting forest change regions from the binary forest masks.Then, these forest change regions were concatenated with the change mask for the final result.
In order to show the effectiveness of the proposed method, we compared the results generated via the proposed method with an unsupervised deforestation detection technique [13].To this end, the unsupervised deforestation detection technique was used to generate the deforestation masks (i.e., forest decrease masks) while the forest increase mask was generated by swapping the bitemporal images.However, the technique is mainly developed for middle-to low-resolution satellite imagery and due to the use of VHR imagery the final forest change masks suffered from falsely detected regions and a copious amount of salt and pepper noise.Therefore, for effective comparison, we utilized the masks with the change masks generated in the proposed study.The final semantic change maps generated via the proposed methodology, semantic change maps generated after the utilization of the change mask with forest decrease and increase masks using the unsupervised deforestation detection technique, are shown in Figure 11 together with the semantic change reference maps.In Figure 11, the yellow color indicates a decrease in forest cover, purple is an increase in forest cover, red is non-forest changes, white is falsely detected or falsely labeled forest changes, and black is a no-change region.It can be seen that the proposed method effectively detected decreased forest co with a small number of false detections and missed detections in all three sites (i.e., Figu 11b,e,h) compared to the reference data (i.e., Figure 11c,f,i).Furthermore, undetec change regions were present in the binary change mask used in the proposed study; ho ever, since our focus is forest cover CD, it is obvious from the figures that these undetec change regions have a subtle impact on forest cover CD and can thus be ignored.Mor ver, due to the higher MR of the post-change binary forest mask compared to that of It can be seen that the proposed method effectively detected decreased forest cover with a small number of false detections and missed detections in all three sites (i.e., Figure 11b,e,h) compared to the reference data (i.e., Figure 11c,f,i).Furthermore, undetected change regions were present in the binary change mask used in the proposed study; however, since our focus is forest cover CD, it is obvious from the figures that these undetected change regions have a subtle impact on forest cover CD and can thus be ignored.Moreover, due to the higher MR of the post-change binary forest mask compared to that of the pre-change binary forest mask generated via the proposed method and a minute amount of increase in forest cover, it remained undetected or falsely detected via the proposed method in Site 1.On the other hand, while using the unsupervised deforestation detection technique together with a change mask, numerous non-forest related changes were detected as decreased forest regions (i.e., shown by a yellow color in Figure 11a,d,g).Similarly, the increased forest regions were either missed or falsely detected by unsupervised deforestation detection technique in Sites 1 and 3.
After the generation of semantic change maps, forest changes concerning overall changes in the scenes were determined.The percentage of change in the overall scene of Site 1 was around 16.736% in the results predicted via the proposed method, whereas in the reference map it was around 15.64%.Moreover, in the results predicted via the proposed method, the total decrease in the forest cover compared to overall changes was around 13.617% and the calculated increase was 1.034%.In the reference map, these values were 15.74% and 2.49%.Due to the higher MR of the post-change forest cover map and lower percentage, the percentage of increase in the proposed study in Site 1 was considered to be an inaccurate result.Moreover, the percentage of decrease in the predicted results was higher than that in the reference map because in some regions the non-forest changes were detected as forest decreased regions.On the other hand, it is worth noting that the results obtained through the unsupervised deforestation detection technique displayed a significantly different pattern.Here, the percentages of decrease and increase in forest cover compared to overall changes were approximately 43.99% and 21.81%, respectively.This discrepancy can be attributed primarily to the numerous falsely detected forest change regions resulting from the use of VHR imagery.
Similarly, in Site 2, through the proposed method we observed that in 4 years around 3.5% of total change occurred in the full scene; 12.6% of total changes were related to a decrease in the forest cover whereas 1.43% were related to an increase in the forest cover.In the overall scene of Site 3, 5.63% of changes occurred in 1 year.Out of the total changes, the decrease in forest cover was 8.21% while the increase in forest cover was 1.25%.In Site 2 and 3, the forest cover increase was detected with less falsely detected forest increase regions compared to those in the results generated for Site 1, while for the results generated through the unsupervised deforestation detection technique for Sites 2 and 3, the decrease in forest cover was around 50.03% and 43.99% of overall changes.The increase in forest cover in Site 3 was shown to be 21.81% and in Site 2 it was not detected via the aforementioned method.

Discussion
Traditional CD methods that perform direct CD between binary forest masks will result in an increase in the number of incorrectly identified forest change covers.The proposed methodology accurately detected the decreased regions of forest cover in Site 1, Site 2, and Site 3 with a lower amount of missed and falsely detected regions.The non-forest change regions, however, contained an inadequate amount of miss detections, but since our study is focused on detecting changes in forest cover, these missed detections or false detections can be disregarded.Figure 12 shows a close-up view of regions of interest (ROIs) from the results predicted via the proposed method together with pre-and post-change images of the same ROIs from each site.
As mentioned earlier, in Site 1 due to the low percentage of increase in forest cover as well as the higher MR of the post-change forest cover map, the detected increased forest cover regions are considered to be inaccurate.A close-up view of the forest cover increase detected via the proposed method in a ROI from Site 1 is shown in Figure 13 in which the changes occur from a built-up region to agricultural land or from agricultural land to bare soil.In sites 2 and 3, the proposed method effectively detected the increased forest regions provided in Figure 14.In Site 3, as shown in Figure 14d-f, although it detected the increase, half of the forest change region in the close-up view was detected as a non-forest changed region.
Traditional CD methods that perform direct CD between binary forest masks will result in an increase in the number of incorrectly identified forest change covers.The proposed methodology accurately detected the decreased regions of forest cover in Site 1, Site 2, and Site 3 with a lower amount of missed and falsely detected regions.The non-forest change regions, however, contained an inadequate amount of miss detections, but since our study is focused on detecting changes in forest cover, these missed detections or false detections can be disregarded.Figure 12 shows a close-up view of regions of interest (ROIs) from the results predicted via the proposed method together with pre-and postchange images of the same ROIs from each site.As mentioned earlier, in Site 1 due to the low percentage of increase in forest cover as well as the higher MR of the post-change forest cover map, the detected increased forest cover regions are considered to be inaccurate.A close-up view of the forest cover increase detected via the proposed method in a ROI from Site 1 is shown in Figure 13 in which the changes occur from a built-up region to agricultural land or from agricultural land to bare soil.In sites 2 and 3, the proposed method effectively detected the increased forest regions provided in Figure 14.In Site 3, as shown in Figure 14d-f  As mentioned earlier, in Site 1 due to the low percentage of increase in fo as well as the higher MR of the post-change forest cover map, the detected incre cover regions are considered to be inaccurate.A close-up view of the forest cov detected via the proposed method in a ROI from Site 1 is shown in Figure 13 in changes occur from a built-up region to agricultural land or from agricultural la soil.In sites 2 and 3, the proposed method effectively detected the increased for provided in Figure 14.In Site 3, as shown in Figure 14d-f, although it detec crease, half of the forest change region in the close-up view was detected as a changed region.

Conclusions
In this study, we proposed a semantic CD technique while focusing on urban forest changes along with other urban changes.To this end, two networks, DeepLabv3+ for binary forest mask generation and DSIFN for binary change detection, were utilized and trained independently on open-source datasets.Then, transfer learning was performed using the dataset generated from VHR bitemporal images acquired via two different satellite images with different spatial resolutions.Then, the results generated by each network were concatenated for generating a semantic change result.To carry out the experiments, full scene tests were performed using the VHR bitemporal imagery of three urban cities acquired via three different satellites.The binary forest masks, generated via the proposed method from pre-and post-change images, showed a higher F1-score, kappa, IoU, and accuracy compared with the results generated via the NDVI, Unet, and SegNet.The final semantic change results showed that the proposed method can detect the changes in forest cover along with other urban changes.Moreover, the results showed that with the changes in the urban environment forest covers are changing considerably.Overall, in sites 1, 2 and 3, changes of 16.73%, 3.5%, and 5.63% occurred, in which 13.61%, 12.6%, and 8.21% of the total changes were related to a decrease in the urban forest cover.The use of both preand post-change VHR images minimized salt-and-pepper noise in regions related to forest cover changes in the sematic change result.
The results showed that the proposed method can effectively detect the regions related to forest cover decrease.However, because the tendency of forest cover decrease is usually higher than that of forest cover increase, as well as in the bitemporal images used in this study the forest cover increase regions were too small, regions where the forest cover was decreased were detected more effectively than those where there was an increase in forest cover.The proposed method can be used for monitoring the impacts of climate change, rapid urbanization, and natural disasters on urban environments especially on urban forests, as well as relations between changes in urban environments and urban forests.Moreover, this study can be used for the planning and development of cities and map updating.In the future, we will integrate the two networks in order to minimize the use of the two networks and independent training.A complex dataset will be generated for a semantic CD task containing changes in the classes related to the urban environment (i.e., urban grass, urban forest, urban trees, and built-up regions).Furthermore, we will apply the proposed method to additional datasets related to forest cover increase regions acquired via different satellite sensors.

Figure 1 .
Figure 1.VHR bitemporal imagery and binary forest labels: (a) pre-and (b) post-change images of Site 1, (c) pre-and (d) post-change images' forest labels, (e) pre-and (f) post-change images of Site 2, (g) pre-and (h) post-change images' forest labels, (i) pre-and (j) post-change images of Site 3, and (k) pre-and (l) post-change images' forest labels.

Figure 1 .
Figure 1.VHR bitemporal imagery and binary forest labels: (a) pre-and (b) post-change images of Site 1, (c) pre-and (d) post-change images' forest labels, (e) pre-and (f) post-change images of Site 2, (g) pre-and (h) post-change images' forest labels, (i) pre-and (j) post-change images of Site 3, and (k) pre-and (l) post-change images' forest labels.

Figure 1 .Figure 2 .
Figure 1.VHR bitemporal imagery and binary forest labels: (a) pre-and (b) post-change images of Site 1, (c) pre-and (d) post-change images' forest labels, (e) pre-and (f) post-change images of Site 2, (g) pre-and (h) post-change images' forest labels, (i) pre-and (j) post-change images of Site 3, and (k) pre-and (l) post-change images' forest labels.

Figure 3 .
Figure 3. Flowchart of the proposed method.

Figure 3 .
Figure 3. Flowchart of the proposed method.

where
Mask and Mask are the pre-and post-change binary forest masks, an denotes the change mask.

Fig- ure 6 .
The semantic change map provides a detailed representation of the forest cover changes during the specific period under consideration.The semantic change map includes four classes: forest cover increase, forest cover decrease, non-forest change regions, and falsely change regions.

Figure 6 .
Figure 6.Semantic change detection focusing on forest changes.

Figure 6 .
Figure 6.Semantic change detection focusing on forest changes.

Figure 7 .Figure 8 .
Figure 7. Training and validation graphs for transfer learning of Deeplabv3+: (a) training l training accuracy, (c) validation loss, and (d) validation accuracy.

Figure 7 .
Figure 7. Training and validation graphs for transfer learning of Deeplabv3+: (a) training loss, (b) training accuracy, (c) validation loss, and (d) validation accuracy.

Figure 7 .Figure 8 .
Figure 7. Training and validation graphs for transfer learning of Deeplabv3+: (a) training l training accuracy, (c) validation loss, and (d) validation accuracy.

Figure 8 .
Figure 8. Training and validation graphs for transfer learning of DSIFN: (a) training loss, (b) training accuracy, (c) validation loss, and (d) validation accuracy.

Figure 9 .
Figure 9. Binary forest masks of pre-change image of each site generated by using (a) the NDVI for Site 1, (b) proposed method for Site 1, (c) label of Site 1, (d) NDVI for Site 2, (e) proposed method for Site 2, (f) label of Site 2, (g) NDVI for Site 3, (h) proposed method for Site 3, and (i) label of Site 3.

Figure 9 .
Figure 9. Binary forest masks of pre-change image of each site generated by using (a) the NDVI for Site 1, (b) proposed method for Site 1, (c) label of Site 1, (d) NDVI for Site 2, (e) proposed method for Site 2, (f) label of Site 2, (g) NDVI for Site 3, (h) proposed method for Site 3, and (i) label of Site 3.
ote Sens. 2023, 15, x FOR PEER REVIEW 12 of present in the results due to variations in the acquisition angles of the satellite sensor du ing image acquisition.

Figure 11 .
Figure 11.Semantic CD results: (a) unsupervised deforestation detection, (b) proposed method a (c) reference data of Site 1; (d) unsupervised deforestation detection, (e) proposed method, and reference data of Site 2; (g) unsupervised deforestation detection, (h) proposed method, and (i) erence data of Site 3.

Figure 11 .
Figure 11.Semantic CD results: (a) unsupervised deforestation detection, (b) proposed method and (c) reference data of Site 1; (d) unsupervised deforestation detection, (e) proposed method, and (f) reference data of Site 2; (g) unsupervised deforestation detection, (h) proposed method, and (i) reference data of Site 3.

Figure 12 .
Figure 12.Region of interest indicating decrease in forest cover: (a) reference data, (b) proposed method, (c) pre-change image, and (d) post-change image of Site 1; (e) reference data, (f) proposed method, (g) pre-change image, and (h) post-change image of Site 2; (i) reference data, (j) proposed method, (k) pre-change image, and (l) post-change image of Site 3.

Figure 13 .
Figure 13.A close-up view of detected forest increase in Site 1: (a) proposed method, (b) pre-change image, and (c) post-change image.

Figure 12 .Figure 12 .
Figure 12.Region of interest indicating decrease in forest cover: (a) reference data, (b) proposed method, (c) pre-change image, and (d) post-change image of Site 1; (e) reference data, (f) proposed method, (g) pre-change image, and (h) post-change image of Site 2; (i) reference data, (j) proposed method, (k) pre-change image, and (l) post-change image of Site 3.

Figure 13 .
Figure 13.A close-up view of detected forest increase in Site 1: (a) proposed method, (b) image, and (c) post-change image.

Figure 13 .
Figure 13.close-up view of detected forest increase in Site 1: (a) proposed method, (b) pre-change image, and (c) post-change image.

Figure 13 .Figure 14 .
Figure 13.A close-up view of detected forest increase in Site 1: (a) proposed method, (b) pre-change image, and (c) post-change image.

14 .
A close-up view of detected forest increase in Site 2 and 3: (a) proposed method, (b) prechange image, and (c) post-change image of Site 2; (d) proposed method, (e) pre-change image, and (f) post-change image of Site 3.