DSNUNet: An Improved Forest Change Detection Network by Combining Sentinel-1 and Sentinel-2 Images

: The use of remote sensing images to detect forest changes is of great signiﬁcance for forest resource management. With the development and implementation of deep learning algorithms in change detection, a large number of models have been designed to detect changes in multi-phase remote sensing images. Although synthetic aperture radar (SAR) data have strong potential for application in forest change detection tasks, most existing deep learning-based models have been designed for optical imagery. Therefore, to effectively combine optical and SAR data in forest change detection, this paper proposes a double Siamese branch-based change detection network called DSNUNet. DSNUNet uses two sets of feature branches to extract features from dual-phase optical and SAR images and employs shared weights to combine features into groups. In the proposed DSNUNet, different feature extraction branch widths were used to compensate for a difference in the amount of information between optical and SAR images. The proposed DSNUNet was validated by experiments on the manually annotated forest change detection dataset. According to the obtained results, the proposed method outperformed other change detection methods, achieving an F1-score of 76.40%. In addition, different combinations of width between feature extraction branches were analyzed in this study. The results revealed an optimal performance of the model at initial channel numbers of the optical imaging branch and SAR image branch of 32 and 8, respectively. The prediction results demonstrated the effectiveness of the proposed method in accurately predicting forest changes and suppressing cloud interferences to some extent.


Introduction
Change detection is an important task in the remote sensing field, which aims to reveal surface changes in multi-temporal remote sensing data [1]. Forests are important natural resources that play a major role in maintaining the Earth's ecological environment. As a sub-task of change detection, forest change detection has been widely used in land and resource inventory, deforestation control, and forest management.
Early forest change detection was generally performed using optical images, which have obvious color characteristics, with certain color bands being sensitive to specific changes [2,3]. Currently, optical images are the main data source in the change detection field [4]. However, the quality of optical images is strongly affected by clouds and fog. Moreover, the temporal difference in multi-phase images captured by a sensor may show spectral changes for the same objects [5]. With the development of synthetic aperture radar (SAR) technology, numerous studies have been carried out on SAR image-based forest change detection in recent years [6]. data [21]. T. A. Schroeder et al. performed the detection of forest fire and deforestation using the supervised classification of the Landsat8 time series [22]. Whereas W. B. Cohen et al. used an unsupervised classification post-difference approach to detect deforestation in the Pacific Northwest on Landsat8 data [23]. SAR based detection has become the most common method for obtaining accurate forest change detection results with reduced interference from clouds and fog. M. G. Hethcoat et al. used the random forest algorithm to analyze deforestation based on Sentinel-1 time series data [24]. J. Reiche et al. combined dense Sentinel-1 time series with Landsat and ALSO-2 PALSAR-2 to perform real-time near-field tropical forest monitoring [25]. Indeed, with the development of change detection technology, deep learning-based models have begun to be applied to forest change detection tasks. R. V. Maretto et al. improved the traditional U-Net model and applied it to forest change detection based on Landsat-8 OLI data, demonstrating the effectiveness of the improved U-Net model in achieving high forest-change detection accuracy [26]. F. Zhao et al. extracted deforestation areas using the U-Net model and Sentinel-1 time series to process the VV and VH data, providing evidence of the efficiency of SAR data as a data source [5].
Although these methods can be used to effectively identify forest change, the application of advanced change detection algorithms has not been thoroughly explored. In addition, most of the major change detection algorithms have been based on high-resolution optical images, while the combination of low-and medium-resolution optical and SAR data has rarely been considered. This study reviewing the characteristics of the major change detection algorithms and forest change detection tasks proposes a double-Siamese nested U-Net (DSNUNet) model to improve forest change detection accuracy based on the encoder-decoder structure. The encoder included two sets of Siamese branches which were used to extract features from optical and SAR images. Meanwhile, the decoder aggregates the optical and SAR features and restores the scale features. Indeed, DSNUNet was derived from the change detection algorithm named SNUNet-CD. In the proposed model, different feature channel combinations were used to extract effective features from optical and SAR images, as well as to compensate for the differences between these image data. Moreover, to overcome the discrepancies between positive and negative samples in the change detection task, a combination of focal loss and dice loss was used as a loss function of the proposed model. The proposed model was validated using Sentinel-1 and Sentinel-2 data. The results demonstrated the effectiveness of the proposed method in forest change detection in terms of precision, recall, and F1-Score compared to the state-of-the-art methods.
The main contributions of this paper are as follows: (1) A Siamese network model named DSNUNet was designed to achieve accurate forest change detection by combining optical and SAR images. The DSNUNet model uses optical and SAR image data as inputs directly and outputs the final change map, thus improving the forest change detection performance; (2) Two sets of Siamese branches with different widths were designed for feature extraction to achieve more effective use of the multi-sensor data. The feature balance of optical and SAR images was performed using different channel combinations. DSNUNet also can be generalized as a general change detection framework for any combination of two kinds of images with information differences.
The code used in this study is open-source, and it can be found on the GitHub link: https://github.com/NightSongs/DSNUNet. The rest of this paper is organized as follows. Section 2 introduces training data sources and data preprocessing and describes the proposed DSNUNet model. Section 3 presents multiple sets of comparative experiments. Section 4 analyzes the experimental results and provides future research directions. Finally, Section 5 summarizes the paper.

Study Area
In this study, image data were collected from the Changsha, Zhuzhou, and Xiangtan areas in the central-eastern part of Hunan Province, with coordinates of 26 • 03 -28 • 40 N and 111 • 53 -114 • 17 E. These areas consist mainly of mountainous and hilly terrain, covering a total area of 2.8 × 10 5 km 2 , bordered mostly by forests and cities, making it suitable for forest change detection research. The location of the study area is shown in Figure 1. presents multiple sets of comparative experiments. Section IV analyzes the experimental results and provides future research directions. Finally, Section V summarizes the paper.

Study Area
In this study, image data were collected from the Changsha, Zhuzhou, and Xiangtan areas in the central-eastern part of Hunan Province, with coordinates of 26°03′-28°40′N and 111°53′-114°17′E. These areas consist mainly of mountainous and hilly terrain, covering a total area of 2.8 × 10 5 km 2 , bordered mostly by forests and cities, making it suitable for forest change detection research. The location of the study area is shown in Figure 1.

Acquisition of Optical Image
The optical image data used in this study were the L1C level data collected by the Sentinel-2 satellite. Three bands, namely the NIR, red, and green (color infrared), were selected as RGB inputs of the detection model. The data were collected from the study area in autumn 2020 and 2021 (September to November), then used for the change detection analysis. The mean synthesis was performed on the quarterly image set using the Google Earth Engine (GEE) platform; cloud cover was controlled to below 5%. Table 1 shows the optical image information used in this study. The optical image data used in this study were the L1C level data collected by the Sentinel-2 satellite. Three bands, namely the NIR, red, and green (color infrared), were selected as RGB inputs of the detection model. The data were collected from the study area in autumn 2020 and 2021 (September to November), then used for the change detection analysis. The mean synthesis was performed on the quarterly image set using the Google Earth Engine (GEE) platform; cloud cover was controlled to below 5%. Table 1 shows the optical image information used in this study. The SAR image data used in this study were derived from the first-order groundranging (GRD) product of Sentinel-1 satellite data. The temporal phase of the SAR data was consistent with the optical image. To reduce the impact of speckles on data quality, the intra-quarter mean synthesis was applied. This process can effectively reduce the speckle Remote Sens. 2022, 14, 5046 5 of 18 noise in SAR data and facilitates alignment with the optical images based on a time scale. The 10-m resolution VV and VH polarizations were used as model inputs. The edge regions were masked, and the radiation-corrected results from the GEE platform were converted to the calibration results in dB using logarithmic scaling according to the following equation: where x db is the final calibration result; x is the original image after platform radiation correction. Table 2 describes the SAR image information used in this study.

Label Annotation and Dataset Construction
In this study, manual annotation of change areas within the study area was performed based on the two synthetic optical images. Indeed, forest change is defined as a loss of forest vegetation caused by human activities and natural disasters, including logging and fire. The annotation was performed in ArcMap, and change areas were stored in a vector format. Finally, a total of 2441 change areas were used to produce the change detection dataset. The change labels used in the experiment are shown in Figure 2. The image patches can be used in the proposed DSNUNet for forward calculation. The optical and SAR images were cropped according to the bi-time at a resolution size of 256 × 256, and then the overlap rate was set to 0.25. Similarly, change areas obtained from visual interpretation were first converted to raster data, in which 255 and 0 were assigned to change and constant areas, respectively, and then cropped using the same approach. In total, 1951 patch pairs were obtained and used for model training and validation. The data The image patches can be used in the proposed DSNUNet for forward calculation. The optical and SAR images were cropped according to the bi-time at a resolution size of Remote Sens. 2022, 14, 5046 6 of 18 256 × 256, and then the overlap rate was set to 0.25. Similarly, change areas obtained from visual interpretation were first converted to raster data, in which 255 and 0 were assigned to change and constant areas, respectively, and then cropped using the same approach. In total, 1951 patch pairs were obtained and used for model training and validation. The data were randomly divided into training, test, and validation datasets according to the ratio of 8:1:1.

Model Structure
The proposed DSNUNet is a change detection model based on Siamese networks which uses optical and SAR images jointly to achieve forest change detection. The DS-NUNet model is derived from the SNUNet-CD model. The structure of the DSNUNet model is shown in Figure 3. In Figure 3, the left side shows the encoder part used for down-sampling and extracting semantic features while the right side indicates the decoder part used for up-sampling and recovering feature scales. Unlike most change detection models, the DSNUNet model accepts both optical and SAR images (VV+VH) as inputs and can efficiently extract data features from different time phases and modes. Finally, the DSNUNet output is a fine change map.

Double Siamese Encoder Structure
Optical and SAR images were obtained using different sensors. The pixel values of optical and SAR images represent the spectral reflectance values and the backscattering coefficients, respectively. To extract differential features from different modal data, two branches are used in the DSNUNet model. An easily overlooked problem in the combined study of optical and SAR data is that optical images have richer semantic features compared to SAR images. In the deep learning field, different channels of features are usually considered as different patterns of features. Therefore, different combinations of convolutional kernels for each of the two types of data were used in this study. Indeed, more complex data use more convolutional kernels to extract more features.
As shown in Figure 4, after using optical images as inputs in the network, primary features were obtained by the initial convolution layer. The number of feature maps' channels gradually increased while the height and width were gradually halved after multiple down-sampling. VV and VH were inputted to the network as SAR images and stacked along the channel dimension to construct a 2 × H × W feature map. Afterward, a feature

Double Siamese Encoder Structure
Optical and SAR images were obtained using different sensors. The pixel values of optical and SAR images represent the spectral reflectance values and the backscattering coefficients, respectively. To extract differential features from different modal data, two branches are used in the DSNUNet model. An easily overlooked problem in the combined study of optical and SAR data is that optical images have richer semantic features compared to SAR images. In the deep learning field, different channels of features are usually considered as different patterns of features. Therefore, different combinations of convolutional kernels for each of the two types of data were used in this study. Indeed, more complex data use more convolutional kernels to extract more features.
As shown in Figure 4, after using optical images as inputs in the network, primary features were obtained by the initial convolution layer. The number of feature maps' channels gradually increased while the height and width were gradually halved after multiple down-sampling. VV and VH were inputted to the network as SAR images and stacked along the channel dimension to construct a 2 × H × W feature map. Afterward, a feature extraction backbone with a smaller number of convolution kernels was used to extract feature maps of SAR images. This is the same principle as optical image branching. SAR and optical features of the same level were kept uniform at the spatial scale, ensuring a spatial alignment of different features. By assuming R i,0 is the feature node set obtained from an optical image and S i,0 is a feature node set obtained from an SAR image, the feature shapes of R i,0 and S i,0 can be expressed as follows: where C and C denote the initial channel numbers of the optical and SAR branches, respectively; H and W denote the height and width of an image.
Remote Sens. 2022, 14, x FOR PEER REVIEW 8 of 20 where C and C′ denote the initial channel numbers of the optical and SAR branches, respectively; H and W denote the height and width of an image.

Decoder Structure
Although the DSNUNet encoder consists of two sets of twin branches, the extracted features from branches are independent. To ensure information integrity in the decoder part, the stitching method along the channel dimension was used to concatenate the features of the same level in different branches. The original information was introduced into the encoder and then down-sampled level by level. This processing results in the loss of spatial information used for the localization of features. To merge high-resolution spatial features with low-resolution semantic features, a continuous skip-connections mechanism for information interaction between the encoder and decoder was used in this study.
As shown in Figure 3, after introducing "A" and "B" in the network, different branches for down-sampling were obtained according to the data types. The four features of the same level were combined with the up-sampling results of the lower-level features and transmitted together to the decoder block. Assume X i,j is a decoder node, R i,0 is the feature node obtained from an optical image, and S i,0 is the feature node obtained from a SAR image; the calculation process of the decoder node can be expressed as follows: where denotes the calculation function of the convolution block; denotes the up-

Decoder Structure
Although the DSNUNet encoder consists of two sets of twin branches, the extracted features from branches are independent. To ensure information integrity in the decoder part, the stitching method along the channel dimension was used to concatenate the features of the same level in different branches. The original information was introduced into the encoder and then down-sampled level by level. This processing results in the loss of spatial information used for the localization of features. To merge high-resolution spatial features with low-resolution semantic features, a continuous skip-connections mechanism for information interaction between the encoder and decoder was used in this study.
As shown in Figure 3, after introducing "A" and "B" in the network, different branches for down-sampling were obtained according to the data types. The four features of the same level were combined with the up-sampling results of the lower-level features and transmitted together to the decoder block. Assume X i,j is a decoder node, R i,0 is the feature node obtained from an optical image, and S i,0 is the feature node obtained from a SAR image; the calculation process of the decoder node can be expressed as follows: The DSNUNet, as a variant of the SNUNet-CD model [19], returns four outputs of the same scale as an input image at the end of the network. A common practice was to use the deep supervision method to calculate the loss value of the four outputs separately and backpropagate them to calculate the full gradient [27]. To merge the shallow and deep in SNUNet-CD effectively, an ensemble channel attention (ECAM) block was introduced to select effective features suitable for automatic change detection. ECAM represents an extension of the channel attention (CAM) process [28]. Its structure is shown in Figure 5. After adding ECAM to the end of the decoder, the pipeline of the DSNUNet is obtained, as shown in Figure 6. After processing the four outputs of the decoder {X 0,j , j ∈ {1, 2, 3, 4}} using ECAM to obtain valid fusion features, a 1 × 1 convolutional layer was used to output the final change map.

Loss Function
Sample imbalance is common in change detection tasks, where the percentage of pixels in a change region is much smaller than that in an invariant region. Therefore, to make the detection model focus more on the change region extraction in training, it is necessary to balance the weights of change and invariant regions when calculating the loss. A hybrid loss function consisting of focal and dice losses is used to improve the change detection performance of a model. The focal loss represents an improved loss function based on the standard cross-entropy loss, which is used to reduce the weights of easily classified samples and improve the model's focus on change regions [29]. In contrast, the dice loss is a loss function used to balance positive and negative samples in image segmentation, but it can face instability in the training process [30]. The combination of these two loss functions can help improve the segmentation performance and training stability of the model for change region detection. The hybrid loss function is expressed as follows: YY YY (8) where α and γ are hyperparameters used to balance positive-and negative-sample weights. In DSNUNet, α and γ were set to 0.25 and 2, respectively; p is the probability; Y is the ground truth' ' Y is the change map of the model output.

Evaluation Metrics
In change detection tasks, Precision, Recall, and F1-score are commonly used as evaluation metrics to evaluate the detection accuracy of a change region versus an invariant region. Since Precision and Recall are negatively correlated, the F1-score was used to evaluate the model performance in a comprehensive manner. The Precision, Recall, and F1score are expressed as follows:

Loss Function
Sample imbalance is common in change detection tasks, where the percentage of pixels in a change region is much smaller than that in an invariant region. Therefore, to make the detection model focus more on the change region extraction in training, it is necessary to balance the weights of change and invariant regions when calculating the loss. A hybrid loss function consisting of focal and dice losses is used to improve the change detection performance of a model. The focal loss represents an improved loss function based on the standard cross-entropy loss, which is used to reduce the weights of easily classified samples and improve the model's focus on change regions [29]. In contrast, the dice loss is a loss function used to balance positive and negative samples in image segmentation, but it can face instability in the training process [30]. The combination of these two loss functions can help improve the segmentation performance and training stability of the model for change region detection. The hybrid loss function is expressed as follows: where α and γ are hyperparameters used to balance positive-and negative-sample weights. In DSNUNet, α and γ were set to 0.25 and 2, respectively; p is the probability; Y is the ground truth' Y is the change map of the model output.

Evaluation Metrics
In change detection tasks, Precision, Recall, and F1-score are commonly used as evaluation metrics to evaluate the detection accuracy of a change region versus an invariant region. Since Precision and Recall are negatively correlated, the F1-score was used to evaluate the model performance in a comprehensive manner. The Precision, Recall, and F1-score are expressed as follows: where TP denotes the number of change regions correctly predicted by the model; FP is the number of invariant regions incorrectly predicted as change regions; and FN denotes the number of change regions incorrectly predicted as invariant regions. The computational complexity of different algorithms was compared using the number of parameters (Params) and the number of floating point operations (FLOPs) of the models.

Implementation Details
The hardware platform used in the experiments consisted of an Intel Core i7-10700KF 16-core processor @ 3.80 GHz (Intel, Santa Clara, CA, USA) and an Nvidia GeForce RTX 3080 graphics card with a 32-GB running memory (Nvidia, Santa Clara, CA, USA). The software included Python and Pytorch as a programming language and deep learning framework, respectively.
The same hyperparameters and training approaches were used in all experiments. Simple data augmentation (random horizontal, vertical flips, and diagonal mirroring) was adopted in the model training phase and test-time augmentation (TTA) in the model testing phase to improve the model prediction performance. The initial learning rate was set to 1 × 10 −4 , while decay was set to 1 × 10 −5 at the 50th epoch. The weights of each convolutional layer in DSNUNet were initialized using the KaiMing normalization. To ensure the reproducibility of the model, the same random seeds were used in all experiments. Only the model with increased F1-Score in the validation set was retained in the training phase.

Results
The performance of the proposed DSNUNet model was validated using a series of experiments. This section presents the results of the evaluation metrics and several comparison methods, as well as the implementation details.

Comparison of Different Models
To assess the effectiveness of the proposed DSNUNet model, the obtained results were compared with those obtained using different deep learning-based change detection models. First, the change detection results obtained using the DSNUNet model were compared with those obtained using the FC-EF, FC-Siam-Conc, and FC-Siam-Diff models, which are classical models [14]. Next, LF UNet was used for forest change detection, which is an improved UNet deep model for revealing forest changes using Landsat imagery [26]. The attention-based STANet and SNUNet models, which use a spatiotemporal attention module and an integrated channel attention module [17,19], respectively, were then compared with the proposed DSNUNet. It should be noted that the proposed DSNUNet model represents an improved version of SNUNet, where a SAR branch is added to the model structure. Finally, the proposed model was compared with the transformer-based change detection model (BIT) [20]. BIT is an efficient model designed according to the advantages of the transformer model for effectively modeling global information. Previous experiments on high-resolution public datasets have demonstrated that BIT outperforms most traditional convolutional models.
Most of the change detection models have been developed using only optical images; the performance of the proposed DSNUNet model using SAR imagery was compared with other models that are based only on optical imagery information. The accuracy metrics and model complexity of DSNUNet and other models in the test step are presented in Table 3 and Figure 7, where it can be seen that the DSNUNet achieved more accurate results than the other models. the performance of the proposed DSNUNet model using SAR imagery was compared with other models that are based only on optical imagery information. The accuracy metrics and model complexity of DSNUNet and other models in the test step are presented in Table 3 and Figure 7, where it can be seen that the DSNUNet achieved more accurate results than the other models. Table 3. Comparison of DSNUNet with other models based on optical images.  To assess the effectiveness of adding SAR data to the change detection model, the input-side improvements were introduced into the other models. For the input of VV and VH channels in SAR data, the initial convolutional layer parameters of the other models were adjusted to accommodate the SAR data. VV and VH were used as additional bands of an image with an optical image, forming a five-channel input for change feature extraction. The accuracy metrics and model complexity of DSNUNet and other models in the test step incorporating SAR imagery information are presented in Table 4 and Figure 8. The results showed that DSNUNet achieved more accurate results than the other models that incorporated SAR image information. To assess the effectiveness of adding SAR data to the change detection model, the input-side improvements were introduced into the other models. For the input of VV and VH channels in SAR data, the initial convolutional layer parameters of the other models were adjusted to accommodate the SAR data. VV and VH were used as additional bands of an image with an optical image, forming a five-channel input for change feature extraction. The accuracy metrics and model complexity of DSNUNet and other models in the test step incorporating SAR imagery information are presented in Table 4 and Figure 8. The results showed that DSNUNet achieved more accurate results than the other models that incorporated SAR image information. * The initial number of channels of the optical image branch was set to 32 for both SNUNet and DSNUNet; the initial number of channels of the SAR image branch in DSNUNet was set to eight.

Figure 8. Prediction results of DSNUNet and other models based on SAR images added to the input.
After adding SAR images to the test data as additional information, the performance of most models was improved to different degrees, indicating that the SAR image data could be helpful in improving the performance of the change detection algorithm. However, DSNUNet still had the advantage in accuracy over the other models. This finding demonstrated that introducing independent SAR image feature extraction branches to the proposed model can effectively improve the information utilization of SAR images by the change detection model.

Comparison of Different Channel Combinations
In model design, a key problem that is often overlooked is the width value of a feature extraction network. Indeed, the feature complexity of optical and SAR images is different. Therefore, extracting the SAR image feature using the same width as for the optical branch may bring redundant features to the SAR branch. Conversely, extracting the optical image features using the same width as for the SAR branch may cause a lack of feature characteristics.
In DSNUNet, different initial channel numbers were set in the encoder for both optical and SAR images. To determine the optimal feature channel combination, several experiments were conducted to assess the performance of DSNUNet using different combinations of initial channel numbers in the two branches. In the two branches, the number After adding SAR images to the test data as additional information, the performance of most models was improved to different degrees, indicating that the SAR image data could be helpful in improving the performance of the change detection algorithm. However, DSNUNet still had the advantage in accuracy over the other models. This finding demonstrated that introducing independent SAR image feature extraction branches to the proposed model can effectively improve the information utilization of SAR images by the change detection model.

Comparison of Different Channel Combinations
In model design, a key problem that is often overlooked is the width value of a feature extraction network. Indeed, the feature complexity of optical and SAR images is different. Therefore, extracting the SAR image feature using the same width as for the optical branch may bring redundant features to the SAR branch. Conversely, extracting the optical image features using the same width as for the SAR branch may cause a lack of feature characteristics.
In DSNUNet, different initial channel numbers were set in the encoder for both optical and SAR images. To determine the optimal feature channel combination, several experiments were conducted to assess the performance of DSNUNet using different combinations of initial channel numbers in the two branches. In the two branches, the number of channels of features revealed a lineal increase. The accuracy metrics and model complexity of DSNUNet in the test step for different initial channel number combinations are presented in Table 5. In terms of accuracy-related metrics, the combination of "32-8" revealed the best performance (78.37% Precision, 74.53% Recall and 76.40% F1-Score). As the number of initial channels of the optical image branch increased, the model performance showed a significant improvement, which was due to the complexity of the optical image features. As the initial channel number of the SAR image branch increased, the model performance improved, reaching an optimal performance at an initial channel number of the SAR image branch of eight, and then the performance decreased with increasing the initial channel number of the SAR image branch, showing redundant branch features. The relationship between the number of initial channels and model performance is shown in Figure 9. The comparison of prediction results obtained in the test step for different channel numbers is shown in Figure 10. * The combination in the initial channel numbers is presented in the form of: Optical branch-SAR branch.
In terms of accuracy-related metrics, the combination of "32-8" revealed the best performance (78.37% Precision, 74.53% Recall and 76.40% F1-Score). As the number of initial channels of the optical image branch increased, the model performance showed a significant improvement, which was due to the complexity of the optical image features. As the initial channel number of the SAR image branch increased, the model performance improved, reaching an optimal performance at an initial channel number of the SAR image branch of eight, and then the performance decreased with increasing the initial channel number of the SAR image branch, showing redundant branch features. The relationship between the number of initial channels and model performance is shown in Figure 9. The comparison of prediction results obtained in the test step for different channel numbers is shown in Figure 10.

Comparison of Different Feature Fusion methods
DSNUNet uses ECAM to perform feature fusion on the four outputs of the decoder. To assess the effectiveness of ECAM, several experiments were conducted to compare different methods of feature aggregation (i.e., deep supervision, 1 × 1 convolution, and ECAM). Deep supervision refers to outputting shallow features during model training to calculate the loss for the purpose of supervising the shallow features of a backbone network. The decoder of DSNUNet was consistent with Nested Unet [27]. Therefore, it was possible to use the feature co-training approach at different scales. However, this method only returns the final single feature map in the prediction phase, while the information on

Comparison of Different Feature Fusion Methods
DSNUNet uses ECAM to perform feature fusion on the four outputs of the decoder. To assess the effectiveness of ECAM, several experiments were conducted to compare different methods of feature aggregation (i.e., deep supervision, 1 × 1 convolution, and ECAM). Deep supervision refers to outputting shallow features during model training to calculate the loss for the purpose of supervising the shallow features of a backbone network. The decoder of DSNUNet was consistent with Nested Unet [27]. Therefore, it was possible to use the feature co-training approach at different scales. However, this method only returns the final single feature map in the prediction phase, while the information on the intermediate feature maps cannot be completely used.
The 1 × 1 convolution is another common approach for a decoder backbone of the Nested Unet, which converts combined features into a final prediction map using a 1 × 1 convolution layer after merging multiple outputs in channel dimensions.
The workflow of different methods in the training and prediction phases is shown in Figure 11. The accuracy metrics of DSNUNet in the test step obtained using different feature fusion methods are reported in Table 6.
Remote Sens. 2022, 14, x FOR PEER REVIEW 16 of 20 Figure 11. The workflow of different methods in the training and prediction phases. (a) the original deep supervision method, which uses multiple prediction maps for joint back-propagation during the training step, but only a single feature map was used for prediction; (b) 1 × 1 convolution-based method that improves the training and prediction disparity in the method (a); (c) ECAM modulebased method, which provides fine-grained change maps by weighting the attention of multiple feature maps and performing information aggregation. The results indicated that using ECAM as a tool for multi-scale feature aggregation could achieve optimal performance. The accuracy results obtained using ECAM were 74.53, 76.40, and 79.18% for Recall, F1-score, and Precision.

Discussion
In the first experiment, most change detection models reveal less accurate bounds and more pseudo changes in forest change detection. The SNUNet and DSNUNet models can provide good detection results. However, SAR image-based DSNUNet revealed closer prediction results to the observed data. BIT is a transformed-based change detection model that requires a longer training time and a larger training dataset for training than CNN-based models.
The introduction of SAR image data can provide more accurate forest variation characteristics. According to the obtained results, from Table 3, F1-Score, Recall, and Precision values of DSNUNet were 2.23, 3.62, and 0.63% higher than those obtained using optical images-based SNUNet, respectively (Table 3). However, the information complexities of Figure 11. The workflow of different methods in the training and prediction phases. (a) the original deep supervision method, which uses multiple prediction maps for joint back-propagation during the training step, but only a single feature map was used for prediction; (b) 1 × 1 convolution-based method that improves the training and prediction disparity in the method (a); (c) ECAM modulebased method, which provides fine-grained change maps by weighting the attention of multiple feature maps and performing information aggregation. The results indicated that using ECAM as a tool for multi-scale feature aggregation could achieve optimal performance. The accuracy results obtained using ECAM were 74.53, 76.40, and 79.18% for Recall, F1-score, and Precision.

Discussion
In the first experiment, most change detection models reveal less accurate bounds and more pseudo changes in forest change detection. The SNUNet and DSNUNet models can provide good detection results. However, SAR image-based DSNUNet revealed closer prediction results to the observed data. BIT is a transformed-based change detection model that requires a longer training time and a larger training dataset for training than CNN-based models.
The introduction of SAR image data can provide more accurate forest variation characteristics. According to the obtained results, from Table 3, F1-Score, Recall, and Precision values of DSNUNet were 2.23, 3.62, and 0.63% higher than those obtained using optical images-based SNUNet, respectively (Table 3). However, the information complexities of optical and SAR image data were different. In addition, an effective fusion of multi-source remote sensing data can improve the change detection performance. However, the simple combination of optical and SAR images into multiple channels of input showed a slight improvement in most models. The results revealed higher F1-Score, Recall, and Precision values of DSNUNet than those obtained using optical and SAR image-based SNUNet by 1.65, 0.74, and 2.62%, respectively (Table 4). This is due to the fact that these models prematurely merged different information in the feature initialization step of the input, resulting in the SAR data's value not being fully used. Therefore, DSNUNet uses two sets of Siamese branches to extract features from optical and SAR images, which can effectively explore the spatial and semantic information of different data and improve detection performance.
In the second experiment, it was found that the proposed model's performance was optimal at 32 and eight initial channels in the optical and SAR image branch, respectively. DSNUNet using the 32-eight combination revealed Precision, Recall, and F1-Score values of 78.37, 74.53, and 76.40%, respectively. As shown in Figure 9, using several initial channels in the SAR image branch can cause redundancy in information, while selecting a moderate number of initial channels can help to obtain a compromise between the number of parameters and model performance. Figure 12 shows the variation in loss during training of DSNUNet with different initial channel combinations. With the increase of epoch, the loss value shows a gradual decline and tends to be stable, which shows that the DSNUNet model has excellent fitting ability for forest change data.  (Table 4). This is due to the fact that these models prematurely merged different information in the feature initialization step of the input, resulting in the SAR data's value not being fully used. Therefore, DSNUNet uses two sets of Siamese branches to extract features from optical and SAR images, which can effectively explore the spatial and semantic information of different data and improve detection performance.
In the second experiment, it was found that the proposed model's performance was optimal at 32 and eight initial channels in the optical and SAR image branch, respectively. DSNUNet using the 32-eight combination revealed Precision, Recall, and F1-Score values of 78.37, 74.53, and 76.40%, respectively. As shown in Figure 9, using several initial channels in the SAR image branch can cause redundancy in information, while selecting a moderate number of initial channels can help to obtain a compromise between the number of parameters and model performance. Figure 12 shows the variation in loss during training of DSNUNet with different initial channel combinations. With the increase of epoch, the loss value shows a gradual decline and tends to be stable, which shows that the DSNUNet model has excellent fitting ability for forest change data. DSNUNet has stronger tolerance to clouds in images since SAR can provide images with high resolutions, even under cloudy conditions. As shown in Figure 13, the DSNUNet could suppress the pseudo-variation caused by cloud layers more effectively than other models. These cloud-covered images are not involved in the training, indicating that the characteristics of SAR images have resulted in more performant change detection of the model. DSNUNet has stronger tolerance to clouds in images since SAR can provide images with high resolutions, even under cloudy conditions. As shown in Figure 13, the DSNUNet could suppress the pseudo-variation caused by cloud layers more effectively than other models. These cloud-covered images are not involved in the training, indicating that the characteristics of SAR images have resulted in more performant change detection of the model. The following aspects could be addressed in future studies. First, the proposed model's structure could be improved. Although DSNUNet uses two sets of branches to obtain different feature combinations, simple splicing has been used to merge information in the decoding stage. Moreover, the number of feature channels can dramatically change during the decoding process, which may cause information losses. Second, the feature extraction backbone of DSNUNet is relatively simple. Therefore, a more complex backbone could be used in the future to improve the model performance. Moreover, an attempt could be made to classify different forest change types for forest change detection, including dominant tree species that have changed and source of change (deforestation or fire). Finally, the increase in the forest area could be extracted to facilitate statistical analyses of related departments.

Conclusions
In this paper, a double Siamese network for forest change detection called DSNUNet was proposed. The DSNUNet model can be applied using two types of inputs, namely optical and SAR images. The SAR images, as a common data source in forest change detection, can effectively improve the effectiveness and robustness of a change detection model. Unlike simply combining two types of input images, DSNUNet uses two set coding branches of different widths to extract features from optical and SAR images to effectively use information from different data sources and performs information fusion by convolutional operations in the decoder stage. Compared with other change detection models, the proposed model revealed higher evaluation metrics (Precision, Recall, and F1-Score values of 78.37, 74.53, and 76.40%, respectively). According to the obtained results, DSNUNet can effectively merge the features of optical and SAR images, which is important for improving the performance of forest change detection. For areas that are permanently located in cloud cover, this study helps to improve the efficiency of forest resource surveys.
Author Contributions: J.J. wrote the manuscript and designed the comparative experiments; W.W. and D.M. supervised the study and revised the manuscript; Y.X. and E.Y. revised the The following aspects could be addressed in future studies. First, the proposed model's structure could be improved. Although DSNUNet uses two sets of branches to obtain different feature combinations, simple splicing has been used to merge information in the decoding stage. Moreover, the number of feature channels can dramatically change during the decoding process, which may cause information losses. Second, the feature extraction backbone of DSNUNet is relatively simple. Therefore, a more complex backbone could be used in the future to improve the model performance. Moreover, an attempt could be made to classify different forest change types for forest change detection, including dominant tree species that have changed and source of change (deforestation or fire). Finally, the increase in the forest area could be extracted to facilitate statistical analyses of related departments.

Conclusions
In this paper, a double Siamese network for forest change detection called DSNUNet was proposed. The DSNUNet model can be applied using two types of inputs, namely optical and SAR images. The SAR images, as a common data source in forest change detection, can effectively improve the effectiveness and robustness of a change detection model. Unlike simply combining two types of input images, DSNUNet uses two set coding branches of different widths to extract features from optical and SAR images to effectively use information from different data sources and performs information fusion by convolutional operations in the decoder stage. Compared with other change detection models, the proposed model revealed higher evaluation metrics (Precision, Recall, and F1-Score values of 78.37, 74.53, and 76.40%, respectively). According to the obtained results, DSNUNet can effectively merge the features of optical and SAR images, which is important for improving the performance of forest change detection. For areas that are permanently located in cloud cover, this study helps to improve the efficiency of forest resource surveys.

Conflicts of Interest:
The authors declare that they have no conflicts of interest.