A Novel End-to-End Unsupervised Change Detection Method with Self-Adaptive Superpixel Segmentation for SAR Images

: Change detection (CD) methods using synthetic aperture radar (SAR) data have received signiﬁcant attention in the ﬁeld of remote sensing Earth observation, which mainly involves knowledge-driven and data-driven approaches. Knowledge-driven CD methods are based on the physical theoretical models with strong interpretability, but they lack the robust features of being deeply mined. In contrast, data-driven CD methods can extract deep features, but require abundant training samples, which are difﬁcult to obtain for SAR data. To address these limitations, an end-to-end unsupervised CD network based on self-adaptive superpixel segmentation is proposed. Firstly, reliable training samples were selected using an unsupervised pre-task. Then, the superpixel generation and Siamese CD network were integrated into the uniﬁed framework to train them end-to-end until the global optimal parameters were obtained. Moreover, the backpropagation of the joint loss function promoted the adaptive adjustment of the superpixel. Finally, the binary change map was obtained. Several public SAR CD datasets were used to verify the effectiveness of the proposed method. The transfer learning experiment was implemented to further explore the ability to detect the changes and generalization performance of our network. The experimental results demonstrate that our proposed method achieved the most competitive results, outperforming seven other advanced deep-learning-based CD methods. Speciﬁcally, our method achieved the highest accuracy in OA, F1-score, and Kappa, and also showed superiority in suppressing speckle noise, reﬁning change boundaries, and improving detection accuracy in a small area change.


Introduction
Natural and human activities have a continuous impact on Earth's resources and environment.The accurate detection of changes is of great significance in resource and environmental protection [1], agricultural survey [2], urban renewal [3,4], forest resource management [5], and other applications of Earth observation.Remote sensing Earth observation has the advantages of large-scale and periodic observations.By using image processing and pattern recognition techniques, change information can be identified from multi-temporal remote sensing data.
The synthetic aperture radar (SAR) is an advanced active remote sensing technology characterized by its penetration ability, all-weather and all-time work, wide coverage, and other advantages.Therefore, SAR images provide crucial data support for acquiring ground information in harsh environments and are widely used in remote sensing change detection (RSCD).
To the best of our knowledge, the change detection (CD) methods that use multitemporal SAR images can be divided into two categories: traditional knowledge-driven methods and data-driven methods, whereby both of which involve supervised and unsupervised designs.Supervised methods require a large number of labeled samples as prior knowledge, which are difficult to obtain for SAR images.Therefore, our research is focused on designing an automatic and efficient unsupervised CD method for SAR data.
In the early stages of SAR CD method research, the majority of the studies focused on developing pixel-based change detection (PBCD) methods.These methods mainly involve three steps: image preprocessing, difference image (DI) generation, and DI analysis.Specifically, image preprocessing includes speckle noise filtering, radiometric calibration, geometric correction, registration, etc., which generate comparable multi-temporal images with less noise.The traditional knowledge-driven methods mainly focus on DI generation and DI analysis.In the step of DI generation, the most straightforward method is based on image algebra, such as image difference [6] and image ratio [7].The more complex approaches include methods based on image transformation, such as principal component analysis (PCA) [8,9] and change vector analysis (CVA) [10,11].Other methods include texture analysis [12], edge-based detection [13,14], machine learning [15,16], GIS analysis [17], and mixed techniques.After the ideal DI is extracted, thresholding [18], clustering [19], or other advanced methods [20,21] are used to analyze DI, and then the binary change map is finally obtained.
However, the PBCD method is sensitive to speckle noise, thus detecting change with false alarms, holes, and jagged boundaries.To solve these problems, object-based change detection (OBCD) methods are proposed.These methods [22][23][24] segment pixels into image objects and take them as study units, which can smooth holes and improve the boundary detection accuracy.However, the proper setting of scale parameters in OBCD is complex and mechanical, and improper settings may cause important small changes to be missed.Moreover, the performance of OBCD methods depends on the accuracy of the segmentation algorithm.As a compromise, the CD method based on superpixel segmentation [25] has become a popular choice to generate uniform and homogeneous regions with the ability to perceive semantic information.
For the object or superpixel segmentation of multi-temporal images, three strategies are generally used [26,27]: (1) Only one phase image is segmented, and the other image is directly stacked with this segmentation result to perform CD.This strategy will cause missed and false detection.(2) Multi-temporal images are segmented independently, which often produces sliver polygons due to the inconsistency of the segmentation.As a result, it is difficult to perform CD analysis, and segmentation error will propagate to the CD analysis step.(3) The multi-temporal images are segmented simultaneously by stacking them; this manner with low computational efficiency often leads to over-segmentation and boundary fragmentation.Therefore, a better segmentation strategy still needs to be further investigated.
Due to the continuous development of satellite sensors, a large number of accumulated images provides opportunities for data-driven deep learning change detection (DLCD) methods.Deep learning automatically extracts high-level features from images [28], which has been proven to be an effective feature-learning technique [29][30][31].The end-to-end DLCD method can directly obtain CD results from multi-temporal images.Moreover, the extracted deep features are robust to speckle noise [32].To our best knowledge, according to the strategy of fusing multi-temporal information, the DLCD methods involve three types: (1) Early fusion [33], where multi-temporal information is fused before being inputted into the network.In order to increase the information, other useful manual features can be added for different tasks.(2) Multi-temporal information fusion based on the Siamese network [34,35], where the multi-temporal images are input into different branches of the Siamese network, respectively, to learn the correlation or difference in the multi-temporal information.(3) Multi-temporal information fusion based on the recurrent neural network (RNN) [36], where RNN is used to mine the dependence of sequential images acquired at different times.Among these three strategies, the Siamese network has proven to be more specific for CD and has great potential to improve detection accuracy [37,38].Currently, the DLCD methods generally input the sampling patches (patch-based) or the whole image (image-based) into the network.The patch-based method with low computational efficiency loses a considerable amount of spatial context information.However, image-based fully convolutional network methods, such as FCN [39], Unet [40], DeepLab [41], and their variants, can accept any size of input image (if the computing memory allows), utilizing global context information to generate the dense pixel-wise prediction.These methods are efficient and accurate and have become the mainstream network in DLCD fields [25].
Furthermore, existing DLCD methods mainly take a pixel as the basic analysis unit, which limits DLCD to perceive object boundaries and model semantic information.Therefore, researchers have proposed the hybrid method combining deep learning with the object-based method or superpixel segmentation.For example, ref. [42] proposed an objectbased method that used a convolutional neural network (CNN) to extract change features, which achieved higher accuracy and computational efficiency.In [43], a CD method combining a neural network and the extraction of superpixel-level change features was proposed, which can obtain a robust and high-contrast CD result.The authors in [44] proposed a CD method combining superpixel segmentation and a graph neural network.Bi-temporal superpixel maps were generated via simple linear iterative clustering (SLIC) [45], and the superpixel-level change features were extracted to generate the graph.However, the above methods isolate the superpixel or object generation from the deep network training.The generated segmentation cannot be dynamically adjusted during the training, thus greatly limiting the performance of CD and failing to obtain the global optimal solution.
To solve the above problems, a novel end-to-end unsupervised CD method combining the superpixel segmentation network and Siamese deep convolutional network is proposed.Two weight-sharing superpixel sampling networks (SSNs) [46] are introduced in series with a Siamese deep convolutional CD network, and the overall framework still follows the Siamese architecture (Figure 1).Firstly, two SSNs are used to generate superpixel and deep features containing segmentation information with less noise.Then, the Unet-based Siamese CD network is used to extract multi-scale change information.The proposed method can train the superpixel segmentation part and CD network end-to-end under a unified framework and finally obtain the global optimal parameters.During the training process, the task-specific loss function promotes the adaptive attachment of the superpixel to the change boundary.The main contributions of this paper are as follows: (1) This study combines knowledge-driven and unsupervised learning to propose an endto-end CD network.The incorporation of superpixel segmentation information is an interesting practice of integrating prior knowledge into the deep learning technique.The generated superpixels in our proposed method can be adjusted adaptively, which ensures better consistency in the superpixel segmentation of unchanged areas and closer segmentation to change boundaries in changed areas for the bi-temporal data.(2) This study is the first to explore the ability of the network to detect changes, which is crucial for the generalization performance of CD networks.We designed transfer learning experiments between homogeneous data and even heterogeneous data to explore the ability to detect changes and generalization performance.This information is of great importance for the development of DLCD for SAR images with no or limited samples in the future.(3) The proposed method is unsupervised and is friendly to SAR data with extremely limited labeled samples.Preprocessed SAR images of different sizes can be input into our network to obtain the change map with high accuracy.Furthermore, this method has the potential to be extended to more complex sequential image processing.7) and (9).6.For both of the different times, calculate task-specific reconstruction loss   as Equation (10) and take the positional pixel features   of input to calculate compactness loss   as Equation (11).7. Calculate the joint loss  as Equation (15).8. Update the parameters of networks based on the joint loss .End for

Unsupervised Change Detection Workflow
The proposed method works in an unsupervised manner.The first step was to gen-

Unsupervised Change Detection Workflow
The proposed method works in an unsupervised manner.The first step was to generate reliable training samples via an unsupervised pre-task to train the network.
Given two coregistered SAR intensity images acquired at different times, t 1 and t 2 , over the same geographic area, the log-ratio operator [7] was used to generate the DI.Previous studies have proven that for SAR images, the ratio operator is not only more robust toward calibration errors, but can suppress multiplicative noise [47].The log-ratio operator takes the logarithm to the ratio image and further converts the residual multiplicative noise into additive noise which is easier to process, as shown in Equation (1).
where LR is the log-ratio DI, and log represents the natural logarithm.
After the DI was obtained, the hierarchical FCM (HFCM) [48] clustering algorithm was used to classify the DI into three clusters: changed class Ω c , unchanged class Ω u , and uncertain class Ω i .Pixels belonging to Ω c and Ω u could be considered to be reliable samples with a high probability of being changed or unchanged.Although these "uncertain" pixels were not used as training samples, their semantic information could still be utilized because we used the image-based fully convolutional network.
The above unsupervised design allowed us to perform CD even with only one target image pair.However, only a few pixels in this image pair were selected as samples and other pixels need to be further classified, so training samples are extremely rare.Therefore, the data augmentation technique was used to augment samples to prevent overfitting [49,50].We applied random crops and rotations in multiples of 90 • (90 • , 180 • , and 270 • ) to the normalized bi-temporal images and the generated pseudo-label map with a 50% probability.Then, we used these samples to train the network and finally obtained the binary change map.Although the training of the network was supervised, the selection of training samples was unsupervised, so the whole CD flow was essentially unsupervised, as shown in Figure 2.
Remote Sens. 2023, 15, x FOR PEER REVIEW 5 of 26 uncertain class Ω  .Pixels belonging to Ω  and Ω  could be considered to be reliable samples with a high probability of being changed or unchanged.Although these "uncertain" pixels were not used as training samples, their semantic information could still be utilized because we used the image-based fully convolutional network.The above unsupervised design allowed us to perform CD even with only one target image pair.However, only a few pixels in this image pair were selected as samples and other pixels need to be further classified, so training samples are extremely rare.Therefore, the data augmentation technique was used to augment samples to prevent overfitting [49,50].We applied random crops and rotations in multiples of 90 ° (90 °, 180 °, and 270 °) to the normalized bi-temporal images and the generated pseudo-label map with a 50% probability.Then, we used these samples to train the network and finally obtained the binary change map.Although the training of the network was supervised, the selection of training samples was unsupervised, so the whole CD flow was essentially unsupervised, as shown in Figure 2.

Superpixel Sampling Networks (SSNs)
The SSN [46] is the first end-to-end deep superpixel segmentation network that can be easily integrated with the downstream deep network to generate task-specific superpixels.This paper aims to use SSN to generate CD-specific superpixels that adhere better to the change boundary, reduce the influence of speckle noise, and refine the boundary of the final change map.
Figure 3 shows the overall architecture of the SSN, which consists of two parts: (1) the CNN-based feature extractor, where a deep network was used to extract features for superpixel segmentation to replace manually designed features.(2) Differentiable SLIC (DSLIC), where the features from the feature extractor were fed into the DSLIC to implement the superpixel segmentation.Given an image to be segmented as the input of SSN, we could obtain the pixel-superpixel soft association  ∈  × and the high dimensional features   ∈  × . could be used to realize the mutual mapping between pixel feature representation  ∈  × and superpixel feature representation  ∈  × , where  is the number of pixels,  represents the number of superpixels, and  is the number of channels.

Superpixel Sampling Networks (SSNs)
The SSN [46] is the first end-to-end deep superpixel segmentation network that can be easily integrated with the downstream deep network to generate task-specific superpixels.This paper aims to use SSN to generate CD-specific superpixels that adhere better to the change boundary, reduce the influence of speckle noise, and refine the boundary of the final change map.
Figure 3 shows the overall architecture of the SSN, which consists of two parts: (1) the CNN-based feature extractor, where a deep network was used to extract features for superpixel segmentation to replace manually designed features.(2) Differentiable SLIC (DSLIC), where the features from the feature extractor were fed into the DSLIC to implement the superpixel segmentation.Given an image to be segmented as the input of SSN, we could obtain the pixel-superpixel soft association Q ∈ R n×m and the high dimensional features F pix ∈ R n×c .Q could be used to realize the mutual mapping between pixel feature representation P ∈ R n×c and superpixel feature representation S ∈ R m×c , where n is the number of pixels, m represents the number of superpixels, and c is the number of channels.

The Input of the SSN
In [46], the SSN was used for RGB image segmentation, and its input was 5-dimensional scaled XYLab features, including three-channel CIELAB color and two-channel positional features (x, y).The position and color scales are expressed as   and   , respectively.The value of   was selected by experiences or trial, while the value of   was determined by the number of superpixels: ,  ℎ ,   , and  ℎ represent the initial number of superpixels and the number of pixels along the image width () and height (ℎ), respectively. is an empirical constant, set to 2.5 in [46].

Feature Extractor Based on CNN
As shown in Figure 3, the feature extractor is a common CNN-based network, consisting of a series of 3 × 3 convolutional layers, batch normalization (BN) layers, and rectified linear unit (ReLU) nonlinear layers.After the second and fourth convolution layers, two 2 × 2 max-pooling layers were, respectively, used for downsampling to expand the receptive field.Skip connections were used to fuse the multi-scale information from shallow and deep layers.The output channel of each hidden layer (i.e., base channel) was set to 64.The feature channel of the output layer was set to ( − 1).Then, the one-channel input and ( − 1)-channel output were concatenated to produce the final -dimensional pixel features.This feature extractor can also be replaced by other networks.The resulting  -dimensional features will be fed into the DSLIC and downstream Siamese CD networks.The pixel-superpixel association  ∈  × will be iteratively updated.

Differentiable SLIC
The core of differentiable SLIC is to replace the non-differentiable nearest neighbor operation of SLIC [45] with a distance soft association  ∈  × defined by a Gaussian radial basis function.The initialization strategy of the SSN is to divide the image into regular grids as initial superpixels; the clustering algorithm can use soft k-means or others.For the pixel feature representation  ∈  × and the superpixel feature representation  ∈  × , the soft association between the pixel  and the superpixel  can be calculated at the th iteration as follows: where  denotes distance computer.The new superpixel centers were computed using the weighted sum of pixel features:

The Input of the SSN
In [46], the SSN was used for RGB image segmentation, and its input was 5-dimensional scaled XYLab features, including three-channel CIELAB color and two-channel positional features (x, y).The position and color scales are expressed as γ pos and γ color , respectively.The value of γ color was selected by experiences or trial, while the value of γ pos was determined by the number of superpixels: m w , m h , n w , and n h represent the initial number of superpixels and the number of pixels along the image width (w) and height (h), respectively.η is an empirical constant, set to 2.5 in [46].

Feature Extractor Based on CNN
As shown in Figure 3, the feature extractor is a common CNN-based network, consisting of a series of 3 × 3 convolutional layers, batch normalization (BN) layers, and rectified linear unit (ReLU) nonlinear layers.After the second and fourth convolution layers, two 2 × 2 max-pooling layers were, respectively, used for downsampling to expand the receptive field.Skip connections were used to fuse the multi-scale information from shallow and deep layers.The output channel of each hidden layer (i.e., base channel) was set to 64.The feature channel of the output layer was set to (K − 1).Then, the one-channel input and (K − 1)-channel output were concatenated to produce the final K-dimensional pixel features.This feature extractor can also be replaced by other networks.The resulting K-dimensional features will be fed into the DSLIC and downstream Siamese CD networks.The pixel-superpixel association Q ∈ R n×m will be iteratively updated.

Differentiable SLIC
The core of differentiable SLIC is to replace the non-differentiable nearest neighbor operation of SLIC [45] with a distance soft association Q ∈ R n×m defined by a Gaussian radial basis function.The initialization strategy of the SSN is to divide the image into regular grids as initial superpixels; the clustering algorithm can use soft k-means or others.For the pixel feature representation P ∈ R n×c and the superpixel feature representation S ∈ R m×c , the soft association between the pixel i and the superpixel j can be calculated at the tth iteration as follows: where D denotes distance computer.The new superpixel centers were computed using the weighted sum of pixel features: For convenience, denote column-normalized Q t as Qt ; then, rewrite the update of superpixel centers as Through Equation ( 5), one can realize the mapping from pixel to superpixel representation, and the inverse mapping from superpixel to pixel representation can be achieved through Equation ( 6): where Q t is the row-normalized Q t .In the calculation of Q ∈ R n×m , only 9 superpixels surrounding the pixel are considered to improve the computational efficiency, that is, m = 9.This simplification is similar to the nearest neighbor searching of SLIC.The interactive update of Q and S was realized by Equations ( 3) and ( 4).It is worth noting that P was updated in the continuous learning of the model rather than in the iteration.

End-to-End Change Detection Network with SSN 2.3.1. Overall Framework
As shown in Figure 1, SAR images acquired at t 1 and t 2 were, respectively, fed into two weight-shared SSNs to obtain the pixel-superpixel associations Q 1 and Q 2 and high-level features F pix 1 and F pix 2 , which were set to be passed, respectively, to the two branches of the following Siamese CD network.This study used two Siamese CD networks with different designs, which we will introduce in the next section.The joint loss combining multiple loss function with different roles was calculated between the predicted probability map and the label map, which required Q 1 and Q 2 as well as the positional features of the input.Finally, the binary change map was obtained.It is worth noting that F pix 1 and F pix 2 were features generated specifically for superpixel segmentation; that is to say, these features were averaged according to the segmented superpixels to compress the noise.

Siamese CD Network
In paper [34], three fully convolutional neural network (FCNN) architectures were proposed for the CD of Earth observation data, and two of these Siamese networks were used as our CD network.We connected the SSN in series with these two Siamese networks, respectively, to verify the effectiveness of the proposed method.

• FC-Siam-conc
The first Siamese network (Figure 4a) is a fully convolutional network based on the decoder-encoder architecture.Ignoring one branch of this network, we can find that the backbone is actually a shallow version of U-net [29].This structure uses the Siamese network as its encoder to process images acquired at different times through two branches with shared weights.This design can fully mine bi-temporal information to generate bitemporal high-level features.The multi-level features from the two branches of the encoder are concatenated with the output of the corresponding scale decoding layer using two skip connections.The purpose of this design is to use this decoder to mine the correlation and difference between the bi-temporal information.This structure is named the full convolutional Siamese connection (FC-Siam-conc).
change information by adding the robust and explicit difference, which is more specific to CD and is named FC-Siam-diff.
These two Siamese networks have the same backbone, using 3 × 3 convolutional layers and 2 × 2 max-pooling layers for downsampling as well as using the BN layer to speed up model convergence.Each block uses residual connections to mitigate gradient vanishing.The skip connections between the encoder and the decoder are used to fuse multiscale information.The implementation details are shown in Figure 4.

Loss Function
The number of samples in the changed set Ω  and the unchanged set Ω  is often highly unbalanced, and usually Ω  has fewer samples.If this situation is not considered,

• FC-Siam-diff
The second Siamese network (Figure 4b) differs from FC-Siam-conc only in that it uses one skip connection to obtain the absolute value of the difference in the bi-temporal features from two encoding streams for each scale.This difference feature is then concatenated with the output of the decoding layer.This design mines the multi-scale change information by adding the robust and explicit difference, which is more specific to CD and is named FC-Siam-diff.
These two Siamese networks have the same backbone, using 3 × 3 convolutional layers and 2 × 2 max-pooling layers for downsampling as well as using the BN layer to speed up model convergence.Each block uses residual connections to mitigate gradient vanishing.The skip connections between the encoder and the decoder are used to fuse multi-scale information.The implementation details are shown in Figure 4.

Loss Function
The number of samples in the changed set Ω c and the unchanged set Ω u is often highly unbalanced, and usually Ω c has fewer samples.If this situation is not considered, the detection accuracy for the changed class will be reduced.Therefore, we use the weighted cross entropy (CE) function to deal with this problem, defined as follows: where P and G represent the predicted change map and the ground truth, respectively, i is pixel index, and ω c and ω u represent the weights of the changed and unchanged class, respectively.N is the number of pixels excluding ignored pixels.Dice loss is also an appropriate choice to further reduce the problem of sample imbalance.Dice similarity is defined as Equation ( 8), which can measure the similarity of the predicted change map P and ground truth G, and its value ranges from 0 to 1, i.e., [0, 1].The dice loss is defined as in Equation ( 9): To generate task-specific and more compact superpixels, we use a combination of a task-specific reconstruction loss and compactness loss to train the SSN, as in paper [46]: where I xy represents the positional pixel features of the input, firstly map the I xy into the superpixel space to obtain S xy through Equation ( 12): Then, the pixel is endowed with the absolute index of the superpixel through hard association rather than soft association Q to obtain I xy : I xy also belongs to the pixel space.As shown in Equation ( 11), we calculate the L2 norm of I xy and I xy .Finally, the joint loss function is defined as in Equation ( 15), and we used it to train our CD networks.
where λ 1 and λ 2 are the weight factors.The first term is the main component of the loss function, which penalizes the overall network learning.The second and last terms are calculated from different times, which encourages the network to simultaneously mine bitemporal information as much as possible to generate task-specific and compact superpixels.In addition, Algorithms 1 provides the training steps of our proposed method.

Datasets and Evaluation Criteria
Several public CD datasets were used in our experiment, including the Ottava dataset, Sulzberger dataset, Yellow River dataset, and San Francisco dataset, whereby all of which comprise single-polarization SAR images and are commonly used in published papers.In addition, we also collected an optical CD dataset, the Mexico dataset.

•
Ottava dataset: two images were acquired via the Radarsat-1 satellite over Ottawa in May 1997 and August 1997, and the change was caused by the summer flooding (Figure 5).
ETM+ images in band 4, the near infrared (NIR) band.This dataset shows the destruction of vegetation after a forest fire in Mexico city (Figure 9).
Refer to Table 1 for more information about the experimental datasets.Five evaluation indicators were introduced to quantitatively evaluate the method, including overall accuracy (OA), precision (Pre), recall, F1-score (F1), and Kappa coefficient.Specifically, OA is the ratio between the pixels correctly predicted against the sum of all pixels.Precision corresponds to the proportion of the number of pixels correctly predicted as the changed class and the total number of pixels predicted as the changed class.The F1-score combines precision (Pre) and recall and is often used to evaluate the binary classification accuracy.Recall reflects the percentage of pixels correctly predicted as the changed class and the total changed pixels of the ground truth.They are calculated as follows: ) where  and  are the number of true positives and true negatives. and  are the number of false positives and false negatives.The Kappa coefficient can measure the overall consistency between the predicted map and the ground truth, and its value has higher reference reliability for the CD with sample imbalance, which is calculated as follows: Refer to Table 1 for more information about the experimental datasets.Five evaluation indicators were introduced to quantitatively evaluate the method, including overall accuracy (OA), precision (Pre), recall, F1-score (F1), and Kappa coefficient.Specifically, OA is the ratio between the pixels correctly predicted against the sum of all pixels.Precision corresponds to the proportion of the number of pixels correctly predicted as the changed class and the total number of pixels predicted as the changed class.The F1-score combines precision (Pre) and recall and is often used to evaluate the binary classification accuracy.Recall reflects the percentage of pixels correctly predicted as the changed class and the total changed pixels of the ground truth.They are calculated as follows: OA = (TP + TN)/(TF ) Recall = TP/(TP + FN) where TP and TN are the number of true positives and true negatives.FP and FN are the number of false positives and false negatives.The Kappa coefficient can measure the overall consistency between the predicted map and the ground truth, and its value has higher reference reliability for the CD with sample imbalance, which is calculated as follows:

Experimental Setting
We implemented the proposed method in PyTorch v1.8, and the training was driven by the NVIDIA Quadro RTX 8000 GPU produced by Lenovo in Beijing, China.
The hyperparameter setting is shown in Table 2. To facilitate the reader in obtaining information, we divide the hyperparameters into four parts: deep learning universal hyperparameters, adjustable parameters for the feature extractor, differentiable SLIC in SSN, and hyperparameters in the loss function.The initial learning rate was set to 0.001 and halved for every 100 epochs using Lr_scheduler.L2 regularization and the aforementioned data enhancement strategy were used to mitigate overfitting.The batch size could only be set to one.The crop size was adjusted according to the size of the image, and the number of superpixels in the SSN was adjusted according to crop size.For the feature extractor of the SSN, we set the base channel to 64 and the output channel K to 20 for our data, which could be adjusted based on the complexity of the data.The number of iterations for differentiable SLIC in the SSN was set to 10 for both the training and prediction.In the loss function, ω c and ω u were set to 0.4 and 0.6, and λ 1 and λ 2 were set to 0.0001 and 1.0.In addition, this study only involved a single-polarization SAR, and when we tried to add two-channel positional features (x and y) into the input, the model was difficult to converge.We suspect that the model mistakenly believed that the positional feature was more important than the original image, as the original image only has one channel while the positional feature has two channels.We also tried adding positional features before feeding the differentiable SLIC of the SSN, which allowed the model to converge.However, the results of 100 experiments showed that adding positional features reduced the detection accuracy by 1-2%.We infer that the high-level features extracted by the deep network are highly effective for superpixel segmentation, and adding original and primary positional features may diminish the advancement of these features, resulting in a decrease in accuracy.Therefore, in this study, only normalized single-channel SAR data were fed into the SSN without positional features.Apparently, the scalers, γ pos , and γ color , as well as η, are not required to be set.
To verify the effectiveness of the proposed method, we connected the SSN to two Siamese CD networks, FC-Siam-conc and FC-Siam-diff, to obtain SSN-Siam-conc and SSN-Siam-diff networks.We provide the running time of SSN-Siam-diff under the above hardware conditions and experimental settings for the readers' reference.The time to train 300 epochs is about 90 s, and the prediction time for an image size of 290 × 350 is about 0.08 s.

Enhancement Effect in Series with SSN
Two SAR datasets were used to verify the enhancement effect in series with the SSN, i.e., the Ottawa (Figure 5) and Sulzberger (Figure 6) datasets.For each network and dataset, we conducted 100 experiments and recorded the accuracy matrix of the best model for each experiment.We present the best results of 100 experiments in the visualized and quantitative CD results.Furthermore, the superpixel generation results are presented to explore the reasons for the superior performance of our method.

CD Results
Figure 10 shows two groups of results obtained by connected and unconnected SSNs to FC-Siam-conc and FC-Siam-diff for the Ottawa (first row) and Sulzberger (second row) datasets.The results of column 1 and column 2 are compared, corresponding to FC-Siamdiff and SSN-Siam-diff.Similarly, the results of column 3 and column 4 are compared, corresponding to FC-Siam-conc and SSN-Siam-conc.SSN-Siam-diff and SSN-Siam-conc win by a large margin.In the areas marked by the red box, the change boundary obtained by the network with the SSN is closer to the ground truth, and very small change areas are also detected more accurately.The quantitative results are shown in Tables 3 and 4.After being connected with the SSN in series, every accuracy indicator of the network was improved, and the improvement effect is particularly significant for the Ottawa dataset.These results suggest that our proposed method with SSN not only has a good boundary-preserving ability, but also improves the detection accuracy of small area changes.
Remote Sens. 2023, 15, x FOR PEER REVIEW 14 of 26 Two SAR datasets were used to verify the enhancement effect in series with the SSN, i.e., the Ottawa (Figure 5) and Sulzberger (Figure 6) datasets.For each network and dataset, we conducted 100 experiments and recorded the accuracy matrix of the best model for each experiment.We present the best results of 100 experiments in the visualized and quantitative CD results.Furthermore, the superpixel generation results are presented to explore the reasons for the superior performance of our method.

CD Results
Figure 10 shows two groups of results obtained by connected and unconnected SSNs to FC-Siam-conc and FC-Siam-diff for the Ottawa (first row) and Sulzberger (second row) datasets.The results of column 1 and column 2 are compared, corresponding to FC-Siamdiff and SSN-Siam-diff.Similarly, the results of column 3 and column 4 are compared, corresponding to FC-Siam-conc and SSN-Siam-conc.SSN-Siam-diff and SSN-Siam-conc win by a large margin.In the areas marked by the red box, the change boundary obtained by the network with the SSN is closer to the ground truth, and very small change areas are also detected more accurately.The quantitative results are shown in Tables 3 and 4.After being connected with the SSN in series, every accuracy indicator of the network was improved, and the improvement effect is particularly significant for the Ottawa dataset.These results suggest that our proposed method with SSN not only has a good boundarypreserving ability, but also improves the detection accuracy of small area changes.Figure 11 shows the distribution of accuracy indicators of 100 experiments for each network and two datasets in the form of a boxplot.The boxplots of both datasets show that all accuracy indicators of the networks with the SSN are significantly higher than the original networks without the SSN.It is worth noting that in 100 experiments of the networks without the SSN, many outliers with very low accuracy appear, while for the networks with the SSN, the outliers are always much higher than the overall detection accuracy, which indicates that the incorporation of superpixel segmentation has great potential to improve the accuracy and is more stable, since its detection accuracy is always maintained at a high level.
Remote Sens. 2023, 15, x FOR PEER REVIEW 15 of 26 Figure 11 shows the distribution of accuracy indicators of 100 experiments for each network and two datasets in the form of a boxplot.The boxplots of both datasets show that all accuracy indicators of the networks with the SSN are significantly higher than the original networks without the SSN.It is worth noting that in 100 experiments of the networks without the SSN, many outliers with very low accuracy appear, while for the networks with the SSN, the outliers are always much higher than the overall detection accuracy, which indicates that the incorporation of superpixel segmentation has great potential to improve the accuracy and is more stable, since its detection accuracy is always maintained at a high level.

Superpixel Segmentation Results for Bi-Temporal SAR Images
We cropped a 256 × 256 area on the Ottawa dataset to display the superpixel segmentation results from the SSN-Siam-diff network in Figure 12.The first column is the bitemporal SAR images, and the middle column corresponds to their superpixel generation results.The third column shows the results where the pixel value is replaced by the mean value of the superpixel to which this pixel belongs.The Ottawa dataset contains winding coastlines and some narrow rivers, which are challenges for superpixel segmentation.For example, in the areas marked in the red box, there are narrow streams of water or land with complex boundaries.In these areas, it is difficult for the superpixel segmentation to perfectly adhere to the boundaries, but the superpixel generation network in our method  , (c,d) correspond to the Sulzberger dataset.

Superpixel Segmentation Results for Bi-Temporal SAR Images
We cropped a 256 × 256 area on the Ottawa dataset to display the superpixel segmentation results from the SSN-Siam-diff network in Figure 12.The first column is the bi-temporal SAR images, and the middle column corresponds to their superpixel generation results.The third column shows the results where the pixel value is replaced by the mean value of the superpixel to which this pixel belongs.The Ottawa dataset contains winding coastlines and some narrow rivers, which are challenges for superpixel segmentation.For example, in the areas marked in the red box, there are narrow streams of water or land with complex boundaries.In these areas, it is difficult for the superpixel segmentation to perfectly adhere to the boundaries, but the superpixel generation network in our method performs very well.The segmentation results for different times demonstrate that our method fully mines bi-temporal information and generates high-quality superpixels for both bi-temporal data.These high-quality results can be attributed to our proposed end-to-end unified framework for obtaining global optimal solutions, as well as Siamese structures, and the adaptive adjustments of superpixels.High-quality superpixel segmentation lays a foundation for boundary optimization and detail preservation for the final change map.

Transfer Learning Experiments
Deep learning relies heavily on a large number of training data.In the RSCD field, although the Earth observation data have been considerably enriched, the available labeled CD data are scarce, especially for the SAR data.Transfer learning is an important tool to solve the problem of insufficient training samples.We think that the transfer learning ability or generalization performance of the CD network is positively correlated with its ability to detect changes.The better the ability of the model to detect changes, the better the generalization performance of the other CD datasets.Research on how to design a model with strong transfer learning ability is helpful to fully utilize multi-source CD datasets.
Therefore, in this section, we designed the transfer learning experiment to explore the ability of the model to "learn how to detect change information", and in comparison, to examine whether the ability and generalization performances of our proposed method are enhanced.In this part, the pre-trained models were obtained based on Ottawa dataset training, which still followed the unsupervised CD flow as mentioned above.Then, these pre-trained models were used on other SAR CD datasets, even optical CD datasets, that are never seen during the training process.This is a simple "parameter sharing" type of transfer learning.performs very well.The segmentation results for different times demonstrate that our method fully mines bi-temporal information and generates high-quality superpixels for both bi-temporal data.These high-quality results can be attributed to our proposed endto-end unified framework for obtaining global optimal solutions, as well as Siamese structures, and the adaptive adjustments of superpixels.High-quality superpixel segmentation lays a foundation for boundary optimization and detail preservation for the final change map.

Transfer Learning Experiments
Deep learning relies heavily on a large number of training data.In the RSCD field, although the Earth observation data have been considerably enriched, the available labeled CD data are scarce, especially for the SAR data.Transfer learning is an important tool to solve the problem of insufficient training samples.We think that the transfer learning ability or generalization performance of the CD network is positively correlated with its ability to detect changes.The better the ability of the model to detect changes, the better the generalization performance of the other CD datasets.Research on how to design a model with strong transfer learning ability is helpful to fully utilize multi-source CD datasets.
Therefore, in this section, we designed the transfer learning experiment to explore the ability of the model to "learn how to detect change information", and in comparison, to examine whether the ability and generalization performances of our proposed method are enhanced.In this part, the pre-trained models were obtained based on Ottawa dataset training, which still followed the unsupervised CD flow as mentioned above.Then, these pre-trained models were used on other SAR CD datasets, even optical CD datasets, that are never seen during the training process.This is a simple "parameter sharing" type of transfer learning.

Transfer Learning for SAR Dataset
Figure 13 shows the results of applying pre-trained models to other SAR datasets, and Tables 5-7 provide the quantitative results.For the San Francisco dataset (Figure 13, first row), the results obtained by SSN-Siam-diff are smoother with less noise and holes than FC-Siam-diff.These results prove that the proposed method can effectively compress noise and holes, thus improving the CD accuracy, which indicates that the proposed network can extract more robust deep features.The comparison between SSN-Siam-conc and FC-Siamconc also confirmed these conclusions.Among the four networks, only SSN-Siam-conc can detect the narrow change area marked by the red box, which indicates that the design of "conc" (Figure 4a) seems to exceed the design of "diff" (Figure 4b), but whether this is the truth will be discussed briefly in Section 3.4.2.than FC-Siam-diff.These results prove that the proposed method can effectively compress noise and holes, thus improving the CD accuracy, which indicates that the proposed network can extract more robust deep features.The comparison between SSN-Siam-conc and FC-Siam-conc also confirmed these conclusions.Among the four networks, only SSN-Siam-conc can detect the narrow change area marked by the red box, which indicates that the design of "conc" (Figure 4a) seems to exceed the design of "diff" (Figure 4b), but whether this is the truth will be discussed briefly in Section 3.    The accurate segmentation of farmland change boundaries is a challenge for CD.As for the Yellow River Farmland-A dataset (Figure 13, second row), compared with FC-Siam-diff, the results obtained by SSN-Siam-diff show that the change boundary segmentation is unbroken and continuous (for example, the area marked with the red box), which is closer to the ground truth and has less noise.The false connectivity between the independent change components is also significantly reduced.FC-Siam-conc has the worst problem of false boundary connectivity, which is probably related to the design of "conc".However, SSN-Siam-conc significantly improves this problem and compresses the noise, and it achieved the best performance both in terms of visual presentation and accuracy indicators.
The results of the Farmland-B dataset (Figure 13, third row) strongly demonstrate the robustness of the proposed method toward speckle noise.Furthermore, FC-Siam-diff and SSN-Siam-diff can successfully detect the slender change area marked by the red box.SSN-Siam-conc yields a less noisy result than FC-Siam-conc, but neither of them detected the change in the red box.Therefore, it seems difficult to judge which design of "diff" or "conc" has more advantages in transfer learning.

Comparison of Generalization Performance between Conc and Diff Models
According to the above experimental results, it seems difficult to judge which design is better, diff or conc.Initially, we found that the "conc" pre-trained models failed to detect changes in some datasets, which caught our attention.Therefore, we tried to exchange the order of the bi-temporal images, that is, exchange the input of the two branches in Figure 4a.In this way, the "conc" model obtained completely different results from before.
FC-Siam-diff and FC-Siam-conc do not connect to SSN, have lower computational cost, and can also clearly state this problem.Therefore, Figure 14 shows the CD results of exchanging the input sequence of FC-Siam-diff and FC-Siam-conc applied to the Yellow River coastline (first row), Farmland-A (second row), and inland water (third row) datasets.As the results for the coastline dataset show, FC-Siam-diff can effectively detect changes regardless of the input order.However, when the input order is based on image acquisition time, FC-Siam-conc does not work at all, and no valid information is detected (Figure 14d, first row).However, after switching the order of inputs, FC-Siam-conc can effectively detect the change (Figure 14c, first row).The results for Farmland-A also show the same information as above.When FC-Siam-conc is used in the inland water dataset, only a partial change can be detected in both orders.These two different change components shown in Figure 14c,d (third row) correspond, respectively, to the positive and negative change in the water.These results indicate that the "conc" model can only learn to detect changes consistent with the change in the training data.As in the Ottawa dataset, the flood in the image acquired at  has faded compared with the image acquired at  , and the water body shows a negative change.As a result, the "conc" pre-trained models can only detect the change similar to the negative change in water.However, the "diff" model overcomes this problem.We infer that FC-Siam-diff and SSN-Siam-diff add explicit difference guidance to the model, which makes the model more specific to the CD and gives the model the ability to detect changes even with few samples.Therefore, from these results, the "diff" model shows better generalization performance.
When performing with few training samples, it is important to consider whether the model can detect both positive and negative changes.However, many existing studies have ignored this problem.The methods in many published papers are based on few training data; for example, a small area is always clipped from a large image and marked as training data.This small area may only contain positive changes or negative changes, or the samples of the two types of change may be extremely unbalanced.Therefore, designing a model that is robust in all three conditions as above can be considered a useful generable CD approach.As the results for the coastline dataset show, FC-Siam-diff can effectively detect changes regardless of the input order.However, when the input order is based on image acquisition time, FC-Siam-conc does not work at all, and no valid information is detected (Figure 14d, first row).However, after switching the order of inputs, FC-Siam-conc can effectively detect the change (Figure 14c, first row).The results for Farmland-A also show the same information as above.When FC-Siam-conc is used in the inland water dataset, only a partial change can be detected in both orders.These two different change components shown in Figure 14c,d (third row) correspond, respectively, to the positive and negative change in the water.These results indicate that the "conc" model can only learn to detect changes consistent with the change in the training data.As in the Ottawa dataset, the flood in the image acquired at t 2 has faded compared with the image acquired at t 1 , and the water body shows a negative change.As a result, the "conc" pre-trained models can only detect the change similar to the negative change in water.However, the "diff" model overcomes this problem.We infer that FC-Siam-diff and SSN-Siam-diff add explicit difference guidance to the model, which makes the model more specific to the CD and gives the model the ability to detect changes even with few samples.Therefore, from these results, the "diff" model shows better generalization performance.
When performing with few training samples, it is important to consider whether the model can detect both positive and negative changes.However, many existing studies have ignored this problem.The methods in many published papers are based on few training data; for example, a small area is always clipped from a large image and marked as training data.This small area may only contain positive changes or negative changes, or the samples of the two types of change may be extremely unbalanced.Therefore, designing a model that is robust in all three conditions as above can be considered a useful generable CD approach.

Transfer Learning for Optical Dataset
We further applied these pre-trained models to the optical CD data, which was more challenging because their imaging mechanisms are completely different.Figure 15 shows the CD results and the superpixel generated by the SSN-Siam-diff of the Mexico dataset.In terms of SSN-Siam-diff vs. FC-Siam-diff and SSN-Siam-conc vs. FC-Siam-conc, the networks with SSN were better at retaining the change details, such as in areas marked by red circles.Therefore, these results show that the proposed method exhibits strong generalization ability even when transferred to heterogeneous data, and the ability to compress noise and refine boundaries is maintained.
difference guidance to the model, which makes the model more specific to the CD and gives the model the ability to detect changes even with few samples.Therefore, from these results, the "diff" model shows better generalization performance.
When performing with few training samples, it is important to consider whether the model can detect both positive and negative changes.However, many existing studies have ignored this problem.The methods in many published papers are based on few training data; for example, a small area is always clipped from a large image and marked as training data.This small area may only contain positive changes or negative changes, or the samples of the two types of change may be extremely unbalanced.Therefore, designing a model that is robust in all three conditions as above can be considered a useful generable CD approach.

Transfer Learning for Optical Dataset
We further applied these pre-trained models to the optical CD data, which was more challenging because their imaging mechanisms are completely different.Figure 15 shows the CD results and the superpixel generated by the SSN-Siam-diff of the Mexico dataset.In terms of SSN-Siam-diff vs. FC-Siam-diff and SSN-Siam-conc vs. FC-Siam-conc, the networks with SSN were better at retaining the change details, such as in areas marked by red circles.Therefore, these results show that the proposed method exhibits strong generalization ability even when transferred to heterogeneous data, and the ability to compress noise and refine boundaries is maintained.Speckle noise does not exist in optical images, so the superpixels generated by SSN-Siam-diff for the bi-temporal images of the Mexico dataset (Figure 15h-i) are tighter, smoother, and less broken than the superpixels of the SAR images.This makes it easier to observe and analyze the superpixel generation.Figure 15i shows that the pixel value of the  2 image is replaced by the mean value of the superpixel to which this pixel belongs.As can be seen from Figure 15i,j, the generated superpixel in the changed region better fits the change boundary, such as the area marked by the red circle, while the superpixel boundary in the unchanged region has a better consistency for bi-temporal images.In conclusion, the proposed method has a promising prospect in making full use of multisource heterogeneous data to complete complex CD tasks.
In DBN [32], a pre-task was used to select reliable training samples, and the deep belief network (DBN) was used to detect changes in SAR images.In PCANet [48], a SAR CD algorithm based on PCANet and more robust toward speckle noise was presented.In CNN [53], the CNN based on patch sampling was used for SAR image CD for the first time.In LR-CNN [54], a more advanced method was adopted to select training samples, and local restricted CNN (LRCNN) was proposed to detect changes in polarized SAR data.DCNet [55] is a channel-weighting-based deep cascade network, which has achieved Speckle noise does not exist in optical images, so the superpixels generated by SSN-Siam-diff for the bi-temporal images of the Mexico dataset (Figure 15h-i) are tighter, smoother, and less broken than the superpixels of the SAR images.This makes it easier to observe and analyze the superpixel generation.Figure 15i shows that the pixel value of the t 2 image is replaced by the mean value of the superpixel to which this pixel belongs.As can be seen from Figure 15i,j, the generated superpixel in the changed region better fits the change boundary, such as the area marked by the red circle, while the superpixel boundary in the unchanged region has a better consistency for bi-temporal images.In conclusion, the proposed method has a promising prospect in making full use of multisource heterogeneous data to complete complex CD tasks.
In DBN [32], a pre-task was used to select reliable training samples, and the deep belief network (DBN) was used to detect changes in SAR images.In PCANet [48], a SAR CD algorithm based on PCANet and more robust toward speckle noise was presented.In CNN [53], the CNN based on patch sampling was used for SAR image CD for the first time.In LR-CNN [54], a more advanced method was adopted to select training samples, and local restricted CNN (LRCNN) was proposed to detect changes in polarized SAR data.DCNet [55] is a channel-weighting-based deep cascade network, which has achieved competitive detection accuracy.SAFNet [56], like our proposed method, is a Siamese CD network with adaptive fusion for bi-temporal SAR images.RUSACD [57] adopted a multi-scale superpixel reconstruction method to generate DI, and then used a clustering algorithm to select training samples and designed a model based on the convolutional wavelet neural network and deep convolutional generative adversarial network to detect small area changes in SAR images.
The results of DBN, CNN, DCNet, SAFNet, and RUSACD were extracted from the original paper with optimal accuracy.PCANet was implemented using the default optimal parameters provided in the original paper.Since the original LR-CNN considered polarization information, we modified the LR-CNN to make it suitable for single-polarization SAR images.
Figure 16 shows the results of different methods used for the Ottawa dataset.Table 8 lists the quantitative evaluation indicators; the best results are in bold font and the secondbest are underlined.We observed that many change pixels were missing for PCANet, LR-CNN, and DCNet.SAFNet and RUSACD obtained competitive results, but the detection accuracy for the small area change is poor, such as the part marked with the red circles.It can be seen from the visual results that the proposed SSN-Siam-conc shows the most abundant details, which are closer to the ground truth, and it achieved the highest values of OA, Recall, F1-score, and Kappa.The proposed SSN-Siam-diff also achieved very good performance.For this dataset, DBN obtained the second-best result in terms of evaluation indicators, but the visual map contained more noise than the other methods, which may be related to the significant limitation in the use of neighborhood information.In conclusion, these results suggest that the proposed method can effectively improve the accuracy of change boundary segmentation and compress speckle noise to a certain extent.Compared with the state-of-the-art methods, it is highly competitive and has considerable potential for exploitation.Figure 17 shows the results of the Farmland-A dataset, and Table 9 shows the quantitative results.It is worth noting that for this dataset, the results of SSN-Siam-diff and SSN-Siam-conc were generated by the transfer learning experiment, that is, generated by the pre-trained models trained by the Ottawa dataset.For DBN, PCANet, CNN, and LR-CNN, the results contain a lot of speckle noise and false positive pixels.DCNet is effective at suppressing noise but has large areas of false positive detection.In the results of RUSACD, we can observe a lot of false connectivity between different change components.The proposed SSN-Siam-conc and SSN-Siam-diff achieved very good performance.Both of these two networks contain relatively little noise, and the continuity and details of the change boundary are well maintained.In particular, SSN-Siam-conc won out of all of the methods and achieved the best results with regard to OA, F1-score, and Kappa.SAFNet also achieved very competitive results and was robust against speckle noise.We infer that this is due to the design of the Siamese network and the appropriate bi-temporal information fusion strategy.In conclusion, our results are the most competitive, although we present the results obtained in the transfer learning experiment, which suggests that our proposed method significantly enhances the ability to compress noise and preserve boundaries and generalization performance.SAFNet also achieved very competitive results and was robust against speckle noise.We infer that this is due to the design of the Siamese network and the appropriate bi-temporal information fusion strategy.In conclusion, our results are the most competitive, although we present the results obtained in the transfer learning experiment, which suggests that our proposed method significantly enhances the ability to compress noise and preserve boundaries and generalization performance.

Discussion
The experimental results show that our method outperforms several existing advanced CD methods.It can not only compress speckle noise effectively and refine the change boundary, but it also has good generalization ability.Furthermore, some important information still deserves to be discussed.
First of all, the proposed method is a combination of superpixel segmentation and a deep CD network, as well as a successful practice of combining prior knowledge and deep learning techniques.In addition, it is difficult for existing methods to balance noise compression and detail preservation, while our proposed method achieves a balance between the two because we can obtain globally optimal parameters.
This study provides a better segmentation strategy for the superpixel generation of multi-temporal images to perform CD.On the one hand, different branches process data acquired at different times to ensure the independence of superpixel generation for multitemporal data.On the other hand, weight sharing and using the task-specific loss function result in an implicit correlation between the bi-temporal data to generate superpixels.This correlation enables the bi-temporal information to be fully mined and combined, which not only ensures the segmentation consistency of the unchanged area, but also results in the segmentation of the changed area to better fit the change boundary.
It is worth noting that the generated superpixels are not used directly in this article, but the advanced features generated by the SSN for the superpixel generation are fed into the flowing CD network.The higher the quality of the visible superpixel generation, the more accurate the segmentation information contained in this feature, and thus the more the performance of the downstream CD network can be improved.
Lastly, this paper is the first to investigate the CD network's ability to detect changes, which is closely related to the generalization performance.When only a few samples with a single change type (including only positive or negative changes) are available, which is often the case when using SAR images, if the model has a strong ability to detect changes, it can effectively identify both positive and negative changes.However, this issue has never been discussed in published papers.Future research could pay more attention to the ability to detect the change when designing CD models, which can promote the full utilization of multi-source CD datasets, so as to design models with a larger capacity to solve more complex CD tasks.
In conclusion, the proposed method is unsupervised, and it is easy to apply to other more complex data to perform CD, such as fully polarized SAR data, which is worth studying in the future.However, due to a lack of validation data, the performance of the proposed method has not been demonstrated in heterogeneous scattering cases such as buildings in SAR images.In addition, our study provides some heuristic information for many tasks involving time series image data processing.

Conclusions
In this paper, a novel end-to-end unsupervised CD method combining the superpixel segmentation and the Siamese CD network was proposed for SAR images.Firstly, the pseudo-training samples were selected using an unsupervised pre-task.Then, under the unified framework, the superpixel segmentation network and CD network were trained end-to-end to obtain the global optimal parameters.The superpixel segmentation network generates task-adaptive superpixels and outputs features containing accurate semantic information.The Siamese CD network based on U-net was used to mine multi-scale change information.The design of the Siamese structure and the use of the joint loss function enabled the multi-temporal information to be fully mined and combined to obtain change information.Several public CD datasets were used to verify the effectiveness and robustness of our proposed method.In addition, the transfer learning experiment was designed to explore the generalization performance of the network.The experimental results prove that the proposed method performs well in compressing noise, refining boundaries, and improving the CD accuracy for small area changes.Furthermore, this paper explores the ability of the network to detect changes for the first time, which deserves further attention in future research.It would also be interesting to extend this method to CD of more complex remote sensing data or sequential data in the future.

Algorithms 1 :Figure 1 .Algorithms 1 :
Figure 1.The overall framework of the proposed method.The input of the network is the SAR intensity image pairs and the output is the predicted probability.The joint loss function is used to backpropagate and the binary change map is finally obtained.Training steps see Algorithms 1.

Figure 1 .
Figure 1.The overall framework of the proposed method.The input of the network is the SAR intensity image pairs and the output is the predicted probability.The joint loss function is used to backpropagate and the binary change map is finally obtained.Training steps see Algorithms 1.

Figure 2 .
Figure 2. The flowchart of unsupervised change detection for SAR image.

Figure 2 .
Figure 2. The flowchart of unsupervised change detection for SAR image.

Figure 3 .
Figure 3. Superpixel sampling networks (SSNs).The SSN consists of a feature extractor and the differentiable SLIC.Input: SAR image; output: the pixel-superpixel associations Q and high-level feature F pix from the feature extractor.The arrows in the black circle illustrate upsampling.

Figure 4 .
Figure 4. Two Siamese networks based on the encoder-decoder structure.(a) FC-Siam-conc; (b) FC-Siam-diff.The top subfigure is the implementation details of the model.Two types of residual blocks (Res-Ι and Res-ΙΙ) and decoder modules (Dec-Ι and Dec-ΙΙ) are used.  and   ′ represent the input features from the last residual block and output features, respectively, in the encoding stage.  and   ′ have a similar denotation to   and   ′ but they belong to the decoding stage.  , denotes the bi-temporal features extracted from the encoding module by using skip connections.Orange arrows illustrate weight sharing.

Figure 4 .
Figure 4. Two Siamese networks based on the encoder-decoder structure.(a) FC-Siam-conc; (b) FC-Siam-diff.The top subfigure is the implementation details of the model.Two types of residual blocks (Res-I and Res-II) and decoder modules (Dec-I and Dec-II) are used.X e and X e represent the input features from the last residual block and output features, respectively, in the encoding stage.X d and X d have a similar denotation to X e and X e but they belong to the decoding stage.X e 1,2 denotes the bi-temporal features extracted from the encoding module by using skip connections.Orange arrows illustrate weight sharing.

Figure 10 .
Figure 10.Comparison of CD results from networks with or without SSN in the Ottawa dataset (first row) and Sulzberger dataset (second row).(a) FC-Siam-diff; (b) SSN-Siam-diff; (c) FC-Siam-conc; (d) SSN-Siam-conc; and (e) ground truth.The areas marked by red boxes deserve more attention.

Figure 10 .
Figure 10.Comparison of CD results from networks with or without SSN in the Ottawa dataset (first row) and Sulzberger dataset (second row).(a) FC-Siam-diff; (b) SSN-Siam-diff; (c) FC-Siam-conc; (d) SSN-Siam-conc; and (e) ground truth.The areas marked by red boxes deserve more attention.

Figure 11 .
Figure 11.Accuracy indicator distribution of 100 experiments.(a,b) refer to the boxplot of FC-Siamdiff vs. SSN-Siam-diff and FC-Siam-conc vs. SSN-Siam-conc for the Ottawa dataset, respectively.Similarly, (c,d) correspond to the Sulzberger dataset.

Figure 12 .
Figure 12.Superpixel segmentation results from the SSN-Siam-diff network for the Ottawa dataset.The first row corresponds to the image captured at   .(a)   image; (b) superpixel generation for   image; (c) the pixel value is replaced by the mean value of the superpixel to which this pixel belongs.Similarly, (d-f) correspond to the results of the image captured at   .The areas marked by red boxes deserve more attention.

Figure 12 .
Figure 12.Superpixel segmentation results from the SSN-Siam-diff network for the Ottawa dataset.The first row corresponds to the image captured at t 1 .(a) t 1 image; (b) superpixel generation for t 1 image; (c) the pixel value is replaced by the mean value of the superpixel to which this pixel belongs.Similarly, (d-f) correspond to the results of the image captured at t 2 .The areas marked by red boxes deserve more attention.

Figure 13 .
Figure 13.Change detection results of transfer learning experiment for SAR datasets, including the San Francisco (first row), Farmland-A (second row) and Farmland-B datasets.(a) FC-Siam-diff; (b) SSN-Siam-diff; (c) FC-Siam-conc; (d) SSN-Siam-conc; and (e) ground truth.The areas marked by red boxes deserve more attention.

Figure 13 .
Figure 13.Change detection results of transfer learning experiment for SAR datasets, including the San Francisco (first row), Farmland-A (second row) and Farmland-B datasets.(a) FC-Siam-diff; (b) SSN-Siam-diff; (c) FC-Siam-conc; (d) SSN-Siam-conc; and (e) ground truth.The areas marked by red boxes deserve more attention.

Figure 14 .
Figure 14.CD results of swapping the input sequence of bi-temporal images for FC-Siam-diff and FC-Siam-conc.(a,b) correspond to the results of FC-Siam-diff switching the input sequence.Similarly, (c,d) correspond to the results of FC-Siam-conc.(e) Ground truth.These three datasets are Yellow River coastline (first row), Farmland-A (second row), and inland water (third row).

Figure 14 .
Figure 14.CD results of swapping the input sequence of bi-temporal images for FC-Siam-diff and FC-Siam-conc.(a,b) correspond to the results of FC-Siam-diff switching the input sequence.Similarly, (c,d) correspond to the results of FC-Siam-conc.(e) Ground truth.These three datasets are Yellow River coastline (first row), Farmland-A (second row), and inland water (third row).

Figure 15 .
Figure 15.Change detection results and superpixel generation of transfer learning experiment for Mexico (optical) datasets.(a) FC-Siam-diff; (b) SSN-Siam-diff; (c) FC-Siam-conc; (d) SSN-Siam-conc; (e) ground truth; (f)  1 image; (g)  2 image; (h,i) superpixel generation via SSN-Siam-diff for  1 and  2 images.(j) The pixel value of the  2 image is replaced by the mean value of the superpixel to which this pixel belongs.The areas marked by red circles deserve more attention.

Figure 15 .
Figure 15.Change detection results and superpixel generation of transfer learning experiment for Mexico (optical) datasets.(a) FC-Siam-diff; (b) SSN-Siam-diff; (c) FC-Siam-conc; (d) SSN-Siam-conc; (e) ground truth; (f) t 1 image; (g) t 2 image; (h,i) superpixel generation via SSN-Siam-diff for t 1 and t 2 images.(j) The pixel value of the t 2 image is replaced by the mean value of the superpixel to which this pixel belongs.The areas marked by red circles deserve more attention.

Feature extractor Max-pool. 2×2 SAR Image Figure 3. Superpixel
sampling networks (SSNs).The SSN consists of a feature extractor and the differentiable SLIC.Input: SAR image; output: the pixel-superpixel associations  and high-level feature   from the feature extractor.The arrows in the black circle illustrate upsampling.

Table 1 .
Details of the experimental datasets.

Table 1 .
Details of the experimental datasets.

Table 3 .
Quantitative results of Ottawa dataset for models with SSN or without.

Table 3 .
Quantitative results of Ottawa dataset for models with SSN or without.

Table 4 .
Quantitative results of Sulzberger dataset for models with SSN or without.

Table 5 .
Quantitative results of transfer learning experiment for San Francisco dataset.

Table 6 .
Quantitative results of transfer learning experiment for Yellow River-Farmland-A dataset.

Table 5 .
Quantitative results of transfer learning experiment for San Francisco dataset.

Table 6 .
Quantitative results of transfer learning experiment for Yellow River-Farmland-A dataset.

Table 7 .
Quantitative results of transfer learning experiment for Yellow River-Farmland-B dataset.

Table 7 .
Quantitative results of transfer learning experiment for Yellow River-Farmland-B dataset.

Table 8 .
Quantitative results of different methods used for the Ottawa dataset.

Table 8 .
Quantitative results of different methods used for the Ottawa dataset.

Table 9 .
Quantitative results of different methods used for the Farmland-A dataset.

Table 9 .
Quantitative results of different methods used for the Farmland-A dataset.