Unsupervised SAR Image Change Detection Based on Histogram Fitting Error Minimization and Convolutional Neural Network

: Synthetic aperture radar (SAR) image change detection is one of the most important applications in remote sensing. Before performing change detection, the original SAR image is often cropped to extract the region of interest (ROI). However, the size of the ROI often affects the change detection results. Therefore, it is necessary to detect changes using local information. This paper proposes a novel unsupervised change detection framework based on deep learning. The speciﬁc method steps are described as follows: First, we use histogram ﬁtting error minimization (HFEM) to perform thresholding for a difference image (DI). Then, the DI is fed into a convolutional neural network (CNN). Therefore, the proposed method is called HFEM-CNN. We test three different CNN architectures called Unet, PSPNet and the designed fully convolutional neural network (FCNN) for the framework. The overall loss function is a weighted average of pixel loss and neighborhood loss. The weight between pixel loss and neighborhood loss is determined by the manually set parameter λ . Compared to other recently proposed methods, HFEM-CNN does not need a fragment removal procedure as post-processing. This paper conducts experiments for water and building change detection on three datasets. The experiments are divided into two parts: whole data experiments and random cropped data experiments. The complete experiments prove that the performance of the method in this paper is close to other methods on complete datasets. The random cropped data experiment is to perform local change detection using patches cropped from the whole datasets. The proposed method is slightly better than traditional methods in the whole data experiments. In experiments with randomly cropped data, the average kappa coefﬁcient of our method on 63 patches is over 3.16% compared to other methods. Experiments also show that the proposed method is suitable for local change detection and robust to randomness and choice of hyperparameters.

Change detection techniques fall into two categories according to the existence of additional ground reference data: supervised methods and unsupervised methods [13,14].Supervised methods require ground truth to be given for a big training dataset [15], which is labor-intensive and time-consuming [3,16].In this paper, we mainly focus on unsupervised change detection methods.Early unsupervised SAR image change detection methods are mainly traditional change detection methods based on probabilistic models, multi-resolution analysis and clustering methods.Rignot et al. [17] assume multi-look SAR intensities to be gamma distribution and use image differencing method for change detection.Bruzzone et al. [18] use Markov random fields to model the relationships between adjacent pixels.Bovolo et al. [19] utilize stationary wavelet decomposition to extract the multi-resolution features.Aiazzi et al. [20] extract change features based on information theory.Zhang et al. [21] proposed a novel Contourlet fusion clustering algorithm for unsupervised change detection.
In the past decade, deep learning has made unprecedented development.Due to its powerful feature extraction ability, deep learning has been widely used in the field of SAR image change detection [4,13,[22][23][24][25][26].Since the change detection data of SAR have fewer labels, methods based on semi-supervised learning and unsupervised learning have been widely used in the field of SAR change detection.In the field of semi-supervised learning, Wang et al. proposed a graph-based knowledge supplement network, which can suppress the adverse effects of noisy samples by adding discriminative information from a labeled dataset [27].Zhao et al. proposed a semi-supervised SAR image change detection method based on a siamese variational autoencoder [28].In the field of unsupervised change detection, since there are no labels, how to train a neural network becomes a key issue.At present, there are mainly two types of solutions, as shown in Figure 1a,b.The first is to use the existing pre-training model and the model parameters are used for feature extraction without modification [13].The pretrained model is trained using external large datasets.Essentially, this method transfers the trained model on the semantic segmentation task to the change detection task.If the model is trained on an optical image dataset and the datasets to be tested are SAR images, image style transfer for SAR and optical images was considered [16,29].The second method does not use any other dataset.Only two registered SAR images obtained at different times are needed.First, the pseudo-labels are generated using traditional methods, such as FCM clustering.Second, some reliable pseudo-labels are selected to train the neural network.Finally, the original images are inputted into the model and the change detection results are acquired [4,[23][24][25]30,31], as shown in Figure 1b.For example, Qu et al. [31] proposed a dual-domain network (DDNet) using discrete cosine transform (DCT) for unsupervised SAR image change detection.Gao et al. [25] proposed a siamese adaptive fusion network (SAFNet) for unsupervised change detection.The pseudo-labels of DDNet and SAFNet are generated by the hierarchical FCM algorithm [32].Each method yields good results [25,31].
However, both methods have their own shortcomings.With the first method, it is difficult to adaptively process various real data due to the fixed network parameters.For the second method, it is necessary to manually design the criteria for sample selection.These selected samples are used to train the model.This is computationally expensive due to the time-consuming training process before testing.Furthermore, nearly all unsupervised methods need image binarization.Saha et al. [13] use the OTSU method for thresholding.Shen et al. [23] and Qu et al. [31] use the FCM method for pre-classification.Tang et al. [33] generate pseudo labels using the expectation-maximization (EM) algorithm.However, the traditional thresholding methods cannot handle regions with very little change [34].
This paper studies local change detection.We break up the image into small patches to test the performance of the method using only local information processing.Traditional segmentation methods may lead to many false alarms if only locally cropped images are processed.If the original images are cropped into small patches, the final change detection results will be very different from the results using full images.In order to solve this problem, in our previous work [34], we proposed a novel thresholding method called histogram fitting error minimization (HFEM) for a little-changed area.After the thresholding using HFEM, conditional random fields (CRFs) are used to model the relationships between the neighborhood pixels.Finally, the small fragments are removed.The fragment removal procedure was called post-processing in our previous work [34].If the area of the fragment is less than a given value β, it will be removed; otherwise, it will be retained.This procedure has been proven to be a very effective way to improve accuracy [31,34,35].However, β needs to be set manually.If β is set too large, some truly changed area will be removed.If it is set too small, some larger false alarms will not be removed.Therefore, the post-processing procedure is essentially a supervised manual trial-and-error procedure (MTEP).As for the subject of image binarization, many excellent adaptive binarization methods have been proposed to replace MTEP [1,36,37].It is meaningful to find a processing method that is not based on manual settings to replace post-processing.In summary, there are currently two problems.First, existing methods do not perform well for local region change detection.Second, in our previous work, accuracy improvement largely depended on the fragment removal procedure.In order to solve these two problems, we proposed a novel unsupervised change detection framework called HFEM-CNN.Different from the above two frameworks based on unsupervised deep learning, this paper proposes a new end-to-end deep learning-based unsupervised change detection method, as shown in Figure 1c.This framework does not need to select training samples.The training and testing are carried out at the same time and it can also learn parameters adaptively.Compared to Figure 1b, the three procedures, including sample selection, training and testing, are combined into one step: multi-objective learning.Compared to our previous work [34], we use CNN to replace CRF and post-processing.The CNN-based method does not require manual setting of fragment size threshold.Although there are also some hyperparameters, such as the learning rate, momentum and the number of iterations, these three hyperparameters are the same as in [38].This shows that such a setting is satisfactory for different tasks.Furthermore, the change detection results are also improved compared to the combination of CRF and the fragment removal procedure.This framework has the ability to detect cropped regions.The proposed method outperforms previous methods on the task of change detection for locally cropped regions.

Pseudo
In general, the contributions of this paper can be summarized as follows: 1.
We first consider local change detection.Local change detection is a challenge for unsupervised change detection.We find that the proposed method has great advantages in local change detection.

2.
This paper proposes a novel unsupervised change detection framework called HFEM-CNN.This framework combines sample selection, training and testing into one step: multi-objective learning.The parameters of the network are learned adaptively.Compared with our previous work, the method proposed in this paper is more automatic.It does not require fragment removal as the post-processing to achieve decent results.

3.
The experiments are conducted on both whole images and cropped images.The encouraging results demonstrate that the proposed method is effective for the local change detection task.
It should be noted that the specific structure of the CNN is not our particular concern.In this work, we design a simple fully convolutional neural network (FCNN) for the experiments.It should be noted that the structure of FCNN is different from the famous FCN [39].To facilitate the distinction, we use FCNN to refer to the fully convolutional network designed in this paper.Furthermore, the classic Unet [40] and PSPNet [41] are also used for unsupervised learning.In this paper, HFEM-FCNN, HFEM-Unet and HFEM-PSPNet are collectively referred to as HFEM-CNNs.The experiments demonstrate that simple neural networks such as FCNN and Unet perform better in this situation.As for the HPF, there are also many specific forms of HPF, such as difference operator, Sobel operator and Laplace operator.This paper uses difference operators as the specific forms of HPF for its simple practicality.
This paper is organized into six sections.Section 2 presents the definition of local change detection and the problem formulation of the proposed method.Section 3 elaborates the proposed method.The Experimental results on real multi-temporal SAR images are reported in Section 4, which also contains the experimental design and evaluation criteria.Discussions concerning the proposed method are presented in Section 5. Finally, conclusions are drawn in Section 6.

Local Change Detection
Let us first consider whether the change detection information relies on local or global information.Figure 2 shows a natural optical image and a remote sensing image.For natural optical images, patch one and patch two must be related because they both represent a part of the cat.Patch one represents the cat's head and patch two represents one of the cat's paws.That is, for ordinary optical images, a deeper neural network is necessary to increase the receptive field to obtain sufficient information.For remote sensing images, Patch one and patch two illustrate different buildings at a distance, respectively.Therefore, we make reasonable assumptions that patch one and patch two are uncorrelated.
Equation ( 1) represents the mathematical model of local change detection.Local change detection describes whether a pixel x change depends only on its adjacent area N x and the other pixels are irrelevant.Based on the locality assumption, we design experiments with cropped regions.The original image is randomly cropped and then the cropped patches are used for change detection.We hope to design an unsupervised change detection algorithm that achieves good accuracy on both whole and cropped images.

Methods
Let X 1 and X 2 represent the SAR amplitude images, respectively.The two images were acquired from different moments in the same area.Change detection output is a binary map, with 1 representing changing regions and 0 representing unchanged regions.
The framework of this method is shown in Figure 3.In general, the proposed method has three steps: 1. Calculate the difference map and use HFEM for pixel-level segmentation.2. Input the difference image into CNN and then perform high-pass filtering to obtain high-frequency components.We consider that for a good segmentation result, adjacent pixels tend to be assigned the same label.Based on the above principle, we feed the output of the CNN into a high-pass filter.3. Calculate the weighted sum of the pixel loss and the neighborhood loss.The pixel loss is calculated between the output of the CNN and the pixel-level segmentation result.The neighborhood loss is calculated using the high-frequency component and the all-zero map.The remainder of this section describes each part of the proposed method.

Difference Image Calculation
The difference map can be generated using the classical change vector analysis method or the log-ratio method.Generally, the log-ratio method is very effective for detecting the change of pixels with low amplitude [1] but it is not very effective for pixels with high amplitude.In the SAR amplitude image, the amplitude of water is the lowest, the amplitude of land is medium and the amplitude of buildings is the highest.For water change detection, the log-ratio method is very effective and for building change detection, we choose the image differencing method.In this paper, the log-ratio method is applied to the change detection of water and the image difference method is applied to building change detection.Appendix A gives the explanation of why we adopted the image differencing approach for building change detection.
In this paper, the log-ratio method adds a normalization factor and the mathematical expression is shown in Equation ( 2): The mathematical expression of the image differencing method is shown in Equation ( 3): HFEM requires that the theoretical maximum value of the difference map is 255, so D lr is the difference map using the log-ratio method, due to the normalization factor, for an 8-bit encoded input image.The theoretical maximum value of D lr is 255.In this article, unless otherwise specified, D d and D lr are collectively referred to as the difference image (DI).

The Review of HFEM
Let us first review the HFEM method [34].HFEM is based on the assumption that the unchanged pixels in DI follow the half-normal distribution and the changed pixels in the DI follow the normal distribution.The reasonability of this assumption is explained in Appendix B. Let z denote the pixel of DI, and the following two equations are used to calculate the probability of the unchanged and the changed pixels in the DI: where p(z|ω u ) is the probability density function of the unchanged pixels, p(z|ω c ) is the probability density function of the changed pixels, σ u is the standard deviation of the unchanged pixels, σ c is the standard deviation of the changed pixels, µ c is the mean of the changed pixels and ω u is the unchanged pixels.Let p 1 (z) denote the probability density function under the condition that changes exist in the region.p 2 (z) represents the probability density function that there is no change in the region.
Let h(z) represent the histogram of Z.The optimization goal is to minimize the fitting error while ensuring the above two conditions.So the optimization model of the algorithm for solving the threshold T is: where eps represents the tolerance of the optimization.Each parameter in Equations ( 4)-( 7) can be calculated by Equation (9).
Readers can refer to [34] for details.

CNN and Loss Function Design
In this paper, the CNN framework is different from the traditional deep learning-based change detection methods.The traditional deep learning unsupervised change detection methods contain three steps, sample selection, supervised training and testing, as shown in Figure 1b.The method in this paper does not require sample selection and supervised training.The network is directly optimized according to the thresholding result and the all-zero map, which is similar to the pixel energy and neighborhood energy in the CRF model [34].The total loss function is the weighted sum of pixel loss and neighborhood loss, where Loss 1 represents pixel loss, Loss 2 represents neighborhood loss and λ represents the weight coefficient.

CNN Design
In this work, we test three different structures of CNN.First, we design FCNN, a simple fully convolutional network.Second, we test two classic semantic segmentation networks, called Unet [40] and PSPNet [41].
There are two primary considerations for the design of CNN in this paper.First, the CNN in this paper takes on the rule of filtering and denoising rather than the role of feature extraction in traditional CNNs.It should be noted that the denoising ability of CNN is different from despeckling.The denoising mentioned here mainly includes two points.The first is to remove small fragments in the thresholding result.The second is to connect some homogeneous changing areas.That is, it has a smoothing effect on the thresholding result, as shown in Figure 4. Second, for change detection using remote sensing images, this paper hopes to reduce the influence of the detection results by the ROI selection scale.Therefore, regarding the network depth, this paper uses a shallow convolutional neural network with less than 10 layers.A deeper neural network will make the results too smooth and blurred.This work does not use a downsampling or pooling layer in the network because the downsampling layer will reduce the resolution.On the one hand, more high-resolution information should be preserved.On the other hand, the receptive field of the network should be reduced to meet the assumption that different regions of the image are irrelevant.In summary, the design of FCNN is shown in Figure 5. FCNN contains L + 2 convolution layers.The number of 3 × 3 convolutional layer equals L + 1 and the number of 1 × 1 convolutional layer equals one.In this work, we set L = 8.The parameter padding is set to 1 for each convolution layer in order to maintain the image size.The structure of Unet is shown in Figure 6a.Slightly different from the original Unet network, the parameter padding = 1 in order to maintain the image size before and after the convolution layer.The batchnorm (BN) layer [42] is used after the ReLU layer to accelerate the convergence.
The structure of PSPNet is also slightly different from the original one.In this work, the batch size equals 1 because there is only one difference image fed into the neural network.However, if we use the original PSPNet, the batch normalization does not support the 1 × 1 tensor because BN requires more than one value per channel [42].Therefore, we slightly modify the pyramid pooling module in the original paper.We change the pyramid pooling sizes to 2 × 2, 3 × 3, 4 × 4 and 6 × 6, respectively, as shown in Figure 6b.We avoid the inability to batch normalize.Other than that, no major modifications are made to the original network.ResNet-50 [43] is used as the backbone of PSPNet.Both pretrained and randomly initialized backbones are tested in Section 4. The pyramid pooling module of PSPNet used in the original paper and in this paper.This figure is adapted from [41].The pyramid pooling module in the original paper is a four-level one with bin sizes of 1 × 1, 2 × 2, 3 × 3 and 6 × 6, respectively [41].In this paper, we change these sizes to 2 × 2, 3 × 3, 4 × 4 and 6 × 6, respectively.

Loss Function Design
Let Y = [y ijk ] ∈ R M×N×C represent the output of the convolutional neural network; L ∈ R M×N represents the segmentation result of the HFEM algorithm.Among them, C represents the number of channels.If the classic cross entropy Loss is used, the number of output channels is equal to the number of categories.In this paper, there are only two categories of categories-changed class and unchanged class-so C = 2. Nevertheless, Binary Cross Entropy Loss (BCELoss) is a loss function specially designed for the two-class problem, which is suitable for change detection problems.If BCELoss is used, then the number of output channels C = 1.Each pixel of the output image represents the probability of change.At this time Y = [y ij ] ∈ R M×N .
Let i, j represent the row and column coordinates of the image, respectively, where i = 0, 1, . . ., M − 1, j = 0, 1, . . ., N − 1.Since change detection is a binary classification problem, BCELoss is used in this paper.Before calculating BCELoss, it is necessary to normalize the output Y to (0, 1) through the sigmoid function layer.The sigmoid function is as follows: Let P = p ij ∈ R M×N represent the output of the sigmoid function, where the element p ij can be considered as the probability of the change of the ijth element.The larger the value, the greater the probability of change.The loss function BCELoss is as follows: where l ij is the ijth element in L. The L1Loss is adopted as the neighborhood loss function.The reason why BCEloss is not used is that BCELoss has a necessary premise that each pixel of the output image represents the probability of belonging to a certain class.After high-pass filtering, each pixel denotes the smoothness of the output image rather than the probability.So BCELoss cannot be used.In this paper, we adopt L1Loss as the neighborhood loss function.Obviously, the optimal solution is not the one where Loss 2 is zero, because even the ground truth, after high-pass filtering, is not an all-zero map.Therefore, in our consideration, the gradient near the optimal point cannot be too small to escape the area where Loss 2 approaches zero.The gradient of L1 loss has the same gradient everywhere except the zero point, so we choose L1loss as the neighborhood loss.The L1loss is also selected for spatial continuity in [38]: Let H represent the output of the high-pass filter, and Z ∈ R M×N represent an all-zero map.The all-zero map denotes a matrix with all zero elements.The L1Loss, which is used as the neighborhood loss, is expressed as follows: It is worth noting that the specific form of the neighborhood loss Loss 2 will vary slightly depending on different high-pass filters.Correspondingly, the all-zero map also has different sizes.

High Pass Filter
In this paper, we use the difference operator as HPF.The difference operator is described below.The reason why we choose the all-zero map for calculating Loss 2 is also discussed.
For the difference operator, the output of the high-pass filter is divided into two parts, the row difference and the column difference, respectively.Let ×N represent the row difference image.Then, each element in the matrix is calculated as follows: When the difference operator is used as a high-pass filter, the outputs of the high-pass filter contain two items, namely H x and H y .The size of H x and H y is not the same as the output image, so the loss function at this time is calculated as follows: For the output of the high-pass filter, it will be used to compare the all-zero map to calculate the loss.Let us discuss why the all-zero map is chosen as the target for learning.Let us look at a simple test.The Ottawa dataset is used for this test.First, we generate the DI using the log-ratio method and the original HFEM is used for image thresholding.The thresholding result and the label are put into the difference operator.The inputs and the outputs of the filter are shown in Figure 7.The result is processed by taking the absolute value.
It can be seen intuitively from Figure 7b,c that there are many non-zero pixels in the high-pass filtering result of (a).The high-pass filtering result of the label is only the edge of the changed area and the number of non-zero pixels is relatively low, as shown in Figure 7e,f.The numbers of non-zero pixels of (b) (c) (e) and (f) are 11922, 10394, 3282 and 1743.We calculate Loss 2 for thresholding result and label using Equation (15).For the thresholding result, Loss 2 = 0.2206 and Loss 2 = 0.0497 for the label.Therefore, we make an assumption that a good change detection result should be relatively smooth; that is, the result after passing through the high-pass filter is closer to the all-zero map.Based on this idea, we use an all-zero map as the label for training.Obviously, the result after high-pass filtering cannot be an absolute all-zero map unless there is no change at all.Therefore, both loss 1 and loss 2 must be taken into account, so we designed such a multi-objective learning framework, as shown in Figure 3.

Experiment
In this section, we implement experiments to demonstrate the effectiveness of the proposed framework.First, we describe the datasets and the evaluation criteria used in this experiment.Then, the experiment design is specified.The detailed experimental results are displayed in the final parts.In this experiment, the weight coefficient λ is set to 1.9 for water change detection and to 1.1 for building change detection.The number of convolution layers L in Figure 5 is set to 8. The difference operator is used as the high-pass filter, as shown in Equation (15).The proposed method and DDNet method are implemented using Pytorch 1.11 and CUDA 11.3 on a single NVIDIA RTX 3090 GPU.The FCMMRF, PCA-kmeans and HFEMCRF are implemented using MATLAB R2022a on AMD Ryzen 9 5900X CPU.We choose the stochastic gradient descent (SGD) method as the optimizer.The learning rate is set to 0.1 and the momentum is set to 0.9.The number of iterations is set to 200.
For comparison, we experimented with four contrasting methods, called DDNet [31], SAFNet [25], FCMMRF [44] and HFEMCRF [34].All four contrasting methods use the fragment removal procedure as postprocessing.Generally speaking, the parameter β in the fragment removal procedure is generally set between 20 and 30.β is set to 20 in the public code of SAFNet [25,45] and DDNet [31,46].In our previous work, β is set to 25 for HFEMCRF [34].In order to maintain uniformity, we set β to 20 for all four contrasting methods in this experiment.The implementation of the DDNet method is to use the code published by the original authors.The other methods we implemented ourselves according to the original papers.We test four HFENCNNs, including HFEM-Unet, HFEM-FCNN, HFEM-PSPNet and HFEM-PSPNet using pretrained backbone (HFEM-PSPNet-Pre).

Dataset Descriptions
We use three real SAR datasets in the experiment.The real datasets are Bern dataset, Ottawa dataset and Tongzhou dataset.
Bern dataset includes two 301 × 301 SAR images, which were acquired by the European Remote Sensing Satellite 2 (ERS 2) SAR satellite covering Bern city.The two images were acquired in April and May 1999.Between these two moments, the waters of the Aare River flooded parts of the cities of Thun and Bern, including Bern Airport.The two images of Bern dataset are illustrated in Figure 8.The ground truth of Bern dataset is from [1].Ottawa dataset, which is shown in Figure 9 was acquired by the RADARSAT SAR sensor in the Ottawa region.Two images (290 × 350) were acquired in May and August 1997, respectively.During this time, the area suffered from flooding.The ground truth of Ottawa dataset is from [44].Tongzhou dataset contains two SAR images acquired by TerraSAR-X.These two images are cropped from two large SAR amplitude images without multi-look operation, which means the range resolution is equivalent to 0.9 m and the azimuth resolution equals 1.9 m.The size of the cropped image is 510 × 510.There were numerous newly built houses in this area between January 2014 and August 2015, so this dataset is used by us for building change detection.The ground truth of Tongzhou dataset is manually marked with reference to the optical images from Google Earth.However, two images of Tongzhou data are single-look images with greater speckle noise.Therefore, the multi-temporal SAR block-matching 3D (MSAR-BM3D) method [47] is used to remove the speckle noise for Tongzhou dataset.The original and despeckled images of Tongzhou dataset are depicted in Figure 10, respectively.
We did not perform despeckling on the Bern dataset and the Ottawa dataset because these two datasets are relatively less noisy.To illustrate this and also to evaluate the effect of despeckling, we select the homogeneous areas and calculate the equivalent number of looks (ENL) of each dataset.The T 1 image of each dataset is selected to calculate the ENL.The selected patches of each image are shown in Figure 11a-c.The selected patch of the original Tongzhou image is the same as that of the despeckled image.It can be seen from Table 1 that the Bern dataset and the Ottawa dataset have relatively high ENL and do not require additional despeckling.However, two images of Tongzhou data are single-look images with greater speckle noise.Therefore, the multi-temporal SAR block-matching 3D (MSAR-BM3D) method [47] is used to remove the speckle noise for Tongzhou dataset.The original and despeckled images of Tongzhou dataset are depicted in Figure 10, respectively.The ENL of the despeckled image increased a lot, as shown in Table 1.

Evaluation Criteria
The confusion matrix is used for the evaluation criteria.For binary classification problems, the size of the confusion matrix is 2 × 2. The four elements in the confusion matrix are TP, FP, TN, FN.The definitions of these four variables are the same as [34].
In this paper, we consider four evaluation criteria-overall accuracy (OA), precision, recall, mIOU and kappa coefficient (KC)-which are defined as follows: The kappa coefficient is defined as follows: where

Experimental Design
In this work, we implement real data experiments to demonstrate the effectiveness of our method.The real data experiment includes two parts.In the first part, the three whole datasets are used to demonstrate that the proposed method is suitable for whole datasets.The second part is the core part of the experiment.We randomly crop the raw images and labels.Each image is cropped into 20 small patches with size 100 × 100.The number of changes in the cropped area is very small.Experiments prove that our method is superior to other algorithms in this extreme case.The datasets in this experiment contain a total of 63 image pairs, of which the first three pairs of images are images of the complete data and the last 60 pairs are randomly cropped images.We use all 63 pairs of patches to perform the experiment in the second part.The purpose is to examine the algorithm's average performance in various scenarios.
The method of the whole data and the cropped data experiments is shown in Figure 12.The first line illustrates the experiment on whole datasets and the second line illustrates the experiment on the cropped datasets.

Experiment on Whole Datasets
In experiment 1, the datasets contain numerous changed pixels.The results of three whole datasets are shown in Figures 13-15.We can see that HFEM-PSPNet does not perform well in three datasets.PSPNet uses ResNet as the backbone.Since the ResNet network is relatively deep, the denoising ability of the model is too strong, so the output image is too smooth.Therefore, relatively shallow networks are better suited for unsupervised SAR change detection; previous research [23,31] also supports our view.

Experiment on Whole and Cropped Datasets
Both whole datasets and cropped datasets are used in experiment 2. The aim of experiment 2 is to test the overall performance of the algorithm in various situations.The total number of pairs of images is 63, including three whole datasets and 60 cropped datasets.In order to better evaluate the performance, We first calculate the mean value of each criterion, then we give visualized results of six selected different patches.The mean values of numerical results are shown in Table 5.The kappa coefficients of each dataset are shown in Figure 16.The acronym 'WP' means 'without post-processing'.The meaning of post-processing is fragment removal procedure, which is explained in Section 1.The proposed method is compared with other methods, respectively.As can be seen from Figure 16, HFEM-Unet and HFEM-FCNN are better than DDNet, FCMMRF HFEMCRF.Even though HFEMCRF greatly improves accuracy through post-processing, the proposed HFEM-Unet is still 3.16 % higher than HFEMCRF in kappa coefficient and 2.87 % than in mIoU.If HFEMCRF has no post-processing, the performance of HFEM-CNNs is much stronger than that of HFEMCRF.Therefore, the proposed framework is a good substitute for CRF and fragment removal procedure.For a subset of patches, our detection results have high kappa coefficients.However, for some special patches, the kappa coefficient value of HFEM-CNNs is very small.In order to visually demonstrate the effect of our method on different patches, we deliberately select eight patches.Four of them have good results and the results of the other four patches are not good.The indices of the four patches with good results are 8, 11, 30 and 43, respectively.The indices of the three patches with poor results are 23, 39 and 45, respectively.Figure 17 illustrates the selected patches.The blue dots represent the patches with good results and the red dots represent the patches with poor results.We give the visualized results and analysis of these selected patches.
In our subsequent analysis, if there is no clear explanation, we use the row number in Figure 18 or Figure 19 to represent the patch number.For example, patch 1 in Figure 18 represents the patch results shown in the first row in Figure 18.
Figure 18 illustrates the results of four cropped patches with good results.Each line of Figure 18 represents the data and detection results of a patch.Patch 1 and patch 2 are cropped from Bern dataset.Patch 3 is cropped from Ottawa dataset and Patch 4 is cropped from Tongzhou dataset.The selected patches include no change areas as well as changed areas.As can be seen from Figure 18, for those patches with significant changes, our method performs well, as shown in patch 1 and patch 3 in Figure 18.Not only our method, but nearly all methods also perform well on the patches with a lot of change.However, for those patches without change, such as patch 2 and patch 4 in Figure 18, traditional methods do not perform well.The HFEM-CNN method can better avoid false alarms.However, the proposed method can also lead to some unsatisfactory results, as shown in Figure 19.We select three patches with poor results.From Figure 19, we can clearly see that the proposed approach leaves out some of the details of the change.This is the reason for the low kappa coefficient in these patches.Nevertheless, from the effect point of view, our method greatly avoids false alarms at the cost of missing some details and we think it is still a good method.From the overall average kappa coefficient, the proposed method performs better than other methods.Due to the strong denoising ability of CNN, some tiny changes will be removed as noise.

Discussion
In this section, we first discuss the effect of random initialization.We do not fix the random seed and repeat the experiment 15 times.The hyperparameter λ is set to 2.5.These repeated experiments are intended to illustrate that the proposed method is less affected by randomness.Then, we discuss the influence of the hyperparameter λ, which aims to demonstrate the proposed method is robust with respect to the selection of the hyperparameter λ.The Unet is used as the CNN structure.Finally, we summarize the strengths and weaknesses of the proposed method.

The Effect of Random Initialization
We set the same hyperparameter λ = 2.5 and repeated experiments 15 times on the three datasets, respectively, to see how much performance is affected by randomness.The purpose of this experiment is to demonstrate that the performance of our method is reproducible and less affected by randomness.Let R represent the difference between the maximum value and the minimum value of the kappa coefficient in multiple experiments.As we can see from Figure 20, for Bern dataset, R = 0.026, for Ottawa dataset, R = 0.023, and for Tongzhou dataset, R = 0.021.The values in Tables 2-4 are the results of setting the random seed to 2022 for repeatability.Compared with the results in Tables 2-4, the proposed method is relatively stable and less affected by randomness.

The Effect of λ
In the experiment, the weight coefficient λ is set to 2.5.If λ is too large, the final output would be very close to HFEM.This means that the neighborhood information is not considered.In order to quantitatively discuss the influence of λ, we show the effect of different lambdas on the final output using three whole datasets, as shown in Figure 21.In order to ensure that the performance of the algorithm is not greatly affected by randomness, we do not fix the random seed.The performance shown in the Tables 2-4 is one of the results of multiple experiments, so it is slightly different from the result in Figure 21, but the error will not exceed one percent.0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3. It can be seen from Figure 21 that the setting of parameter λ does affect the detection accuracy.Because of this, in the cropped data experiment, we uniformly set the parameter λ to 2.5 for all cropped datasets.If λ is too small, the output of the CNN will appear as an all-zero map.From the point of view of optimization, that the output is an all-zero map means Loss 2 in Equation ( 10) is approaching zero.Because of the small weight of Loss 1 , the overall loss is mainly dominated by Loss 2 .So it falls into a locally optimal solution.Furthermore, as long as λ is set to about 2 to 3, the basic performance of the algorithm can still be guaranteed.We calculated the difference between the maximum and minimum kappa coefficients of the detection results when lambda equals 1.9, 2.1, . . ., 3.7.Let R denote the difference between the maximum value and the minimum value of the kappa coefficient using different λ.For Bern dataset, R = 0.054, for Ottawa dataset, R = 0.014 and for Tongzhou dataset, R = 0.039.The difference is 5.44%, 1.40% and 3.90% on Bern, Ottawa and Tongzhou datasets.Considering the impact of random initialization, the impact of changes in λ is even more negligible.For the Ottawa dataset, the influence of λ on the kappa coefficient is even lower than the influence of randomness on the kappa coefficient, which shows that the influence of lambda on the result is completely submerged in random noise.This shows that in such an interval, the effect of the method is relatively robust with respect to the selection of parameter λ.
Experiments prove that the proposed method is of great help in reducing false alarms.The proposed method has only one main hyperparameter λ.It can also be shown from the above two discussions that HFEM-Unet has little impact on random initialization.When λ is between 2 and 3, it has little impact on the final performance of the method.Therefore, HFEM-Unet is a relatively stable method that does not rely too much on hyperparameters.However, our previous work, HFEMCRF, is greatly affected by post-processing.Not only that, even if HFEMCRF uses post-processing, HFEM-CNN still slightly outperforms HFEMCRF on all 63 datasets.The proposed method is superior to HFEMCRF.Compared with the work of other scholars, such as DDNet, FCMMRF and SAFNet, the proposed method shows a great advantage in reducing the false alarm rate due to HFEM.However, the proposed method also has shortcomings.That is, the change detection of details needs to be improved.In order to reduce false alarms, the proposed method ignores some changes in details.This leads to the fact that the change detection results of the proposed method are not as good as other methods for some images with detailed changes.

Conclusions
This paper proposed the concept of local change detection.Local change detection describes whether a pixel change depends only on its adjacent area, while the other pixels are irrelevant.Based on the locality assumption, it is possible to implement change detection only using local patches.For better local change detection, we develop a novel change detection framework for local change detection, called HFEM-CNNs.We tested three different CNN architectures, namely Unet, PSPNet and our own designed FCNN.The experiments were conducted using both whole datasets and cropped datasets.Experiments show that simple shallow convolution networks such as Unet and FCNN are more suitable for the proposed framework.
The fragment-removal procedure is generally used as a post-processing step for traditional methods.However, this procedure is basically an MTEP process.The performance of the HNCRF method decreases rapidly without post-processing.So it is meaningful to find a processing method that is not based on manual settings to replace post-processing.The proposed CNN-based framework can solve this problem.
Experiments show that the proposed method performs a little better than other methods on the whole datasets.This demonstrates that HFEM-CNNs are suitable for normal change detection.Furthermore, the second part of the experiment demonstrates that the proposed method is effective with regard to the local change detection task.We can see from Section 5 that the proposed method is robust with regard to randomness and choice of hyperparameters.

Figure 1 .
Figure 1.The three frameworks of the unsupervised change detection method using deep learning.(a)Pretrained model and change vector analysis[13].(b) Using pseudo label selection for training and then test[4,23,24,30,31].(c) The proposed framework using multi-objective optimization.

Figure 2 .
Figure 2. (a) Ordinary optical image; (b) Remote sensing image.Let X ∈ R 2×H×W represent two single-polarization SAR images acquired from different times.Let x ∈ R 2 represent a certain pixel.N x represents the area around the pixel x and y ∈ {0, 1} indicates whether the pixel changes, where 0 indicates no change and Finally, perform backpropagation to update CNN parameters.Compared with other deep learning-based unsupervised methods, our method combines sample selection, training and testing into one step, which greatly reduces the computation time.

Figure 3 .
Figure 3.The framework of the proposed change detection method.

Figure 4 .Figure 5 .
Figure 4.The performance of CNN to remove small fragments and smoothing.(a) HFEM thresholding result; (b) the output of CNN.

Figure 6 .
Figure 6.The Unet and PSPNet tested in this paper.(a) The Unet structure used in the proposed change detection method.This figure is adapted from [40].A cyan rectangle in the figure represents the side view of a tensor and the number above the rectangle represents the number of channels.(b) The pyramid pooling module of PSPNet used in the original paper and in this paper.This figure is adapted from[41].The pyramid pooling module in the original paper is a four-level one with bin sizes of 1 × 1, 2 × 2, 3 × 3 and 6 × 6, respectively[41].In this paper, we change these sizes to 2 × 2, 3 × 3, 4 × 4 and 6 × 6, respectively.

Figure 7 .
Figure 7.The thresholding results and the label of Ottawa datasets, as well as the output of the high pass filter.(a) The thresholding result; (b) The row difference image of (a); (c) The column difference image of (a); (d) label; (e) The column difference image of (d); (f) The row difference image of (d).

Figure 12 .
Figure 12.Schematic diagram of two parts of real data experiment.The first row indicates the experiment using the complete data and the second row indicates that we use the cropped images from the complete data.

Figure 17 .Figure 18 .Figure 19 .
Figure 17.The indices of the selected patches for visualization.Seven patches are selected, including four patches with good results and three patches with poor results.The indices of the four patches with good results are 8, 11, 30 and 43, which are represented by blue dots.The indices of the three patches with poor results are 23, 39 and 45, which are represented by red dots.

Table 1 .
The ENL of each dataset.

Table 2 .
The numerical results of Bern dataset.

Table 3 .
The numerical results of Ottawa dataset.

Table 4 .
The numerical results of Tongzhou dataset.

Table 5 .
The average numerical results of all 63 datasets.