A Feature Space Constraint-Based Method for Change Detection in Heterogeneous Images

With the development of remote sensing technologies, change detection in heterogeneous images becomes much more necessary and significant. The main difficulty lies in how to make input heterogeneous images comparable so that the changes can be detected. In this paper, we propose an end-to-end heterogeneous change detection method based on the feature space constraint. First, considering that the input heterogeneous images are in two distinct feature spaces, two encoders with the same structure are used to extract features, respectively. A decoder is used to obtain the change map from the extracted features. Then, the Gram matrices, which include the correlations between features, are calculated to represent different feature spaces, respectively. The squared Euclidean distance between Gram matrices, termed as feature space loss, is used to constrain the extracted features. After that, a combined loss function consisting of the binary cross entropy loss and feature space loss is designed for training the model. Finally, the change detection results between heterogeneous images can be obtained when the model is trained well. The proposed method can constrain the features of two heterogeneous images to the same feature space while keeping their unique features so that the comparability between features can be enhanced and better detection results can be achieved. Experiments on two heterogeneous image datasets consisting of optical and SAR images demonstrate the effectiveness and superiority of the proposed method.


Introduction
Change detection (CD) is the process of identifying differences in remote sensing images acquired on the same location, but at different times [1]. It has been widely applied in many fields, such as disaster assessment [2,3], environmental monitoring [4,5], urbanization research [6] and so on.
According to the sources of the utilized images, CD can be divided into homogeneous CD and heterogeneous CD. The images used for homogeneous CD refer to images coming from the sensors with the same or a similar imaging modality, e.g., synthetic aperture radar (SAR) images (SAR-to-SAR) or optical images (optical-to-optical). For heterogeneous CD, the images are acquired by different sensors with different imaging modalities such as SAR and optical images (SAR-to-optical). Different from homogeneous CD, the pixels in heterogeneous images are in different distinct feature spaces [7], and the change map (CM) cannot be obtained by simple linear operations or some homogeneous methods, which is also the main difficulty for heterogeneous CD. Over the past several decades, much attention has been paid to homogeneous CD [8], and many excellent methods have been explored [9][10][11][12][13]. With the increase of different types of satellite sensors, however, CD based on the approximated SAR image to get the final CM. However, during the transfer process, both of these methods will lose some unique features in their original feature spaces (e.g., some geometric structural features of the shape and boundary in optical images will be lost when they are directly transferred from the optical feature space to the SAR feature space), which may cause some missing detections.
In this paper, an end-to-end heterogeneous CD method based on the feature space constraint is proposed. Compared with the domain transfer-based methods mentioned above, the proposed method can constrain the extracted features to the same feature space while keeping the unique features during the training process. The proposed method first uses Gram matrices, which include the correlations between features, to represent the feature spaces of two heterogeneous images, respectively. Then, the squared Euclidean distance between Gram matrices, termed as feature space loss, is calculated to constrain the features. Similar to the style loss in the style transfer, the feature space loss mainly focuses on constraining the correlations of the features, but not including the content of the features. After that, a combined loss function, consisting of feature space loss and binary cross entropy loss, is designed for training the model. In this way, we can keep as many features of two heterogeneous images as possible during the training process and enhance the comparability between features so that we can get better detection results. Furthermore, considering that the input images are in two distinct feature spaces, we use two encoders with the same structure, but non-shared weights to extract features separately. When training the model, to solve the problem that the numbers of changed and unchanged pixels vary greatly in CD datasets [28], the class balancing is used to weight different parts in the loss function.
The rest of this paper is organized as follows. The background of the proposed method will be introduced in Section 2, and Section 3 will introduce the proposed method in detail. The experimental results on two real heterogeneous CD datasets will be analyzed in Section 4. Finally, the conclusion of this paper will be drawn in Section 5.

Homogeneous Transformation
To make input heterogeneous images comparable, homogeneous transformation is a necessary and significant step for heterogeneous CD, and it can be performed at different levels of the feature spaces.
In the low-level feature spaces, Liu et al. [7,29] used the pixel transformation based on the mapping relationship between unchanged pixel pairs to transfer one image from its original feature space (e.g., gray space) to another feature space (e.g., spectral space) so that the changes between heterogeneous images can be detected. The O-PCC [19], P-PCC [18], and cGAN-based [27] methods mentioned above also usethe homogeneous transformation to make input heterogeneous images comparable in the low-level feature spaces. In the high-level feature spaces, Gong et al. [30] proposed an iterative coupled dictionary learning (CDL) model to establish a pair of coupled dictionaries, which can transfer input heterogeneous images to a common high-dimensional feature space. Jiang et al. [31,32] utilized the deep-level features for homogeneous transformation and built a deep homogeneous feature fusion (DHFT) model to detect the changes in heterogeneous images.
The low-level features contain more spatial information, but they cannot accurately describe the semantic content that is abstract at the high level, especially in the regions with massive ground objects and complex scenes [32]. The high-level features contain more semantic information, but their resolution is usually lower and cannot help us locate where changes occurred precisely. In our method, the homogeneous transformation is realized by the proposed feature space constraint. To make full use of the features at different levels, we perform the proposed feature space constraint for all the extracted features of different levels in input heterogeneous images.

Style Transfer
Style transfer is usually considered as a generalized problem of texture synthesis, which is to extract and transfer the texture from the source to the target [33]. It is always used to modify the style of an image while still preserving its content.
As shown in Figure 1, the input for style transfer is a pair of images. The content image p is usually of a nature scene and is responsible for providing content information, while the style image a is usually an art image and is responsible for providing style information. With some style transfer algorithms, it will finally generate an image x with the provided content information and style information at the same time. Among these algorithms, to produce the transferred images, Gatys et al. [34] used a pre-trained CNN to extract features and optimized the model with the following loss function: where p, a, and x are the content image, the style image, and the generated image. The content loss L content ( p, x) is used to minimize the distance of content representation between p and x.
The style loss L style ( a, x) is used to minimize the distance of style representation between a and x. The hyperparameters α and γ are the weighting factors for content and style loss, respectively.

Content image
Style image Generated image Style Transfer In style transfer, to obtain the style representation without including the content information, Gatys et al. [34] used the Gram matrix, which is the correlations between the different filter responses, to represent the feature space of the style image a. It should be noted that the content and style representations of images are not completely independent, which means the distance of the style or content representation between source images and target images cannot be minimized separately. Even so, we can still make a tradeoff between style and content representations and produce the desired images by adjusting the weighting factors α and γ.
Inspired by the main idea of style transfer, in this paper, we design a loss function to constrain the correlations between different feature spaces in the form of Gram matrices. Different levels of feature spaces for input paired heterogeneous images are represented by Gram matrices, and a loss function is designed to constrain the correlations between different feature spaces, which is similar to the style loss in style transfer. By minimizing the Gram matrix-based loss function, the features of input heterogeneous images can be constrained to the feature space with the same style (correlations), while the content information is kept, which means the unique features mentioned above.

Proposed Method
In this section, how the features are encoded (extracted) and decoded from two heterogeneous images will be first introduced. After that, how the feature space constraint is constructed will be described in three parts: how to represent a feature space while keeping the unique features, how to design the loss function for the feature space constraint, and how to build the final combined loss function for training. Finally, we will summarize the proposed method and give the overall schematic as shown in Figure 2.
is the feature space loss among features F X , F Y , and F Z , and it is used to constrain the features to the same feature space. L bce (O, T) is the binary cross entropy between the output and ground-truth, and it is used to train the model to detect the changed pixels in input image pairs. L = L bce + L f eature means L f eature (F X , F Y , F Z ) and L bce (O, T) are jointly used for training the model. For homogeneous CD, Daudt et al. [11] proposed a fully convolutional Siamese network with an autoencoder architecture, which was similar to U-Net [35]. Because the feature spaces of two homogeneous images are very similar, the fully convolutional Siamese network used two encoders with the same structure and shared weights to extract features from the input images X and Y. However, different from homogeneous CD, the images X and Y for heterogeneous CD are in two distinct feature spaces. Therefore, in the encoding part, as shown in Figure 2, we use two encoders with the same structure, but non-shared weights (NSWs) to extract features F X = F l X |1 ≤ l ≤ L and F Y = F l Y |1 ≤ l ≤ L from images X and Y, where F l X , F l Y ∈ R N l ×M l , N l is the number of features at the lth level, and M l is the product of the height and width of features at the lth level. It should be noted that, to simplify the calculation, all the extracted features were vectorized, which means F l X (as well as the others) is a matrix composed of the flattened features at the lth level. For the same reason, in the decoding part, the absolute value of the difference between features F X and F Y cannot be concatenated directly with the features F Z = F l Z |1 ≤ l ≤ L − 1 , which are generated during the decoding process. The l in F l Z does not mean that F l Z is generated by the Decoder-l, but rather that F l Z has the same scale as F l X and F l Y at the lth level. For example, at the fourth level, F 4 X and F 4 Y are generated by Encoder-4, and F 4 Z is obtained by up-sampling the output of Decoder-5. As shown in Figure 2, the L mentioned above is 5, which means the number of feature scales is 5 in the encoders and 4 in the decoders. This is because we do not have Decoder-6 and therefore cannot get F 5 Z at the fifth level.
The details of the encoders and decoders in the proposed network are shown in Table 1. It should be noted that, in practice, Encoder-5 is not adopted, and the output of Encoder-4 is down-sampled twice, then directly passed to Decoder-5. This is because the datasets used in the experiments are relatively small and simple, and the network must be simplified to avoid overfitting during the training process. As shown in Table 1, the output size of 112 × 112 pixels means that the height and width of the output features in Encoder-1 and Decoder-1 are both 112 pixels. The kernel of 3 × 3 × 1 × 16 means that the size of the kernel is 3 × 3 pixels and the number of input channels and output channels is 1 and 16. Table 1. Details of the encoders and decoders in the proposed network.

Block Output Size Kernel
Encoder To detect the changes based on the extracted features F X , F Y , and F Z , we must constrain them to the same feature space, so that they are comparable. However, for two heterogeneous images, they both have their own unique features in their respective feature spaces, which are helpful for the model to detect the changes, and these unique features may be lost if we directly constrain them to the same feature space. For example, the geometric structural features, such as the shape and boundary, in optical images can help locate where the change happened and make the boundary clearer. If we directly transfer the optical images from their original feature space to that of SAR images, these features will be lost during the transfer process. To this end, inspired by the style loss in style transfer, we use the Gram matrices, which calculate the correlations between features, to represent different feature spaces and design a combined loss function based on the Gram matrices. After that, we can constrain the features to the same feature space by minimizing the squared Euclidean distance between Gram matrices in the combined loss function while keeping their own content information. During the training process, the statistical characteristics of features will not be changed, and only the correlations between features are changed, which means as many unique features mentioned above will be kept as possible. In this way, some missing detections will not be caused, and therefore, the CD results can be improved theoretically.

Feature Space Representation
Before constraining the extracted features F X , F Y , and F Z to the same feature space, we must characterize their respective feature spaces in the same way. In the proposed method, the Gram matrix, which was introduced in [36] and includes the correlations between features, is applied to represent the feature space. The Gram matrix for features F X at the lth level is G l X ∈ R N l ×N l : where G l X,ij is the correlation between the ith feature and the jth feature: Similarly, the representations for the feature spaces of F Y and F Z at the lth level are G l Y and G l Z : We can see that the Gram matrix does not include the statistical properties of features such as the mean and variance, but only represents the correlations between features (e.g., the correlation between Features A and B can be regarded as the probability that when Feature A is extracted, Feature B is also extracted). Therefore, how the encoders extract features from two heterogeneous images will not be changed, which means the unique features in their original feature space will not be changed. During the training process, only the correlations between features in F X , F Y , and F Z are constrained, which is very similar to the style representations in style transfer [34].

Feature Space Loss
Inspired by the style loss proposed in [34], we use the squared Euclidean distance between Gram matrices to judge whether two feature spaces of heterogeneous images are constrained to the same feature space. Then, the distance between feature spaces of F X and F Y can be calculated as follows: Similarly, we can get E(F X , F Z ) and E(F Y , F Z ). It should be noted that in the Lth layer, we can only To constrain the features to the same feature space while training the model, we add an optimization term, which is related to the feature space distance, into the loss function. The added term is called feature space loss (FSL), and its definition is as follows: If there is no change between input heterogeneous images, the correlations between features F X , F Y , and F Z should be the same, which means Gram matrices G X , G Y , and G Z should also be the same, and the FSL should be zero theoretically. However, if there are some changes between input images, the FSL should not be 0, then L f eature should not be used to train the model theoretically. It should be noted that some changes happened, but were not labeled in the ground-truth, and some noise in the images can also affect the extracted features, so that the FSL may not be 0. Nevertheless, as the training goes on, the features corresponding to those useless changes will be ignored, and the noise will be suppressed by the encoders, which can be viewed as a few filters. If the FSL is used to train the model regardless of whether changes happened or not, the model will tend not to detect changes at all, which is not conducive to CD. Because of this, the FSL must be combined with some other loss functions for CD to build the final loss function for training the CD models.

Combined Loss Function
The most commonly used loss function in CD is BCELoss, and the BCELoss between output O and target (ground-truth) T is: To solve the imbalance between changed and unchanged pixels in CD datasets, we use the class balancing to weight the BCELoss. The weighted BCELoss is: where N c is the number of changed pixels and N u is the number of unchanged pixels. By weighting the BCELoss, we can expand the positive samples in the training set by w p times, which is consistent with the number of negative samples, when the number of positive samples is much less than that of negative ones. Theoretically, when there is no change (T = 0) in input images, the FSL should be 0, and in contrast (T = 0), FSL should be greater than 0. Therefore, we can define the final combined loss: where α is the weighting factor for FSL, and it is also used to ensure that the orders of the magnitude of BCELoss and FSL are consistent. When training the model, mini-batch gradient descent is often used to update the weights of different modules in the model. However, it is very difficult to find a situation where there is no change at all (T = 0) in the input batch when mini-batch gradient descent is used during the training process. To this end, a threshold γ is introduced to replace the original T, and the combined loss function is adjusted as follows: where p is the proportion of changed pixels in the input batch. The threshold γ is determined in such a way that the frequency proportion of training with L bce + αL f eature in an epoch is basically the same as the proportion of unchanged images in the training set.

Detailed Change Detection Scheme
The schematic of the proposed method is shown in Figure 2. First, two encoders with the same structure, but non-shared weights, are used to extract features F X and F Y from input heterogeneous images X and Y. Next, the absolute value of the difference between the extracted features F X and F Y is fed to the decoder to output a two-channel probability map, during which the features F Z are generated in the decoding part. For example, Decoder-3 will concatenate the absolute value of the difference between F 3 X and F 3 Y with the up-sampled output of Decoder-4 as the input and pass the up-sampled output to Decoder-2. Then, a combined loss function consisting of BCELoss and FSL is designed for training the model. BCELoss is the traditional binary cross entropy loss function between the output of the decoder and the one-hot encoded ground-truth. FSL is the feature space loss between the Gram matrices of features F X , F Y , and F Z . BCELoss will be used throughout the training model, and the use of FSL will be determined by the proportion of changed pixels in the input image pairs. After the model is trained well, the final CM can be obtained by the argmax operation on the channel dimension of the two-channel probability map. Specifically, as shown in Figure 3, the first channel (Channel-0) represents the probabilities that pixels have not changed, while the second channel (Channel-1) represents the probabilities that pixels have changed. In the final CM, 0 represents an unchanged pixel and 1 a changed pixel. Therefore, for pixels in the two channel probability map, we can choose the channel dimension, where the maximum probability is, as the final change detection result. That is the argmax operation on the channel dimension.

Dataset Description
The source images of the datasets that we perform experiments on are from [22]. Considering the distribution of objects and the size of images, as shown in Figure 4, we just select the first and fourth image pair of the four image pairs to make our datasets. It should be noted that the coregistration is a significant step before we detect the changes between images. The first and fourth image pairs are first loosely registered with geographic coordinates and then finely registered by the software ENVI (the environment for visualizing images) with the control points manually selected [22]. To construct the test set, we crop the top-left corner of each image pair to 560 × 560 pixels, and the constructed test set (SAR images, optical images, and ground-truth) is shown in Figure 5a-c. The rest of the two images are divided into the training set by a sliding window whose step size is 56 pixels, and the height and width are both 112 pixels, respectively. The details of the two datasets are shown in Table 2.   The first image pair (1200 × 1271 pixels in size) consists of two images that describe the urban areas in Hebei, China. The panchromatic optical image was captured by QuickBird with a 0.6 m resolution, and it was collected on the Google Earth platform. The SAR image was acquired by GaoFen-3 with HH polarization, spotlight (SL) imaging mode, and 1 m resolution. They were registered and resampled to 5 m resolution. The ground-truth is manually defined by integrating expert knowledge with the high resolution historical images in Google Earth (changes are presented in white) [22]. As for the sensors, Google Earth is a platform for Earth science data access and analysis. It integrates the satellite images, aerial photography images, digital elevation model (DEM), and geographic information systems (GISs) onto a 3D globe and then creates giant, multi-terabyte, and high resolution images of the entire Earth [37]. The GaoFen-3 satellite is a C-band multi-polarized SAR satellite in the China High Resolution Earth Observation System [38]. With the mentioned sliding window, we can get 320 training image pairs with a size of 112 × 112 pixels and 1 test image pair with a size of 560 × 560 pixels. The number of image pairs with no changes entirely is 161, which accounts for 50.3% in the training set. This dataset was created from the first image pair in [22], and to be consistent with the source image pair, we call it Dataset-1.

Dataset-4
The fourth image pair describes the areas near Jiangsu, China. The optical image was a panchromatic image acquired by GaoFen-1 with 2 m resolution. The SAR image was acquired by GaoFen-3 with DH polarization, UFSimaging mode, and 3 m resolution. They were also registered and resampled to 2 m resolution. As for the sensors, GaoFen-1 is a Chinese civil optical remote sensing satellite, which provides panchromatic and multispectral data. The image of GaoFen-1 used in the fourth image pair is the Level-1C product that has been processed by radiation correction [22]. The characteristics of GaoFen-3 were introduced above. Similarly, we can get 376 training samples with a size of 112 × 112 pixels and 1 test sample with a size of 560 × 560 pixels. It should be noted that, as shown in Table 2, 266 samples (70.7%) in the training set are completely unchanged. This is the common problem mentioned above on CD datasets, that is the imbalance between positive (changed) and negative (unchanged) samples. This dataset was created from the fourth image pair in [22], and to be consistent with the source image pair, we call it Dataset-4.

Data Augmentation
As shown in Table 2, the number of images in two training sets is relatively small for training the model, and the size of each image is only 112 × 112 pixels, which makes it easier for the model to overfit the datasets. Therefore, data augmentation is necessary for each input image pair in the training sets.
Before each input training pair is input into the network, as shown in Figure 6, it will be first randomly flipped horizontally and vertically, then the flipped image pair will be randomly rotated by a random angle from 0 degrees to 360 degrees. The probability of random horizontal (vertical) flipping is set to 0.5. This is to ensure that the number of samples that have been flipped horizontally (vertically) is basically the same as the number of samples that have not been flipped horizontally (vertically). The probability of random rotation is set to 0.8, which is to ensure that the ratio for the number of unrotated images, rotated images from 0 degrees to 90 degrees, 90 degrees to 180 degrees, 180 degrees to 270 degrees, and 270 degrees to 360 degrees is 1:1:1:1:1. Theoretically, the operation of flipping can expand the dataset to four times its original size, and the operation of rotating can further help augment the training image pairs.  Figure 6. The augmentation process for each input training pair, where the p is randomly generated between 0 and 1.

Parameter Setting
The proposed method is implemented using the PyTorch framework, and stochastic gradient descent (SGD) with momentum is applied for optimization. During the training process, the mini-batch size, the base learning rate, the momentum, and the weight decay are set to 32, 0.001, 0.9, and 0.005, respectively. To reduce the training time, the weights of each convolutional layer are initialized with the Kaiming algorithm [39]. To speed up the convergence of the model, the cosine annealing schedule is used to dynamically adjust the learning rate. Besides, according to the analysis above in our approach (see Section 3 for details), for Dataset-1, w p , α, and γ are set to 12.0, 1.0 × 10 7 , and 1/40, respectively. For Dataset-4, they are set to 8.0, 5.0 × 10 7 , and 1/35, respectively.

Evaluation Criteria
To assess the CD results more quantitatively and comprehensively, with respect to the changed pixels, we use overall accuracy (OA), precision (Pr), recall (Re), F1 scores (F1), and the Kappa coefficient (Kappa) [40] as the evaluation criteria. They can be calculated as follows: where TP is the number of pixels detected by the model and included in the ground-truth images and FP is the number of pixels detected by the model, but not included in the ground-truth images. TN is the number of pixels not detected by the model and not included in the ground-truth images, and FN is the number of pixels not detected by the model, but included in the ground-truth images [41]. p o represents the percentage of the correct classification, which is equivalent to OA. p e denotes the proportion of expected agreement between the ground-truth and predictions with given class distributions [42]. Pr and Re refer to the positive predictive value and true positive rate, respectively. F1 is the harmonic mean of Pr and Re. OA shows the classification accuracy of the model, and Kappa is used to evaluate the extent to which the classification results outperform random classification results.
Second, to verify the validity of the NSW in encoders and the FSL in the combined loss function, we perform some experiments on Dataset-1 and Dataset-4 by training the model with or without the NSW or FSL. Furthermore, to investigate why the FSL can improve the CD results, we visualize the curves of the FSL and BCELoss when training the model only with the BCELoss, which aims to find the deeper optimization goal of the FSL.
Finally, to investigate the effect of hyperparameters α, γ, and w p on the CD results, some experiments are performed with different values of α, γ, and w p on Dataset-4. Considering the randomness of the experiments and in order to make the results more convincing, each experiment is repeated ten times with different random seeds, and the mean of each criterion is calculated for evaluation. Figure 5 shows the visualized results of all testing methods, and the corresponding quantitative evaluation is presented in Table 3. The best values of each metric on Dataset-1 and Dataset-4 are marked in bold. It should be mentioned that we directly use the results in [22] and crop the top-left with the size of 560 × 560 pixels to make a comparison. Although the proposed method is end-to-end, which means it does not need some preprocessing such as segmentation and classification, it can still outperform the other mentioned methods in terms of OA, Kappa, Pr, and F1 on Dataset-1 and Dataset-4. In terms of OA scores, the proposed method can reach 95.7% on Dataset-1 and 95.2% on Dataset-4, which is higher than 95.1% for CMS-HCC (L2) on Dataset-1 and 93.9% for CMS-HCC (L1) on Dataset-4. Although the OA scores of the CD results for the proposed method and CMS-HCC (L2) on Dataset-1 are very similar, the result of the proposed method is 3.3% higher in Kappa than that of CMS-HCC (L2), which means the proposed method can get better performance when the CD datasets have imbalanced changed and unchanged pixels. Besides, for the F1 scores, the results of the proposed method are nearly 3% and 1% better than other methods on Dataset-1 and on Dataset-4, respectively. Although the O-PCC on Dataset-1 and the CMS-HCC (L3) on Dataset-4 can get the best Re scores (93.3% and 94.8%), they both have many FP pixels, as shown in Figure 5e,h and thus perform very poorly on the Pr scores, which also reduces the F1 scores. According to the results in Table 3, we can see that the CMS-HCC focuses on improving the Pr scores while keeping better Re scores. When compared with the ground-truth in Figure 5c, the CD results of CMS-HCC, as shown in Figure 5f-h, have more TN and FP pixels, which make it able to get better Re scores, but poor Pr scores. Even so, when the Re scores of the proposed method and the CMS-HCC both reach about 80.0% on Dataset-1, the Pr scores of the proposed method can reach 73.5%, while the CMS-HCC can only reach 69.2%. This further demonstrates the superiority of the proposed method. However, compared with the CD results of CMS-HCC, as shown in Figure 5f-h, the boundary of CD results by the proposed method, as shown in Figure 5i, is blurry and discontinuous. This is because the proposed method has some down-sampling and up-sampling modules, which will cause some spatial information to be lost during the feature extraction process. This problem may be solved by some boundary enhancement modules in semantic segmentation [43] and post-processing operations [44]. Table 4 shows the results of training the model with and without the NSW or the FSL on Dataset-1 and Dataset-4. The best values of each metric on Dataset-1 and Dataset-4 are marked in bold. The NSW means the non-shared weights in encoders, and when it is used, the encoders for input paired heterogeneous images will be trained separately. The FSL is the feature space loss in the combined loss function, and whether it is used determines that the feature space constraint is used or not during the training process. In the proposed method, if we use the encoders with shared weights, but not NSW to extract features from two heterogeneous images, the CD results will be very poor. When the NSW is adopted in encoders, as shown in Table 4, all the criteria of the model are greatly improved. Specifically, the OA scores increase from 84.7 to 95.4%, and the Kappa scores increase from 40.1 to 72.0%. The Pr scores increase from 34.0 to 72.1%, and the F1 scores increase from 47.5 to 74.5%. However, the Re scores decrease slightly from 78.8 to 77.3%. This is because the model without the NSW cannot extract compared features from two different feature spaces, and many pixels are regarded as changed; therefore, the number of FN pixels decreases, and the Re scores increase.

Experiments on Different Modules of the Proposed Method
Besides, as shown in Table 4, the CD results can be further improved when the FSL is adopted in the combined loss function. Specifically, on Dataset-1, the Kappa and Re scores are improved by 2%. The F1 and Pr scores are also improved by 1 to 2%. On Dataset-4, the Re scores decrease slightly, while the Pr scores are improved by 4%, which makes the F1 scores increase from 72.1 to 73.6%. In terms of the Kappa and OA scores, the detection results are also improved by 0.8% and 0.5%. Based on the performance of the NSW in encoders and the proposed FSL in the combined loss function, were are convinced that they are both effective for the model and can improve the final CD results. Finally, after visualizing the detection results with error bars, in terms of the standard deviations, on Dataset-1, as shown in Figure 7, we can see that the proposed FSL can also improve the stability of the model, especially when we use the OA, F1, and Kappa scores to assess the model. To sum up, the NSW in encoders makes it possible to detect changes in heterogeneous images with the designed model, and the proposed FSL can further improve the CD results and make the model more stable.   Figure 8, when the optimization goal is to minimize the BCELoss, the BCELoss certainty keeps decreasing as the training process goes on. However, it should be noted that the FSL also continues to decrease during the training process even if the feature space constraint is not added to the optimization goal. According to the results in Table 4, we can see that the model still has the ability of mapping two heterogeneous images to the same feature space so that the changes between them can be detected when the model is trained well only with the BCELoss. Thinking about the results in Table 4 and Figure 8, we can conclude that the feature space constraint is implicitly included in the optimization goals even if the model is trained only with the BCELoss. This means that the underlying optimization goal of BCELoss and FSL is very similar. When the FSL is added into the combined loss function, as shown in Equation (14), the implicit constraint becomes explicit. This will further reduce the distance between two feature spaces and exploit the mapping capability of the convolutional neural network, which can therefore improve the final CD results.

Experiments on Different Hyperparameters of the Proposed Method
In the designed loss function, as shown in Equation (14), the hyperparameter γ is the value of the threshold. It is used to determine whether to use the combined loss function according to the input images. The constant one and hyperparameter α are the weighting factors for BCELoss and FSL in the combined loss function. Tables 5 and 6 show the CD results on Dataset-4 from training with different values of γ and α. According to Equation (14), the greater the value of γ is, the more likely the combined loss function will be used in the whole training process. Similarly, the greater the value of α is, the stricter the feature space constraint between extracted features will become when the combined loss function is used. In other words, the hyperparameters α and γ jointly determine the proportion of the FSL and the BCELoss during the training process. After visualizing the Pr and Re scores of the CD results, as shown in Figures 9 and 10, we can see that the Re scores keep increasing, while the Pr scores keep decreasing when the proportion of FSL becomes higher. For example, when the value of γ increases from 1/55 to 1/15, the Re scores decrease from 75.9 to 66.5%, while the Pr scoress increase from 71.3 to 79.1%. As introduced in Section 3, the FSL is designed to constrain the features to the same feature space. When the proportion of FSL in the combined loss function becomes higher, it will make the model tend not to detect the changes. If so, the number of FN pixels will decrease while the number of TP pixels will increase. Then, the Re scores, as shown in Equation (21), will decrease, while the corresponding Pr scores will increase. It should be noted that when the value of α increases from zero to 1.0 × 10 7 , almost all the metrics of CD results are improved, but the changes of Re and Pr are irregular. This may be due to the use of the mini-batch gradient descent mentioned above.    To achieve the better performance of the model when training with the designed loss function, we must make a tradeoff between the FSL and the BCELoss, which is similar to the content loss and style loss in style transfer.
In addition to the hyperparameters α and γ, we also perform some experiments on the parameter w p . As shown in Equation (11), w p is the weighting factor in the BCELoss and used to solve the imbalance between changed and unchanged pixels on CD datasets. The CD results of training with different values of w p are shown in Table 7, and the visualized Pr and Re scores of the CD results are shown in Figure 11. As the value of w p increases, the Re scores keep increasing, while the Pr scores keep decreasing. Specifically, on Dataset-4, when the value of w p increases from 4.0 to 12.0, the Re scores increase from 67.8 to 77.2%, while the Pr scores decrease from 75.7 to 69.7%. This is because the model will be trained to be more sensitive to the changed pixels when the parameter w p becomes larger. As shown in Equation (11), when the weight of T log O is increased, the number of TP pixels will increase, while the number of FN pixels will decrease. As shown in Equation (21), the Re scores will be improved, which may lead to a decrease in the Pr scores. To sum up, the model can get better CD results when the balance between the Pr and Re is achieved, which can be made by adjusting the parameter w p as shown in Equation (12).

Discussion
As mentioned above, the main difficulty for heterogeneous CD lies in how to compare the extracted features in input heterogeneous images. First of all, before comparing the features, we must extract them from heterogeneous images. However, different from homogeneous CD, the images for heterogeneous CD are in different feature spaces. Therefore, similar to the symmetric structure in BDNN [25], two encoders with the same structure, but NSWs, are used to extract the features in the proposed method. When the model is trained well, two encoders can extract useful features from input heterogeneous images for CD, separately. As shown in Table 4, we can find that when the NSW is applied in the encoders, the CD results can be highly improved.
Secondly, as described in the Introduction, for the domain transfer-based methods such as cGAN [27] and SCCN [16], some unique features may be lost during the transfer process, which may cause some missing detections. To this end, after the features are extracted, the proposed method uses the Gram matrices, which include the correlations between features, to represent different feature spaces. The features will be constrained to the same feature space by minimizing the squared Euclidean distance between Gram matrices, termed as the FSL. Different from some existing methods of heterogeneous CD [14,29,45], we construct the final loss function by combining the traditional BCELoss and the designed FSL. In this way, we can constrain the correlations between features of different feature spaces to the same feature space, which is similar to the style representations in style transfer. Meanwhile, as many of the unique features of input heterogeneous images will be kept as possible during the training process, which is similar to the content representation in style transfer. As shown in Table 4, when the FSL is used together with the BCELoss for optimizing the model, the final CD results can be further improved. In addition, as shown in Figure 7, we can find that the stability of the model can also be improved when the FSL is used in the optimization goal.
Thirdly, there are three hyperparameters, α, γ, and w p , in the proposed method. The weighting factor w p is used in the BCELoss to deal with the imbalance between changed pixels and unchanged pixels in CD datasets. The hyperparameters α and γ in the combined loss function are used to determine the proportions of the BCELoss and the FSL. According to the experimental analysis on these hyperparameters, we can find that the training process, in fact, can be viewed as a game between the FSL and the BCELoss, which is similar to the style loss and content loss in style transfer [34]. When the balance between the FSL and the BCELoss is achieved, the best CD results can be obtained. According to the results of the comparison experiments in Table 3, we can find that our proposed method is superior to other compared methods in terms of OA, Kappa, Pr, and F1 scores. As shown in Figure 5d-h, although the compared methods can get better Re scores, there are so many FP pixels in their CD results that they perform very poorly on the OA, Kappa, and Pr scores, which also reducing the F1 scores.
Finally, as shown in Figure 2, there are some down-sampling and up-sampling modules in the network, which may make some spatial information lost during the feature extraction process. This lost spatial information will make the boundary of objects in the final CM become blurry and discontinuous, as shown in Figure 5i. This problem may be solved by some boundary enhancement modules in semantic segmentation [43] and post-processing operations [44].

Conclusions
In this paper, based on the feature space constraint, an end-to-end CD method for heterogeneous images is proposed. For the domain transfer-based methods of heterogeneous CD, there is a problem that some unique features may be lost during the transfer process. In order to solve the problem, inspired by the style transfer, we design a combined loss function based on the FSL and the BCELoss in the proposed method. It can help the model constrain the features of heterogeneous images to the same feature space while keeping as many unique features as possible. In this way, some missing detections will be decreased, and the CD results can be further improved. Additionally, the stability of the model can also be improved when the combined loss function is used. Experiments on two heterogeneous image datasets consisting of optical and SAR images demonstrate the effectiveness and superiority of the proposed method.
In the future, our research will mainly include two aspects. First, as shown in Figure 5, there is the problem of a blurry and discontinuous segmentation boundary, which may be caused by some lost spatial information during the down-sampling. Therefore, in the future work, we will try to use some other architectures such as DeeplabV3 [46] to replace the U-Net architecture or use some boundary enhancement modules in semantic segmentation [43] to make the boundary clearer. Second, according to the analysis in preparing the datasets, we can find that there is a serious imbalance between changed and unchanged samples. The number of changed samples is so small that we must use some methods to augment the change samples, such as the weighting factor w p in the BCELoss. However, most of these methods cannot actually expand the changed samples. Inspired by some GAN-based methods, we plan to first use GAN to generate some changed samples, then train the model with the generated samples and original datasets. In this way, we can achieve a balance between changed and unchanged samples and maybe improve the CD results.