Change Capsule Network for Optical Remote Sensing Image Change Detection

: Change detection based on deep learning has made great progress recently, but there are still some challenges, such as the small data size in open-labeled datasets, the different viewpoints in image pairs, and the poor similarity measures in feature pairs. To alleviate these problems, this paper presents a novel change capsule network by taking advantage of a capsule network that can better deal with the different viewpoints and can achieve satisfactory performance with small training data for optical remote sensing image change detection. First, two identical non-shared weight capsule networks are designed to extract the vector-based features of image pairs. Second, the unchanged region reconstruction module is adopted to keep the feature space of the unchanged region more consistent. Third, vector cosine and vector difference are utilized to compare the vector-based features in a capsule network efﬁciently, which can enlarge the separability between the changed pixels and the unchanged pixels. Finally, a binary change map can be produced by analyzing both the vector cosine and vector difference. From the unchanged region reconstruction module and the vector cosine and vector difference module, the extracted feature pairs in a change capsule network are more comparable and separable. Moreover, to test the effectiveness of the proposed change capsule network in dealing with the different viewpoints in multi-temporal images, we collect a new change detection dataset from a taken-over Al Udeid Air Basee (AUAB) using Google Earth. The results of the experiments carried out on the AUAB dataset show that a change capsule network can better deal with the different viewpoints and can improve the comparability and separability of feature pairs. Furthermore, a comparison of the experimental results carried out on the AUAB dataset and SZTAKI AirChange Benchmark Set demonstrates the effectiveness and superiority of the proposed method.


Introduction
Change detection is the process of identifying differences in the state of an object or phenomenon by observing it at different times [1]. As one of the important technologies for remote sensing image analysis, change detection has played an important role in the military and in civilian life, such as military strike effect evaluation [2][3][4], land use [5][6][7][8][9], and natural disaster evaluation [10][11][12][13].
Recently, deep learning (DL) has been widely applied to the field of change detection [14][15][16][17][18][19] thanks to its simple process, strong feature representation ability, and excellent application performance. However, there are still many challenges in change detection. First, DL-based methods usually require a large number of labeled samples to optimize the network. However, the available open-labeled datasets for remote sensing change detection are extremely scarce and predominantly very small compared to other remote sensing image-interpretation fields [20]. For example, The Vaihingen dataset, which is widely used in remote sensing image classification [21,22], only contains 33 patches, and each pair of images is about 1900 × 2500 pixels. The effective sample size of Vaihingen is about 1.5 × 10 8 . The SZTAKI AirChange Benchmark Set [23,24], which is extensively used to evaluate the performance of change detection algorithms [14][15][16]18,25], is composed of 13 aerial image pairs with size of 952 × 640 pixels. Therefore, the effective sample size of SZTAKI is only about 7.9 × 10 6 . In comparison, the data size of change detection is more than 100 times smaller than that of the remote sensing image-classification dataset. Second, the image pairs or image sequences used for change detection are often obtained from different viewpoints [26][27][28][29]. In other words, it is difficult to capture a scene from similar viewpoints every time in remote sensing change detection. As shown in Figure 1, the buildings were shot at different times. Due to the different viewpoints, buildings have different shadows even if the image pair has been registered, which makes the comparison of image pairs more difficult. In order to alleviate the problems caused by different viewpoints, Sakurada et al. [26] designed a dense optical flow-based change detection network. Palazzolo et al. [30] relied on 3D models to identify scene changes by re-projecting images onto one another. Park et al. [27] presented a novel dual dynamic attention model to distinguish different viewpoints from semantic changes. Therefore, if the algorithms do not pay attention to the viewpoints, the result of the change detection is affected. Finally, the similarity measurement method of existing change detection methods is relatively simple. The study of similarity measurement in change detection has a long history. Traditional similarity measurement methods include image difference, image ratio, and change vector analysis (CVA) [31]. For DL-based methods, similarity measurement also plays an important role in improving the performance of model, such as the euclidean distance used in [14], the improved triplet loss function applied in [15], the difference skip connections adopted in [25], and the feature space loss designed in [32]. Similarity measurement is one of the important factors affecting the separability of sample pairs in change detection. It is beneficial to improve the performance of change detection to make sufficient and effective comparison of the features between sample pairs. To deal with the small training data size in change detection datasets, some scholars chose unsupervised methods [16,33,34]. These methods did not require labeled training samples, but the performance of these algorithms could be improved. To cope with this problem, Xu et al. [18] took advantages of the capsule network [35] and designed the pseudo-siamese capsule network. Siamese network has two branches in the network that share exactly the same architecture and the same set of weights. For pseudo-siamese, it has two identical branches but the weights of two branches are not shared. The capsule network used vectors for feature extraction and dynamic routing technology for features aggregation. As shown in much existing literature [35][36][37], the capsule network could use less training samples to reach the comparable performance than traditional convolutional neural networks (CNN). Moreover, the pseudo-siamese capsule network achieved satis-factory results on small open-labeled remote sensing change detection datasets, which confirmed that a capsule network was very suitable for change detection.
Unfortunately, the pseudo-siamese network still has some shortcomings. First, the pseudo-siamese capsule network [18] did not analyze the experimental results of the image pairs that were obtained from different viewpoints. Vector-based features and the dynamic routing technology in a capsule network are beneficial to the capsule network when dealing with the pose information (e.g., translation, rotation, viewpoint, and scale). In other words, the pseudo-siamese capsule network may alleviate the problem caused by different viewpoints in image pairs to some degree. It may be limited by the open dataset in which the problem of different viewpoints is not obvious and for which the pseudo-siamese capsule network did not investigate thoroughly in the experiments. Second, the weights of the two branches in the pseudo-siamese network were not shared in order to maintain the flexibility of the model [38], which may cause feature space offset. Therefore, the features that were extracted by the pseudo-siamese network were uncomparable. Finally, the extracted features were directly concatenated in the pseudo-siamese capsule network, which led to insufficient features comparison.
In order to alleviate the problems mentioned above, we carried out the following works. First, the AUAB dataset in which the sequence images are different in terms of illumination, season, weather, and viewpoint was collected from the Google Earth. Second, the reconstruction module on an unchanged region was designed. As a regularization method, this module drives the network to maintain the feature consistency on an unchanged region, which keeps the comparability between feature pairs. Finally, in order to make similarity measuring more efficient, the vector-based features output by the capsule network were compared for both direction and length in the forms of vector cosine and vector difference, which can alleviate the insufficient feature comparison in the pseudo-siamese capsule network.
The main contributions of this paper are summarized as follows: • This paper proposes a novel change capsule network for optical remote sensing image change detection. Compared with other DL-based change detection methods, the proposed change capsule network has good performance and robustness. • In order to make the extracted feature pairs in a change capsule network more comparable and separable, this paper designs an unchanged region reconstruction module and a vector cosine and vector difference module, respectively. • The AUAB dataset, which simulates practical applications, is collected to further analyze the viewpoints in change detection. Moreover, experiments on the AUAB dataset and the SZTAKI dataset show the effectiveness and robustness of the proposed method.
The rest of this paper is organized as follows. The background of the proposed method is introduced in Section 2, and Section 3 introduces the proposed method in detail. In Section 4, the dataset and present experimental results are described to validate the effectiveness of the proposed method. In Section 5, we discuss the results of the proposed method. Finally, the conclusion of this paper are drawn in Section 6.

Capsule Network
Sabour et al. [35] introduced the idea of a capsule network. Different from the scalarbased feature in CNN, the feature extracted by the capsule network is a vector. The length of the vector represents the probability that the entity exists, and its orientation represents the instantiation parameters. Furthermore, the capsule network replaced the pooling layer used in convolutional networks with dynamic routing technology because the pooling layer may lose some information. As shown in much existing literature [18,[35][36][37], a capsule network could use less training samples to reach comparable performance to that of a traditional CNN. Furthermore, a capsule network could better deal with the pose information (e.g., translation, rotation, viewpoint, and scale). It is worth noting that the image pairs or image sequences used for change detection are often obtained from different viewpoints. Therefore, a capsule network may alleviate the problem caused by different viewpoints in the field of remote sensing change detection.
Let the input of the capsule layer be a vector u i , and use the matrix w i,j to perform affine transformation on the input vector: Then, perform a weighted summation on the output of the affine transformation: where c i,j is updated using the dynamic routing algorithm. The output vectors in capsule network are computed using a nonlinear squashing function: It can be seen from Equation (3) that the output of the capsule network is a vector for which the length is between [0, 1). Figure 2 shows the forward propagation of capsule network. The capsule network uses dynamic routing technology to implement the aggregation of shallow-layer capsules to higher-level capsules by calculating the intermediate value c i,j . As shown in Table 1, dynamic routing technology consists of seven steps. In step 1, affine transformation is applied to the input u i to obtainû i,j ; in step 2, variable b i,j is initialized; in step 3, the intermediate value c i,j is computed as the softmax of b i,j ; in step 4, a weighted summation onû i,j is performed using the intermediate value c i,j ; in step 5, the output of capsule layer v j is obtained by applying the non-linear squashing function to the output of the weighted summation; in step 6, b i,j is updated using the dot product ofû i,j and v j ; and in step 7, whether the number of iterations meets the requirement is checked and, if it does, the algorithm and output v j are terminated; otherwise, skip back to step 3.
is the dot product ofû i,j and v j . 7. Compute r = r − 1. If r > 0: jump to step 3, else: end.
Nowadays, the capsule network has been widely used in image classification [37,39,40], video processing [41,42], image generation [43,44], and change detection [18,33], etc. It is reasonable to explore the characteristics of the capsule network to make it more suitable for change detection.

Change Vector Analysis
The output of the capsule network is vector-based features. Change vector analysis (CVA) [31] is the most widely used method in the field of change detection to analyze vector-based features. The similarity measurement method proposed in this paper is inspired by the CVA. Therefore, we introduce the CVA in this section.
CVA generally includes the following steps. First, some basic preliminary data processings are needed in CVA, suhc as geometric registration and radiometric normalization. Second, some algorithms are applied to image pairs to extract effective features of the images. Third, change vector is obtained by calculating the difference of the feature pair. Finally, binary change detection is performed based on the length of the change vector, and the direction of the change vector is used to distinguish different kinds of change. In the past few decades, a series of CVA techniques have been developed and explored, including selecting suitable thresholds [45][46][47] and feature domains [48]. A flow diagram of CVA is shown in Figure 3. The framework of CVA is very effective for low-/medium-resolution multitemporal images. For very high spatial resolution (VHR) images, it is necessary to consider the spatial contextual information [49]. Therefore, Saha et al. [34] designed deep change vector analysis (DCVA). DCVA used a pretrained multi-layer CNN [50] to obtaining deep features. To make sure that only change-relevant features are retained, a layer-wise feature comparison and selection mechanism was applied to the extracted features. The deep change vector was obtained by concatenating the selected features from different layers of CNN. The length of the deep change vector represented whether the corresponding pixel changed. The different kinds of change could be obtained by identifying the direction of the deep change vector.
It can be seen from the above introduction that the vector-based change detection method mainly considers length and direction. The length and direction are also two important attributes of the capsule in the capsule network. Therefore, fully considering a comparison of the length and direction in the capsule network may be beneficial to improving the performance of change detection.

Proposed Method
In this section, the proposed change detection algorithm based on change capsule network is detailed. The framework is illustrated in Figure 4. First, the features of image pairs are extracted using two identical non-shared weight capsule networks to maintain the flexibility of the model. The shape of vector-based features output by the backbone is W × H × 1 × 16. Each capsule represents the feature of a pixel. Second, the unchanged region reconstruction module is adopted to make the feature space of the unchanged region more consistent. This module takes the features (the shape is W × H × 1 × 16) of image 1 as input to reconstruct the unchanged region in image 2. Third, the vector-based features output by capsule network are compared for both direction and length in the forms of vector cosine and vector difference. The outputs of vector cosine and vector difference are both change probability maps that can be optimized using the ground truth. Finally, a binary change map can be produced by analyzing the result of vector cosine and vector difference.

Capsule Network as Backbone
The backbone used in the change capsule network is modified from SegCaps [51]. SegCaps improved the traditional capsule network by implementing a convolutional capsule layer and a deconvolutional capsule layer. Unlike the original capsule network [35] that only outputs the category of the entire image, SegCaps implements the classification of each pixel of the input image. The structure of SegCaps is similar to U-net [52], including the encoder-decoder and skip connections structure. SegCaps and its improved version have been applied to the fields of image segmentation [51], image generation [43], and change detection [18]. Change capsule network uses SegCaps as the backbone to make full use of semantic and contextual information. The detailed parameter setting of the backbone is illustrated in Figure 5. Let the size of the input image be W × H; then, the shape of the output vector-based features is W × H × 1 × 16.

Unchanged Region Reconstruction Module
A change capsule network was designed based on the pseudo-siamese network [18]. The pseudo-siamese network provides more flexibility than a restricted siamese network does because the weights of pseudo-siamese network are not shared [38]. However, the weights of two unshared branches may cause feature spaces to be inconsistent, which leads to a lack of comparability in the features extracted by the network. Therefore, the reconstruction module on the unchanged region was designed. As a regularization method, this module drives the network to maintain feature consistency on the unchanged region, which improves the comparability between feature pairs. As shown in Figure 6, vector-based features (the shape is W × H × 1 × 16) output by the backbone are reshaped to scalar-based features (the shape is W × H × 16). Then, two convolutional layers for which the convolution kernel size is 1 × 1 are applied to the scalar-based features to obtain the global feature map (W × H × 3). In order to obtain the unchanged region map and unchanged region features, a merge mechanism is designed.
the ground truth, where g ij = 0 means the corresponding pixel pair unchanged and g ij = 1 means changed. The unchanged region features P = {p(i, j)|1 ≤ i ≤ W, 1 ≤ j ≤ H} can be obtained as follows: can be obtained as follows: In Figure 6, a black mask is used to cover the change region, so the unchanged region map is the input image 2 without the black mask.
The unchanged region reconstruction module uses the mean squared error (MSE) loss to optimize, and the function is shown as follows: It can be seen from Equation (6) that the unchanged region features is more and more similar to the unchanged region map.

Comparison of Vector-Based Features
It is known that the output of the backbone is a vector for which the length is between [0, 1). Let the output vectors of the two branches be a i,j and b i,j , respectively, where (i,j) represents coordinates. 0 ≤ a i,j < 1 (7) The vector difference between a i,j and b i,j is as follows: Therefore, the length of difference vector d i,j is as follows: where θ is the angle between two vectors.
Then, linear function f is applied to scale d i,j to between [0, 1). The linear function f is as follows: Therefore, the output of vector difference similarity comparison is f ( d i,j ), where the value range is as follows: The larger f ( d i,j ), the more likely the corresponding pixel pair changes. Therefore, the output of the vector difference similarity comparison can be used to optimize network parameters.
To analyze the direction of the output vector in a capsule network, we used the vector cosine. For any two vectors, their cosine value is between [−1, 1]. Therefore, −1 ≤ cos θ ≤ 1, where θ is the angle between a i,j and b i,j . We also utilize a linear function g to scale cos θ to between [0, 1]. The expression is as follows: Therefore the output of vector cosine similarity comparison is g(cos θ), for which the value range is as follows: The larger θ, the larger g(cos θ). In other words, the larger the angle between the two vectors, the more likely the corresponding pixel point changes. Therefore, the output of vector cosine similarity comparison can be used to optimize network parameters.
There are four situations when we use vector cosine and vector difference to optimize network parameters: (1) The angle between two vectors is large and the length of the difference vector is large. (2) The angle between two vectors is small but the length of the difference vector is large. (3) The angle between two vectors is large but the length of the difference vector is small. (4) The angle between two vectors is small and the length of the difference vector is small. Figure 7 shows the four situations, where the unit circle can represent the feature space because the length of the output vectors in capsule network is range of [0, 1). It can be seen from the above that vector cosine and vector difference is not contradictory during network optimization.  During the test time, only a simple threshold (0.5) is needed to obtain the binary results of the vector cosine similarity comparison and the vector difference similarity comparison.
Second, we only use the result of the vector difference similarity comparison: Third, the result of the OR gate operation on the vector cosine similarity comparison and the vector difference similarity comparison is regarded as the final binary change map. In other words, either the output of vector cosine or the output of vector difference is changed; the final result is changed. The expression is as follows: Finally, we use the result of the AND gate operation on the vector cosine similarity comparison and the vector difference similarity comparison as the final binary change map. That is, both the results of vector cosine and vector difference are changed; the corresponding pixel is changed.
In this paper, these four fusion ways all can obtain satisfactory, but we used the AND gate operation. We analyze the reason in the experimental part.

Loss Function
The loss function of a change capsule network consists of two parts. MSE loss is used in the unchanged region reconstruction module, and Margin-focal loss [18] is applied to the similarity comparison. Margin-focal loss takes the advantages of focal loss [53] and margin loss [35], which can effectively alleviate samples imbalance in capsule network. The margin-focal loss is defined as follows: where p ij is the output of the vector difference similarity comparison or the vector cosine similarity comparison at spatial location (i, j) and y ij is the label. γ is a focusing parameter, and α is a balance parameter. m + and m − are the margin. The final loss function is defined as follows: L(p f , p cos , p di f f , y f , y l ) = MFL(p cos , y l ) + MFL(p di f f , y l ) + βL mse (p f , y f ), where p cos is the output of the vector cosine similarity comparison, p di f f is the output of the vector difference similarity comparison, y l is the binary label, p f is the unchanged region features, y f is the unchanged region map, and β is a balance parameter.

Detailed Change Detection Scheme
The change detection scheme in this paper is composed of two stages: training and inference. • Training: First, we use two identical non-shared weights capsule networks to extract the vector-based features of image pairs. Second, the features of image one are sent to the unchanged region reconstruction module to reconstruct the unchanged region of image two to make the feature space more consistent. Third, the vector-based features output by capsule network are compared for both direction and length in the forms of vector cosine and vector difference. Finally, we use Equation (21)

Dataset Description
The experiments were carried out on the AUAB dataset and the SZTAKI dataset. Both AUAB and SZTAKI are optical RGB remote sensing image datasets. It is worth noting that the AUAB dataset is used to perform ablation. An ablation study studies the performance of an AI system by removing certain components to understand the contribution of the component to the overall system. The term originated from an analogy with biology (removal of components of an organism), and continuing the analogy, they are used particularly in the analysis of artificial neural nets, analogous with ablative brain surgery. Source: Wikipedia (accessed on 1 July 2021). Comparative experiments with other methods were carried out on both two datasets.

AUAB Dataset
The AUAB dataset was collected from the Google Earth. The dataset contains four registered optical images taken over Al Udeid Air Basee in years of 2002, 2006, 2009, and 2011. The size of each image in the dataset is 1280 × 1280 pixels with 0.6-m/pixel resolution. The sequence images that are co-registrated are illustrated in Figure 8. The change maps that are manual labeled by outsourced annotators and verified by domain experts are shown in Figure 9. is used as the test data to evaluate the performance of the model. It can be known by calculation that the effective training sample size of AUAB is about 4.9 × 10 6 , which is at the same level as the data size of the STZAKI. Though the AUAB dataset was collected from multiple time series in the same region, the sequence images are different in terms of illumination, season, weather, and viewpoint, especially the viewpoint. Therefore, it is convincing to use this dataset to evaluate the performance of the model.

SZTAKI Dataset
The SZTAKI AirChange Benchmark Set [23,24] is widely used in change detection [14][15][16]18,25]. This dataset contains three sets of labeled images pairs named SZADA, TISZADOB, and ARCHIEVE, containing 7, 5, and 1 image pairs, respectively. The size of each image in the dataset is 952 × 640 pixels with 1.5-m/pixel resolution. Following the literature in [14][15][16]18,25], the top left 784 × 448 rectangle of the image pairs are cropped for testing and the rest of the region is used for training data construction. For convenience of comparison, Szada and Tiszadob are treated completely separately as two different datasets to train and test the model in this paper (ARCHIEVE is ignored), and the first pair of the SZADA testing dataset and the third pair of the TISZADOB testing dataset are used to evaluate the proposed method. SZADA/1 and TISZADOB/3 are illustrated in Figure 10. The number of changed and unchanged pixels on the AUAB dataset and the SZTAKI dataset (Szada and Tiszadob) is shown in Table 2. Since the datasets are collected in different regions, the ratios of changed-to-unchanged pixels in the training and testing are quite different.

Data Augmentation
As in the literature [14,15,18,25], data augmentation was applied to avoid model overfitting. We introduced the operations on the AUAB dataset and the SZTAKI dataset separately.   [54] and with an Nvidia GTX1060 GPU with 6 GB memory. We used adam [55] with an initial learning rate of 0.00001 to optimize network parameters. In Equation (20), m + = 0.9 and m − = 0.1, γ = 1.0, and α is set around 0.85 in the SZTAKI dataset. For AUAB dataset α is set around 1.5, where α can be adjusted with the dataset. In Rquation 21, we set β = 0.5. Kaiming initialization [56] was applied to initialize the convolutional layer parameters. The batch size was set to 1 due to the memory limitation of the GPU. For the number of training samples in one forward/backward pass, the higher the batch size, the more memory space you need. The code is available at https://github.com/xuquanfu/capsule − change − detection (accessed on 1 July 2021).

Evaluation Criterion
To evaluate the performance of the proposed method, we calculated the precision, the recall, the F-measure rate (F-rate), and the Kappa [57] with respect to the changed class, where precision refers to positive predictive value, recall refers to true positive rate, F-measure is the harmonic mean of precision and recall, and Kappa is used to evaluate the extent to which the classification results outperform random classification results. The expressions are as follows.
where TP is the number of pixels detected by the model and included in the ground-truth images, FP is the number of pixels detected by the model but not included in the groundtruth images, and FN is the number of pixels not detected by the model but included in the ground-truth images [58]. p o represents the percentage of correct classifications. p e denotes the proportion of expected agreement between the ground-truth and predictions with given class distributions [59].

Results
Ablation experiments and comparison experiments are designed to evaluate the effectiveness of the model.

•
Ablation experiments: Three experiments are designed on the AUAB dataset. First, the unchanged region reconstruction module is applied to the pseudo-siamese capsule network. Second, we train the pseudo-siamese capsule network with both the vector cosine similarity comparison and the vector difference similarity comparison. Finally, we analyze how to obtain a better binary map from the results of vector cosine and vector difference.

•
Comparison experiments: we carries out the comparison experiments on the AUAB dataset and the SZTAKI dataset. We compared the proposed algorithm with two other methods: (1) FC-Siam-diff proposed in [25]; (2) Pseudo-siamese capsule network presented in [18]. FC-Siam-diff, which is a fully convolutional siamese-based network for change detection, has achieved satisfied performance. FC-Siam-diff effectively reduces the amount of parameters by reducing the number of channels in the network, so this method is suitable for change detection in which open source datasets are extremely scarce and the amount of the data is small. In contrast to the FC-Siamdiff, which is a representative method for change detection based on convolutional network, the pseudo-siamese capsule network is a representative method for change detection based on capsule network. Moreover, the pseudo-siamese capsule network is the baseline, which can be used to evaluate whether the improvements in this paper are effective.

Ablation Experiments
The effectiveness of the unchanged region reconstruction module. We apply the unchanged region reconstruction module to the pseudo-siamese capsule network [18]. According to the results listed in Table 3, the unchanged region reconstruction module can effectively improve the performance of the model in terms of recall, F-measure, and Kappa. The result of our improvement is 2.5% higher in both F-measure and Kappa than the baseline (the pseudo-siamese capsule network). For recall, the result of our improvement is 6.5% higher than the baseline. The improvements of recall, F-measure, and Kappa show that our improved method can effectively reduce the number of changed samples that are incorrectly judged as unchanged by the model. The reason may be that the unchanged region reconstruction module improves the comparability between feature pairs, which promotes the performance of the model. The effectiveness of the designed similarity comparison method. As shown in Table 4, when we use the vector cosine similarity comparison and vector difference similarity comparison in the pseudo-siamese network, the results have a significant improvement in terms of recall, F-measure, and Kappa. Especially for recall, the result of our improvement is 7.9% higher than the baseline, which indicates that some change samples that the baseline cannot distinguish are correctly separated. Therefore, the vector cosine similarity comparison and the vector difference similarity comparison can effectively increase the difference of inter-class between sample pairs, which can enlarge the separability between the changed pixels and the unchanged pixels. The fusion of vector cosine and vector difference. As shown in Section 3.3, there are four methods of fusion for a change capsule network in inference. First, only the result of the vector cosine similarity comparison is used. Second, only the result of the vector difference similarity comparison is used. Third, the final binary change map is obtained by the OR gate operation. Finally, the AND gate operation is used to obtain the final binary change map. The results of the four methods of fusion are listed in Table 5 and Figure 11. It can be seen from Table 5 and Figure 11 that both the result of the vector cosine similarity comparison and the result of the vector difference similarity comparison are satisfactory. Precision, F-measure, and Kappa are effectively improved when the AND gate operation is applied on two similarity comparison results. For the OR gate, the obtained result is the worst in terms of F-measure and Kappa, although recall is improved. In fact, both the result of the vector cosine similarity comparison and the result of the vector difference similarity comparison may misjudge due to noise. When we use the OR gate operation, more noise may be accumulated. For the AND gate operation, noise can be partially filtered. Therefore, we use the AND gate operation for fusion to obtain the final result.

Comparison Experiments
AUAB dataset. As shown in Table 6, the proposed method achieves the best recall, F-measure, and Kappa. Compared with the FC-Siam-diff, the change capsule network greatly improved in terms of precision, recall, F-measure, and Kappa. The pseudo-siamese capsule network obtains the best precision, but its recall is the lowest. This shows that the pseudo-siamese capsule network cannot distinguish the category of sample pairs well on the AUAB dataset, so some changed samples are not correctly detected. The change capsule network has both the vector cosine similarity comparison and the vector difference similarity comparison to improve the separability of the sample pairs, which can effectively increase the recall while maintaining high precision. As shown in Figure 12, the change map obtained by FC-Siam-diff has a lot of noise, even though most of the change region was detected. The result obtained by the pseudosiamese capsule network is less noise, but some changed samples are not correctly detected. Therefore, the recall of the pseudo-siamese capsule network is the lowest among the three methods. For the change capsule network, the change map is smooth with less noise and missed detection. In Figure 12, the different viewpoints in the image patches with red boxes are obvious. Figure 13 shows patch-based change maps at a suitable scale. In Figure 13, FC-Siam-diff obtains some false detections, especially in the region where shadows are generated due to different viewpoints. The pseudo-siamese capsule network and the change capsule network can better deal with the different viewpoints and can produce more reliable change maps. This confirms that the capsule network can better deal with the pose information and can alleviate the problem caused by different viewpoints in image pairs. In other words, the proposed method can deal with different viewpoints to some extent. SZTAKI dataset. Table 7 lists the results of different methods on the SZTAKI dataset. Compared with the pseudo-siamese capsule network, the change capsule network obtains better recall, F-measure, and Kappa on both SZADA/1 and TISZADOB/3, which proves once again that the pseudo-siamese capsule network for which the extracted features are concatenated directly cannot effectively improve the separability of sample pairs. In the change capsule network, the vector-based features output by capsule network are compared for both direction and length in the forms of vector cosine and vector difference, which effectively measures the features dissimilarity and improves the separability of sample pairs. Moreover, the results of the change capsule network are the best in terms of F-measure and Kappa on both SZADA/1 and TISZADOB/3, which further confirms the robustness of our method.   Figures 14 and 15 show the results of different methods on SZADA/1 and TISZADOB/3, respectively. On SZADA/1, the results of all methods are not very good because of the small and scattered changed regions. Compared with the FC-Siam-diff and the pseudosiamese capsule network, the proposed method produces a more smooth change map. For TISZADOB/3, the change map produced by FC-Siam-diff has many false detections and missed detections. The pseudo-siamese capsule network produces a satisfied change map, but there is still some noise compared with the change capsule network, especially in the region marked by the red box. Figure 16 shows patch-based change maps at a suitable scale. It can be seen from Figure 16 that the change capsule network is less affected by noise. Therefore, the method proposed in this paper is effective.

Discussion
The experimental results of the ablation experiments and the comparison experiments prove that the proposed method can effectively improve the performance of a change detection network. In the ablation experiments, the unchanged region reconstruction module and the vector cosine and vector difference module were applied to the baseline, respectively. The vector cosine and vector difference module measures the difference in the vector-based features for both length and direction, which can effectively filter noise and enlarge the separability between the changed pixels and the unchanged pixels. For the unchanged region reconstruction module, it drives the network to maintain feature consistency in the unchanged region when the image features are extracted. In the comparison experiments, the proposed method obtains better results in terms of recall, F-rate, and Kappa while maintaining high precision compared with other methods. In other words, the proposed method is more suitable for the application scenarios in which high recall is required.
Although the proposed method achieves satisfactory change detection results, the inference time and the amount of model parameters need to be further improved. Compared with FC-Siam-diff, the change capsule network is time-consuming and has a large amount of parameters. The trainable parameters of the change capsule network are about 2.8 × 10 6 , and it takes about 2 seconds to infer an image pair with size of 784 × 448. For FC-Siam-diff, the trainable parameters are about 1.3 × 10 6 and the inference time is under 0.1 seconds. Therefore, reducing the amount of model parameters and inference time can make the proposed change detection method more widely used.

Conclusions
This paper presents a novel change capsule network in which the extracted feature pairs have better comparability and separability for optical remote sensing images change detection. On one hand, the unchanged region reconstruction module is designed to improve the comparability between feature pairs extracted by the capsule network. On the other hand, vector cosine and vector difference are adopted to compare the vector-based features in the capsule network efficiently and can enlarge the separability between the changed pixels and the unchanged pixels. Moreover, the change capsule network takes advantages of the capsule network, which can better deal with the different viewpoints. We carried out experiments on the AUAB dataset and the SZTAKI dataset. The results of the ablation experiments and the comparison experiments showed that the change capsule network can better deal with different viewpoints and can improve the comparability and separability of feature pairs. Therefore, the method designed in this paper is effective.
Author Contributions: Q.X. proposed the algorithm and performed the experiments. K.C. gave insightful suggestions for the proposed algorithm. X.S. and G.Z. provided important suggestions for improving the manuscript. All authors read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the authors.