Change Detection in Remote Sensing Images Based on Image Mapping and a Deep Capsule Network

Homogeneous image change detection research has been well developed, and many methods have been proposed. However, change detection between heterogeneous images is challenging since heterogeneous images are in different domains. Therefore, direct heterogeneous image comparison in the way that we do it is difficult. In this paper, a method for heterogeneous synthetic aperture radar (SAR) image and optical image change detection is proposed, which is based on a pixel-level mapping method and a capsule network with a deep structure. The mapping method proposed transforms an image from one feature space to another feature space. Then, the images can be compared directly in a similarly transformed space. In the mapping process, some image blocks in unchanged areas are selected, and these blocks are only a small part of the image. Then, the weighted parameters are acquired by calculating the Euclidean distances between the pixel to be transformed and the pixels in these blocks. The Euclidean distance calculated according to the weighted coordinates is taken as the pixel gray value in another feature space. The other image is transformed in a similar manner. In the transformed feature space, these images are compared, and the fusion of the two different images is achieved. The two experimental images are input to a capsule network, which has a deep structure. The image fusion result is taken as the training labels. The training samples are selected according to the ratio of the center pixel label and its neighboring pixels’ labels. The capsule network can improve the detection result and suppress noise. Experiments on remote sensing datasets show the final detection results, and the proposed method obtains a satisfactory performance.


Introduction
With the development of satellite technology, the amount of remote sensing images can be acquired in the same region at different times easily.There are different kinds of remote sensing images obtained by different imaging sensors.Imaging sensors contain some types of remote sensing data such as very high resolution (VHR) images [1,2], multi-spectral [3] or hyperspectral images [4], synthetic aperture radar (SAR) images [5], polarimetric synthetic aperture radar images [6], etc.In this paper, we mainly focus on the change detection of SAR images and general optical images.These images are convenient to acquire and are commonly homogeneous and heterogeneous images for experiments.
Change detection is defined as the process of identifying variations of an object or a phenomenon by observing it at different times [7].Change detection is applied in many fields [8][9][10].According to the two acquired images [11], where the region is changed and where the region is not changed can be found through comparison.For these two images, many image preprocessing technologies including denoising and coregistration technologies are applied to deal with the data, aiming at decreasing the noise and making these images easy to compare [12].These technologies are necessary owing to the existence of speckle noises [13].These speckle noises may cause false alarms.Coregistration technology aims to place those pixels in a suitable position, namely a one-coordinate system [14].Changed and unchanged regions are detected after comparing and analyzing the preprocessed images.
In general, the final resulting map contains two sorts of pixels, black and white, representing unchanged and changed regions, respectively.The change detection image result can reveal where the changed region is and whether the region is changed.We can find it by experiments instead of going to the location, which improves the efficiency greatly.There are currently two main change detection directions: one is homogeneous image change detection, and the other is heterogeneous image change detection [15][16][17].
Homogeneous image change detection is the change detection on remote sensing images in the same or a similar feature space.SAR images are traditional experimental data of homogeneous image change detection.The pixels in homogeneous images possess the same or a similar property or we can say that they are linearly related.They are easy to compare.Thus, many methods are applied in comparing these two images directly, such as the log-ratio method [18,19], the difference method [20], and the mean-ratio method, and other traditional ways have been put forward.They can generally achieve a fairly good performance.Usually, homogeneous image change detection can be performed in two ways.In short, one is to compare first and then to perform classification, while the other one is to classify them into different types and then compare.The classification often can be done by threshold segmentation methods like the Kittler and Illingworth (KI) [21] threshold method, the Otsu threshold method [22], and some other auto-threshold segmentation methods or some clustering methods such as FCM and K-means [23,24].We can usually obtain a fairly good result, but there are many noises polluting the image.Then, a common way is to take the difference image as the training labels and input them into a neural network such as a convolutional neural network (CNN) [25,26], a generative adversarial net (GAN) [27], a deep neural network (DNN) [28], a deep belief network (DBN), a restricted Boltzmann machine (RBM) [29], etc.We can calculate the similarity level and obtain the ratio of the center pixel label and its neighboring labels to select reliable training samples [28].
However, we may meet a more complex situation in which two images are in different feature spaces.Some remote sensing images are less costly, and these images are not always taken from one kind of satellite.They are in different feature spaces or, in other words, unrelated feature spaces.Therefore, direct comparison is not feasible.The change detection based on heterogeneous images is more promising and necessary in some situations, though more challenging.There are some technologies for this that have been researched in recent years.In [1], a method was proposed for the damage assessment of buildings before and after an earthquake.It mainly calculates and predicts the parameters after an earthquake according to the parameters acquired before the earthquake.Then, the damage level can be obtained according to the predictive image and the reference image.In [30], this problem was solved by a method based on classification, which can be applied in both homogeneous and heterogeneous image change detection.Post-classification comparison (PCC) classifies each image independently [31].After obtaining two classifications, it is easy to obtain the changed and unchanged regions.The accuracy of this method is strongly dependent on the performance of the classification algorithm.At the same time, it may cause error accumulation because of the wrong classification.In [32], a symmetric convolution coupling neural network (SCCN) was proposed to detect the image difference.SCCN has several characteristics.It has a symmetric structure [33], and both sides are made up of the convolutional layer and the coupling layer, which are used to extract feature-level information.Network parameter learning and updating are performed through the feature extractor.Then, the coupling function is minimized and calculated according to the selected pixels in the unchanged regions.
In this paper, SAR and optical images are used to implement experiments for heterogeneous images' change detection.In general, the unchanged region is much larger than the changed region.The pixels in the unchanged region can be utilized to map the image.First, some small unchanged image blocks are picked and put into a self-organizing map (SOM) network [34].SOM clusters those pixels into groups or, in other words, obtains some classifications.For the selected k pixels in those image blocks, their gray values are close to the pixel to be transformed.The weights can be obtained through these pixels' Euclidean distances.Then, converted pixel coordinate positions are added together.The Euclidean distance between the pixels before and after transformation is taken as the pixel gray value.Simple mapping images and difference images can then be acquired.Due to the influence of noises [35], the difference value at that location will be too large.However, it will be less affected in another feature space.By fusing the difference images, the difference value can be reduced.Thus, the influence of noises can be reduced.The difference image after the fusion is classified to obtain a binary image, which is used as the labels for network training.The samples selected are connected together and input into a deep capsule neural network to obtain the classification results.
The subsequent sections in this paper are organized as follows: In the second section, the motivation and the related background knowledge in the experiments are introduced.The methods in this paper are described in detail in Section 3. The fourth section is the experimental part.In the final section, we summarize the method proposed in this paper and look at making some improvements to our proposed method.

Motivation
The purpose of heterogeneous image change detection is to identify the changed areas from two images obtained at different times in the same geographical area.However, due to the different feature information of these images, direct comparison is very difficult.The pixel-level transformation method is used to deal with this problem.Some transformation methods based on object and feature level do not perform very well at preserving details.The pixel-level transformation method is used to retain more details and make full use of the pixel information in the image to obtain a more reliable change detection result.In recent years, representation learning based on neural networks has been widely applied in many fields.Some basic models, such as DBN, sparse denoising autoencoder, and CNN, have achieved good performance in image processing.In the task of change detection, a neural network is used to extract key information and suppress irrelevant changes caused by the environment or noise.The deep capsule network uses CNN to extract feature information effectively, deepened in a certain way, and it processes the feature information based on the simple classification by the capsule network to obtain better detection results.

Self-Organizing Maps
The self-organizing map (SOM) is an unsupervised learning algorithm proposed by Teuvo Kohonen for clustering [36] and visualization.The SOM network consists of two layers, the input layer and the competition layer.The nodes of the input layer are the training samples, and the number of nodes is the number of training samples.The output layer is a topological diagram consisting of a set of neurons.The SOM network clusters training samples into groups.It can find the neighbors of the pixels that need to be transformed by calculating the distance between the clustering center and the pixels.Compared with other clustering algorithms, the SOM network can update the weights of adjacent neurons while updating the current neurons and decreasing the noises.

Capsule Neural Network
The capsule network was first introduced by Hinton [37] in 2017.The capsule network is a three-layer structure.The first layer is a simple convolutional layer without a pooling layer.The low-level features are extracted by the convolutional layer, and the pixel-level local feature detection is performed on the image pixels.The capsule's output vector [38] is used to represent an instance of a certain object.The more advanced the capsule is, the more advanced instances it can represent.If features are not extracted through a convolutional layer, then the capsule will obtain the contents of the image directly, which are not ideal low-level features.The shallow CNN is good at extracting low-level features [39], so the convolution layer is used.However, the ability of one CNN layer is not sufficient to extract enough appropriate features, so another CNN layer is added.The second layer is another capsule network, called primaryCaps, which includes several simple convolutional layers.Different convolutional kernels obtain different information.The acquired data are then joined to form a vector.The resulting vector is input into the third layer, which is called the digitCaps layer.The third layer can be taken as a fully-connected layer.Each input is a vector, and the output is also a vector.The length of the output vector, the L2 norm [40], represents the probability of a classification.Its length characterizes the probability of a certain category, and the length-independent part characterizes some graphical properties of the object such as position, color, direction, shape, etc., and obtains more information to help with classification.This layer contains the dynamic routing algorithm [41].Routing [42] is only between the primaryCaps layer and the digitCaps layer, and it has a certain layer-to-layer selective connection function.
The capsule network combines the advantages of CNN in image processing and feature extraction.At the same time, the capsule can obtain more information.Important feature information is encapsulated in a vector form instead of a scalar form.These vectors are processed by a dynamic routing algorithm and a newly-proposed activation function for classification.More applications and modifications of the capsule network [43] can also enhance the network's capabilities and expand the applicable scenarios.

The Whole Process Structure
The entire flowchart of the proposed method is shown in Figure 1.After clustering by SOM, these two images are compared.Then, some rough unchanged regions are obtained.In these regions, some image blocks are selected randomly.The first step is to transform each pixel in the image according to the mapping method with these image blocks.Then, the SAR image is converted into the optical feature space pixel by pixel by the mapping method, and similarly, the optical image is converted into the SAR feature space.After that, these images are compared directly in two different feature spaces, respectively.Finally, the classification images, obtained by fusing these two difference maps, are used as the labels for training.Samples of the network are selected from the fused image, which is the black and white training labels.The selected samples are connected and input into the network.Finally a binary classification map is obtained.

SOM Clustering and Block Selection
In this paper, each pixel is considered to be the basic element in the image, and the proposed change detection method is performed at the pixel level.By clustering pixels, they can be divided into several heaps, and the pixels in the same heap have similar characteristics.SOM clustering is used mainly for selecting image blocks from the unchanged regions we detected.The input node is connected to the competing layer neurons by weights, and the neurons are connected to the adjacent neurons.The nodes of the input layer depend on the training samples, the input data.The number of nodes is equal to the number of the dimensions of the input data.The output layer is a topological structure composed of a group of neurons, the number of which is set to 100 (10 × 10).SOM adjusts the weights of the network adaptively through the training samples, and the formula is as follows: Here, i is the neuron index.The learning rate α is a function of the training time t and the topological distance N, and the x is the training sample.w i (t + 1) is calculated according to the previous value w i (t).These parameters can be obtained by the method in [44].The output layer in the trained network can not only determine the category of an input mode, but also reflect the approximate distribution of the input data.Thus, the input data can be clustered based on certain characteristics.

Image Mapping Method
We can transform the images according to the obtained image blocks.There is a pair of heterogeneous images, the images before and after the event, representing Experimental Images 1 and 2, respectively.In heterogeneous images, some pixels have very close values in pre-event images, while their corresponding pixel gray values are more or less different in post-event images, even though they are not affected by the event.This is mainly caused by noise effects and differences in image mode.It is hard to compare them directly to detect changes.An image transformation from the original feature space to another is performed [45].The image will be converted to a similar feature space as the post-event image for direct comparison.
The mapping method is shown in Figure 2. In this mapping method, the first step is that the k pixels are selected from the unchanged regions.These k pixels to be used for transformation are considered as potential values of the mapping pixel.The pixels that have the nearest gray value to the mapping pixel are used to estimate the missing attribute values, such as the pixel gray value in the optical feature space, according to the known attributes, such as the Euclidean distance and the pixel gray value in the SAR feature space.If a known attribute value is very close in one space, its missing portion should be close to the corresponding part of the mode also.Therefore, the nearest neighbors are found according to the known attribute, and the missing attribute will be filled by the weighted average of the k neighbor pixels.The strategy uses the weighted average of the k nearest similar pixel positions as the mapping expectation coordinates.Images 1 and 2 are represented in each other's feature space respectively, such that their pixel gray values can be compared.According to the Euclidean distance difference of the pixel position, k nearest pixel points [46,47] in the space are found, then the reliable neighbors are selected according to the pixel gray value difference.The pixel gray difference values are sorted for selection.The difference is obtained according to the corresponding position of Image 2. The weight value is obtained through the difference value.The following is the pixel mapping equation: where k is the number of selected pixels used for transformation.The parameter ẏj is the k pixels' value, and ŷi is the transformed value, which is viewed as the pixel gray value in another feature space.The weight w j is obtained by the equations below: where d k is the ratio of two Euclidean distances.The numerator is the Euclidean distance between the pixel to be transformed and the selected pixel.The denominator is the max Euclidean distances between the k pixels and the pixel to be transformed.
The expected pixel gray value is obtained through the distance, and the Euclidean distance between them is the transformed pixel gray value.
where ŷi and x i both represent space locations.X and Y represent the abscissa and ordinate, respectively, in the coordinate system.n is the total number of pixels in the image.Parameter ŷi is the expected space location, and x i represents the position in the other image.Where c = 1 or c = 2, the difference values are calculated as follows: where d 1 i is the difference value between the transformed Image 1 and Image 2. The acquirement of d 2 i is the contrary process.They are both the pixel difference values between the other feature space and the transformed feature space.Finally, we will integrate the difference images [48].This is given by the equation below: If based on one pixel only, it is likely to cause the wrong detection.However, if we make the opposite transformation, the pixels in Image 2 are associated with the feature space of Image 1.Some pixels in Image 2 that have close values may be closer to the pixels in Image 1. Thus, if the difference value d 2 i is too large, the difference value d 1 i will be more or less a little smaller.In this case, the sum of d 1 i and d 2 i will not be too large.This fusion process can utilize the information of the two feature spaces to suppress noise [49].

Sample Selection
This section introduces how to select training samples to acquire reliable samples and a good trained network.The label map we obtained before contained correct labels and many false labels, namely unreliable ones.The higher the reliability of the selected training label is, the more correct the final result of the training will be.Suppose that the value of a training label in the label map is 1 and that this pixel has a neighborhood of n × n, as shown in Figure 3. Obviously, if the pixel gray values in this neighborhood are all 1, then the value of this label is reliable.Conversely, if other pixel gray values in this neighborhood are all 0, then the central pixel is considered a noise point.Therefore, the number of pixel labels in this neighborhood the same as central pixel can be used as a parameter for us to judge whether the sample is trustworthy or not.It can be judged according to the following formula: where N ij is the neighborhood and p ξη is the pixel in it.Ω represents the pixel label.Ω ξη is the neighbor pixel label, and ) means the number of the pixel labels equal to the central pixel.n is the neighborhood size.Therefore, α means the ratio of neighborhood pixels the same as the central label.Parameter α should be set appropriately.If it is too large, the selected samples may be too few, which will have less diversity for training the network.However, if it is set too small, too many samples will be chosen.Many false labels will be selected, resulting in more wrong training results.

Deep Capsule Network and Parameter Settings
The deep capsule network is used to process the fused difference image obtained by the mapping method.Based on the reliable samples, we can finally obtain a well-trained network.The method of deepening the capsule network and the related parameter settings are shown in Figure 4.The deep capsule network for change detection can be accomplished by the following steps: (a) Select two n × n samples and connect them directly into an n × 2n size and use this as a network input.(b) Put the input into the Conv layer, using many convolution kernels to extract different simple feature information.The primarycaps layer further selects the extracted feature information and combines the feature information into vectors.The digitcaps layer normalizes these vectors and classifies them into a set of vectors.(c) Reshape these vectors into a one-dimensional vector, and reshape it as some image blocks of a certain size.Then, input them into the network as before.(d) Compute the L2 norm of the vectors for classification and obtain the final classification results.Then, the parameters involved in the network are as marked in Figure 4.The neurons in each layer are divided into groups, i.e., capsules.The output of the traditional neurons is increased and reshaped as a vector.It is rich in representing the features and the direction of the entity.The route consistency algorithm preserves the location information and other information of the entity.Network training is evaluated by using variance functions traditionally, while using the function below: where c represents a type of classification and T c is a parameter.When c exists, T c is 1, else 0. m + = 0.9, m − = 0.1 are the cases of missing the existing classification situations.L c is called the margin loss.There being a reconstruction process in the capsule network, we combine the margin loss and the reconstruction loss so as to make the training result more precise.

Experiment Study
The experiments mainly were performed based on two kinds of datasets.The first part in the experiments on performed on the homogeneous datasets and based on the deepened capsule network.The second part was performed on the heterogeneous datasets, and it was based on the proposed method.

Homogeneous Datasets
The first dataset consisted of two SAR images of the same size, 306 × 291, as shown in Figure 5a The Ottawa dataset was a group of two SAR images over the city of Ottawa acquired with Radarsat SAR sensors, and the size was 290 × 350 pixels.The ground truth (reference image), which is shown in Figure 6c, was acquired by integrating some prior information of Figure 6a,b.The experiment on the Ottawa dataset was to evaluate water disaster.The white areas represent the changed areas, namely those areas affected.

Heterogeneous Datasets
The third dataset consisted of a SAR image and an optical image with the same size of 291 × 343 pixels, as shown in Figure 7a,b, respectively.The SAR image was also acquired with Radarsat-2 sensors from the Yellow River Estuary in June 2008.The optical image obtained in September 2010 was captured from Google Earth, covering the same region.These data provided by Google Earth integrated the imagery from both satellite and aerial photography.These satellite images were obtained from the Landsat-7 and QuickBird sensors.This dataset was used to study the change of the Yellow River affected by flood.Figure 7c is the reference image that reveals the actual changed regions.

Evaluation Criteria
The final classification results not only show the final change detection binary map, but also provide some evaluation criterion values to help with analyzing and observing the performance of the change detection results.
The parameters of the evaluation criteria are shown as follows: (1) the number of all pixels in the image N; (2) the actual number of changed pixels in the reference image NC and (3) the actual number of unchanged pixels in the reference image NU; they both can be calculated with the reference image; (4) the number of changed pixels taken as unchanged pixels FN (false negative); and (5) the number of unchanged pixels taken as changed pixels FP (false positive).These two parameters can be calculated by comparing the reference image with the resulting image we obtained.We can calculate the overall error (OE) as follows: Another two parameters TP and TN, which have the opposite meaning of FP and FN, respectively, are calculated as follows: where TP (true positive) means the amount of changed pixels correctly detected in both the reference image and the final experiment resulting image.TN (true negative) represents the number of unchanged pixels correctly detected in both the reference image and the resulting image.
For a further evaluation of the resulting image, we can calculate the percentage of correct classification (CA) [50] as follows: where CA shows the correct rate of the results.However, since the value N is usually large, the values CA obtained by different methods may be very similar in some situations.It is not enough to distinguish the quality of detection with CA only.Thus, we introduced the Kappa coefficient (KC) [51] as another overall evaluation criterion.KC was used to evaluate the results.The higher the KC value is, the better the detection result is.The calculation method of KC is as follows: where CA depends on the sum of TP and TN.KC relies more on parameters containing more detailed classification information, so KC can further explain the quality of the change detection map.

Parameter Settings
The relevant parameters should be set appropriately before evaluating the effectiveness of the proposed method.In the deep learning method, the structure of the whole network is very important.It is generally believed that more available features can be learned with more layers of the network.However, the complex structure of the network can also lead to extra calculation time.Therefore, the proper settings are very important.For change detection, the image scale is relatively small, and the structure does not need to be large.In such a process, 3 layers are sufficient for all parameter settings in the network.In fact, too few units in the hidden layer will affect the results, and too many units will bring many computational costs also.In the network, the window size n selected by the user and the sample selection parameter α have important effects on the result.The value of n determines the amount of information we extract from the two original images.The value of the parameter α determines the appropriate number of training samples.When n is too large, the classification of the central pixel is too affected by its neighbor pixels, and the calculation costs much more.In general, n is chosen to be in the vicinity of 5.It can be chosen from n = 3, 5, 7, 9, 11. Figure 9 shows the criteria of the parameters α and the n size of neighbors, respectively.Lines of different colors represent different criteria on different datasets.The results show that n = 7 was the best choice.When n < 7, the extracted information was not enough.On the contrary, when n > 7, the accuracy of the result was no better than n = 7.The reason may be that the local information extracts too much, so that the characteristics of the local pixels are covered.Figure 10 is the resulting map based on these parameters.
Overall, when n = 7, the performance was better than the others.In the following experiments, all datasets selected n as 7. Figure 11 depicts the effect of the parameters α and n on the resulting image, respectively.Lines of different colors represent performance on different datasets.The results show that α near 0.5 was a good choice, so the following experiments were implemented according to 0.5.When n was too small, the sample reliability was not strong enough, and it was impossible to get good results through training.Conversely, when n was too large, the accuracy of the result was no better than n = 0.5.The reason is that the samples obtained contained only reliable ones, but the samples were not abundant enough and the sample size insufficient.n = 0.5 is much better than others.In the following experiments, α as 0.5 was selected for all datasets.Figure 12 is the resulting map based on these parameters.

Experiments on Homogeneous Datasets
In the homogeneous image experiments, we compared the log-ratio (LR) method [52], the meanratio (MR) method [53], the SCCN method, and these methods based on the deep capsule network, D_LR, D_MR, and D_SCCN.The methods LR and MR are the most commonly-used homologous remote sensing image change detection methods and are simple and effective.The SCCN method is a heterogeneous change detection method, which is also suitable for change detection in homogeneous datasets.
In the experiment of the farmland dataset, the results obtained by the methods mentioned before are shown in Figure 13.The difference maps obtained by different methods were different.Based on the comparison between the resulting map and the reference image (ground truth), different evaluation criteria were obtained.They are listed in Table 1.It can be seen that the deep capsule network has improved results.In the experiment of the Ottawa dataset, several different difference maps and binary resulting images were obtained, which are shown in Figure 14.Comparing the resulting image with the reference map, we list the evaluation criteria of the results by the different methods in Table 2.
Similarly, the deep capsule network had improved results.Both experiments showed the capacity of processing and improving the difference image.

Experiment Performance on Heterogeneous Datasets
In the next experiments, we will compare our methods (PROPOSED) with the change vector analysis (CVA) [54,55], ASDNN [56], SCCN, and PCCmethods.The CVA method is a very effective method for multi-spectrum change detection.The ASDNN method is a heterogeneous image change detection method based on the idea of SCCN.It is an improved method of SCCN and has a strong capability in heterogeneous image processing.

Experiment on the Yellow River Dataset
In the experiment of the Yellow River dataset, the image blocks were chosen randomly.When selected, they should be distributed as reasonably as possible.If just a certain block or part in the image is selected, the blocks cannot contain sufficient information, and it will not make the results more general, but accidental.In this experiment, k was selected as 1300, namely 13 small image blocks contained 100 pixels each.In Figure 15, six images are shown.These six images were the difference maps obtained by selecting different numbers of pixels from these image blocks.They were acquired based on these pixels.In Figure 16, OE in the results was obtained by the simple threshold segmentation method.It was suitable to select the number of pixels to be small in this dataset.When the number of selected pixels was k/20 = 65, OE was the least.Other good results were based on this number being around 65.
In this experiment, the PCC method was used to generate classifications, and the results are shown in Figure 17a,b.Each of these two images included two identifiable categories that represent land and rivers.The final binary resulting map can be obtained by direct comparison pixel by pixel, and it is shown in Figure 17c.Pixels with the same category of labels remained unchanged, and a different category of labels would be considered as changed.The difference images generated by the CVA, ASDNN, and SCCN methods and the corresponding resulting images are shown in Figure 18. Figure 18d,h shows the difference image and resulting map generated by our proposed method.The reference image is shown in Figure 18j.The number of pixels selected It is shown that the quality of the difference image produced by our proposed method was significantly higher.It can be seen that the proposed method had the fewest false alarms.Table 3 shows the values of the evaluation criteria obtained by the five methods.The CVA method can utilize different spectrum information.However, it obtained the wrong change detection result because the spectra in this dataset had more gray information, but less color information.The accuracy of the proposed method achieved the best performance overall.PCC is a simple change detection approach, and its performance was affected by the classification algorithm while ignoring much detailed information.SCCN is an innovative method based on symmetric coupled deep convolutional neural networks.It exhibited a fairly high degree of accuracy in detecting changes in heterogeneous images, and its training samples were selected from regions that were unchanged.It blurred some locations belonging to the changed class.ASDNN performed better on this dataset.The performance of ASDNN was better in the main detection regions, and it can decrease many of the noises.However, our proposed method balanced these two problems.It can decrease as many of the noises as possible and detect the main regions in detail.In the Shuguang dataset experiment, image blocks were selected randomly.The same as above, they should be distributed as reasonably as possible when selected.In this experiment, k was selected as 4500, namely 15 small image blocks, each of which contained 300 pixels.There are six images shown in Figure 19.These pictures were based on different numbers of pixels selected from image blocks.Then simple difference maps were obtained.In Figure 20, the OE of the different results was obtained.When the number of selected pixels was k/25 = 180, OE was the least.However, if the number was too small, such as 80, the result would be a little worse.A good choice for the number of pixels was about 200, as selected in this dataset.Fairly good mapping images could be obtained when the number was set around 200.
In this experiment, PCC method was used to generate classifications and resulting maps, as shown in Figure 21a,b.There were two identifiable types in the SAR image, namely farmland region and water region.In fact, there were also some buildings, but it was hard to identify these buildings using unsupervised classifiers.There were three identifiable categories in the optical image, namely farmland, water, and building regions.However, some farmland areas were not correctly classified.Therefore, such an error caused a change detection result, which was not good enough.The resulting map was obtained by direct comparison.The difference images generated by the proposed method, other methods, and the corresponding resulting images are shown in Figure 22.The number of pixels selected (i) (j) According to Table 4, the proposed method obtained the best result among these methods.ASDNN and SCCN performed better than PCC.The evaluation criteria of proposed method were better than these two network methods.ASDNN was the best in the main region we wanted to detect.CVA also achieved a fairly good result.CVA can make full use of the color information in this kind of image, though it cannot detect the region of interest well.The same as the performance before, PCC detected nearly all of the regions that were to be detected, while containing too many unnecessary regions.The difference image obtained by the proposed method was better.In the SCCN method, most of the changed regions were detected, but some small detailed regions that were considered to be changed belonged to the unchanged regions in the reference map.The proposed method was superior to the other methods in terms of accuracy and detail.

Conclusions
In this paper, two heterogeneous images were transformed in the feature space at the pixel level and then compared in their respective feature spaces.Finally, the resulting classified images were sampled and input into the improved neural network to obtain the final classification result.The results obtained were better than those obtained by some current methods, but the drawback of the proposed method is that it was limited to SAR and certain optical images, rather than multi-spectral images, like high spectral images, natural images, etc.The future work is to explore the feasibility of this method on multi-spectral images, natural images, or other kinds of images.

Figure 1 .
Figure 1.Flowchart of the proposed method for remote sensing image change detection.
Compute the Euclidean distance and the weights according to the image blocks selected.unselected unselected Take the Euclidean distance between the pixels before and after transformation as pixel transformed gray value.Compute the coordinates, and add them together respectively based on the weights and the selected pixels.

Figure 2 .
Figure 2. The process of image mapping.

Figure 3 .
Figure 3. Different label neighbor information and sample selection according to the threshold.(a) All the labels are the same as the central pixel label.(b) All the labels are different from the central pixel label.(c) More than half of the labels are the same as the central pixel.(d) More than half of the labels are different from the central pixel.

Figure 4 .
Figure 4.The flowchart of the capsule network deepened in a certain way.

Figure 7 .Figure 8 .
Figure 7. Yellow River dataset.(a) SAR image acquired in May 1997.(b) Optical image acquired in August 1997.(c) Reference image.The last dataset contained a SAR image and an optical image, as shown in Figure8a,b, respectively.This dataset covered a piece of the farmland area in Shuguang Village in Dongying City, China.The new buildings corresponding to the changed regions were built on the farmland, as shown in Figure8c.The SAR and optical images were the same size, 921 × 593 pixels.They were obtained in June 2008 and in September 2012, respectively.

Figure 9 .Figure 10 .Figure 11 .Figure 12 .
Figure 9. (a) Relationship between the size of the neighbor and FP, FN, and OE on the farmland dataset.(b) Relationship between the parameter α and the criteria on the farmland dataset.

Figure 13 .
Figure 13.Difference maps and resulting maps of the farmland dataset obtained by different methods.(a) Difference map by the log-ratio (LR).(b) Difference map by the mean-ratio (MR).(c) Difference map by SCCN.(d) Resulting map by LR.(e) Resulting map by MR.(f) Resulting map by SCCN.(g) Resulting map by DCAPSbased on LR (D_LR).(h) Resulting map by DCAPS based on MR (D_MR).(i) Resulting map by DCAPS based on SCCN (D_SCCN) (j) Reference map.

Figure 14 .
Figure 14.Difference maps and resulting maps of the Ottawa dataset obtained by different methods.(a) Difference map by LR.(b) Difference map by MR.(c) Difference map by SCCN.(d) Resulting map by LR.(e) Resulting map by MR.(f) Resulting map by SCCN.(g) Resulting map by DCAPS based on LR (D_LR).(h) Resulting map by DCAPS based on MR (D_MR).(i) Resulting map by DCAPS based on SCCN (D_SCCN) (j) Reference map.

Figure 15 .
Figure 15.The difference images for the Yellow River dataset obtained according to different numbers of selected pixels.(a) Difference image when the number is k/2.(b) Difference image when the number is k/4.(c) Difference image when the number is k/6.(d) Difference image when the number is k/10.(e) Difference image when the number is k/13.(f) Difference image when the number is k/20.

Figure 16 .Figure 17 .Figure 18 .
Figure 16.The overall error (OE) on the Yellow River dataset according to different numbers of pixels selected.

Figure 19 .
Figure 19.The difference images for the Shuguang Village dataset obtained according to different numbers of selected pixels.(a) Difference image when the number is k/3.(b) Difference image when the number is k/4.(c) Difference image when the number is k/11.(d) Difference image when the number is k/23.(e) Difference image when the number is k/25.(f) Difference image when the number is k/65.

Figure 20 .Figure 21 .Figure 22 .
Figure 20.The overall error (OE) on the Shuguang Village dataset according to different numbers of pixels selected.

Figure 22 .
Figure 22.Difference images and resulting images for the Shuguang Village dataset obtained by different methods.(a) Difference image by CVA.(b) Difference image by ASDNN.(c) Difference image by SCCN.(d) Difference image by our proposed method.(e) Resulting image by CVA.(f) Resulting image by ASDNN.(g) Resulting image by SCCN.(h) Resulting image by our proposed method.(i) Resulting image by PCC.(j) Reference image.

Table 1 .
Values of the evaluation criteria on the farmland dataset by different methods and these methods based on the deep capsule network.

Table 2 .
Values of the evaluation criteria on the Ottawa dataset by different methods and these methods based on the deep capsule network.

Table 3 .
Values of the evaluation criteria on the Yellow River dataset by different methods.

Table 4 .
Values of the evaluation criteria on the Shuguang Village dataset by different methods.