Dual Learning-Based Siamese Framework for Change Detection Using Bi-Temporal VHR Optical Remote Sensing Images

: Asafundamentalandprofoundtaskinremotesensing,changedetectionfromvery-high-resolution (VHR) images plays a vital role in a wide range of applications and attracts considerable attention. Current methods generally focus on the research of simultaneously modeling and discriminating the changed and unchanged features. In practice, for bi-temporal VHR optical remote sensing images, the temporal spectral variability tends to exist in all bands throughout the entire paired images, making it difficult to distinguish none-changes and changes with a single model. In this paper, motivated by this observation, we propose a novel hybrid end-to-end framework named dual learning-based Siamese framework (DLSF) for change detection. The framework comprises two parallel streams which are dual learning-based domain transfer and Siamese-based change decision. The former stream is aimed at reducing the domain differences of two paired images and retaining the intrinsic information by translating them into each other’s domain. While the latter stream is aimed at learning a decision strategy to decide the changes in two domains, respectively. By training our proposed framework with certain change map references, this method learns a cross-domain translation in order to suppress the differences of unchanged regions and highlight the differences of changed regions in two domains, respectively, then focus on the detection of changed regions. To the best of our knowledge, the idea of incorporating dual learning framework and Siamese network for change detection is novel. The experimental results on two datasets and the comparison with other state-of-the-art methods verify the efficiency and superiority of our proposed DLSF.


Introduction
Change detection, one of the most important tasks in remote sensing, mainly concerns the process of comparing remote sensing images that are acquired over the same geographic area but at different times, and then identifying the changed regions [1][2][3].It is widely used in a large number of applications, for example, land cover change mapping [4,5], resource and environment monitoring [6][7][8], disaster monitoring [9], vegetation studying [10], and urban planning [11,12].Along with the development of imaging sensors, the spatial resolution and spectral space of acquired images have significantly improved.As one of the most common and accessible remote sensing types of data, current very-high-resolution (VHR) optical remote sensing images provide considerable detailed information due to their high resolution and image quality, however, they bring more redundancy and noise.Therefore, change detection using VHR optical remote sensing images is of fundamental and challenging significance [13].
Technically speaking, current change detection methods have evolved by considering voluminous information in order to selectively extract positive and meaningful information from paired images.These change methods generally comprise two major parts which are feature extraction and decision making.The former aims to pursue positive and meaningful features such as color distribution, texture characteristics, and contextual information.The latter aims to analyze the above features to identify the changed regions in bi-temporal remote sensing images with certain technical algorithms.
Conventional change detection methods mainly take paired pixels or their simple differences and ratios [14,15] as input features, and detect changes by determining the threshold such as done by Otsu [16] and Kullback-Lerbler (KL) [17].Wu et al. (2014) [18] transformed paired images into a new feature space retaining the invariant components and analyzed the slow feature (SFA) to detect changes.References [19][20][21] produced pixel-level change vectors and performed an analysis (CVA) on them for change detection.These types of methods are advantageous due to their simplicity and directness.Nevertheless, the individual changes of paired pixels mostly do not clearly reflect whether the region has actually changed or not.With the development of machine learning, considerable strategies are proposed and applied to extract region-and object-based features from registered bi-temporal images.Along with these strategies, certain advanced decision making algorithms have been proposed to analyze the features and detect the changes.Nielsen et al. (1998) [22] first made an orthogonal transformation, namely multivariate alteration detection transformation (MAD), on paired images and then analyzed them for the canonical correlation (CCA).By integrating the expectation-maximization (EM) algorithm with CCA, Nielsen et al. (2007) [23] proposed an iteratively reweighted MAD (IRMAD) to improve MAD.References [24,25] took image saliency and object-based segments as input features and detected the changes using the random forest algorithm (RF).References [26,27] made wavelet transformation on paired images and then detected the changed regions using Markov random field (MRF).Volpi et al. (2013) [28] proposed including contextual information through local textural statistics and mathematical morphology in order to extract features, and adopted the support vector machine (SVM) to determine the changed regions.References [29][30][31], first, made a principal component analysis and segment images into object-based superpixels or regions, respectively, then, selected multiple classifiers to decide the changes and produce the final predicting change map using weighted voting.These types of methods are able to take into consideration the relationship of neighboring pixels and make complex nonlinear decisions for change detection, which will effectively improve the detection accuracy and resist negative influences from redundant information and noises.Most of the time, human involvement is still required to facilitate the machine learning models.
Recently, the development of deep learning technology has provided new ideas and progress has improved remarkably due to its high efficiency and outstanding performance.In comparison to machine learning-based methods, deep learning-based methods exploit considerably more implicit features from optical images.Although, most of the features of a deep neural network seem to not have visual significance, such ones may practically benefit change detection.Lyu et al. (2016) [32] applied an end-to-end recurrent neural network (RNN) with long short-term memory (LSTM) to learn a transferable change rule for land cover change detection.Gong et al. (2016) [33] trained a deep belief network to classify changed and unchanged regions in synthetic aperture radar (SAR) images.Wang et al. (2019) [34] proposed an end-to-end two-dimensional convolutional neural network (CNN) framework for hyperspectral image change detection.References [35][36][37] applied a generative adversarial network (GAN) to detect changes in multispectral and other heterogeneous images respectively.Zhan et al. (2017) [38] proposed a deep Siamese convolutional network (DSCN), derived from the Siamese network, to detect changed regions with contrastive loss.Liu et al. (2018) [39] expanded the DSCN by proposing a symmetric convolutional coupling network (SCCN) to detect changes between optical and SAR images.These types of methods are able to learn considerable decision rules for identifying changes without any manual intervention, and the primary barrier of deep learning-based methods is the lack of sufficient labeled change map samples and open benchmarks for training models [40].In this paper, all the aforementioned feature extractors and decision makers are summarized in Table 1.
Table 1.Feature extraction methods and decision making algorithms for change detection.

Conventional
Image differences [14], ratios [15] Image transformation Pixel vectors Otsu [16], KL [17] SFA [18] CVA [19][20][21] Machine learning based MAD [22], IRMAD [23] Image saliency Wavelet transform Contexts PCA, Segments CCA, EM RF [24,25] MRF [26,27] SVM [28] Multiple classification [29][30][31] Deep learning based LSTM [32] CNN [34], GAN [35][36][37] DSCN [38], SCCN [39] Regression Softmax loss Contrastive loss For registered bi-temporal VHR optical remote sensing images, in an ideal situation, the features of these two images in unchanged regions remain theoretically invariant, and distinctive discrepancies exist among the features of the paired images in real changed regions.Nevertheless, in reality, the spectral and spatial context features of image domains in unchanged regions may be tremendously different, which are mainly caused by imaging times, illumination and atmospheric conditions, and imaging sensors, etc.Therefore, simultaneous modeling of distinctive features for the changed and unchanged regions is usually not feasible.Regarding this problem, it is quite meaningful to perform alternate feature modeling of unchanged and changed regions, in other words, to iteratively make the unchanged regions as similar as possible and the regions that have changed as different as possible.With this method it is essential to design a model which can translate paired images to each other's domain to eliminate the domain differences but retain intrinsic information.In this model, the original features of the unchanged regions in one temporal image are able to act as a reliable reference for those in the other temporal image.This internal relationship is taken as an auxiliary for following decision makers, and therefore improves the detection accuracy and effectiveness.
According to the above analyses, in this paper, we propose a novel hybrid end-to-end method named dual learning-based Siamese framework (DLSF) specifically for change detection using bi-temporal VHR optical remote sensing images.This framework is an integration of one conditional dual learning framework (CDLF) and two fully convolutional Siamese networks (FCSN).The CDLF is aimed at generating cross-domain images in order to suppress the differences between the paired unchanged regions and separate the changed regions, while the two FCSNs are aimed at determining the changes in two domains, respectively.The primary contributions of our research are summarized as follows:

•
We propose a novel hybrid end-to-end framework integrating strategies of dual learning and Siamese network to directly achieve supervised change detection using bi-temporal VHR optical remote sensing images without any pre-or post-processing.

•
To the best of our knowledge, it is the first time applying the idea of dual learning in change detection to achieve a cross-domain translation between bi-temporal images.

•
The CDLF with two conditional discriminators is designed to ensure the complete translations of paired images from the source domain to the target domain specifically in the unchanged regions.

•
We adopt a weight shared strategy on discriminators and detectors to improve the training velocity and efficiency.

•
We design a new loss function comprise of adversarial, cross-consistency, self-consistency, and contrastive losses as the decision maker to better train the DLSF for change detection.
The remainder of this paper is organized as follows.The related works about DLF and FCSN are briefly described in Section 2. The theory and implementation of our proposed DLSF are introduced in detail in Section 3. The results of experiments on two different datasets are presented in Section 4. Certain relative analyses to verify the effectiveness and robustness of our models are provided in Section 5. Finally, the conclusion is summarized in Section 6.

Dual Learning Framework
The dual learning framework (DLF) involves making a loop translation between two types of data by setting primal task f : x → y and dual task g : y → x .With these two models, the original signals are mapped forward y = f (x) and backward x = g(y), and then the feedback signals are reconstructed.With the deviation of the original and feedback signals x − g( f (x)), the primal and dual models will be improved together to achieve better translations via a policy gradient algorithm as shown in Equation (1).
where, η is the training rate and T is the preset threshold.∆ f and ∆g are the gradients of two models f and g.This mechanism is applied to natural language processing [41] for the first time.The DLF conducts reinforcement learning and represents a primal-dual pair to simultaneously train two "opposite" language translators by minimizing the reconstruction loss in a closed loop.Notably, the reconstruction loss measured over monolingual words generates information feedback to train a bilingual translator.Similar to image processing, the DLF is often used in paired images style transfer and unpaired image-to-image translation like DualGAN [42], DiscoGAN [43], and CycleGAN [44].

Fully Convolutional Siamese Network
The Siamese network is a similarity measurement strategy rather than a specific network.When the number of categories is large the number of samples for certain categories is small, the Siamese network is used to achieve identification and classification without predicting all the categories of samples in advance.Derived from the Siamese network, FCSN mainly focuses on pixel-level identification and classification, and thus replaces the CNN and the distance measurement with a fully convolutional network (FCN) and a pixel-wise distance measurement.Specifically, the FCN retains the dimension of the input image and eliminate discontinuities on pixel outlines, and thus it achieves high synchronism and accuracy in pixel-level feature extraction.The pixel-wise distance measurement aims to measure the similarities of paired pixels of the entire two paired images.Therefore, pixel-wise Euclidean distance D is expressed as shown in Equation (2).
Considering that certain pixels belong to the same category and that others belong to different categories for certain tasks, the contrastive loss function is designed to improve the model f and thereby adapt the preset rules as shown in Equation (3).
where, L is the preset binary labeled map where the pixel values equal to 0 indicate that these paired pixels are similar, and the ones equal to 1 indicate that these paired pixels are dissimilar.m is the distance margin for dissimilar pixel pairs.FCSN was first proposed for object tracking [45].It is used to determine whether the paired pixels belong to the same category in order to train the FCN by minimizing the contrastive loss.On the basis of this mechanism, the FCSN is used for change detection in remote sensing [46,47].

Methodology
In this section, the problem formulation for change detection on registered bi-temporal VHR optical remote sensing images is first presented and described in detail, and then this is followed by an overview of the proposed framework architecture.Thereafter, we interpret our new loss functions.Finally, additional implementations regarding the training and predicting processes are depicted.

Problem Formulation
The primary goal of change detection is to identify the changes between registered bi-temporal VHR optical remote sensing images I (T 1 ) and I (T 2 ) , which are acquired over the same geographic area but at different times T 1 and T 2 .As a result of the different times, illumination and atmospheric conditions, and imaging sensors, the bi-temporal images are regarded as paired images in two different domains with varying appearances.To ensure the consistency of the representations of the unchanged regions in these paired images, we introduced two models, G 1to2 and G 2to1 , to simulate the mapping procedures between two domains, I (T 1 ) to I (T 2 ) , and I (T 2 ) to I (T 1 ) , respectively.Logically, with the two models, the unchanged regions of the original image and translated image have to be completely the same.Therefore, the relationship of paired images is formulated as expressed in Equations ( 4) and (5).
where, C is the binary change map of the same width and height with I (T 1 ) and I (T 2 ) but with only one channel, where the value 1 means that the pixel is part of a changed region and the value 0 means it is part of an unchanged region.The operation ⊗ interprets element-wise multiplication.
For learning the mapping models, G 1to2 and G 2to1 , with ground reference data, we propose two conditional adversarial discriminators, D 1 and D 2 , to evaluate the domain consistency of paired real and fake images.As expressed in Equations ( 6) and (7), the discriminator D 1 aims to distinguish between two images, I (T 1 ) and G 2to1 I (T 2 ) , in the unchanged regions in domain T 1 , while the discriminator D 2 aims to distinguish between two images, I (T 2 ) and G 1to2 I (T 1 ) , in the same unchanged region in domain T 2 .
where, TRUE and FALSE are matrices with Boolean values 1 and 0 respectively which denote real image pixels and fake image pixels judged by the discriminators, and RANDOM is a matrix with random values between 0 and 1.Unlike the traditional adversarial discriminator, the conditional adversarial discriminator restricts the distance between the original image and the translated image only in the unchanged regions.
After the training process, the original and translated images are difficult to separate by the discriminators in the unchanged regions, which indicates that the generation models are able to realize the domain transfer of bi-temporal images between domains, T 1 and T 2 .Finally, to make the decision for changed regions, we introduce two Siamese detectors, S 1 and S 2 , to perform a pixel-level comparison on original images and translated ones in two domains, respectively, as expressed in Equations ( 8) and (9).
i,j 2 (9) where, P(i, j) is the change probability of the pixel located at (i, j), if it is close to M means the pixel is part of a changed region while close to 0 means it is part of an unchanged region.M is the change threshold, which is set to 1 here.• 2 is the L2 distance loss.

Framework Architecture
Activated by the problem formulation, the framework architecture is designed as shown in Figure 1.Our framework contains three main parts, namely, mapping generation, conditional discrimination, and Siamese detection.There are two mapping generators, two conditional discriminators, and two Siamese detectors.These six neural networks make up two paralleled streams: dual learning-based domain transfer and Siamese-based change decision.The two streams are discussed in detail in the following.
where,  ,  is the change probability of the pixel located at ,  , if it is close to  means the pixel is part of a changed region while close to 0 means it is part of an unchanged region. is the change threshold, which is set to 1 here.‖⋅‖ is the L2 distance loss.

Framework Architecture
Activated by the problem formulation, the framework architecture is designed as shown in Figure 1.Our framework contains three main parts, namely, mapping generation, conditional discrimination, and Siamese detection.There are two mapping generators, two conditional discriminators, and two Siamese detectors.These six neural networks make up two paralleled streams: dual learning-based domain transfer and Siamese-based change decision.The two streams are discussed in detail in the following.

Domain Transfer Stream
Domain transfer for bi-temporal VHR optical remote sensing images is aimed at eliminating the domain differences including color distribution, texture characteristics, and contextual information between paired bi-temporal images, which are mainly caused by different times, illumination and atmospheric conditions, and imaging sensors.Concatenating Figure 1b to Figure 1a, domain transfer stream is achieved by two mapping generators and two conditional discriminators as illustrated in Figure 2.

Domain Transfer Stream
Domain transfer for bi-temporal VHR optical remote sensing images is aimed at eliminating the domain differences including color distribution, texture characteristics, and contextual information between paired bi-temporal images, which are mainly caused by different times, illumination and atmospheric conditions, and imaging sensors.Concatenating Figure 1b to Figure 1a, domain transfer stream is achieved by two mapping generators and two conditional discriminators as illustrated in Figure 2.
It is noted that the two generators translate images from one domain to the other and together make up a closed loop, while the two discriminators distinguish translated images from original images in the unchanged regions in two domains, respectively.This adversarial learning will continuously improve the generators and the discriminators, and thereby boost the performance of the domain transfer stream.It is noted that the two generators translate images from one domain to the other and together make up a closed loop, while the two discriminators distinguish translated images from original images in the unchanged regions in two domains, respectively.This adversarial learning will continuously improve the generators and the discriminators, and thereby boost the performance of the domain transfer stream.
Although the DLF is first designed for unpaired image-to-image translation, being applied to paired image-to-image domain transfer provides the correct direction of gradient descent when training models, thus, stabilizing and shortening the training process.In practice, the CDLF doubles the number of training datasets, but further enhances stability and increases the fault tolerance of this method for change detection.

Change Decision Stream
Change detection on paired images in the same domain is more accurate and efficient than that performed in two different domains.Concatenating Figure 1c to Figure 1a, change decision stream is achieved by two mapping generators and two Siamese detectors.Taking one pair of the original image and the translated image in the same domain as an example, the dataflow of the Siamese detector is illustrated in Figure 3. Paired images are used as input of the same FCN, and two multichannel feature maps serve as the output.Thereafter, we compute the pixel-wise Euclidean distance between two feature maps and Although the DLF is first designed for unpaired image-to-image translation, being applied to paired image-to-image domain transfer provides the correct direction of gradient descent when training models, thus, stabilizing and shortening the training process.In practice, the CDLF doubles the number of training datasets, but further enhances stability and increases the fault tolerance of this method for change detection.

Change Decision Stream
Change detection on paired images in the same domain is more accurate and efficient than that performed in two different domains.Concatenating Figure 1c to Figure 1a, change decision stream is achieved by two mapping generators and two Siamese detectors.Taking one pair of the original image and the translated image in the same domain as an example, the dataflow of the Siamese detector is illustrated in Figure 3.It is noted that the two generators translate images from one domain to the other and together make up a closed loop, while the two discriminators distinguish translated images from original images in the unchanged regions in two domains, respectively.This adversarial learning will continuously improve the generators and the discriminators, and thereby boost the performance of the domain transfer stream.
Although the DLF is first designed for unpaired image-to-image translation, being applied to paired image-to-image domain transfer provides the correct direction of gradient descent when training models, thus, stabilizing and shortening the training process.In practice, the CDLF doubles the number of training datasets, but further enhances stability and increases the fault tolerance of this method for change detection.

Change Decision Stream
Change detection on paired images in the same domain is more accurate and efficient than that performed in two different domains.Concatenating Figure 1c to Figure 1a, change decision stream is achieved by two mapping generators and two Siamese detectors.Taking one pair of the original image and the translated image in the same domain as an example, the dataflow of the Paired images are used as input of the same FCN, and two multichannel feature maps serve as the output.Thereafter, we compute the pixel-wise Euclidean distance between two feature maps and Paired images are used as input of the same FCN, and two multichannel feature maps serve as the output.Thereafter, we compute the pixel-wise Euclidean distance between two feature maps and produce a change map that has the same width and height as the input images but has only one channel.By comparing the change map with the corresponding reference map, the FCN will be updated guided by the contrastive loss, and therefore boost the performance of the change decision stream.

Loss Function
We propose a new loss function comprised of four types of terms: (1) adversarial loss to match the information distribution between fake images and real ones in two domains respectively; (2) cross-consistency loss to represent the distance from translated images to original ones in two domains, respectively; (3) self-consistency loss to prevent two mapping generators from contradicting each other; (4) contrastive loss to bring similar pixels closer and push dissimilar pixels apart.

Adversarial Loss
As the basic loss function in GAN, adversarial loss is first proposed by Goodfellow et al. [48] and is set to fool conditional discriminators that do not distinguish translated images from original images.In this research, we apply two adversarial losses in two domains, respectively to facilitate transforming image domains but retaining intrinsic features as expressed in Equations ( 10) and (11).
where, G 1to2 and G 2to1 try to map the images to the other domain and make them appear similar to the real images in the other domain, while D 1 and D 2 try to distinguish between fake and real images in domain T 1 and T 2 , respectively.Therefore, the generators aim to minimize these losses against the discriminators that aim to maximize them such as in min and min

Cross-Consistency Loss
Cross-consistency loss is derived from the LogSoftmax function in [49], which is a special type of cross entropy loss often used in semantic labeling with CNN or FCN.As interpreted in Section 3.1, in the unchanged regions, the paired real image is the reference map of the paired fake image.Therefore, we set two cross-consistency losses here to facilitate training two mapping generators with the L1 distance losses as expressed in Equations ( 12) and (13).
where, • 1 is the L1 distance loss to strictly represent distance.Minimizing these losses makes the generators achieve good mapping between two domains.

Self-Consistency Loss
As a result of the powerful expression capacity of deep neural networks, the mappings between two domains are generally stochastic and not unique.To facilitate the adversarial losses and reduce the randomness of the mapping generators, here the self-consistency losses are set to guarantee that the mappings should bring images back to the original images as illustrated in Figure 2. Similar to the cycle-consistency losses in CycleGAN [44], the self-consistency losses in two domains are expressed in Equations ( 14) and (15).
1 (14) Remote Sens. 2019, 11, 1292 9 of 24 where, • 1 is the L1 distance loss.By minimizing these losses, it reduces the randomness of generators and provides a positive direction for the convergence procedure.

Contrastive Loss
FCSN plays an important role in ensuring effective decision making.It aims to make the feature points of unchanged pixel pairs closer to each other, and make the ones of changed pixel pairs considerably more distant in the output change map.In order to process the relationship between paired data in Siamese-based networks, Hadsell et al. [50] introduced a contrastive loss.In our proposed DLSF, the contrastive loses are set to evaluate whether the FCSN is trained well as expressed in Equations ( 16) and (17).16) where, m is a distance threshold which is set to be 2 here.µ U and µ C denote the relative importance for unchanged and changed pixel distribution, respectively, which are designed with global average frequency balancing, as shown in Equations ( 18) and (19).
where, N is the total number of training sample pairs.W and H are the width and height of one image patch, respectively.pD(•) denotes the evaluation of pairwise Euclidean distance between two feature maps without changing the shape of the input image patches.As Equations ( 20) and (21) show, pD(T 1 ) denotes the pixel-wise distance of two paired images in domain T 1 , while pD(T 2 ) denotes the distance of that in domain T 2 .
where, • 2 is the L2 distance loss.Specific to pixel-level, the contrastive losses are expressed as Equations ( 22) and ( 23).
In general, our full objective is an integration of the aforementioned loss functions, as expressed in Equation (24).
where, λ denotes the relative importance for each of these four loss functions.Therefore, our main solutions are expressed as Equations ( 25) and (26).
Guided by certain supervised references, we train all the networks on the same timeline.When predicting, however, only two trained mapping generators and two trained Siamese detectors are used for change detection.

Implementation
Since bi-temporal VHR optical remote sensing images are of large scale, in this paper, we make global normalizations in their own domains, respectively, and then crop them to small patches sized 256 × 256 for later training and predicting.The training dataset is produced by randomly cutting from the original training areas.Certain overlapped and rotated samples enhance the training effect.When predicting, we first make predictions on all small patches and then splice them together into the entire prediction image.

Network Architecture
As illustrated in Figure 4, our two mapping generators are deep FCNs, and each one consists of two down-sampling convolutional blocks, nine residual blocks, and two up-sampling transposed convolutional blocks, which are followed by a convolutional layer and a linear activation function TanH.Unlike in the generators, only one down-sampling and up-sampling convolutional block is present in the conditional discriminators and Siamese detectors.The conditional discriminators comprise four convolutional blocks, followed by a discriminant transposed convolutional layer.The Siamese detectors compose seven convolutional blocks, followed by a transposed convolutional layer and a TanH activation.
where,  denotes the relative importance for each of these four loss functions.Therefore, our main solutions are expressed as Equations 25 and 26.
Guided by certain supervised references, we train all the networks on the same timeline.When predicting, however, only two trained mapping generators and two trained Siamese detectors are used for change detection.

Implementation
Since bi-temporal VHR optical remote sensing images are of large scale, in this paper, we make global normalizations in their own domains, respectively, and then crop them to small patches sized 256 × 256 for later training and predicting.The training dataset is produced by randomly cutting from the original training areas.Certain overlapped and rotated samples enhance the training effect.When predicting, we first make predictions on all small patches and then splice them together into the entire prediction image.

Network Architecture
As illustrated in Figure 4, our two mapping generators are deep FCNs, and each one consists of two down-sampling convolutional blocks, nine residual blocks, and two up-sampling transposed convolutional blocks, which are followed by a convolutional layer and a linear activation function TanH.Unlike in the generators, only one down-sampling and up-sampling convolutional block is present in the conditional discriminators and Siamese detectors.The conditional discriminators comprise four convolutional blocks, followed by a discriminant transposed convolutional layer.The Siamese detectors compose seven convolutional blocks, followed by a transposed convolutional layer and a TanH activation.Here, setting down-sampling and up-sampling blocks facilitates the networks learning of highlevel features and enhances the generalization of mapping models in domain transfer.In order to Here, setting down-sampling and up-sampling blocks facilitates the networks learning of high-level features and enhances the generalization of mapping models in domain transfer.In order to prevent losing image information, no pooling layer is set in our neural networks.Instead, the down-sampling and up-sampling convolutional layers consist of convolution kernels sized 3 × 3 with reflection padding 1 and stride 2, while the common convolutional layers consist of convolution kernels sized 3 × 3 with reflection padding 1 and stride 1.The activation function in the blocks of generators is Rectified Linear Unit (ReLU), while the one in the blocks of discriminators and detectors is Leaky Rectified Linear Unit (Leaky ReLU) with a negative slope of 0.2.Instance normalization is used in all the blocks of these six networks.Remarkably, the first four convolutional blocks of the discriminator and the detector in the same domain share the same weights.In general, the memory occupations of these three types of neural networks are 11.378M, 0.964M and 1.973M, respectively.

Training Procedure
With our proposed DLSF, bi-temporal VHR remote sensing images are the direct inputs for the map generation without any pre-or post-processing.The two sets of translated images that serve as the outputs are recombined with the original images according to the domain.These sets of images are then regarded as the inputs of the conditional discriminations and Siamese detections.For forward propagation, we carry out generation, discrimination, and detection in sequence.Whereas, for backward propagation we update detectors, discriminators, and generators in the opposite order.
For all the experiments, we aim to obtain well trained models through the training procedure that mainly pursues to minimize the full objective L(G 1to2 , G 2to1 , D 1 , D 2 , S 1 , S 2 ).According to the implementations in CycleGAN [44] and conditional pixel-to-pixel GAN [51], we set both λ GAN and λ con to be 10 −2 , set λ sel f to be 10 −1 , and set λ cross to be 1 in Equation (24).In order to stabilize the training procedures, we choose the least-squares loss specifically for L GAN instead of the negative log likelihood loss in traditional GAN.For the six neural networks, all the weights and biases of the layers are first initialized with random values from zero-mean Gaussian distribution with a standard deviation of 0.1, and then optimized using Adaptive Moment Estimation (Adam) [52] solver when training.
Given the complexity of this framework, our optimization is designed to train all the networks in the same process.The detailed procedures of forward and backward propagations in one epoch are presented in Table 2.
Table 2.The overview of the optimization sequence in one epoch.

Inputs:
Paired images in domain T 1 : I (T 1 ) ∼ pair (T 1 ) Paired images in domain T 2 : I (T 2 ) ∼ pair (T 2 ) Corresponding binary change map references: C f or i ← 1 to N Forwards:

Backwards:
Update S 1 and S 2 with L con (T 1 ) and L con (T 2 ) Update D 1 and

Predicting Detail
Although our proposed DLSF produces six well trained models after training, only two mapping generators and two Siamese detectors are used in predicting the test dataset.Similar to the training procedure, the generators, G * 1to2 and G * 2to1 , are responsible for the transformation of the representations of paired images between two domains and production of two sets of patches as the inputs of the detectors afterward.Then Siamese detectors, S * 1 and S * 2 , are mainly responsible for detecting the changed regions in two domains respectively, and they will produce two change probability maps in which the pixel values are between 0 and 1.At last, the final prediction result is achieved by the binarization of the mean value of these two change probability maps at the middle threshold of 0.5.Shenzhen dataset: This dataset contains two registered large scale bi-temporal VHR remote sensing images that cover approximately 182 square kilometers with the size of 5233 × 8677 and 10466 × 17354, respectively, which were acquired in the same district of Shenzhen, Guangzhou Province, China, but at different times.As shown in Figure 6, the image captured by SPOT 6 in 2014 has a resolution of 2 meter per pixel, while the image captured by GeoEye-1 in 2015 has a resolution of 1 meter per pixel.We classify the land cover of this dataset into the following six primary categories: (1) building group, (2) road and highway, (3) tree and vegetation, (4) cultivated land, (5) barren land, and ( 6) others.

Experiments
According to the classification rules above, we give the binary ground reference change maps by whether the regions at the same geographic location belong to different categories or not.We take three quarters of the area in paired images as the training areas, which leaves one quarter of the area as the testing areas.As compared with the SZTAKI dataset, the Shenzhen dataset covers a larger area of land surface and has more types of ground targets with a complex distribution.

Methods Comparison
In this research, the performance of the proposed DLSF is compared with that of several state-of-the-art conventional, machine learning-based and deep learning-based methods as follows: • CVA [19]: Derived from the simple difference algorithm, CVA is a classic method for unsupervised change detection in remote sensing.By using a magnitude of difference vectors, CVA is able to achieve pixel-level change detection.In this competitive method, we use pixel-level change vectors to calculate the threshold by K-means clustering to achieve change detection.
• SVM [28]: As the most typical case of machine learning, SVM aims to make a generalized linear classification on dataset, and then find the decision boundary in high dimensional space.It is used for both supervised and unsupervised change detection.In this competitive method, we perform the experiments using a Gaussian radial basis function (RBF) kernel.And the SVM hyper-parameters are selected by a three-fold cross-validation.

•
CNN [34]: This network is not only used for image classification, but also is applied to extract positive and meaningful features.By purposefully designing the network architecture and loss function, the features from the CNN process provide guidance for the supervised change detection.

Evaluation Metrics
In order to prove the validity and effectiveness of our proposed DLSF for change detection, the following three indices are used to evaluate the accuracy of the final results.
Overall Accuracy (OA): The total accuracy is often used to assess the overall capacity of the change detection method, as expressed in Equation (27).

OA =
TP + TN TP + TN + FP + FN (27) where, TP is the number of changed pixels correctly detected, TN is the number of unchanged pixels correctly detected, FP is the number of unchanged pixels incorrectly detected as changed, FN is the number of changed pixels incorrectly detected as unchanged.Kappa Coefficient (KC): This index is a statistical measure that reflects the consistency between experimental result and reference, as expressed in Equation (28).
where, p 0 indicates the true consistency equaling OA here and p e indicates the theoretical consistency, as expressed in Equation (29).
F1 Score (F1): This statistical magnitude is often used to evaluate neural network models and is calculated by precision rate and recall rate, as expressed in Equation (30).
For the three evaluation indices, as the values of OA, KC, and F1 become larger, the change detection method is better.

Experimental Setup
We conduct two experiments on the aforementioned two datasets to verify the accuracy and efficiency of this method using certain training paired samples and testing paired samples with the ratio of approximately 3:1 as interpreted in Section 4.1.The size of all the paired samples and corresponding references is 256 × 256, except for channel 3 and 1 respectively.
For the optimization procedure, we set 200 epochs to make the models converge and apply Adam solver with the batch size of 1.All the networks are trained from scratch with the learning rate of 2 × 10 −4 .For the first 100 epochs, the learning rate is kept the same and is then linearly decayed to 0 for the next 100 epochs.The decay rates for the moment estimates are 0.9 and 0.999, respectively, and the epsilon is 10 −8 .
In the present research, the proposed DLSF are implemented in a PyTorch environment, which offers an effective programming interface written in Python.The experiments are performed on a computer with Intel Core i7, 16GB RAM and NVIDIA GTX1080 GPU.The time for one forward propagation and backward propagation on one sample patch pair is approximately 0.8 second, and the times for training one epoch on two datasets are approximately 495 and 790 seconds, respectively, for two datasets.With 200 epochs, the entire times for training the DLSF on SZTAKI benchmark and Shenzhen dataset are approximately 27 and 44 hours, respectively.On testing datasets, the time for one sample patch pair of size 256 × 256 is just 0.25 second.

Results Presentation
The predicted binary change maps of our proposed DLSF and all the competitors on SZTAKI airchange benchmark and Shenzhen dataset are depicted in Figures 7 and 8, respectively, where the black and white regions indicate the unchanged and changed regions.
As can be seen, since optical images have only three bands, the detection result of CVA contains numerous errors and noises.Specifically, substantial numbers of unchanged pixels are predicted as changed ones, while many changed regions are not detected or are detected as a couple of discrete regions.This result confirms that pixel-based methods does not consider the relationship of neighbor pixels, hence the prediction result will be not ideal.The SVM-based method gives better detection result than CVA does, as it incorporates contextual information as auxiliary data.Nevertheless, due to the weak generalization ability of the SVM model, the result of SVM is still unsatisfactory.With the embedding of deep learning technology, CNN learns certain implicit features and give better detection result as shown in Figures 7c and 8c.With the same quantity of training samples, the convergence effect of GAN is not as good as that of CNN, but its prediction result on testing samples are better than those of CNN, as presented in Figures 7d and 8d, which indicates that the GAN model is better for generalization in change detection.Notably, the results of DSCN and SCCN are largely free from noise, since they are designed specifically for change detection with consideration of the correlation of paired images and the contextual information of paired pixels.The outstanding performance of DSCN and SCCN indicates that the Siamese network-based methods are effective and robust for change detection tasks.As Figures 7g and 8g show, our proposed DLSF, which integrates CDLF and FCSN, achieves the best detection result.As compared with the results of DSCN and SCCN, the contours of predicted changed regions by DLSF are more accurate and smoother, and there is hardly any negative influence from image noises.With the change detection results of our proposed DLSF and other comparative methods, the evaluation metrics OA, KC, and F1 values for two datasets are computed and summarized in Table 3.As compared to CVA, SVM, CNN, GAN, DSCN, and SCCN, our proposed DLSF achieves the highest OA, KC, and F1 values of 0.8672, 0.7905, and 0.8066 on the SZTAKI benchmark, and of 0.8986, 0.7716, and 0.8149 on the Shenzhen dataset.With the change detection results of our proposed DLSF and other comparative methods, the evaluation metrics OA, KC, and F1 values for two datasets are computed and summarized in Table 3.As compared to CVA, SVM, CNN, GAN, DSCN, and SCCN, our proposed DLSF achieves the highest OA, KC, and F1 values of 0.8672, 0.7905, and 0.8066 on the SZTAKI benchmark, and of 0.8986, 0.7716, and 0.8149 on the Shenzhen dataset.

Discussion
Among the methods based on deep learning technology, we believe that the decisive factors are mainly the model architecture and the loss function.Therefore, in this section, discussion of these two factors will verify the uniqueness of our design.

Effect of Model Architectures
Deep neural networks have considerably diverse structures for different image processing tasks.The large and complex structures of the network have strong ability on feature representation and extraction, but they may induce data overfitting to some extent.On the contrary, small and simple network structures improve the generalization and efficiency of the models, but their limited expression may reduce the utilization of image information.Therefore, optimal model architectures best represent final performance.With regard to change detection in VHR optical remote sensing images, we design the models specifically adapting the training samples and goals using quantitative experiments.In this subsection, certain analyses on the three main parts of the DLSF are discussed.

Mapping Generator
With residual network as the baseline, we design this mapping generator that comprises two down-sampling layers, two up-sampling layers, and several residual blocks.On VHR optical remote sensing images, clear image details provide tremendous information and complicate change detection.
Here the down-and up-sampling processes not only facilitate the networks learning high level features, but also reduce the negative impact from tiny ground targets in domain transfer, for example, layout of cars, seasonal variations in crown size, and minor landslides.We made several comparative experiments with diverse mapping generators and the same conditional discriminators and Siamese detectors.The generators are different at the times of down-and up-sampling processes and the number of residual blocks.The detection results on a typical paired sample are illustrated in Figure 9.

Discussion
Among the methods based on deep learning technology, we believe that the decisive factors are mainly the model architecture and the loss function.Therefore, in this section, discussion of these two factors will verify the uniqueness of our design.

Effect of Model Architectures
Deep neural networks have considerably diverse structures for different image processing tasks.The large and complex structures of the network have strong ability on feature representation and extraction, but they may induce data overfitting to some extent.On the contrary, small and simple network structures improve the generalization and efficiency of the models, but their limited expression may reduce the utilization of image information.Therefore, optimal model architectures best represent final performance.With regard to change detection in VHR optical remote sensing images, we design the models specifically adapting the training samples and goals using quantitative experiments.In this subsection, certain analyses on the three main parts of the DLSF are discussed.

Mapping Generator
With residual network as the baseline, we design this mapping generator that comprises two down-sampling layers, two up-sampling layers, and several residual blocks.On VHR optical remote sensing images, clear image details provide tremendous information and complicate change detection.Here the down-and up-sampling processes not only facilitate the networks learning high level features, but also reduce the negative impact from tiny ground targets in domain transfer, for example, layout of cars, seasonal variations in crown size, and minor landslides.We made several comparative experiments with diverse mapping generators and the same conditional discriminators and Siamese detectors.The generators are different at the times of down-and up-sampling processes and the number of residual blocks.The detection results on a typical paired sample are illustrated in Figure 9.As Figure 9 shows, the mapping generators comprised of more residual blocks give better performances of change detection.With the same number of residual blocks, for the VHR remote sensing images with resolutions of 1 to 2 meters per pixel, two down-and up-sampling processes are As Figure 9 shows, the mapping generators comprised of more residual blocks give better performances of change detection.With the same number of residual blocks, for the VHR remote sensing images with resolutions of 1 to 2 meters per pixel, two down-and up-sampling processes are able to filter out most tiny objects and noises.It is noteworthy that the mapping generators comprised of more than two down-and up-sampling processes will induce certain detection errors.

Conditional Discriminator
In conventional GAN, both global and patch-based discriminators pursue the same goal of processing real patches as binary maps with all pixel values of 1, and fake patches as binary maps with all pixel values of 0. The backward propagation for these types of discriminators activate generators to translate all the patch information from fake presentations to real ones.This process will transform the domain and revise the information of the original image, and then mislead the Siamese detector afterward.We conducted several experiments with diverse patch-based, global, and our conditional discriminators, and the results are illustrated in Figure 10.able to filter out most tiny objects and noises.It is noteworthy that the mapping generators comprised of more than two down-and up-sampling processes will induce certain detection errors.

Conditional Discriminator
In conventional GAN, both global and patch-based discriminators pursue the same goal of processing real patches as binary maps with all pixel values of 1, and fake patches as binary maps with all pixel values of 0. The backward propagation for these types of discriminators activate generators to translate all the patch information from fake presentations to real ones.This process will transform the domain and revise the information of the original image, and then mislead the Siamese detector afterward.We conducted several experiments with diverse patch-based, global, and our conditional discriminators, and the results are illustrated in Figure 10.It is noted that the global 256 × 256 discriminator facilitates the generator producing far more realistic images than the 32 × 32 patch-based discriminator.Nevertheless, all the conventional discriminators revised the original features and suppressed the differences of paired images simultaneously in the unchanged and changed regions, which induced pool detection results.Our proposed conditional discriminator is able to activate the generators to only translate the unchanged regions from fake presentation to a real one, without preserving the changed areas.As has been demonstrated, the adversarial learning with our proposed conditional discriminator is the most effective.

Siamese Detector
In terms of working mechanisms, FCSN has inputs, outputs, and architectures that are similar to the conditional discriminator, but requires additional network layers to recognize change regions.In an ideal situation, perfect discriminators in CDLF will process the unchanged regions of paired patches into the same value of 1, and therefore the pairwise Euclidean distances of paired pixels are close to 0. In contrast, the changed regions of paired patches will process to different random values, and therefore the pairwise Euclidean distances of paired pixels are greater than 0. At this point, we suggest that the weights of the conditional discriminator provide guidance to be shared in the first several layers of the Siamese detector.We conducted two experiments with two frameworks on Shenzhen dataset.The former framework separately trains discriminators and detectors, while the It is noted that the global 256 × 256 discriminator facilitates the generator producing far more realistic images than the 32 × 32 patch-based discriminator.Nevertheless, all the conventional discriminators revised the original features and suppressed the differences of paired images simultaneously in the unchanged and changed regions, which induced pool detection results.Our proposed conditional discriminator is able to activate the generators to only translate the unchanged regions from fake presentation to a real one, without preserving the changed areas.As has been demonstrated, the adversarial learning with our proposed conditional discriminator is the most effective.

Siamese Detector
In terms of working mechanisms, FCSN has inputs, outputs, and architectures that are similar to the conditional discriminator, but requires additional network layers to recognize change regions.In an ideal situation, perfect discriminators in CDLF will process the unchanged regions of paired patches into the same value of 1, and therefore the pairwise Euclidean distances of paired pixels are close to 0. In contrast, the changed regions of paired patches will process to different random values, and therefore the pairwise Euclidean distances of paired pixels are greater than 0. At this point, we suggest that the weights of the conditional discriminator provide guidance to be shared in the first several layers of the Siamese detector.We conducted two experiments with two frameworks on Shenzhen dataset.The former framework separately trains discriminators and detectors, while the latter framework trains the weights shared discriminators and detectors.The change detection results are illustrated in Figure 11.

Effect of Loss Functions
As the representative of training goal, loss function is the guidance the convergence procedure of models.In order to verify the effectiveness and uniqueness of our loss functions, we conducted several comparative experiments on the SZTAKI and Shenzhen datasets with different losses.The training and testing OA with five different losses are computed and summarized in Table 4.In the following, we respectively make certain detailed interpretations on the effects of adversarial, cross-consistency, and self-consistency losses.

Effect of Loss Functions
As the representative of training goal, loss function is the guidance for the convergence procedure of models.In order to verify the effectiveness and uniqueness of our loss functions, we conducted several comparative experiments on the SZTAKI and Shenzhen datasets with different losses.The training and testing OA with five different losses are computed and summarized in Table 4.In the following, we respectively make certain detailed interpretations on the effects of adversarial, cross-consistency, and self-consistency losses.As can be seen, when the loss function becomes more complex, the training OAs on two datasets are gradually declining while the testing OAs are continuously growing.It is confirmed that complex loss functions have weaker fitting ability on training samples but have stronger generalization ability on testing samples.
Without adversarial loss, the feature extractor and decision maker are regarded as a simple Siamese network and a contrastive loss.As shown in the first and second rows in Table 4, with the addition of GAN, the training and testing OAs have significantly increased, which indicates that using one model to simultaneously detect unchanged and changed regions is difficult For cross-consistency loss, we considered a rigorous study on the working mechanism of GAN.The goal of adversarial learning between generator and discriminator is to pursue the Nash equilibrium of these two networks.The convergence of objective in this adversarial learning indicates that the models have reached a mutually stable stage, but not the best ones.It is noteworthy that certain conventional deep learning models such as CNN and FCN perform the best classification and feature extraction due to the participation of references.Therefore, we set cross-consistency loss here to be a direct guidance facilitating adversarial loss to learn the best models.As shown in the third and fifth rows in Table 4, the overall accuracy of change detection on two datasets has prominently increased with the addition of cross-consistency.
As a technical trick, the self-consistency aims to reduce the randomness of mapping generators when training the DLSF.Without self-consistency loss, the CDLF is regarded as two "opposite" image-to-image translation models based on conditional pixel-to-pixel GAN [51].As shown in the fourth and fifth rows in Figure 4, the addition of self-consistency has slightly improved the training and testing OAs on two datasets.Meanwhile during the training procedure, self-consistency enables the DLSF to rapidly achieve convergence.

Conclusions
In this paper, we propose a DLSF for change detection using bi-temporal VHR optical remote sensing images.With the proposed CDLF, we successfully reduced the domain differences between paired images, then suppressed the differences of unchanged regions and highlight the differences of changed regions.Meanwhile the proposed FCSN successfully detected the changes on bi-temporal images and achieved better detection results as compared with other state-of-the-art methods.Massive experiments on SZTAKI benchmark and Shenzhen dataset confirmed that our proposed method is advantageous with regard to fast processing velocity, small model size, and high accuracy.
Nevertheless, the proposed DLSF involves two major limitations.First, with the complex DLSF process, the training speed is slow, and the convergence curves are full of oscillations (see Appendix A Figure A1 for more details).Second, in the early period of training, the updates on the two Siamese detectors are useless, because the CDLF has no ability to achieve the cross-domain translations at that time.In future studies, on the premise of ensuring detection accuracy, we plan to simplify the full objective and try to design an intelligent optimization strategy for training the models.

Figure 2 .
Figure 2. Dataflow of the domain transfer stream.The values of white and black pixels are 1 and 0, and the values of gray pixels are random between 0 and 1.

Figure 3 .
Figure 3. Dataflow of the Siamese detector.The changed pixels are white, while the unchanged pixels are black.

Figure 2 .
Figure 2. Dataflow of the domain transfer stream.The values of white and black pixels are 1 and 0, and the values of gray pixels are random between 0 and 1.

Figure 2 .
Figure 2. Dataflow of the domain transfer stream.The values of white and black pixels are 1 and 0, and the values of gray pixels are random between 0 and 1.

Figure 3 .
Figure 3. Dataflow of the Siamese detector.The changed pixels are white, while the unchanged pixels are black.

Figure 3 .
Figure 3. Dataflow of the Siamese detector.The changed pixels are white, while the unchanged pixels are black.
Well trained mapping generators: G * 1to2 and G * 2to1 Well trained Siamese detectors: S * 1 and S *

4. 1 . 24 Figure 5 .
Figure 5. Sample image pairs and the corresponding change map references of SZTAKI benchmark: (a) images acquired at time  , (b) Images acquired at time  , (c) reference map.

Figure 5 .
Figure 5. Sample image pairs and the corresponding change map references of SZTAKI benchmark: (a) images acquired at time T 1 , (b) Images acquired at time T 2 , (c) reference map.

Figure 5 .
Figure 5. Sample image pairs and the corresponding change map references of SZTAKI benchmark: (a) images acquired at time  , (b) Images acquired at time  , (c) reference map.

Figure 6 .
Figure 6.The overview of Shenzhen dataset and the corresponding change map reference: (a) image acquired at time  , (b) image acquired at time  , (c) reference map.

Figure 6 .
Figure 6.The overview of Shenzhen dataset and the corresponding change map reference: (a) image acquired at time T 1 , (b) image acquired at time T 2 , (c) reference map.

Figure 9 .
Figure 9.The representative change detection results affected by generators with diverse structures: (a) one down-and up-sampling layer, (b) two down-and up-sampling layers, (c) three down-and up-sampling layers.

Figure 9 .
Figure 9.The representative change detection results affected by generators with diverse structures: (a) one down-and up-sampling layer, (b) two down-and up-sampling layers, (c) three down-and up-sampling layers.

Figure 10 .
Figure 10.The representative domain transfer results affected by diverse discriminators: (a) the fake image translated from domain  to  , (b) the fake image translated from domain  to  , (c) change detection results.

Figure 10 .
Figure 10.The representative domain transfer results affected by diverse discriminators: (a) the fake image translated from domain T 1 to T 2 , (b) the fake image translated from domain T 2 to T 1 , (c) change detection results.
Remote Sens. 2019, 11, x FOR PEER REVIEW 20 of 24 latter framework trains the weights shared discriminators and detectors.The change detection results are illustrated in Figure 11.

Figure 11 .
Figure 11.The representative change detection results affected by diverse detectors: (a) image 1, (b) image 2, (c) the changes detected by separate training, (d) the changes detected by weights shared training, (e) reference map.

Figure 11 .
Figure 11.The representative change detection results affected by diverse detectors: (a) image 1, (b) image 2, (c) the changes detected by separate training, (d) the changes detected by weights shared training, (e) reference map.As Figure 11 shows, the third row gives the change detection results for training of the discriminators and detectors separately, while the fourth row gives the ones for the weights shared training of the discriminators and detectors.Both performances are almost the same, but the consumed times are different.For SZTAKI benchmark and Shenzhen dataset, the times for separate training are approximately 638 and 935 seconds per epoch, respectively, while that for weights shared training are approximately 495 and 790 seconds per epoch, respectively.It is noteworthy that the latter training way has earned nearly 145 seconds as compared with the former in every epoch.
[39]N[35]: With basic CNN network, GAN adds a discriminator for adversarial learning.In many image processing tasks, as compared with CNN, GAN has better generalization ability, and displays almost the same performance of CNN with fewer input training samples.•DSCN[38]:Derivedfrom the Siamese network, DSCN aims to extract robust features from two paired images with one CNN, which has no down-or up-sampling layers.With the convergence of contrastive loss, the model is able to detect the changed regions by calculating the pairwise Euclidean distance.•SCCN[39]:As an extension of the Siamese network, SCCN is specifically designed for supervised change detection on heterogeneous remote sensing images.It maps two input images into the same feature space with a deep neural network comprised of one convolutional layer and several coupling layers, then it detects the changed regions by calculating the distance of paired images in the target feature space.

Table 3 .
Overall accuracy, Kappa coefficients and F1 score over state-of-the-art methods on SZTAKI and Shenzhen datasets, and the size and processing rate of their models.The best values are in bold.

Table 3 .
Overall accuracy, Kappa coefficients and F1 score over state-of-the-art methods on SZTAKI and Shenzhen datasets, and the size and processing rate of their models.The best values are in bold.

Table 4 .
Training and testing overall accuracy on SZTAKI and Shenzhen dataset for different losses.

Table 4 .
Training and testing overall accuracy on SZTAKI and Shenzhen dataset for different losses.