A Deep Siamese Network with Hybrid Convolutional Feature Extraction Module for Change Detection Based on Multi-sensor Remote Sensing Images

Information extraction from multi-sensor remote sensing images has increasingly attracted attention with the development of remote sensing sensors. In this study, a supervised change detection method, based on the deep Siamese convolutional network with hybrid convolutional feature extraction module (OB-DSCNH), has been proposed using multi-sensor images. The proposed architecture, which is based on dilated convolution, can extract the deep change features effectively, and the character of “network in network” increases the depth and width of the network while keeping the computational budget constant. The change decision model is utilized to detect changes through the difference of extracted features. Finally, a change detection map is obtained via an uncertainty analysis, which combines the multi-resolution segmentation, with the output from the Siamese network. To validate the effectiveness of the proposed approach, we conducted experiments on multispectral images collected by the ZY-3 and GF-2 satellites. Experimental results demonstrate that our proposed method achieves comparable and better performance than mainstream methods in multi-sensor images change detection.


Introduction
The detection of changes on the surface of the earth has become increasingly important for monitoring the local, regional, and global environment [1]. It has been studied in a number of applications, including land use investigation [2,3], disaster evaluation [4], ecological environment monitoring, and geographic data update [5].
Classical classification algorithms, such as support vector machine (SVM) [6], extreme learning machine (ELM) [7], multi-layer perceptron (MLP) [8], and some unsupervised methods, for instance, change vector analysis (CVA) [9,10], and the integration with Markov random field (MRF) [11,12], are widely utilized in change detection. With the improvement in spatial resolution, more spatial details have been recorded. Therefore, object-based methods are often utilized in a change detection task, as pixel-based change detection methods may generate the high commission and omission errors, due to high within class variation [10]. In this regard, the object-oriented technique has recently attracted obtained on two multi-sensor remote sensing datasets, and Section 4 is the part of discussion. Finally, our conclusions are presented in Section 5.

Data Description
In order to verify the effectiveness of the proposed method, the changes at three datasets are investigated. The first area covers part of Tongshan district, China, which are shown in Figure 1a,b, respectively. The second area is located near Dalong lake in Yunlong district, China, which are shown in Figure 1d-f, respectively. Figure 1g-h show the third area, which is located at Yunlong lake in Xuzhou, China. These three datasets represent three regions: Urban, rural-urban fringe, and non-urban areas. Date 1 is 1 October 2014 and the images were acquired by ZY-3 and date 2 is 5 October 2016 and the images were acquired by GF-2. The band combination of these three datasets is composed of blue, green, red, and near-infrared bands, with different resolutions and imaging conditions. Figure 1 shows the images and reference maps of these three datasets. The key technical specifications of ZY-3 and GF-2 satellites are shown in Table 1. Their sensors have presented challenges in change detection using multi-sensor data.
Both datasets were resampled to the same resolution of 5.8 m, and the geometric registration root-mean-square error (RMSE) is 0.5 pixels. The pseudo-invariant feature (PIF) was applied to achieve relative radiometric correction. The reference maps for both datasets were obtained via visual interpretation with the aid of prior knowledge and the images from Google Earth during the corresponding period, which are shown in Figure 1c,f,i. results, obtained on two multi-sensor remote sensing datasets, and Section 4 is the part of discussion. Finally, our conclusions are presented in Section 5.

Data Description
In order to verify the effectiveness of the proposed method, the changes at three datasets are investigated. The first area covers part of Tongshan district, China, which are shown in Figure 1a,b, respectively. The second area is located near Dalong lake in Yunlong district, China, which are shown in Figure 1d-f, respectively. Figure 1g-h show the third area, which is located at Yunlong lake in Xuzhou, China. These three datasets represent three regions: Urban, rural-urban fringe, and nonurban areas. Date 1 is 1 October 2014 and the images were acquired by ZY-3 and date 2 is 5 October 2016 and the images were acquired by GF-2. The band combination of these three datasets is composed of blue, green, red, and near-infrared bands, with different resolutions and imaging conditions. Figure 1 shows the images and reference maps of these three datasets. The key technical specifications of ZY-3 and GF-2 satellites are shown in Table 1. Their sensors have presented challenges in change detection using multi-sensor data.
Both datasets were resampled to the same resolution of 5.8 m, and the geometric registration root-mean-square error (RMSE) is 0.5 pixels. The pseudo-invariant feature (PIF) was applied to achieve relative radiometric correction. The reference maps for both datasets were obtained via visual interpretation with the aid of prior knowledge and the images from Google Earth during the corresponding period, which are shown in Figure 1c,f,i.  The manual selection of training samples is a time-consuming process and the selected samples often present incomplete representation. Therefore, the training samples are selected in combination with an automatic analysis process, based on differences in multi-feature images in this work.
Initial selection of changed and unchanged pixels is conducted by combining the individual detection results from spectral and texture features. Firstly, the Gabor features are constructed in the 0°, 45°, 90°, and 135°directions, with a kernel sizes of [7,9,11,13,15,17], for the transform-based texture features. Consider the original images with X spectral bands, the multi-kernel Gabor features in one direction is calculated as Equation (1), [7,9,11,13,15,17], where means the Gabor feature of x-th spectral band with a kernel size k. 4 × Gabor features are then obtained.
The difference image D is generated from two temporal images, with the dataset consisting of the spectral features, and Gabor texture features. Consider the images with r spectral bands at and , D is calculated as follows: Each dimension of D must be normalized in the range [0, 1], and data in the b-th dimensional is normalized as follows,  The manual selection of training samples is a time-consuming process and the selected samples often present incomplete representation. Therefore, the training samples are selected in combination with an automatic analysis process, based on differences in multi-feature images in this work.
Initial selection of changed and unchanged pixels is conducted by combining the individual detection results from spectral and texture features. Firstly, the Gabor features are constructed in the 0 • , 45 • , 90 • , and 135 • directions, with a kernel sizes of [7,9,11,13,15,17], for the transform-based texture features. Consider the original images with X spectral bands, the multi-kernel Gabor features in one direction is calculated as Equation (1), 7,9,11,13,15,17], where g x k means the Gabor feature of x-th spectral band with a kernel size k. 4 × X Gabor features are then obtained.
The difference image D is generated from two temporal images, with the dataset consisting of the spectral features, and Gabor texture features. Consider the images with r spectral bands at t 1 and t 2 , D is calculated as follows: Remote Sens. 2020, 12, 205

of 18
Each dimension of D must be normalized in the range [0, 1], and data in the b-th dimensional D b is normalized as follows, where D min and D max are the minimum and maximum values of the difference image in b-th dimension. Equation (4) is aimed at obtaining the initial pixel-based change detection map CD b on each band, where cd b i,j indicates that the pixel at position (i, j) in CD b belongs to the unchanged or changed part. T b is calculated according to the mean m b and standard deviation s b of the pixels on the b-th dimension. In order to select reliable training and testing samples, the uncertainty of each b-th dimensional difference image is considered, and a conservative decision is made as follows, where L i,j = 0, 1 indicates that the label on position (i, j) in the image belongs to unchanged or changed part. p is the score that a pixel at position (i, j) considered to be changed by all dimensions. For the pixel on position (i, j), if the score p is greater than the threshold 0.7 × b , then the pixel will be labelled as "changed" category. Likewise, if p is less than 0.3 × b , then it will be labelled as "unchanged" category. Training samples were selected from these "certain" cases randomly. Patches of a fixed size ω centered on each selected pixel are taken as the input samples. Therefore, the inputs in our proposed methods are [patch 1 , patch 2 , label].

Hybrid Convolutional Feature Extraction Module
When an image patch is input into the model, such as FCN [40], it is firstly convolved and then pooled to reduce the size, and increase the receptive field at the same time. After that, the size of the patch is expanded by up-sampling and deconvolution operations. However, the pooling process gives rise to partial loss of image information. In this case, understanding how to achieve a larger receptive field without pooling has become a new question in the field of deep learning.
Dilated convolution (or Atrous convolution) was originally developed for wavelet decomposition [41], the main idea of which is to insert "holes" (zeros) between pixels in convolutional kernels to improve the resolution. The characteristic of expanding the receptive field without loss of resolution or coverage enables the deep CNNs to extract effective features [42]. As shown in Figure 2a, standard convolution, with kernel size 3 × 3, is equal to dilated convolution when rate = 1. Figure 2b illustrates the samples of dilated convolution when rate = 2. The receptive field is larger compared with the standard convolutional operation. Figure 2c shows the convolution with dilated convolution when rate = 5 and the receptive field reaches 11 × 11.
Before the architecture of Inception [35], further convolutional layers were stacked on top of each other, making the CNN deeper and deeper for the pursuit of better performance. The advent of Inception makes the structure of CNN wider and diverse. Based on the structure of "network in network", the hybrid convolutional feature extraction module (HCFEM) is developed for the purpose of extracting effective features from the multi-sensor images in this work. As shown in Figure 3, HCFEM includes two units: Feature extraction unit and Feature fusion unit. Four channels Remote Sens. 2020, 12, 205 6 of 18 with different convolutional operation compose the extraction unit: (1) 1 × 1 convolution kernel to increase nonlinearity of neural network and change the dimension of the image matrix; (2) block 1 uses convolutional layer with a dilation rate r = 1; (3) block 2 uses convolutional layer with a dilation rate r = 2; (4) block 3 uses convolutional layer with a dilation rate r = 5. Three blocks apply 3 × 3 convolutions. After the convolution operation by four channels, feature fusion is carried out. Add 1 refers to the fusion between the results from block 1 and block 2, and Add 2 refers to that between the results from Add 1 and block 3. Before the architecture of Inception [35], further convolutional layers were stacked on top of each other, making the CNN deeper and deeper for the pursuit of better performance. The advent of Inception makes the structure of CNN wider and diverse. Based on the structure of "network in network", the hybrid convolutional feature extraction module (HCFEM) is developed for the purpose of extracting effective features from the multi-sensor images in this work. As shown in Figure 3, HCFEM includes two units: Feature extraction unit and Feature fusion unit. Four channels with different convolutional operation compose the extraction unit: (1) 1 × 1 convolution kernel to increase nonlinearity of neural network and change the dimension of the image matrix; (2) block 1 uses convolutional layer with a dilation rate r = 1; (3) block 2 uses convolutional layer with a dilation rate r = 2; (4) block 3 uses convolutional layer with a dilation rate r = 5. Three blocks apply 3 × 3 convolutions. After the convolution operation by four channels, feature fusion is carried out. Add 1 refers to the fusion between the results from block 1 and block 2, and Add 2 refers to that between the results from Add 1 and block 3.
Based on dilated convolution and the structure of "network in network", HCFEM can encode the object on multiple scales. With dilated convolution, deep convolutional neural network (DCNN) is able to control the resolution at which feature responses are computed, without requiring learning extra parameters [43]. Moreover, the "network in network" structure can increase the depth and width of the network, without any additional computational budget needed.  Before the architecture of Inception [35], further convolutional layers were stacked on top of each other, making the CNN deeper and deeper for the pursuit of better performance. The advent of Inception makes the structure of CNN wider and diverse. Based on the structure of "network in network", the hybrid convolutional feature extraction module (HCFEM) is developed for the purpose of extracting effective features from the multi-sensor images in this work. As shown in Figure 3, HCFEM includes two units: Feature extraction unit and Feature fusion unit. Four channels with different convolutional operation compose the extraction unit: (1) 1 × 1 convolution kernel to increase nonlinearity of neural network and change the dimension of the image matrix; (2) block 1 uses convolutional layer with a dilation rate r = 1; (3) block 2 uses convolutional layer with a dilation rate r = 2; (4) block 3 uses convolutional layer with a dilation rate r = 5. Three blocks apply 3 × 3 convolutions. After the convolution operation by four channels, feature fusion is carried out. Add 1 refers to the fusion between the results from block 1 and block 2, and Add 2 refers to that between the results from Add 1 and block 3.
Based on dilated convolution and the structure of "network in network", HCFEM can encode the object on multiple scales. With dilated convolution, deep convolutional neural network (DCNN) is able to control the resolution at which feature responses are computed, without requiring learning extra parameters [43]. Moreover, the "network in network" structure can increase the depth and width of the network, without any additional computational budget needed. Based on dilated convolution and the structure of "network in network", HCFEM can encode the object on multiple scales. With dilated convolution, deep convolutional neural network (DCNN) is able to control the resolution at which feature responses are computed, without requiring learning extra parameters [43]. Moreover, the "network in network" structure can increase the depth and width of the network, without any additional computational budget needed.  Figure 4 shows a traditional Siamese neural network, which has two inputs and two branches. In Siamese neural network, two inputs feed into two neural networks (Network1 and Network2) concurrently and the similarity of the two inputs is evaluated by contrastive loss [44]. Based on the architecture of the Siamese network, a change decision approach has been proposed with Siamese convolutional neural network.
images. The network consists of two components: Encoding network (feature extraction network) and change decision network. The layers in the encoding network are divided into two streams with same structure and shared weights as in a traditional Siamese network. As shown in Figure 5a, each image patch is inputted into these equal streams. Each stream is composed of heterogeneous convolution groups. In each group, the former convolutional module transforms the spatial and spectral measurements into high dimensional feature space, from which the subsequent HCFEM (colored in yellow in Figure 5) extracts the abundant features.
Through two heterogeneous convolution groups and another two normal convolutional modules, the absolute difference value of multiple-layer features are concatenated and inputted into the change decision network, in which three normal convolutional modules are used to extract difference features. A global average pooling layer (GAP) is carried out to decrease the number of parameters and avoid overfitting. The changed result is obtained after a fully connected layer. Figure  5a shows the designed deep Siamese convolutional neural network, and Figure 5b shows the change decision network.  Combining with the architecture of "network in network", we design a deep Siamese convolutional network based on HCFEM (DSCNH) for supervised change detection on multi-sensor images. The network consists of two components: Encoding network (feature extraction network) and change decision network. The layers in the encoding network are divided into two streams with same structure and shared weights as in a traditional Siamese network. As shown in Figure 5a, each image patch is inputted into these equal streams. Each stream is composed of heterogeneous convolution groups. In each group, the former convolutional module transforms the spatial and spectral measurements into high dimensional feature space, from which the subsequent HCFEM (colored in yellow in Figure 5) extracts the abundant features.
Through two heterogeneous convolution groups and another two normal convolutional modules, the absolute difference value of multiple-layer features are concatenated and inputted into the change decision network, in which three normal convolutional modules are used to extract difference features. A global average pooling layer (GAP) is carried out to decrease the number of parameters and avoid overfitting. The changed result is obtained after a fully connected layer. Figure 5a shows the designed deep Siamese convolutional neural network, and Figure 5b shows the change decision network.

Bootstrapping and Sampling Method for Training
To train the model properly with limited labelled samples, we introduce a sampling method based on the strategy of bootstrapping, which is implemented by constructing a number of resamples with replacement of the training samples [45]. Specifically, random sampling can be performed to extracting a certain number of samples, which are reused with new samples in the next iterative training process.

Multi-Resolution Segmentation
The images acquired by multiple sensors often present the great variations due to different imaging conditions, which brings strong noises in change detection. The object-oriented change detection (OBCD) can effectively restrain the influence of noise on change detection. Image segmentation is a primary step in OBCD, and the fractal net evolution approach (FNEA) is an effective and widely-used image segmentation method for remote sensing imagery [46]. It merges neighboring pixels with similar spectral measurements into a homogeneous image object following the principle of minimum average heterogeneity [47]. In the proposed method, two temporal images are combined into one data set by band stacking. The stacked image is then segmented on an oversegmented scale using FNEA. The segmented objects are then merged into multiple scales based on their heterogeneity.
In this work, the optimal segmentation scale according to the GS value is obtained firstly [18], then five segmentation scales, [ , , , , ], are selected. The optimal image segmentation scale, , is defined as the scale that maximizes the inter-segment heterogeneity and the intra-segment homogeneity [48]. The global Moran's I [49], which calculates spatial autocorrelation, is used as the inter-segment heterogeneity measure, and is calculated as,

Bootstrapping and Sampling Method for Training
To train the model properly with limited labelled samples, we introduce a sampling method based on the strategy of bootstrapping, which is implemented by constructing a number of resamples with replacement of the training samples [45]. Specifically, random sampling can be performed to extracting a certain number of samples, which are reused with new samples in the next iterative training process.

Multi-Resolution Segmentation
The images acquired by multiple sensors often present the great variations due to different imaging conditions, which brings strong noises in change detection. The object-oriented change detection (OBCD) can effectively restrain the influence of noise on change detection. Image segmentation is a primary step in OBCD, and the fractal net evolution approach (FNEA) is an effective and widely-used image segmentation method for remote sensing imagery [46]. It merges neighboring pixels with similar spectral measurements into a homogeneous image object following the principle of minimum average heterogeneity [47]. In the proposed method, two temporal images are combined into one data set by band stacking. The stacked image is then segmented on an over-segmented scale using FNEA. The segmented objects are then merged into multiple scales based on their heterogeneity.
In this work, the optimal segmentation scale S l according to the GS value is obtained firstly [18], then five segmentation scales, [S l−2 , S l−1 , S l , S l+1 , S l+2 ], are selected. The optimal image segmentation scale, S l , is defined as the scale that maximizes the inter-segment heterogeneity and the intra-segment homogeneity [48]. The global Moran's I [49], which calculates spatial autocorrelation, is used as the inter-segment heterogeneity measure, and is calculated as, Remote Sens. 2020, 12, 205 where w ij is the spatial adjacency measure of R i and R j . If regions R i and R j are neighbours, w ij = 1; otherwise, w ij = 0. y i and y j are the mean values of R i , and R j , respectively. While, y is the mean value of each band of the image. Low Moran's I values indicate a low degree of spatial autocorrelation and high inter-segment heterogeneity. The variance average weighted by each object area is used as the global intra-segment homogeneity measurement, which is calculated as, where a i and v i represent the area and variance of segment R i , respectively. n is the total number of objects in the segmentation map.
Both measurements are rescaled to range (0-1). To assign an overall "global score" (GS) on each segmentation scale, the V and MI are combined as the objective function: For each segmentation, the GSs are calculated on all the feature dimension. The average GS of all the feature bands are used to determine the best image segmentation scale, where the optimal segmentation scale is identified as the one with the lowest average GS value. For the experimental data, the segmentation scales of three datasets are set to [30,35,40,45,50], [25,30,35,40,45] and [25,30,35,40,45], respectively. The results on different segmentation scale are shown in Figure 6.
where is the spatial adjacency measure of and . If regions and are neighbours, = 1; otherwise, = 0. and are the mean values of , and , respectively. While, is the mean value of each band of the image. Low Moran's I values indicate a low degree of spatial autocorrelation and high inter-segment heterogeneity.
The variance average weighted by each object area is used as the global intra-segment homogeneity measurement, which is calculated as, where and represent the area and variance of segment , respectively. n is the total number of objects in the segmentation map.
Both measurements are rescaled to range (0-1). To assign an overall "global score" (GS) on each segmentation scale, the and are combined as the objective function: For each segmentation, the GSs are calculated on all the feature dimension. The average GS of all the feature bands are used to determine the best image segmentation scale, where the optimal segmentation scale is identified as the one with the lowest average GS value. For the experimental data, the segmentation scales of three datasets are set to [30,35,40,45,50], [25,30,35,40,45] and [25,30,35,40,45], respectively. The results on different segmentation scale are shown in Figure 6.

Change Detection Framework Combined with Deep Siamese Network and Multi-Resolution Segmentation
Patches of high-resolution remote sensing image are utilized in DSCNH to extract the deep context features and analyze the changes in the feature space. However, the learned spatial features are only restricted to a fixed neighborhood region. In this regard, we introduce the multi-resolution segmentation algorithm to fully explore the object's spatial information. The pixel-based result obtained by DSCNH can be refined by an additional constraint in the same object, so as to make better use of the spatial information of multi sensor images.
Suppose the category ϑ = {C, U}, where C and U represent the changed and unchanged classes, respectively. Then the inputs are divided into these two categories through DSCNH, and the pixel-based change detection results can be obtained. For each scale level l, an object is represented as R i , i = 1, 2 . . . N, where N denotes the count of objects in level l, and threshold T is set to classify the objects R i using Equations (9) and (10).
where p c represents the probability of object R i belonging to C in level l, n j c and n are the changed pixels and total number of pixels in object R i . If the CD i satisfies P c > T, the object R i is labeled as changed object. CD i = 0, 1 indicates that R i belongs to the unchanged and changed classes, respectively.
We can see now the proposed method can be regard as a combination with deep learning and multi-resolution segmentation (OB-DSCNH), including images pre-processing, sample selection, change detection based on DSCNH, and decision fusion. The flow chart of the procedures is shown in Figure 7.

Change Detection Framework Combined with Deep Siamese Network and Multi-Resolution Segmentation
Patches of high-resolution remote sensing image are utilized in DSCNH to extract the deep context features and analyze the changes in the feature space. However, the learned spatial features are only restricted to a fixed neighborhood region. In this regard, we introduce the multi-resolution segmentation algorithm to fully explore the object's spatial information. The pixel-based result obtained by DSCNH can be refined by an additional constraint in the same object, so as to make better use of the spatial information of multi sensor images.
Suppose the category ϑ = , , where C and U represent the changed and unchanged classes, respectively. Then the inputs are divided into these two categories through DSCNH, and the pixelbased change detection results can be obtained. For each scale level l, an object is represented as , = 1,2 … , where N denotes the count of objects in level , and threshold T is set to classify the objects using Equations (9) and (10).
where represents the probability of object belonging to C in level l, and n are the changed pixels and total number of pixels in object . If the satisfies > , the object is labeled as changed object.
= 0, 1 indicates that belongs to the unchanged and changed classes, respectively.
We can see now the proposed method can be regard as a combination with deep learning and multi-resolution segmentation (OB-DSCNH), including images pre-processing, sample selection, change detection based on DSCNH, and decision fusion. The flow chart of the procedures is shown in Figure 7.

Results
In order to demonstrate the effectiveness of the OB-DSCNH, two dates of images from two sensors were utilized at three locations. The factors that may impact the performance of the model were explored. The influence of different patch sizes was also studied, which is linked to the size of the receptive field. Five hundred changed, and one thousand unchanged, regions (patches) were chosen as the labelled dataset, fifty percent of which were randomly selected to be the training sets and the rest for testing. The threshold for the uncertainty analysis was set as 0.70 by trial and error. The segmentation scales of the three datasets were set as 40, 30, and 45, respectively, based on crossvalidation. In this work, all the experiments were implemented in Python 3.7.

Results
In order to demonstrate the effectiveness of the OB-DSCNH, two dates of images from two sensors were utilized at three locations. The factors that may impact the performance of the model were explored. The influence of different patch sizes was also studied, which is linked to the size of the receptive field. Five hundred changed, and one thousand unchanged, regions (patches) were chosen as the labelled dataset, fifty percent of which were randomly selected to be the training sets and the rest for testing. The threshold for the uncertainty analysis was set as 0.70 by trial and error. The segmentation scales of the three datasets were set as 40, 30, and 45, respectively, based on cross-validation. In this work, all the experiments were implemented in Python 3.7.

Experimental Results
We compared the proposed OB-DSCNH with the state-of-the-art methods to demonstrate its superiority. The supervised pixel-wise change detection methods of Multiple Linear Regression (MLR), Extreme Learning Machine (ELM), the Artificial Neural Network (ANN), and Support Vector Machine (SVM) were chosen as comparative methods. Moreover, CD based on the deep Siamese multi-scale convolutional network (DSMS-CN) [36], the deep convolutional neural network (DCNN), and traditional Siamese convolutional neural network (TSCNN) [25] were chosen on behalf of the deep learning methods in the contrast experiments. The patch size used in deep learning comparison experiments are the same as that of OB-DSCNH. The hyper-parameters of each method were chosen empirically. Figures 8-10 show the change detection results based on the deep Siamese convolutional network. The unchanged and changed classes are colored in black and white, respectively. It can be seen from the change maps that the changed regions in the first dataset mainly comprise the increased land and roads, and the decreased buildings. The changed regions in the last two datasets mainly are constructions. Compared with the reference change maps shown in Figures 8i, 9i and 10i, the change detection results of OB-DSCNH are more consistent with the reference change maps.
The detection results on the first dataset show that the change maps obtained by MLR, ELM, and SVM contain a large number of false detected pixels. From Figure 8d, ANN and the previous several methods present a similar result, which demonstrates the insufficiency of these classifiers on multi-sensor images. For the second dataset, there is a large area of cultivated land in the southwest of the image. The convention in change detection is that the area should be judged to be unchanged when it is covered by crops. As shown in Figure 9a-d, the common change detection methods fail to extract useful features towards the classification task, and there is significant "salt-and-pepper" noise due to the lack of spatial context usage. As shown in Figures 8e-h and 9e-9h, deep convolutional neural networks have a powerful ability to extract spectral and spatial context information. The third dataset has less changes than the first two datasets. From Figure 10a-d, it can be seen clearly that the change maps, obtained by ELM, MLR, SVM, and ANN, contain many false detected pixels in the water area. OB-DSCAH and other deep learning methods succeed in the unchanged information detection, as shown in Figure 10e-h. Some of the "salt-and-pepper" noise in the change detection results is eliminated after including the segmented object information constraint.

Experimental Results
We compared the proposed OB-DSCNH with the state-of-the-art methods to demonstrate its superiority. The supervised pixel-wise change detection methods of Multiple Linear Regression (MLR), Extreme Learning Machine (ELM), the Artificial Neural Network (ANN), and Support Vector Machine (SVM) were chosen as comparative methods. Moreover, CD based on the deep Siamese multi-scale convolutional network (DSMS-CN) [36], the deep convolutional neural network (DCNN), and traditional Siamese convolutional neural network (TSCNN) [25] were chosen on behalf of the deep learning methods in the contrast experiments. The patch size used in deep learning comparison experiments are the same as that of OB-DSCNH. The hyper-parameters of each method were chosen empirically. Figures 8-10 show the change detection results based on the deep Siamese convolutional network. The unchanged and changed classes are colored in black and white, respectively. It can be seen from the change maps that the changed regions in the first dataset mainly comprise the increased land and roads, and the decreased buildings. The changed regions in the last two datasets mainly are constructions. Compared with the reference change maps shown in Figures 8i, 9i and 10i, the change detection results of OB-DSCNH are more consistent with the reference change maps.
The detection results on the first dataset show that the change maps obtained by MLR, ELM, and SVM contain a large number of false detected pixels. From Figure 8d, ANN and the previous several methods present a similar result, which demonstrates the insufficiency of these classifiers on multi-sensor images. For the second dataset, there is a large area of cultivated land in the southwest of the image. The convention in change detection is that the area should be judged to be unchanged when it is covered by crops. As shown in Figure 9a-d, the common change detection methods fail to extract useful features towards the classification task, and there is significant "salt-and-pepper" noise due to the lack of spatial context usage. As shown in Figures 8e-h and 9e-9h, deep convolutional neural networks have a powerful ability to extract spectral and spatial context information. The third dataset has less changes than the first two datasets. From Figure 10a-d, it can be seen clearly that the change maps, obtained by ELM, MLR, SVM, and ANN, contain many false detected pixels in the water area. OB-DSCAH and other deep learning methods succeed in the unchanged information detection, as shown in Figure 10e-h. Some of the "salt-and-pepper" noise in the change detection results is eliminated after including the segmented object information constraint.

Accuracy evaluation
In order to assess the performance of the proposed approach, four indicators are adopted by comparing the detection results with the ground truth: (1) Overall accuracy (OA); (2) Kappa coefficient; (3) commission error; and (4) omission error, which are defined as:

Accuracy evaluation
In order to assess the performance of the proposed approach, four indicators are adopted by comparing the detection results with the ground truth: (1) Overall accuracy (OA); (2) Kappa coefficient; Omission error = N 10 (N 10 +N 00 ) (11) where N 11 and N 00 are the numbers of changed pixels and unchanged pixels correctly detected, respectively; N 10 denotes the number of missed changed pixels; N 01 is the number of unchanged pixels in the ground reference that are detected as changed in the change map; and N is the total number of the labelled pixels.
The accuracies of the change detection for the three datasets are listed in Tables 2-4. It can be clearly seen that the proposed OB-DSCNH obtains a higher change detection accuracy than the other methods. The accuracy of OB-DSCNH achieves the highest among all the methods with the OAs being 0.9715, 0.9468, and 0.9792 on the three datasets. In the first dataset, the OA of OB-DSCNH is superior to DSMS-CN by 3.24%, and the Kappa coefficient is superior by 13%. The OA and Kappa of OB-DSCNH are increased by 2.21%, 5% on the second dataset compared with which of DSMS-CN, respectively. On the third dataset, the OA of OB-DSCNH are superior to other deep learning methods by more than 2.9%, and the Kappa coefficient is increased by 16.6% compared with which of DSMS-CN. These results demonstrate the superiority in effectiveness and the generalizability of the proposed method.   Atmosphere and illumination variations may lead to the complicated feature statistics for the multi-sensor images, resulting in poor performance on change detection for some classical methods. It is evident that the proposed method can extract the deep and separable features from the training data towards change detection task. OB-DSCNH outperforms the classical methods, such as SVM and ELM, which can be ascribed to the extracted features by the deep Siamese convolutional network.
Although, the omission error is higher than DSMS-CN, the proposed method still presents a stronger robustness compared with DSMS-CN on the three datasets.

Discussion
In the proposed network, the input consists of a pair of satellite images with an alterable size. To detect the changes of landcover using fine-grained features, the size of the input patch needs to be considered carefully. In this study, five sizes of input patch are chosen to analyze the influence on the accuracy. The values of three datasets are set to [5,7,9,11,13], [7,9,11,13,15], and [5,7,9,11,13] respectively. The experimental results in this part are obtained without constraining by segmentation, in order to eliminate the influence of the segmentation scale.
The accuracies under different patch sizes for change detection are listed in Tables 5-7. It can be seen that, for the first dataset, the model yields the highest OA when patch size is 5 while the omission ratio is also higher than others. The aggregative indicators show that the optimum is 7. For the second dataset, the method achieves the best performance when the patch size is 13. When the patch size is 9, the method preforms best on the third dataset.   Generally, most of the changes come from buildings in the first dataset. Relatively, a single change category and regular change shape should be the main reason that caused the patch size has no significant impact on the accuracy on the first dataset. As is shown in Table 5, due to the complexity of surface feature in the second dataset, the accuracy of change detection is improved obviously when the patch size changes from 7 to 13. Compared to the first dataset, change scenarios in this area are more complex, such as a large number of buildings being demolished and turned into land. If the patch size is too small, the network cannot fully learn the change information of surface feature, as well as its surrounding areas, which results in the inability to accurately detect these changes.

Conclusions
In this paper, we propose a supervised change detection method based on the deep Siamese convolutional network for multi-sensor images. The hybrid convolutional feature extraction module (HCFEM) has been designed based on dilated convolution and the structure of "network in network". The proposed method is capable of extracting the hierarchical features from the input image pairs, which are more abstract and robust than comparative methods. In order to demonstrate the performance of the proposed technique, two multi-sensor datasets at three locations were utilized. Experimental results demonstrate that the proposed method achieves significant superiority than mainstream methods in multi-sensor images change detection.
However, when the central pixel and its neighborhoods are not in the same category, they are still regarded as the same class because of the impartible of the square input patch, which is the limitation of OB-DSCNH. In future work, segmentation object, taken as a training sample, will be explored. In addition, the unsupervised representation learning methods will also be considered during the detection process.