A Multi-Dimensional Deep Siamese Network for Land Cover Change Detection in Bi-Temporal Hyperspectral Imagery

: In this study, an automatic Change Detection (CD) framework based on a multi-dimensional deep Siamese network was proposed for CD in bi-temporal hyperspectral imagery. The proposed method has two main steps: (1) automatic generation of training samples using the Otsu algorithm and the Dynamic Time Wrapping (DTW) predictor, and (2) binary CD using a multidimensional multi-dimensional Convolution Neural Network (CNN). Two bi-temporal hyperspectral datasets of the Hyperion sensor with a variety of land cover classes were used to evaluate the performance of the proposed method. The results were also compared to reference data and two state-of-the-art hyperspectral change detection (HCD) algorithms. It was observed that the proposed method relatively had higher accuracy and lower False Alarm (FA) rate, where the average Overall Accuracy (OA) and Kappa Coefﬁcient (KC) were more than 96% and 0.90, respectively, and the average FA rate was lower than 5%.


Introduction
Earth's surface constantly changes due to different factors, such as climate change and anthropogenic activities [1]. Detecting these changes helps to understand the relationship between optimum management of various resources and the Earth system [2,3]. Therefore, timely and accurate Change Detection (CD) is essential to know the effects of anthropogenic activities on Earth's objects [4][5][6][7]. CD can be performed using different datasets, such as bi-temporal remote sensing imagery [8]. So far, different types of remote sensing datasets have been effectively applied to many applications [9,10] due to several advantages, such as large coverage, low cost, and availability of archived consistent datasets [11,12]. In this regard, one of the valuable datasets is those collected by hyperspectral sensors. Hyperspectral imagery is collected in very narrow spectral sampling intervals [13][14][15]. The availability of a high number of spectral bands in hyperspectral data considerably facilitates the detection of targets with similar spectral responses [16][17][18]. However, one of the limitations of the hyperspectral dataset is the low temporal resolution, which can be efficiently resolved using the recent and future series of hyperspectral sensors (e.g., Hyperspectral Infrared Imager (HyspIRI), PRISMA (PRecursore IperSpettrale Della Missione Applicativa), and HypXIM). Moreover, due to the unique content of hyperspectral imagery, the extraction of multitemporal information is a challenge. Finally, atmospheric effects, noise, and data redundancy can negatively affect the results of Hyperspectral CD (HCD) [19].
So far, many algorithms have been developed for HCD [20]. It is widely reported that Deep Learning (DL) algorithms have the highest accuracies [21][22][23]. A DL approach can automate the learning of features on input data at several abstraction levels [24,25]. Among different DL frameworks, CNN methods have been extensively applied to remote sensing The quality and quantity of the reference dataset are paramount. To this end, this dataset utilized two benchmark datasets for HCD. The reference data for both datasets were created using visual analysis of the previous studies which employed these hyperspectral datasets (i.e., [19,29,36,38,42,43]). The quality and quantity of the reference dataset are paramount. To this end, this dataset utilized two benchmark datasets for HCD. The reference data for both datasets were created using visual analysis of the previous studies which employed these hyperspectral datasets (i.e., [19,29,36,38,42,43]).  Figure 2 demonstrates the flowchart of the proposed HCD framework. The proposed method has two main phases: (1) sample data generation; and (2) End-to-end CNN-based binary classification and CD.  Figure 2 demonstrates the flowchart of the proposed HCD framework. The proposed method has two main phases: (1) sample data generation; and (2) End-to-end CNN-based binary classification and CD.

Pre-Processing
An essential step in producing accurate CD results is data pre-processing. In the proposed method, the pre-processing step starts with spectral and spatial corrections. Spectral correction of the Hyperion L1R data includes removing no-data bands, correcting striping and noise effects, radiometric correction, and atmospheric correction. Finally, 154 spectral bands were used for HCD.

Sample Data Generation
The primary purpose of the sample data generation phase is to generate reliable samples for the CNN algorithm. To this end, a new framework for sample data generation is proposed in this study. This framework has three main steps: (1) determining sample regions (sub-regions) to select the training samples; (2) highlighting change and no-change

Pre-Processing
An essential step in producing accurate CD results is data pre-processing. In the proposed method, the pre-processing step starts with spectral and spatial corrections. Spectral correction of the Hyperion L1R data includes removing no-data bands, correcting striping and noise effects, radiometric correction, and atmospheric correction. Finally, 154 spectral bands were used for HCD.

Sample Data Generation
The primary purpose of the sample data generation phase is to generate reliable samples for the CNN algorithm. To this end, a new framework for sample data generation is proposed in this study. This framework has three main steps: (1) determining sample regions (sub-regions) to select the training samples; (2) highlighting change and

Sub-Regions Determination
Since the proposed framework utilizes a CNN algorithm for CD, it requires sample data to optimize the hyperparameters. The sample data can be generated from specific regions and then can be utilized to learn the CNN algorithm to generate a change map of other areas. To this end, it is required to define sub-regions for sample data generation and, subsequently, to extract reliable sample data from those sub-regions. In this study, the sub-regions were randomly identified from different regions well distributed over the study area.

Change/No-Change Prediction
It was required to highlight change and no-change areas over the selected sub-regions. The predictor phase was mainly used to discriminate change areas from no-change areas. To this end, the DTW algorithm was utilized in this study due to its high potential and robustness [44].
The DTW algorithm transforms the global optimization problem into a local optimization problem using a sequence matching method [44,45]. This procedure is defined for bi-temporal images (X1 and X2, respectively) as follows: The Euclidean distance (Equation (3)) is employed as a metric to measure the similarity of pixels in bi-temporal hyperspectral images.
In which x 1ib and x 2ib are the ith pixel in the first and second times of hyperspectral images at the bth spectral band, and B is the total number of spectral bands.
DTW for two pixels gained at two different times is defined using Equation (4).
are the upper, left, and upper left neighboring spectral elements in the bi-temporal hyperspectral images.

Hierarchical Otsu Thresholding
The Otsu method is a global clustering-based image thresholding method that can automatically divide objects of images from the background using a threshold value [46]. The main idea behind this algorithm is to divide the histogram of input into two segments by minimizing the weight of the variance within the cluster [46,47].
There are multiple factors, including pixels affected by atmospheric and noise, as well as mixed pixels, which can cause issues in the results produced by the Otsu algorithm. These factors cause a mix of change and no-change pixels in the histogram. In this study, a new framework was designed to perform thresholding based on the hierarchical Otsu method to resolve these issues. The use of hierarchical thresholding increased the reliability of the training samples. The output of the DTW was used within the Otsu thresholding method to discriminate the change and no-change pixels. This resulted in an initial change/nochange map. The no-change training pixels were selected from the no-change areas in this initial map. In the next step, the Otsu thresholding method was employed to the output of DTW and only on the change areas in the initial change/no-change map. This resulted in a new change/no-change map where the change areas were selected with better accuracy. The change training pixels were selected from the change areas in this second change/no-change map.

End-to-End CNN-Based Binary CD
A CNN framework can generally contain two main parts of feature extractor and class label assignment, which a softmax can perform [31]. A CNN network includes multiple layers, including a convolution layer, pooling layers, and a fully connected layer [48].
The proposed CNN algorithm was based on a Siamese framework and has two main differences compared to the original deep Siamese networks: (1) utilizing multidimensional kernel convolution (i.e., 3-D, 2-D, and 1-D kernel convolutions); and (2) employing depth-wise dilated convolution for investigating spectral information. Figure 3 presents the architecture of the proposed deep Siamese CNN for HCD. The proposed architecture has three 3D dilated convolution layers, a 2D convolution layer, and a 1D convolution layer ( Figure 4). Then, the features were transmitted to a fully connected layer that was two layers. Finally, the softmax layer classified the two classes of change and no-change.
ployed to the output of DTW and only on the change areas in the initial change/nomap. This resulted in a new change/no-change map where the change areas were s with better accuracy. The change training pixels were selected from the change a this second change/no-change map.

End-to-End CNN-based Binary CD
A CNN framework can generally contain two main parts of feature extract class label assignment, which a softmax can perform [31]. A CNN network include tiple layers, including a convolution layer, pooling layers, and a fully connected lay The proposed CNN algorithm was based on a Siamese framework and has tw differences compared to the original deep Siamese networks: (1) utilizing multi-d sional kernel convolution (i.e., 3-D, 2-D, and 1-D kernel convolutions); and (2) emp depth-wise dilated convolution for investigating spectral information. Figure 3 p the architecture of the proposed deep Siamese CNN for HCD. The proposed archi has three 3D dilated convolution layers, a 2D convolution layer, and a 1D convo layer ( Figure 4). Then, the features were transmitted to a fully connected layer th two layers. Finally, the softmax layer classified the two classes of change and no-ch The optimization algorithm used in this study was the Adam optimizer. Thi rithm is the adaptive learning rate optimization algorithm. The cost function w cross-entropy (Equation (5)).
where y is a real label, p(y) is the predicted value by the model, and N refers to the n of classes. The convolution layer consisted of a set of filters to automatically generate in tive features from the raw input data. The kernel of the convolution layer e desig  The convolution layer uses x as input and provides y in the φth layer using Equation (6). where is the nonlinear activation function, is the weight in the lth layer, and b is defined as the bias vector of the current layer.
The new output of the ω th feature map in the φ th layer ( , ) at position (x,y,z) for the 3D-convolution is defined based on Equation (7) , where q is the feature cube connected to the current feature cube in the (φ − 1)th layer; and , , and γ are the length, width, and depth of the convolution kernel size, respectively. The output of the 2D convolution layer can be computed using Equation (8).
, , The computational in 1D convolution at spatial position x can be expressed based on Equation (9).
An activation function and a batch normalization function are included in each convolutional layer.

Accuracy Assessment and Comparison
The results of CD were compared with the reference data (see Section 2) by calculating several accuracy indices, including the OA, F1-Score, Precision, KC, Recall, Miss-Detection (MD), and False Alarm (FA) extracted from a confusion matrix (Table 1). The confusion matrix is generated based on four components of True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).
Our proposed method was compared with four other advanced HCD methods to assess its performance. Liu et al. [49] was an unsupervised Multiple HCD framework using the Spectral Unmixing (MSU) method. The second method was introduced by Jafarzadeh and Hasanlou [39], which utilized the Spectral Unmixing (SU) method and similarity measure index. The third HCD method was designed by Hou et al. [50] based on a The optimization algorithm used in this study was the Adam optimizer. This algorithm is the adaptive learning rate optimization algorithm. The cost function was also crossentropy (Equation (5)).
where y is a real label, p(y) is the predicted value by the model, and N refers to the number of classes. The convolution layer consisted of a set of filters to automatically generate informative features from the raw input data. The kernel of the convolution layer e designed in different dimensional (1D, 2D, and 3D). Figure 4 presents the main difference of kernel convolutions in different dimensions.
The convolution layer uses x as input and provides y in the ϕth layer using Equation (6).
where ρ is the nonlinear activation function, w is the weight in the lth layer, and b is defined as the bias vector of the current layer.
The new output of the ωth feature map in the ϕth layer (v ϕ,ω ) at position (x,y,z) for the 3D-convolution is defined based on Equation (7 where q is the feature cube connected to the current feature cube in the (ϕ − 1)th layer; and α, β, and γ are the length, width, and depth of the convolution kernel size, respectively. The output of the 2D convolution layer can be computed using Equation (8).
The computational in 1D convolution at spatial position x can be expressed based on Equation (9).
An activation function and a batch normalization function are included in each convolutional layer.

Accuracy Assessment and Comparison
The results of CD were compared with the reference data (see Section 2) by calculating several accuracy indices, including the OA, F1-Score, Precision, KC, Recall, Miss-Detection (MD), and False Alarm (FA) extracted from a confusion matrix (Table 1). The confusion  Our proposed method was compared with four other advanced HCD methods to assess its performance. Liu et al. [49] was an unsupervised Multiple HCD framework using the Spectral Unmixing (MSU) method. The second method was introduced by Jafarzadeh and Hasanlou [39], which utilized the Spectral Unmixing (SU) method and similarity measure index. The third HCD method was designed by Hou et al. [50] based on a combination of deep, slow features and differencing (DSFA-Diff) methods. Finally, Du et al. [51] introduced an HCD method based on multiple morphological attributes and spectral angle weighted-based local absolute distance (SALA) methods.

Experiments and Results
The weights of the proposed CNN algorithm were initialized using the Glorot normal initializer technique [52]. Moreover, the CNN algorithm was trained with a backpropagation manner and the Adam optimizer. The input patch size was 11 × 11 × 154, and the size of mini-batches was 500. Additionally, the initial learning rate and the number of iterations were set to 10 −4 and 750 epochs, respectively.
The number of sub-regions for the Farmland-1 and Farmland-2 datasets were considered 2 and 7, respectively. The sample data generation results of the proposed method are demonstrated in Figure 5. As can be seen, the generated samples had a high correlation with the reference data.
The results of the statistical accuracy assessment of the sample data generation method for sample regions (Figure 5a,d) are provided in Table 2. Based on the results, utilizing the hierarchical Otsu algorithm increased the reliability of the sample data generation. For example, for the Farmland-1 dataset, the OA was 93.97 when the original Otsu method was used (i.e., Level-I). However, it was increased by 4.5% when the hierarchical Otsu thresholding method was employed (i.e., Level-II). propagation manner and the Adam optimizer. The input patch size was 11 × 11 × 154, and the size of mini-batches was 500. Additionally, the initial learning rate and the number of iterations were set to 10 −4 and 750 epochs, respectively. The number of sub-regions for the Farmland-1 and Farmland-2 datasets were considered 2 and 7, respectively. The sample data generation results of the proposed method are demonstrated in Figure 5. As can be seen, the generated samples had a high correlation with the reference data. The results of the statistical accuracy assessment of the sample data generation method for sample regions (Figure 5a,d) are provided in Table 2. Based on the results, The generated sample data were randomly divided into three groups. Figure 5b,e illustrates the distribution of the reference data, and Table 3 provides the number of the generated samples. The results of the HCD using the proposed framework and the SU and MSU methods for the Farmland-1 dataset are presented in Figure 6. The results are different, especially in the areas where no changes occurred (i.e., there are no FN pixels). Relatively, more no-change areas were considered as change pixels using the MSU, DSFA-Diff, and SU methods. However, the proposed CD method showed a high performance compared to the reference data (i.e., Figure 6d). Furthermore, there were many FN pixels in the result of the SALA method. Figure 7 shows the errors using different HCD algorithms for the Farmland-1 dataset. SU, DSFA-Diff, and MSU CD methods had significant amounts of FP and FN pixels (i.e., red and blue colors, respectively). Overall, the proposed HCD method had the least number of errors.
The results of the HCD using the proposed framework and the SU and MSU methods for the Farmland-1 dataset are presented in Figure 6. The results are different, especially in the areas where no changes occurred (i.e., there are no FN pixels). Relatively, more nochange areas were considered as change pixels using the MSU, DSFA-Diff, and SU methods. However, the proposed CD method showed a high performance compared to the reference data (i.e., Figure 6d). Furthermore, there were many FN pixels in the result of the SALA method.  Figure 7 shows the errors using different HCD algorithms for the Farmland-1 dataset. SU, DSFA-Diff, and MSU CD methods had significant amounts of FP and FN pixels (i.e., red and blue colors, respectively). Overall, the proposed HCD method had the least number of errors. The result of the numerical analysis of the HCD methods for the Farmland-1 dataset is presented in Table 4. Our framework had a higher performance than other methods. The OA and KC of the proposed framework were more than 96% and 0.91, respectively. The SU method also had good performance in detecting TP pixels, although it had a limitation in detecting TN pixels.  Figure 7 shows the errors using different HCD algorithms for the Farmland-1 dataset. SU, DSFA-Diff, and MSU CD methods had significant amounts of FP and FN pixels (i.e., red and blue colors, respectively). Overall, the proposed HCD method had the least number of errors.  The result of binary change maps for the Farmland-2 dataset using different HCD methods is also presented in Figure 8. The proposed method provided relatively better performance and lower FA rates.
The CD errors using different methods for the Farmland-2 dataset were also evaluated, and the results are demonstrated in Figure 9. It was observed that the SALA and proposed method had lower FP and MD pixels than the other methods. Overall, the proposed method resulted in the lowest FP pixels but more FN pixels than SALA.
The statistical accuracy indices obtained by different HCD methods for the Farmland-2 dataset are also presented in Table 5. Based on the results, all HCD algorithms had relatively lower performance than the results obtained for the Farmland-1 dataset. This issue was more severe for the MSU and SU methods in terms of OA, KC, and PCC. For this dataset, the FA and PCC were considerably different compared to the results of the Farmland-1 dataset. Although the SALA method had the lowest MD rate, it had a high FA rate (more than 4%). Generally, the proposed method provided the highest accuracy and lowest error compared to the other four HCD methods. The result of binary change maps for the Farmland-2 dataset using different HCD methods is also presented in Figure 8. The proposed method provided relatively better performance and lower FA rates.   The ablation analysis in deep learning frameworks measures a network's performance after removing one or more components to understand how the ablated components contribute to the overall performance. We removed the 1D/2D/3D convolution layers to consider the effects of these layers in our proposed method. In this regard, four scenarios were considered: (S#1) without 1-D convolution layers, (S#2) without 2-D convolution layers, (S#3) without 3-D convolution layers, and (S#4) proposed method with all components. The result of the ablation analysis of the proposed method for the Farmland-2 dataset is presented in Table 6. The 2D and 3D convolution layers had the highest and lowest impact on the proposed HCD method, respectively. The CD errors using different methods for the Farmland-2 dataset were also evaluated, and the results are demonstrated in Figure 9. It was observed that the SALA and proposed method had lower FP and MD pixels than the other methods. Overall, the proposed method resulted in the lowest FP pixels but more FN pixels than SALA. The statistical accuracy indices obtained by different HCD methods for the Farmland-2 dataset are also presented in Table 5. Based on the results, all HCD algorithms had relatively lower performance than the results obtained for the Farmland-1 dataset. This issue was more severe for the MSU and SU methods in terms of OA, KC, and PCC. For this dataset, the FA and PCC were considerably different compared to the results of the Farmland-1 dataset. Although the SALA method had the lowest MD rate, it had a high FA rate (more than 4%). Generally, the proposed method provided the highest accuracy and lowest error compared to the other four HCD methods.

Discussion
Based on the results, hyperspectral imagery is an excellent resource for CD purposes, providing an average OA of more than 90% using different HCD algorithms. The SU method had a low FA (2.25%) rate and a high MD rate (16.85%) for the Farmland-1 dataset. There is, however, a trade-off between detecting change and non-changed classes. For example, the SALA method provided lower MD rates in both datasets but had higher FA rates than the proposed method. Ideally, a method should be able to detect both change and no-change pixels with the lowest error.
The DTW model is a robust predictor and can efficiently predict the change and nochange areas. One of the main limitations of the original DTW predictor is that it takes more time to predict the bi-temporal hyperspectral datasets (about 5 h). However, our proposed method improved the time processing of HCD (less than one hour). Another limitation of the original DTW predictor is that it utilizes only a spectral dataset for HCD. However, in this study, both spatial/spectral features were combined to enhance the results of HCD.
DL methods need a high number of sample datasets, which is usually a challenge for bi-temporal datasets. Additionally, the sample dataset's quality and quantity are other challenges for the supervised methods. The proposed method in this study did not require setting parameters and collecting user training data. This could significantly reduce the cost and time.
Recently, several HCD methods were proposed to generate sample data using an unsupervised framework [53]. Mainly, these methods generate the sample data with a traditional predictor (i.e., principal component analysis, change vector analysis) and thresholding methods. Although they obtained promising results, the generation of reliable sample data is still challenging. The unreliable sample data result in low accuracies because the supervised classifiers are trained with false sample data. One of the achievements of the proposed method was refining sample data in a hierarchical thresholding method to improve the reliability of sample data. Additionally, utilizing a robust predictor (i.e., DTW algorithm) helped achieve promising results.
Although hyperspectral imagery contains rich spectral information, spatial features should also be included in the HCD process to obtain more accurate results. Most HCD methods have only utilized spectral features, neglecting the importance of spatial features. On the other hand, although there are many methods to extract spatial features, they need to optimize them. Optimizing these features by optimization algorithms is a big challenge and time confusing. This issue can reduce the accuracy of HCD using both traditional and advanced methods. The proposed method could automatically extract the deep features containing spatial and spectral information. In conclusion, the proposed method yielded accurate and reliable results for HCD.
The proposed method contained multi-dimensional convolution layers to improve the results of HCD. Although there are many DL methods based on only 2-D convolution filters, the proposed method used 3D dilated convolution to employ both spatial features and the relationship between spectral bands and, consequently, improved the accuracy of HCD. Additionally, the dilated convolution increased the receptive field without missing information. Furthermore, this research replaced the 2D convolution layer with 1D kernel convolution to decrease the number of parameters of the proposed network. This also decreased the time of the learning process.
The proposed HCD method effectively extracted the changes in an automatic framework. In this study, we focused on binary change detection, though multi-change detection could provide more details of changes. Thus, our future work will be focused on developing a new HCD method for multiple change detection.

Conclusions
Developing robust and reliable HCD algorithms is a relatively challenging task due to the specific content of the hyperspectral data, such as a high number of spectral bands and noise conditions. It is also essential to develop an HCD algorithm that can effectively utilize spectral and spatial features within hyperspectral imagery to improve the result of CD. Although DL methods have proved to be efficient for HCD, they require a large number of samples to produce accurate CD results. In this study, a new HCD method based on deep Siamese CNN was proposed for HCD to resolve some of the HCD challenges. The hierarchical Otsu thresholding within the proposed framework improved the performance of the sample generation by producing a high number of reliable sample data. The proposed CNN architecture also improved the HCD by employing spatial and spectral deep features. In addition to comparing the results with the reference data and state-of-the-art CD methods, the proposed method was applied to two bi-temporal hyperspectral datasets. According to the results, hyperspectral imagery has a high potential for CD purposes but requires special techniques to extract change information accurately. Based on visual and statistical accuracy analyses, in comparison with other state-of-the-art HCD methods, the proposed method has the following advantages: (1) it provides a higher accuracy (more than 93%) as well as low MD and FA rates; (2) it is an automatic framework and did not require collecting training data; (3) it is robust to a variety of datasets and land cover classes, (4) it can effectively extract robust deep features using multi-dimensional kernel convolution. All of these advantages illustrate the high potential of the proposed DL framework for different HCD applications.