Unsupervised Deep Noise Modeling for Hyperspectral Image Change Detection

: Hyperspectral image (HSI) change detection plays an important role in remote sensing applications, and considerable research has been done focused on improving change detection performance. However, the high dimension of hyperspectral data makes it hard to extract discriminative features for hyperspectral processing tasks. Though deep convolutional neural networks (CNN) have superior capability in high-level semantic feature learning, it is difﬁcult to employ CNN for change detection tasks. As a ground truth map is usually used for the evaluation of change detection algorithms, it cannot be directly used for supervised learning. In order to better extract discriminative CNN features, a novel noise modeling-based unsupervised fully convolutional network (FCN) framework is presented for HSI change detection in this paper. Speciﬁcally, the proposed method utilizes the change detection maps of existing unsupervised change detection methods to train the deep CNN, and then removes the noise during the end-to-end training process. The main contributions of this paper are threefold: (1) A new end-to-end FCN-based deep network architecture for HSI change detection is presented with powerful learning features; (2) An unsupervised noise modeling method is introduced for the robust training of the proposed deep network; (3) Experimental results on three datasets conﬁrm the effectiveness of the proposed method.


Introduction
Hyperspectral images (HSIs) acquired by hyperspectral imaging sensors have been available for research since the early 1980s [1]. Because of the hundreds of nearly continuous spectra, HSIs help to distinguish the subtle differences of various ground objects [2]. The rich spectral information found in HSIs makes them useful in many applications, such as band selection [3][4][5], anomaly detection [6][7][8], image classification [9][10][11], hyperspectral unmixing [12][13][14][15], change detection [16][17][18][19], and so on. Among them, HSI change detection provides a timely and powerful means to observe our changing planet, which is a very significant study. To be specific, change detection is the process of identifying the differences of the same ground objects by observing them at different times [20]. Its use in various geologic applications is well documented, for example in the fields of disaster monitoring [21], resource and environment management [22,23], and land cover mapping [24,25]. In general, there are three main steps for a complete HSI change detection, namely, image preprocessing, change detection map generation, and evaluation. Among these steps, how to generate the final change detection map is the most important problem for change detection research.
Many change detection methods have been proposed in recent years. Simple methods to generate change detection maps are image subtraction and the use of ratios and other mathematical the representation of features. Thirdly, pretrained networks on other classification datasets are used for change detection in an unsupervised way. Pretrained CNN was employed to extract features of zooming levels [46] and these concatenated features were compared to obtain the change results. However, the features extracted from pretrained CNN designed for other datasets are suboptimal for the task of change detection. Furthermore, Kevin et al. [47] utilized the CNN model pretrained for semantic segmentation to detect change areas in an unsupervised way. However, this work heavily depends on the ability of a trained model to perform semantic segmentation. In other words, this method becomes invalid when the datasets do not contain segmentation labels.
Additionally, the number of change detection datasets is very limited because it is labor-intensive and time-consuming to label each pixel in an HSI. Despite the fact that there are supervised methods, some of them are based on the pseudo-training datasets which are not real labeled data. However, the unsupervised methods are usually independent of datasets and have more practical applications. The unsupervised learning does not need any labeled data and is able to be applied under various conditions. This, therefore, raises the question: how to learn the change detection map based on deep learning without labeled data while obtaining competitive results with supervised methods.
Considering these problems, how to effectively utilize the rich spectra information in high dimensionality and distinguish useful information from noise in an unsupervised way is an important field of research [15,[48][49][50]. In this work, a novel perspective on noise modeling-based unsupervised fully convolutional network (FCN) framework for HSI change detection task is proposed. The proposed end-to-end framework learns the latent change detection map by excluding the noise in change detection maps of existing unsupervised methods. We consider how to improve the performance by the training of the FCN without labeled data in order to obtain the final change detection map. To this end, this paper develops a novel end-to-end unsupervised framework, consisting of three modules: the FCN-based feature learning module, the two-stream feature fusion module, and the unsupervised noise modeling module. These three modules work natively to jointly improve the change detection accuracy. To sum up, the main contributions of this work can be summarized as follows.
(1) A new FCN-based deep network architecture is designed to learn powerful features for the task of change detection. The proposed architecture works in an end-to-end manner, which minimizes the final change detection cost function to avoid error accumulation. (2) An unsupervised noise modeling module is introduced for the robust training of the proposed deep network in the task of HSI change detection. By excluding the noise in an unsupervised way, the performance is improved effectively. (3) Extensive experimental results on three datasets demonstrate the proposed method's superior performance. It not only achieves a better performance than common unsupervised approaches, but is also competitive with some supervised approaches.
The rest of this paper is organized as follows. The details of the proposed method are introduced in Section 2, the performance of the method is evaluated in Section 3,and finally, we conclude this paper in Section 4.

Methodology
In this section, we introduce the proposed unsupervised HSI change detection framework in detail. As illustrated in Figure 1, the proposed method consists of three main modules. The first one is the FCN-based feature learning module, which is designed to learn discriminative features from high-dimensional data. As HSI change detection is treated as a segmentation task, we propose to employ the FCN-based deep learning framework as the backbone. The second one is the two-stream feature fusion module, which paves the way to feature fusion of two types of data. Different from traditional image classification, the task of change detection involves two HSIs, and how to fuse the two-stream features is still an open problem. The final one is the unsupervised noise modeling module, applied to tackle the influence of the noise and the robust training of the proposed network.
The traditional unsupervised change detection method is utilized for excluding the noise by performing this module. After these three modules, the final change detection map is obtained and the accuracy is also improved. The proposed three modules will be described in the following sections.

FCN-Based Feature Learning Module
Deep CNN-based HSI change detection methods usually sample patches for feature extraction and classification. However, patch-wise training and testing are not efficient and have extra pre-and post-processing complications. Since change detection requires one to assign labels to all pixels of the input HSIs, it is an essential segmentation task. Inspired by this, we propose to design an unsupervised FCN-based HSI change detection framework.
To clearly present the proposed FCN-based network architecture, we introduce the convolution and deconvolution layers first. Convolution is a basic feature learning component of CNN, which operates on local input regions. Suppose the input data is x ij , and (i, j) is the spatial coordinate. Then, the convolution output y ij is computed by where s is the convolution stride and k is the kernel size. f l is the specific operation for the l − th layer, and for convolution it is a matrix multiplication. Then, the output y ij will be the input of the next layer. By stacking these convolution layers, high semantic-level features are extracted. For a convolution layer, if stride s is set to 2, then the spatial size of the output will be half of the input size. Deconvolution is a special kind of convolution layer, also named transposed convolution, which is used to enlarge the output spatial size. As the weight of the deconvolution layer is transposed, the output spatial size will be two times that of the input size if stride s is set to 2.
The FCN architecture of the proposed work is illustrated in Figure 2 in detail, which consists of an encoder module and a decoder module. On the one hand, convolution layers are used in the encoder module to get the quintessence from original data. On the other hand, deconvolution layers are exploited in the decoder module to recover the input spatial size. We also compare the FCN-based change detection architecture with the patch-wise CNN change detection method. No pre-or post-processing are needed in the FCN-based framework and it can be trained in an end-to-end manner.  In change detection task, the two HSIs are captured at the same position but different times. Thus, the same FCN-based networks are employed to extract the deep features to make sure that the features of two HSIs are in the same space. Considering HSI 1 and HSI 2 as two input HSIs, the feature extraction of them with the FCN-based network can be formulated as where f encoder and f decoder are the encoder network and decoder networks, respectively. The final convolution feature maps of the two input HSIs are denoted as F HSI 1 ∈ R {N,C,H,W} and F HSI 2 ∈ R {N,C,H,W} , which are used for the following feature fusion module. N is the batch size, and C is the channel size. H and W are the height and width of the final feature maps, respectively, which are the same as the input HSIs. The architecture details for the proposed network is illustrated in Table 1.

Two-Stream Feature Fusion Module
Because HSI change detection takes two images as the input, it is different from the traditional image segmentation task which only uses one. In the change detection task, the main target is comparing the features of two HSIs in order to determine whether one pixel has changed or not. Thus, how to fuse these two branch features together is critical for the performance of change detection.
In order to fuse the extracted features of the two CNN branches, varieties of strategies have been proposed. These methods can be roughly divided into three types. (1) Image-level fusion; (2) feature-level fusion; (3) score map-level fusion. In this work, image-level fusion is not suitable because too many channels of HSI can dramatically increase the computational complexity. Moreover, image-level pre-processing may suffer as a result of the noise in raw HSIs and decrease the performance. Considering the subsequent noise modeling module, score map-fusion is also not applicable [51]. Thus, we employ the feature-level fusion strategy in this work to fuse the extracted features from two HSIs at different times. The illustration of feature-level fusion can be seen in  We perform feature-level fusion with the final layer of the FCN network [52], i.e., F HSI 1 ∈ R {N,C,H,W} and F HSI 2 ∈ R {N,C,H,W} . The reason for this is that the final feature maps are with highest spatial resolution and abstract level. Under this setting, we propose three types of feature fusion method [53]. The first is concatenation. Concatenation is widely used to fuse multiple features, and is effective in many applications. We concatenate the two feature maps along the second axis into one as F HSI ∈ R {N,2C,H,W} . Then, F HSI is used to generate the change detection map. The second type of fusion strategy is element-wise summation. By adding the two feature maps together, we obtain the final feature maps F HSI for the subsequent noise modeling module. Similarly, element-subtraction is also used for the feature fusion. All the aforementioned feature fusion strategies can be formulated as After the feature fusion module, the fused feature map F HSI ∈ R (N,2C,H,W) or F HSI ∈ R (N,C,H,W) is generated, and these feature maps are used for the following noise modeling module to estimate the 'true' change detection map.

Unsupervised Noise Modeling Module
Since ground truth labels can not be used in the HSI change detection task, we propose to make use of the change detection results of existing unsupervised change detection methods to train the FCN network. These samples which are not real labeled data with labor are used to train the network. However, the change detection results of existing unsupervised methods such as CVA [26], PCA-CVA [30], and IR-MAD [34] contain errors, which means that directing training the network with these samples will limit the performance. Inspired by the work of [54], we aim to improve the performance by excluding the noise in the existing change detection maps. Specifically, the noisy but informative change detection maps of existing unsupervised methods are utilized for the training of proposed network.
The two input HSIs are denoted as HSI 1 and HSI 2 , and K existing change detection maps are considered as the training dataset Y p = {y i ∈ R (H,W) , i = 1, ..., K} produced by the unsupervised change detection methods. The accuracies of these existing change detection maps are not 100%, which means there are some noises difficult to remove. The output of the proposed FCN-based network is employed to estimate the 'true' change detection map. This can be represented as where f d is the decoder network. With the two input HSIs, the estimated change map y t can also be formulated asȳ t = f (HSI 1 , HSI 2 ; Θ), where f denotes the whole network. Specifically, to exclude the noise in the training dataset Y p , each pixel in y i is modeled as the sum ofȳ t and a Gaussian noise n i : y i =ȳ t + n i . In practice, given the FCN network parameters Θ and two input HSIs, the noise module can be formulated as where n i is a noise map which is subject to a zero-mean Gaussian distribution. The reason why n i is supposed to be a zero-mean Gaussian distribution is that Gaussian noise may be the best simulation of real noise when the real noise is particularly complex. In addition, the zero-mean Gaussian distribution is very simple and easy to calculate. The distribution of parameter Σ in n i can be estimated with the following steps during training. Firstly, a noise map n i ∈ R (H,W) is generated from a prior Gaussian distribution p i (0, Σ). Then, the computed noise mapn i is denoted aŝ and the corresponding distribution q(Σ i ) can be estimated withn i . Finally, the Kullback-Leibler (KL) divergence loss is used to enforce the computed noise map to approximately obey the prior Gaussian distribution p i (0, Σ). The loss function can be formulated as With these steps, the noise is separated from the Y p and the change detection performance is improved in an unsupervised manner.

End-to-End Training
As mentioned above, the proposed FCN-based network firstly learns the latent change detection map, which then is trained to generate the 'true' change detection map with the training dataset Y p and the noise modeling module. Since no ground truth labels are used in this work, it is still an unsupervised framework for the change detection task and works in an end-to-end manner. For the change detection map generation, the cross-entropy loss is employed for supervision. Given the predicted change detection map y t and the training dataset generated from existing unsupervised change detection methods y i , i = 1, ..., K, the cross entropy loss L cls can be computed by The aforementioned KL divergence loss L KL works natively with the cross entropy loss L cls . Thus, the total loss for the training of the whole framework is denoted as where the parameter λ is used to balance the two types of loss functions. In practice, the parameter λ is set to 0.001. Moreover, in order to make the training process more stable, the FCN network is first trained until it converges. Then, the whole framework is trained with the loss L. The illustration of the training details of the proposed denoising module is depicted in Figure 4.

Experiments
In this section, lots of experiments are carried out on three HSI datasets to evaluate the superiority of the proposed method. Firstly, the employed HSI change detection datasets are introduced. Then, we describe the experimental details including evaluation measures and parameter setup. Finally, the experimental results of the proposed method and other state-of-the-art works are analyzed in detail.

Datasets
In order to evaluate the performance of the proposed method, three HSI change detection datasets are employed, which are from Earth Observing-1 (EO-1) Hyperion hyperspectral sensor. The EO-1 Hyperion sensor offers imagery at 30 meter spatial resolution [55] and the spectral range is from 0.4 to 2.5 µm [56]. Additionally, it provides 10 nm spectral resolution and 7.7-km swath width [57]. Each dataset consists of three images, which are HSI of time 1, HSI of time 2, and a ground-truth map. The two real HSIs are photographed at the same place but different times and the corresponding ground-truth map is a binary image. The white pixels in the ground-truth map indicate the changed part while the black pixels mean the unchanged objects. The detailed descriptions of three HSI change detection datasets are shown as follows:

Evaluation Measures
Evaluation measures are very important to analyze the performance of change detection methods. In this work, the overall accuracy (OA) and the kappa coefficient are employed to evaluate different change detection methods. In the calculation of OA and kappa coefficient, four indexes are adopted: (1) true positives (TP), the number of changed pixels that are correctly detected; (2) true negatives (TN), the number of unchanged pixels that are correctly detected; (3) the false positives (FP), the number of unchanged pixels that are detected as changed pixels wrongly; (4) the false negatives (FN), the number of changed pixels that are detected as unchanged pixels. Specifically, the OA is defined as The kappa coefficient is employed as a consistency test, which is an index to evaluate the accuracy of classification. In the change detection task, the kappa coefficient indicates the consistency between the change detection map and the ground-truth map. The larger the value of kappa coefficient, the better the performance of the corresponding method. The kappa coefficient is denoted as where

Parameter Setup
All of the experiments are conducted on Ubuntu 18.04 with four Nvidia TITAN X Pascal cards. In this work, the change detection is treated as a segmentation task. The training details are introduced as follows. The ImageNet [58] pre-trained parameters are used for the initialization of FCN encoder network. Adam [59] optimizer with decay 1e − 8 is employed for the training process. The initial learning is set to 4e − 4 for all the three datasets, and no reduction strategy is used during training. For each dataset, we train the whole network with 1200 iterations. For the first 500 iterations, we only train the FCN network. For the remaining 700 iterations, the whole network including the denoising module is trained and updated. In the training of the denoising module, the initial variance of the prior Gaussian distribution is set to 0.1. It is worth mentioning that all the parameters are the same for the three change detection datasets.
The parameter values of other tested methods are as follows: For CVA method, a MATLAB implementation of CVA is used for HSI change detection. In addition, K-means is utilized to output the final change detection map. For the PCA-CVA method, the PCA pre-processing is employed for dimensionality reduction, whose contribution rate is 0.75. Then the CVA method is applied for the PCA outputs. For IR-MAD, we adopt the MATLAB code of Reference [34]. The pseudo-training dataset is used in SVM, which is the same with GETNET. We set C to 2.0 for the SVM classifier, where the linear kernel is used. For the CNN method, a patch-size of 5 is employed to sample pseudo-training patches, and the network backbone is the same with the GETNET but without the affinity matrix. For GETNET, we follow the same parameter settings described in Reference [44].

Comparison Results
In this section, extensive experiments are conducted to prove the effectiveness of the proposed method. Specifically, we compare it with other change detection methods including unsupervised approaches such as CVA [26], PCA-CVA [30], and IR-MAD [34], and supervised approaches such as support vector machines (SVMs) [60], patch-based CNN and GETNET [44]. The detailed results of the three datasets of seven different methods are presented in Table 2. These methods are divided into 'Pixel based', 'Patch based', and 'FCN based'. For each dataset, the proposed method is compared with the other six methods. In addition, the learned change detection maps and the estimated noise maps are further visualized.

Farmland Dataset
For the farmland dataset, the visualization results of the seven different methods are presented in Figure 6. Specifically, the proposed unsupervised method achieves competitive performance with the supervised method GETNET, and outperforms the other methods. Since the proposed method uses the results of 'CVA' and 'PCA-CVA', we compare the performance of them with ours. From the results we can see that the proposed denoising module can effectively exclude the noise with the results of 'CVA' and 'PCA-CVA', and improve the performance in an unsupervised manner. Although GETNET can outperform our method, it employs patch-based CNN network architecture, which is more time-consuming. Since there exists noise in the pseudo-dataset used in SVM and CNN, their performances are lower than the unsupervised CVA, PCA-CVA, and IR-MAD methods. Thus, the noise in the training set can do harm to the accuracy of these methods. However, with the denoising module of the proposed method, our method outperforms the CVA, PCA-CVA, and IR-MAD methods. In addition, the kappa coefficient of the proposed method is also high, indicating that the consistency between our method and the ground-truth map is almost perfect. To sum up, our method is competitive with GETNET and achieves a better performance than the other methods on this dataset.

Countryside Dataset
The countryside dataset is more complicated than the farmland dataset as shown in Figure 7. For this dataset, our method achieves a similar result as with the supervised GETNET method, and outperforms the other methods including the supervised and unsupervised forms. To be specific, the PCA-CVA, IR-MAD, and patch-based CNN methods yield lower OA than that of the farmland dataset since this dataset contains more scatter points and these points make these noise-sensitive methods perform worse. It is worth mentioning that our method achieves a better performance than the patch-based CNN, which is a supervised method with deep learning. This result indicates that the proposed method is robust to noise and excellently improves the performance. Additionally, regarding the kappa coefficient, our method is next only to GETNET. In summary, the proposed framework achieves the second-best OA and is superior to the other five change detection methods.

Poyang Lake Dataset
The Poyang lake dataset contains more scatter points and is more challenging for pseudo-dataset-based methods. For this dataset, our method works great and is competitive with GETNET. Furthermore, IR-MAD performs worst both on OA and kappa coefficient. What is particularly noteworthy is the proposed method achieves the largest value of kappa coefficient, which means the almost perfect consistency between it and the ground-truth map. Particularly, the performance of our method exceeds the best unsupervised method CVA by about 1%, which indicates the effectiveness of the proposed denoising module. As can be seen in Figure 8g, some of the obvious noise is removed compared to (a), (b), and (c). Although GETNET performs best, it is more complex to train and test. Overall, our method has obvious advantages over the other methods. To better visualize the denoising module, we show the estimated noise map and latent change detection map in Figure 9.

Ablation Study
To evaluate the effect of different feature fusion methods, we conducted several experiments, and the results are shown in Table 3. The results illustrate that feature-level concatenation outperforms other fusion methods. Actually, the element-wise summation and subtraction are special cases regarding the concatenation. When the weights of the next convolution layer are specially structured, the concatenation operation is equal to the summation or subtraction. However, image-level subtraction obtains the worst performance. The reason may be that some important information is lost after the HSI subtraction. Furthermore, we also conducted experiments to evaluate the effect of the number of unsupervised change detection maps, which are presented in Table 4. The results reveal that using all three unsupervised change detection maps achieves the best performance. Even with only one unsupervised change detection map, our method can still improve the performance compared to the noisy training change detection maps. This indicates the effectiveness of the proposed method.
As for the running time, the proposed framework takes 0.403 s for one inference on the described hardware. However, it takes 139 s for the patch-based method GETNET to generate the final change detection map, which is dramatically slower than the proposed method. Thus, the proposed method is time efficient compared to the patch-based method GETNET.

Conclusions
This paper proposes a novel noise modeling-based unsupervised deep FCN framework for HSI change detection. In view of the fact that the high dimension of hyperspectral data has adverse effects on the performance of change detection, an effective deep framework is necessary to deal with this problem. Different from common CNN which learn features with the supervision method, an unsupervised deep FCN framework is presented without any labeled data. It makes use of the results of existing unsupervised change detection methods to train the network in an end-to-end manner. Specifically, the proposed method consists of three main modules, which are the FCN-based feature learning module, the two-stream feature fusion module, and the unsupervised noise modeling module. Firstly, the FCN-based feature learning module is employed to learn discriminative features from bitemporal HSIs. Then, the two-stream feature fusion module fuses the extracted feature for the purpose of the next step. Finally, the unsupervised noise modeling module deals with the influence of noise and is used for the robust training of the proposed network. These three modules work collaboratively towards improving the performance of change detection. A lot of experimental results illustrate that the proposed method is superior to unsupervised methods and some supervised method.
Author Contributions: All authors make contributions to proposing the method, performing the experiments and analyzing the results. All authors contributed to the preparation and revision of the manuscript.