MV-CDN: Multi-Visual Collaborative Deep Network for Change Detection of Double-Temporal Hyperspectral Images

.


Introduction
Change detection of hyperspectral images is a hot application in remote sensing. To detect the earth surface changes resulting from natural causes or human activities in the same geographic location over time, double-temporal images, which contain a former image captured at a certain time and a detected image acquired at a later point in time, are needed. This technique is highly applied in many fields, e.g., eco-environmental protection [1,2], urban sprawl [3,4], land application [5,6], farmland changes [7,8], geological disaster monitoring [9,10], as well as forest and wetland conservation [11]. With the development of multitemporal real-time monitoring, intelligent scene surveillance [12][13][14] and warning forecast systems were pushed to fresh highs.
In the remote-sensing field, although multispectral images [6] and synthetic aperture radar (SAR) images [15] demonstrate good performance in the domain of change detection, they show some weaknesses in terms of limited spectral information. It is fortunate that hyperspectral images (HSIs) can carry more detailed descriptions about real physical objects because of the wider spectrum. This favors the performance improving Table 1. The relationship among learning collaborator (LC), output collaborator (OC), the first hidden layer (HL-1), the second hidden layer (HL-2), the output layer (OL), in three subdivision approaches: CDN with one collaborator (CDN-C), CDN with two collaborators (CDN-2C), CDN with three collaborators (CDN-3C). The "×" means not available.
The major contributions of our work are summarized as follows: (1) an MV-CDN is introduced to mine more robust features from multi-visual deep expressiveness, and it can achieve a better balance in detecting positive and negative pixels; (2) CDN-C of greatest efficiency is proposed, and because of the three loose and incoherent outputs, it has much room for improvement in the model performance, e.g., sample importance analysis and derivation of weighted network; and (3) CDN-2C is mainly designed for actual ensemble learning with model compactness. The results tested on three real-word hyperspectral image datasets demonstrate the superiority with comparison to the stateof-the-art benchmarks. We organize the rest of this paper as follows. Section 2 is a review of the related work. Then we demonstrate the process of the proposed schema in Section 3. Section 4 describes the details of the tested datasets, the performance of the proposed scheme, as well as the comparison with state-of-the-art studies. Section 5 discusses the issues encountered during the experiments. Finally, in Section 6 we draw conclusions from the comparison of the proposed schema and the benchmarks.

Related Work
With the explosive growth of double-and multitemporal hyperspectral images, confronting the complexity of background features, traditional algorithms cannot effectively detect the changes in the spectral domain. Deep learning frameworks, meanwhile, have shown their powerful expressiveness in change-detection tasks. The deep networks are grouped by the training data as supervised and unsupervised deep networks. As a supervised model for change detection in remotely sensed images, Feng et al. [29] extended the encoding path of U-Net into a weight-sharing bilateral encoding path without introducing additional parameters when extracting independent features of bitemporal images. The testing results on two real-world aerial image datasets confirmed the effectiveness and robustness of the novel method in comparison to other state-of-theart methods. Yuan et al. [30] proposed a weakly supervised method for change detection in hyperspectral images. With the distance metric learning, evolution regular framework, and Laplacian regularized metric learning methods, the tested results demonstrated the superiority of the proposed schema compared with the novel methods under both "ideal" and "noisy" conditions. Shi et al. [31] proposed a deeply supervised attention metric-based network. In this work, convolutional block attention modules were integrated in the deep metric learning to produce more discriminative features; also, the deeply supervised module was applied to assist the feature extractors in generating more useful features. Compared with other state-of-the-art work, the proposed method achieved the highest performance on both tested datasets.
Since the quality of label data largely determines the detection performance of the supervised model and ill labels can cause uncertainty in the detection results, unsupervised models that have superiority in this regard are being used by many researchers for change detection in hyperspectral images. Lei et al. [32] proposed a novel changedetection method for hyperspectral image change detection. Based on unsupervised adversarial learning for spectral mapping and spatial attribute optimization for discriminant analysis, the results tested on two real datasets showed competitive performance over other state-of-the-art methods. Li et al. [33] proposed to combine two complemen-tary model-driven methods, structural similarity and change vector analysis, to generate credible labels as training samples of a subsequent CNN. The experimental results confirmed the effectiveness of the proposed method.
Although many unsupervised methods perform well in change detection of hyperspectral images, the deep expressiveness of an individual deep network is limited. In fact, there exist few cases served by ensemble DNN to improve the model robustness and the detection balance between changed pixels and unchanged pixels. Thus, in this paper, we propose to apply three similar light-weight collaborative network members with sensitivity disparity as a MV-CDN, which acts as the main part of our novel multivisual collaborative deep network for change detection of double-temporal hyperspectral images.
As a baseline method, the novel baseline DSFA was proposed to detect the changes of hyperspectral images. The major contribution of DSFA lies in the derivation of the loss function with the theory of SFA, which results in the suppression of unchanged pixels to enhance the changed pixels. The testing results have demonstrated its superiority in change-detection performance in comparison to the state-of-the-art algorithms. To support our proposed MV-CDN, three collaborative network members, the prototypic model FCNet used in DSFA, and two other fine-tuned networks with similar structure and sensitivity disparity, are introduced. Based on the FCNet, we construct the other two collaborative network members, which are sensitive to unchanged pixels and changed pixels and are named USNet and CSNet, respectively. With collaborators, the MV-CDN translate the multi-vision of three collaborative network members into a more robust field of vision and a more balanced performance in detecting the changed pixels and unchanged pixels.

•
This section describes the procedure of the proposed schema, which is composed of four modules: the MV-CDN module, the collaborator module, the SFA reprocessing module, and the change analysis model.

•
The MV-CDN consists of three subdivision approaches: CDN-C, CDN-2C, and CDN-3C, with CDN-C denoting the CDN with one collaborator on the OL, CDN-2C denoting the CDN with two collaborators on the OL and HL-2, and CDN-3C denoting the CDN with three collaborators on the OL, HL-2, and HL-1.

•
We first feed the double-channel MV-CDN with symmetric training pixels X and Y, which are selected from specific areas of the pre-detected binary change map (BCM) obtained from DSFA [25] with SFA reprocessing omitted, and the specific areas are detailed in the experiment section. The MV-CDN model could be well-trained following Sections 3.1 and 3.2 under the hyperparameter settings in the experimental section. In the deep learning process, three light-weight collaborative network members, FCNet [25], USNet [26], and CSNet [26], are employed to serve the MV-CDN; the SFA algorithm is applied to construct the loss function for extracting invariance features. As the central idea of this paper, collaborators are utilized to translate the group-thinking of the collaborative network members into a more robust field of vision. The collaborators carried by different network layers of the three subdivision approaches are shown in Table 1, where OC denotes the output collaborator and LC denotes the learning collaborator. • Figure 1 shows the procedure of the proposed schema for detecting the changes of double-temporal hyperspectral images. Given the reference image denoted by R and query image denoted by Q, we reshape both into N × 1 × B-dimensional data, with N and B respectively denoting the number of pixels and the band number of an image. Note that N is equivalent to H × W, with H and W respectively representing the height and width of the image. All paired pixels will pass the well-trained model to obtain the CPF with a spatial dimension of H × W × b, with b denoting the number of feature bands. Next, the SFA reprocessing module is employed to further inhibit the unchanged features and enhance the changed features of the spatial data model of N × 1 × b. Finally, in the change-analysis module, the Euclidean distance [34] and K-means [28] algorithms are successively applied to compute the changeintensity map and the final binary change map.

Architecture and Training Process of Proposed MV-CDN
In this section we describe the architecture of the MV-CDN and explain it mathematically in detail. The MV-CDN consists of three light-weight collaborative network members with similar structures and sensitivity disparity. In this work, the FCNet [25] is regarded as the prototype of the particular collaborative network members. Extensive experiments have confirmed that the double-cycle of internal parameters W and b in the HL-1/HL-2 is more conducive to detecting changed pixels/unchanged pixels. Thus far, the construction of the USNet and CSNet, which we have proposed in our previous work [26], is designed and implemented. Supposing there is no collaborator, each collaborative network member can independently generate the projection features for the change-detection analysis; with the MV-CDN mechanism, the collaborators are selectively applied to translate the groupthinking of the collaborative network members into a more robust field of vision.
We illustrate the architecture of the collaborative network members, FCNet, USNet, and CSNet, in Figure 2 and demonstrate their corresponding parameter settings in Table  2. In Figure 2, the white nodes on the far left, denoted by X and Y, indicate the input variables (IV) of the double-temporal samples; the rightmost white nodes denoted by X f / X u / X c and Y f / Y u / Y c indicate the symmetric projection features of the collaborative network members: FCNet denoted as f, USNet denoted as u, and CSNet denoted as c; the green nodes on the rightmost stand for the output layer; consequently, the remaining two groups of green nodes represent the two hidden layers. In Table 2, the 128, 6, and 10 indicate the numbers of nodes; B and b indicate the number of bands of each detected image and the number of bands of the corresponding mapped features, respectively. Table 2 also lists the cycle layers, the activation function of each layer, and the dropout rate.
Additionally, the learning rate, epoch, sampling range, and sample size, etc., are detailed in the experimental section. Cycle Layers The architecture of collaborative network members: FCNet, USNet, and CSNet. Table 2. Cycle layers of internal parameters and structural configuration of some hyperparameters for collaborative network members: FCNet, USNet, and CSNet.

Settings Activation Function Nodes Double-Cycle Dropout
FCNet We feed the symmetric samples X and Y to train the MV-CDN. The well-designed process of HL-1 is formulated in proper order as (1)-(3). c  c  c  c  c  c  HL  HL  HL  HL  HL  HL  HL   c  c  c  c  c  c  c  HL  HL  HL  HL  HL  HL  ; the superscript and subscript of internal parameters W and b indicate the collaborative network members and the current layer, respectively; and the paired results ( ) are three outputs of the corresponding layer of the collaborative network members.
In the CDN-3C approach, the outputs of HL-1 will then go through the LC of the collaborator process module to obtain the pair-wise data ( ) , which are taken as the input of HL-2, where 'upd' means 'updated'. However, in the case of the other two subdivision approaches, the output of HL-1 is regarded as the input of HL-2. With LC, the HL-2 process can be represented as (4)-(6).  , which will then go through the output layer to obtain the projection features. In the case of CDN-C approach, the output of HL-2 is regarded as the input of the output layer. With LC, the process of the output layer can be expressed with (7)- (9). To train the MV-CDN model, we follow the loss function of DSFA [25], which is derived from the feature invariance extraction known as SFA theory on double-temporal images. The SFA theory is summarized into an objective function and three restrictions [25], which could be reconstructed into a generalized eigenproblem as (10).
where W and L stand for the generalized eigenvector matrix and diagonal matrix of eigenvalues, respectively; A and B denote the expectation of the covariance matrix for the first-order derivative of double-temporal features and the expectation of the covariance matrix for double-temporal features as (11) and (12), respectively.
where and are regarded as the ith pair-wise pixels; T and n indicate the transpose operation and the number of pixels of a whole image. In conditions where both XX ∑ and YY ∑ are non-negative and invertible, the generalized eigenproblem can be reformulated as (13).
where the square of 1 B A − should be minimized to meet the feature invariance of SFA theory; thus, the loss function could be designed as (14).
where tr denotes the trace of the matrix. Through the gradient descent algorithm detailed in part B of the methodology section of reference [25], both pair-wise internal parameters ( ) X X W ,b and ( ) Y Y W ,b , which result from the learning of X and Y, are obtained.

Collaborator Process
In this section, Figure 3 shows the collaborator process of subdivision approaches. In HL-1 and HL-2, LC is applied for the collaborative task, which is formulated as (15)- (17), and before that the data of tensor type to be processed should be converted to array type.
Min FCNet USNet CSNet Tensors or Arrays ( , ) y ,y , y ,x , y ,y where Mean represents the arithmetic average operator, HL-j signifies the jth hidden layer, and naturally, the paired data ( ) where the pair-wise data ( ) in the collaborator process.
Regarding the output layer, since the projection features no longer transmit in the network, (15) and (16)

SFA Reprocessing
In the tests we feed the well-trained model with a reference image R and query image Q to obtain the double-temporal CPF ( ) 1 1 R ,Q . To take the SFA efficiency one step further, the weight vector matrix W resulting from USNet is required because the USNet is more sensitive to unchanged pixels to match the SFA objectives. The SFA reprocessing can further inhibit the unchanged features to highlight the changed features as conducive to the segmentation of thresholding. As shown in Figure 4a, before SFA reprocessing with the effective network, the red dots and blue dots of reverse scattering would undoubtedly lead to false segmentation.
To execute the SFA reprocessing, we multiply the transpose of W with ( ) 1 1 ,Q R as in (18) to produce the feature sets ( ) 2 2 ,Q R . The result is indicated as Figure 4b. It is noteworthy that the effectiveness of SFA reprocessing depends on the expressiveness of the network used in the previous step; thus, the deep features of low quality would reinforce the error description, as described in Figure 4c,d.

Change Analysis
In effect, it is impossible to artificially recognize the change areas from the doubletemporal features ( ) 2 2 ,Q R . Therefore, the Chi-square distance [35], Euclidean distance [34] and improved Mahalanobis distance [36], etc., could be selectively applied to the calculation of the change-intensity map (CIM). In the tests, the Euclidean distance is employed to serve the computing of CIM using (19) and (20). The computed result of Euclidean distance is regarded as the CIM, which could be applied for the initial detection of changes. Then the K-means clustering method is employed as automatic thresholding for image segmentation, and finally the binary change map is generated, in which the white and black marks uniquely identify the changed and unchanged areas, respectively. The pseudocode of the proposed schema is summarized and presented in Algorithm 1. Apply OC on OL, and LC on HL-2; 9: Go to line 13; 10: Case CDN-3C: 11: Apply OC on OL, and LC on HL-2& HL-1; 12: Go to line 13; 13: while i < epochs do 14: Compute the double-temporal projection features of pair-wise samples X and Y: X � = f (X, P X ) and Y � = f (Y, P Y ); 15: Compute the gradient of loss function (P X , In the tests, we found that other distance methods have little influence on the detection results in comparison to the using of Euclidean distance; in addition, the K-means clustering algorithm could be replaced by other threshold algorithms such as Otsu [37]. In particular, the designed example uniformly adopts the Euclidean distance and Kmeans threshold algorithm.

Results
To test the performance of comparison methods, three hyperspectral image datasets acquired from web address [38] are employed. We detail the tested datasets as follows. As shown in Figure 5, all three datasets are double-temporal data models. Among them, the scene images of "Hermiston" were taken in 2004 and 2007. It covers Hermiston City, Oregon, with a 30 m ground resolution using the HYPERION sensor and a size of 390 × 200 pixels. There are 242 spectral bands selected for change-detection tasks. The full image of Hermiston is labelled with 78,000 pixels, including 9986 positive pixels and 68,014 negative pixels. The two hyperspectral images of the dataset "Santa Barbara" were obtained from the AVIRIS sensor in 2013 and 2014 over Santa Barbara, California, USA, with a 20 m ground resolution and having a spatial dimension of 984 × 740 × 224 to indicate the height and width of pixels and the number of bands, respectively. The "Bay Area" scenes were acquired using the AVIRIS sensor in 2013 and 2015 over Patterson City, California, USA, with a 20 m ground resolution and dimensional size of 600 × 500 × 224. It is estimated that the full image of the Santa Barbara dataset has 728,160 pixels, including 52,134 positive pixels, 80,418 negative pixels, and 595,608 unlabeled pixels; while in the 'Bay Area dataset, there are 300,000 pixels in each image, including 39,270 positive pixels, 34,211 negative pixels, and 226,519 unlabeled pixels.  Figure 5(C3) demonstrates the ground-truth maps of the tested datasets. Among them, the white marks and black marks of (R2, C3) indicate the positives and negatives, respectively, while in (R1, C3) and (R3, C3), the silver marks and white marks represent the labelled areas, with silver marks denoting the negatives and white marks representing the positives. Based on the ground-truth maps with confidence of absoluteness, five metric coefficients, OA_CHG, OA_UN, OA [39], Kappa and F1 [40], as defined in (21)- (26), are employed to quantify the comparison among methods of change detection. where OA_CHG and OA_UN are two metric coefficients on positive pixels and negative pixels, respectively, while OA, Kappa, and F1 are three comprehensive coefficients of different measurement modes. In the equations, ALL stands for the total number of labelled pixels; TP and FN indicate the number of true positives and false negatives, respectively, and the sum of them is equivalent to LP, which means the number of labelled positives; TN and FP represent the number of true negatives and false positives, respectively, and their sum is equivalent to LN, which signifies the number of labelled negatives.

Measurement Coefficients
With respect to the hyperparameters, the number of nodes of each hidden layer denoted by NoH, the number of nodes of the output layer represented by NoO, and the learning rate (LR) are, respectively, 5 × 10 −5 , 128, and 10. In clinical practice, the pairwise pixels matched to the specific areas are regarded as training samples. Hence, in the tests, the pre-detected binary change map (PD-BCM) resulting from the changedetection model using the fully connected network (CD-FCNet) [25] and with SFA reprocessing cancelled is applied as the sampling reference object. Furthermore, because of the synchronous collaboration, all collaborative network members train with 2000 epochs. The dataset Santa Barbara requires 3000 paired pixels selected from the changed areas of PD-BCM, while 3000 paired pixels of unchanged areas are selected for the other two datasets, Hermiston and Bay Area. We especially note that the visualization and quantization results of a particular dataset are generated from the same sampling strategy; therefore, the comparison is fair.
When the training of the MV-CDN is completed, two detected images are fed to the well-trained model to generate three couples of CPF for three subdivision approaches, CDN-C, CDN-2C, and CDN-3C. Since an RGB image has three bands, we endow the fourth, third, and second bands to synthesize the feature map of pseudo-color from CPF. By performing the comparison methods on three tested datasets, we demonstrate the double-temporal feature maps of CPF in Figure 6, where the left column, right column, top row, and bottom row are assigned to the feature maps resulting from the reference image and query image with and without SFA reprocessing, respectively. (C1)-(C6) correspond respectively to the comparison methods: DSFA [25], the change-detection model using the unchanged sensitivity network (CD-USNet) [26], the change-detection model using the changed sensitivity network (CD-CSNet) [26], and the proposed CDN-C, CDN-2C, and CDN-3C. In Figures 7-9, (R1) demonstrates the divergence maps; (R2) shows the changeintensity maps (CIM), with very bright marks indicating a change of high probability and very dark marks denoting a small change or no change. In the grayscale CIM, it is difficult to determine the pixel-wise change states. Therefore, the K-means clustering algorithm is then applied to generate the BCM as shown in (R3), where the white marks and black marks represent the detected changed areas and unchanged areas, respectively. To identify the detected states of TP, FN, TN, and FP, as well as the unlabeled domain, we recolor the BCM with white, red, green, yellow, and black, respectively, to generate the hitting state map (HSM), as shown in (R4). Since the dataset Hermiston does not contain an unlabeled domain, there exist no black marks in the corresponding HSM.

Comparison with State-of-the-Art Work
This part analyzes and compares the proposed scheme and the state-of-the-art work: DSFA [25], CD-USNet [26], and CD-CSNet [26] in the perspectives of visualization and quantitation. Figures 7-9 shows the visualization results on the three datasets Hermiston, Santa Barbara, and Bay Area, respectively.
In Figure 7, (R4, C4), (R5, C4), and (R6, C4), and in Figures 8-9, (R4, C4), (R4, C5), and (R4, C6) are the three images with the smallest area of red marks and largest area of white marks, representing the minimum FN values and maximum TP values, respectively. This also indicates that the proposed scheme is a method to obtain a high hit-rate of changed pixels in comparison to the benchmarks. Moreover, a closer examination reveals that, even if the proposed schema does not always have the superiority in detect-ing unchanged pixels, due to its distinct advantage on changed pixels, it always outperforms the benchmarks in overall performance.
The five well-formulated coefficients OA_CHG, OA_UN, OA, Kappa, and F1 given in (21)- (26) were calculated to show the performance of the change-detection methods.
Reliable and high-quality data are provided for analysis in Tables 3-5, where the more recent advanced algorithms, CD-SDN-AM and CD-SDN-AL [26], are compared, with the best result of each quantization coefficient marked in bold. We summarize the data analysis as follows. (1) Collaborative network members serving the MV-CDN are theoretically effective: based on the DSFA algorithm, the CD-CSNet model outperforms the DSFA in OA_CHG, while it underperforms in the OA_UN coefficient, and the CD-USNet model has been proved to be contrary. (2) Due to the characteristics of multi-vision, the MV-CDN model achieves a better balance in the detection performance of positive and negative pixels. (3) The proposed schema has intense, comparative, and slight superiority over the other comparison methods on the datasets Bay Area, Santa Barbara, and Hermiston, respectively. (4) Compared with CDN-2C, the CDN-3C with one more collaborator cannot further improve the detection performance. (5) Even though the proposed CDN-C and CDN-2C may not always be in the top two for the coefficients OA_CHG and OA_UN, their performance in the comprehensive coefficients OA, Kappa, and F1 is not inferior to any other method on any tested dataset.  Figure 10.

Discussion
The proposed MV-CDN is an effective deep model for change-detection tasks on double-temporal hyperspectral images. The subdivision approaches, CDN-2C and CDN-3C, are two collaborative deep models of compactness, because they restrict each other in the hidden layer(s). As we have shown, the CDN-3C, which takes the most time in the learning process with no improving performance, is an eliminated subdivision approach. The less time-consuming CDN-2C is not inferior to CDN-3C in terms of detection performance. Based on this, considering the performance and model compactness, CDN-2C is desirable. The CDN-C carries only one collaborator and takes the least time. In this case, the three collaborative network members have no mutual restriction before the output layer, and they do not even require a synchronized process or similar structure; thus, there is much variability in the collaborative network members when using CDN-C.
As we know, CDN-3C does not show performance superiority, whether in terms of detection accuracy or time consumption, compared to the other two approaches. The reasons can be explained as follows: (1) The mechanism of the proposed schema lies in the effective perspectives; however, in the HL-1, the transformation process of FCNet and USNet are the same, and this is reflected in Figure 2 and Table 2. In this case, the additional collaborator of the first hidden layer of CDN-3C may prevent the MV-CDN from generating better description features. (2) The sensitivity disparity between FCNet and USNet is formed until it reaches the HL-2, and therefore achieves a better multivision effect. (3) Within the effective range of perspectives (the second hidden layer and the output layer), CDN-2C outperforms the CDN-C because it has more collaborators.

Conclusions
In this paper, we propose a MV-CDN for double-temporal hyperspectral image change detection. In the proposed schema, three light-weight collaborative network members, the prototypical FCNet, USNet, and CSNet, are employed to serve three subdivision approaches: CDN-C, CDN-2C, and CDN-3C. In the CDN-C approach, an output collaborator is applied on the output layer; based on CDN-C, an additional learning collaborator is applied on the HL-2 for CDN-2C, while two additional learning collaborators are applied on HL-1 and HL-2 for CDN-3C. The collaborators integrate the multi-vision of three collaborative network members. When the collaborative projection features are acquired, the SFA reprocessing and Euclidean distance are successively applied to enlarge the difference between the changed pixels and unchanged pixels and then generate the change-intensity map. Finally, the Kmeans method is employed to split the change-intensity map into a binary change map, which can uniquely identify the detected changes. We implemented the proposed schema on three open double-temporal hyperspectral image datasets. The tested results show that our proposed scheme could outperform any change-detection model with a single collaborative network member and achieves a better balance in the detection performance of positive and negative pixels in comparison to the benchmarks.