Differentially Deep Subspace Representation for Unsupervised Change Detection of SAR Images

Temporal analysis of synthetic aperture radar (SAR) time series is a basic and significant issue in the remote sensing field. Change detection as well as other interpretation tasks of SAR images always involves non-linear/non-convex problems. Complex (non-linear) change criteria or models have thus been proposed for SAR images, instead of direct difference (e.g., change vector analysis) with/without linear transform (e.g., Principal Component Analysis, Slow Feature Analysis) used in optical image change detection. In this paper, inspired by the powerful deep learning techniques, we present a deep autoencoder (AE) based non-linear subspace representation for unsupervised change detection with multi-temporal SAR images. The proposed architecture is built upon an autoencoder-like (AE-like) network, which non-linearly maps the input SAR data into a latent space. Unlike normal AE networks, a self-expressive layer performing like principal component analysis (PCA) is added between the encoder and the decoder, which further transforms the mapped SAR data to mutually orthogonal subspaces. To make the proposed architecture more efficient at change detection tasks, the parameters are trained to minimize the representation difference of unchanged pixels in the deep subspace. Thus, the proposed architecture is namely the Differentially Deep Subspace Representation (DDSR) network for multi-temporal SAR images change detection. Experimental results on real datasets validate the effectiveness and superiority of the proposed architecture.


Introduction
Change detection with remote sensing images is the process of identifying and locating differences in regions of interest by observing them at different dates [1].It is of great significance for many applications of remote sensing images, such as rapid mapping of disaster, land-use and land-cover monitoring and so on.Wessels et al. [2] use optical images with the reweighted multivariate alteration detection method to identify change areas, and then update the land-cover mapping.A multi-sensor change detection method between optical and synthetic aperture radar (SAR) imagery is proposed in [3] for earthquake damage assessment of buildings.Taubenbock et al. [4] propose a post-classification based change detection using optical and SAR data for urbanization monitoring.Multi-temporal airborne laser data is used to monitor forest change in [5].In this paper, we tackle the issue of change detection using SAR images.Unlike optical remote sensing images, SAR images can be acquired under any weather condition at day or night; however, there usually are more challenges (i.e., non-linear/non-convex problems) for SAR image visual and machine interpretation due to the coherent imaging mechanism (speckle).
For change detection using remotely sensed optical images, the most widely used criterion is difference operator [1] (for single channel images) or change vector analysis [6][7][8] (for multi-band/spectral images).Due to the temporal spectral variance caused by different atmospheric conditions, illumination and sensor calibration, image transformation has been widely used to yield robust change detection criteria.The core idea of the image transformation is to transform the multi-band/spectral image into a specific feature space, in which the unchanged temporal pixel pairs have similar representations while the changed ones differ from each other.Principal component analysis (PCA) [9][10][11] is one of the state-of-the-art operators for modeling temporal spectral difference of unchanged pixels.Beyond PCA, Kauth-Thomas transformation [12], Gram-Schmidt orthonormalization process [13,14], multivariate alteration detection [15,16] and slow feature analysis [17,18] theories have been used for optical image change detection.However, these algorithms are mainly designed for optical images and usually fail to deal with SAR images with speckle.
Given SAR images, we may meet a more complex situation in which the multi-temporal images are in different feature spaces and changed/unchanged pixels are linearly non-sparable, due to the coherent imaging mechanism.Two main approaches have been developed in the literature: coherence change detection and incoherent change detection.The former uses the phase information of SAR time series to study the coherence map, which has strict limitations for the input multi-temporal SAR images [19].The incoherent change detection more relies on the amplitude or intensity values of SAR data, for instance, the amplitude ratio or log-ratio [20].Improvements have been proposed thanks to automatic thresholding methods [21] and multi-scale analysis to preserve details [22].Lombardo and Oliver [23] propose a generalized likelihood ratio test given by the ratio between geometric and arithmetic means for SAR images.Quin et al. [24] extend the SAR ratio to more general cases with an adaptive and nonlinear threshold, which can be applied to not only SAR image pairs but also long-term SAR time series.Beyond change detection, Su et al. [25] propose a generalized likelihood ratio test based spectral cluster for temporal behaviours analysis of long-term SAR time series.Obviously, non-linear change criteria have been widely used for SAR images in the literature.However, these change criteria usually have noisy results due to the SAR speckle, or face a trade off between spatial resolution and smoothness of detecting results.
Recently, deep learning techniques have been experiencing a rapid growth and have achieved remarkable success in various fields.Given change detection issue using remotely sensed data, a large number of deep network architectures have been proposed.Improved UNet++ [26] is proposed to solve the error accumulation problems in the deep feature based change detection.Ji et al. [27] apply a Mask R-CNN based building change detection network with self-training ability, which does not need high-quality training samples.Dual learning-based Siamese framework in [28] can reduce the domain differences of bi-temporal images by retaining the intrinsic information and translating them into each other's domain.A set of convolutional neural network features [29] have been used to compute the difference indices.Similarly, a spare autoencoder is applied in [30] to extract robust SAR features for change detection.
In this paper, we propose a differentially deep subspace representation (DDSR) for multi-temporal SAR images.The proposed network consists of a non-linear mapping network followed by a linear transform layer to deal with the complex patterns of changed and unchanged pixels in SAR images.The non-linear mapping network is built upon an autoencoder-like (AE-like) deep neural network, which can non-linearly map the noisy SAR data to a low-dimensional latent space.Contrary to normal autoencoder (AE) network, the proposed architecture is trained to minimize the representation difference of unchanged pixel pairs, instead of reconstruction error of the decoder.To better separate the unchanged and changed pixels in the latent space, a single-layer self-expressive network linearly transform the mapped SAR data into a mutually orthogonal subspace.In the transformed subspace, the unchanged pixel pairs have similar representation, while the temporally changed ones are comparatively different from each other.Changed pixels are finally identified by an unsupervised K-Means clustering method [31].Note that a similar idea has been proposed in [32], in which the slow feature analysis [18] is applied to perform the linear transform, instead of our self-expressive network with the back propagation algorithm.
This paper is organized as follows.Section 2 briefly recalls the non-linear/linear subspace approaches.The proposed network is presented in Section 3, which is followed by the evaluation (Section 4) and the conclusion (Section 5).

Related Work
To deal with the nonlinearities in SAR change detection task, the proposed DDSR maps the bi-temporal SAR data into a subspace using a non-linear AE-like network followed by a linear self-expressive layer.The change criterion is computed by the DDSR difference of the input bi-temporal SAR images.Similar ideas have been proposed in the literature.

Deep Subspace Clustering
Ji et al. [33] propose a deep autoencoder framework for subspace clustering, in which a self-expressive layer has been introduced between the encoder and the decoder to learn the pairwise affinities of the input data through a standard backpropagation procedure.Figure 1a gives a brief illustration of this deep subspace clustering network.It provides an explicit non-linear mapping for the complex input data that is well-adapted to the subspace clustering, which yields significant improvement over the state-of-the-art subspace clustering solutions.A structured autoencoder in [34] introduces a global structure prior into the non-linear mapping.These deep subspace approaches mainly focus on the clustering or recognition problems, in which the network weights are well trained to exploit the similar information among the input data, instead of the differential information used for change detection task.Even though these approaches can be easily adapted to change detection task, the performance might not be optimal.In this paper, the proposed architecture discards the decoder network and redesigns the network loss to adapt the SAR image change detection.

Deep Slow Feature Analysis Network
In [32], Du et al. present a slow feature analysis (SFA) theory based deep neural network for optical remote sensing change detection.This network non-linearly maps the input bi-temporal data into a higher dimensional space, as shown in Figure 1b.The classic slow feature analysis (SFA) algorithm is then applied to suppress the unchanged components and highlight the changed components of the mapped data.In our work, the non-linear mapping is our DDSR network, which performs a sparse autoencoder, which compacts the input data to lower dimensional space.In addition, compared with the SFA based linear transformation, the self-expressive layer can be trained to adapt well to the given task and the given dataset by the backpropagation algorithm.

Differentially Deep Subspace Representation (DDSR) for Change Detection
As far as we are concerned, the non-linear transformation for change detection generally outperforms linear ones, which can handle the complex patterns of the input data.Non-linear kernel based methods have also been proposed [35][36][37]; however, it is not clear whether the pre-defined kernels are suitable for SAR image change detection tasks.In this work, our goal is to learn an explicit mapping that makes the changed and unchangded pixel pairs more separable in the transformed subspaces.This section builds our architecture namely differentially deep subspace representation (DDSR) based on the classic autoencoder network.As shown in Figure 2, the non-linear part (the AE-like network) first maps the input bi-temporal SAR data into a low-dimensional latent space.The linear part (the self-expressive layer) further transforms the mapped SAR data to a subspace.Contrary to minimization of the reconstruction error, the proposed architecture is trained to compact unchanged pixel pairs and explode the changed ones in the subspace.

AE-Like Network Based Non-Linear Mapping
Basically, the encoder of the AE network is a classical multi-layer deep neural network.Each layer consisting of an input layer, a hidden layer and a output layer can non-linearly transform the input data to latent features.Given a pair of pixels {x, y} ∈ {R 1 N are corresponding patches with pixel x and y as center, respectively.In the proposed AE-like network, denote the input, hidden and output layer of the neural networks.At the first stage, the patch pair {X, Y} corresponding to pixel pair {x, y} are reshaped to form input vector I (i.e., I X and I Y ).The hidden layer can be computed by where W H ∈ R M×N denotes the weight matrix of the hidden layer, B H ∈ R M denotes the bias and f denotes the activation function performing the non-linear mapping.At the second stage, the latent feature H is mapped to the output by where

Self-Expressive Layer Based Linear Transformation
As shown in Figure 2, the main motivation of the self-expressive layer is based on the PCA and SFA theories.However, unlike PCA or SFA, the linear transformation of our DDSR is learned by the backpropagation algorithm, instead of the classic or generalized eigenvalue decomposition.The data-driven strategy can make the self-expressive aspects more adaptive to the given datasets than PCA and SFA.Let Z ∈ R M×1 and Z ∈ R M×1 denote the input (i.e., the output of the AE-like network) and the output of the self-expressive layer.
where W SE ∈ R M×M denotes the weights of the self-expressive layer.To form a mutually orthogonal subspace, each row vector in W SE has to be orthogonal to any other row vector in W SE .

Network Architecture of DDSR
Since the pixel-wise change detection is strongly affected by the speckle, patch-wise strategy has been applied in this paper, i.e., a square image patch formed by a pixel and its surrounding pixels.Each patch pair {I X , I Y } with center pixel {x, y} is reshaped to vector X ∈ R N×1 and Y ∈ R N×1 (N = 5 × 5 in this paper), as shown in Figure 2. Through the AE-like network (Section 3.1), the input bi-temporal SAR patches X and Y are non-linearly mapped to a lower dimensional latent space, denoted by Z X ∈ R M×1 and Z Y ∈ R M×1 (where N > M).Z X and Z Y are then linearly transformed to Z X and Z Y by the self-expressive layer.The change criterion r between pixel x and y can be calculated by To identify the changed pixels, an unsupervised K-Means cluster is applied to classify {r} into the changed and unchanged groups.

Training Strategy
As shown in Figure 2, the classic AE network is adapted to handle the change detection task.The whole network is trained by minimizing the loss computed from the differential representation of the bi-temporal SAR patches.
where the loss can be calculated by 6) 2 denotes the representation differential, Norm (Z X , Z Y ) is the data constraint term and Regl (W AE , W SE ) is the weight regularization term.The weights Λ = {λ 1 , λ 2 } control the balance terms in the loss function.The data constraint term ensures that the output of DDSR has significant information (avoiding a meaningless solution, i.e., W AE = 0, where E ∈ R M×1 denotes the is a column vector whose elements are all 1.Note that theoretically the non-zero variance constraint is enough.However, for the sake of simplification, the unite-variance constraint is used in the paper.The weight regularization term is calculated by where ||W AE || 2 2 and ||W SE || 2 2 are classic regularization term.The third term controls the orthogonality of W SE , in which Cov(w i SE , w j SE ) is the correlation coefficient between the i-th row vector and the j-th row vector of W SE .Theoretically, the self-expressive layer performs like a PCA or SFA approach, for which the orthogonality is needed to have a complete and non-redundant representation.Without this orthogonality term, the output of DDSR will be a vector of a constant number.

Implementation Details
Since no labeled data is needed in the training stage, our DDSR is unsupervised.However, DDSR makes an assumption that the unchanged pixel pairs are much more than changed ones, since theoretically only unchanged pixel pairs meet the minimization of the proposed loss (Equation ( 6)).A similar assumption has also been used in slow feature analysis (SFA) based unsupervised change detection approach in [18].In addition, this assumption might not hold when the given bi-temporal SAR images have a very long time interval (changed pixels/regions are more than unchanged ones).However, one can easily discard this assumption by introducing a pre-detection strategy (e.g., the classic log-ratio change detection approach) providing some unchanged pixel pairs as training samples.A similar strategy has been used in [30].
Since the proposed network focuses on the change detection task instead of the representation of the classic AE network, the network parameters are firstly initialized randomly, not by a pre-trained AE network.In the training stage, all the patch pairs are fed into the DDSR network.The Adam optimization algorithm is applied to minimize the loss (Equation ( 6)) and obtain the optimal parameters W AE and W SE with 0.1 learning rate.The number of iterations is 1500.In the testing stage, the change criterion r (Equation ( 4)) is computed pixel by pixel.The classic K-Means clustering method is then performed on {r} to group pixel pairs into two groups, in which the group with lower magnitude of cluster center |r| is the unchanged group.

Experiment
In this section, we investigate the effectiveness of the non-linear part (i.e., AE-like network) and test our DDSR network with different parameters, e.g., number of hidden neurons, weights in the loss.Four real SAR datasets are tested in the experiment to evaluate the superiority and advantage of our proposed method.

Datasets and Evaludation Metrics
There are 4 SAR datasets in this experiment.( 1 In order to verify the validity of our proposed method, five metrics are computed to quantitatively investigate the detection results, i.e., Precision (P), Recall (R), Overall accuracy (OA), Kappa coefficient and F 1 .
R = TP TP + FN (10) where TP, FP, TN and FN denote the number of true positives, the number of false positives, the number of true negatives and the number of false negatives respectively, as defined in Table 1.

Analysis of Parameter Setting
As described in Section 3, the hyperparameters are selected before performing the proposed network, i.e., the number of hidden neurons in the AE-like network, the number of layers of the AE-like network and the weights in the loss.The efficiency of learning features may be affected by the number of hidden neurons and the number of hidden layers.The weights in the loss function can reflect the influence of the relation between different constraints and objective functions on the detection results.Thus, comparison experiments have been launched here to investigate the proper hyperparameter setting.Besides, there is a strong link between patch size and image resolution or the size of changed regions.Considering the SAR datasets tested in our exeriments, we choose the patch size as 5×5 by some comparative experiments and keep this patch size in the following experiments.

Number of Hidden Layers and Hidden Neurons
We argue that the number of hidden layers and the number of hidden neurons interact with each other.To choose the parameters of the network, we adopt a grid search method to avoid blindness and randomness, i.e., the number of hidden layers {0, 1, 2, 3} and the number of hidden neurons {10, 25, 50, 100}.The weights in the loss function are λ 1 = 1.0, λ 2 = 1.0.The results are evaluated by Kappa and F 1 .
The change detection performance against the number of non-linear hidden layers and hidden neurons in the AE-like network is shown in Figure 5.It can be seen that the detection accuracy significantly increases with the introduce of the non-linearly mapping by the AE-like network.In addition, the accuracy of change detection gradually increases with the increasing layers.It can be seen that the number of neurons in the hidden layers has a slight effect on the detection results.Consequently, in the following experiments, we perform the AE-like network with 3 hidden layers and 25 hidden neurons by balancing the detection accuracy and the computation complexity.
Tables 2-5 list the Kappa and F 1 of change detection results on dataset Huangshi, Daye, San Francisco and Guangdong.It can be found that λ 1 and λ 2 have great influence on the change detection results.Extremely small λ 1 tends to neglect the variance constraints, which leads to failure of the network training (output the zero weights W AE = 0, W SE = 0).Extremely small λ 2 stands for despising the covariance constraints.There will be lots of redundancy information among the channels of the output.The change detection results may drop up to 5% in terms of Kappa and F 1 , given unbalanced settings of λ 1 and λ 2 .However, this dropping only takes place at the extreme cases, e.g., {λ 1 = 10, λ 2 = 0.01} and {λ 1 = 0.1, λ 2 = 10}.λ 1 = 1.0 and λ 2 = 1.0 thus have been chosen in the proposed experiments.

Parameter Setting and Comparison Methods
In order to verify the superiority and efficiency of our proposed method, different change detection approaches are tested as reference methods in this experiment, i.e., (1) the classic mean ratio operator (MR) [1], (2) NORCAMA [25], a generalized likelihood ratio test based change criterion, (3) SAE + FCM + CNN [30], deep features based change detction.In our proposed approach, we convert each 5 × 5 patch into a vector as the input of our network.The number of neurons in the 3 non-linear AE-like network is 25.Consequently, the number of neurons in the self-expressive layer is 25 as well.The weights in the loss are λ 1 = 1.0 and λ 2 = 1.0.

Experimental Results
The change detection maps are shown in Figure 6-9 and the quantitative metrics are presented in Tables 6-9.From the results, we can find that the classic MR has noisy detection results and the corresponding detection accuracy is lower than other approaches.NORCAMA with the help of pre-denoising operation yields less noisy detection results, however, its detection accuracy is highly depending on the pre-denoising performance.SAE + FCM + CNN achieves a balance between the precision and the recall, which has less noise than the classic MR.However, it heavily relies on the pseudo labels, which may make the final detection accuracy very low when the pre-detection/classification results are poor.The edges of the detection results are very indistinct.Generally, our DDSR network outperforms the reference methods with higher detection accuracy, smooth detection results and clear edges.

Figure 2 .
Figure 2. The differentially deep subspace representation for synthetic aperture radar (SAR) image change detection.The network consists of an encoder with 3 layers (non-linear mapping), a self-expressive layer (linear transform) and a classic K-Means clustering.

Figure 3 .
Figure 3.The encoder network diagram.(a) A simple encoder network consists of only an input layer, a hidden layer and an output layer.(b) A multi-layer encoder network including input layer, two hidden layers, output layer.
) Huangshi dataset as shown in Figure 4a, Sentinel-1 SAR images in Huangshi Hubei, China acquired on 8 October 2014 and 19 December 2014.The spatial resolution is 5 m and image size is 1024 × 1024.(2) Daye dataset in Figure 4b, Sentinel-1 SAR images in Daye Hubei, China acquired on 8 October 2014 and 19 December 2014 with image size of 1024 × 1024.(3) San Francisco dataset in Figure 4c, TerraSAR images in San Francisco, America acquired on 5 December 2007 and 16 December 2007.The spatial resolution is 1m and image size is 1024 × 1024.(4) Guangdong dataset in Figure 4d, TerraSAR images in Guangdong, China acquired on 24 May 2008 and 19 December 2008 with image size of 1024 × 1024.The corresponding ground truth maps are labeled manually, as shown on the right of Figure 4.

Figure 4 .
Figure 4. Datasets tested in the experiments.(a) Huangshi dataset.(b) Daye dataset.(c) San Francisco dataset.(d) Guangdong dataset.From left to right, the bi-temporal SAR images and the corresponding reference change maps.In the reference change maps, the unchanged and changed pixels are gray and white respectively (black is not defined).

Figure 5 .
Figure 5.The influence of the number of non-linear layers and hidden neurons in the AE-like network on the change detection results.The vertical axis represents the Kappa and F1 metrics of the detection results.One horizontal axis denotes the proposed network without the autoencoder (AE)-like network and with the AE-like network containing 1, 2 or 3 hidden layers, the other horizontal axis denotes the number of hidden neurons.Different colors denote different number of hidden layer.(a) Change detection results on Huangshi dataset.(b) Change detection results on Daye dataset.(c) Change detection results on San Francisco dataset.(d) Change detection results on Guangdong dataset.The left represents kappa metrics and the right denotes F1 metrics.

Figure 6 .
Figure 6.Change detection results of Huangshi dataset by (a) mean ratio (MR), (b) NORCAMA, (c) SAE + FCM + CNN, (d) Our proposed approach.The left represents detection result with ground truth mask.The right denotes detection result without ground truth mask.

Figure 7 .
Figure 7. Change detection results of Daye dataset by (a) MR, (b) NORCAMA, (c) SAE + FCM + CNN, (d) Our proposed approach.The left represents detection result with ground truth mask.The right denotes detection result without ground truth mask.

Figure 8 .
Figure 8. Change detection results of San Francisco dataset by (a) MR, (b) NORCAMA, (c) SAE + FCM + CNN, (d) Our proposed approach.The left represents detection result with ground truth mask.The right denotes detection result without ground truth mask.

Figure 9 .
Figure 9. Change detection results of Guangdong dataset by (a) MR, (b) NORCAMA, (c) SAE + FCM + CNN, (d) Our proposed approach.The left represents detection result with ground truth mask.The right denotes detection result without ground truth mask.

Table 1 .
Confusion matrix of change detection results.

Table 2 .
Change detection results of Huangshi dataset with different weights in the loss function.

Table 3 .
Change detection results of Daye dataset with different weights in the loss function.
-denotes fail of the network training.

Table 4 .
Change detection results of San Francisco dataset with different weights in the loss function.
-denotes fail of the network training.

Table 5 .
Change detection results of Guangdong dataset with different weights in the loss function.
-denotes fail of the network training.