Learning Rotation Domain Deep Mutual Information Using Convolutional LSTM for Unsupervised PolSAR Image Classiﬁcation

: Deep learning can archive state-of-the-art performance in polarimetric synthetic aperture radar (PolSAR) image classiﬁcation with plenty of labeled data. However, obtaining large number of accurately labeled samples of PolSAR data is very hard, which limits the practical use of deep learning. Therefore, unsupervised PolSAR image classiﬁcation is worthy of further investigation that is based on deep learning. Inspired by the superior performance of deep mutual information in natural image feature learning and clustering, an end-to-end Convolutional Long Short Term Memory (ConvLSTM) network is used in order to learn the deep mutual information of polarimetric coherent matrices in the rotation domain with different polarimetric orientation angles (POAs) for unsupervised PolSAR image classiﬁcation. First, for each pixel, paired “POA-spatio” samples are generated from the polarimetric coherent matrices with different POAs. Second, a special designed ConvLSTM network, along with deep mutual information losses, is used in order to learn the discriminative deep mutual information feature representation of the paired data. Finally, the classiﬁcation results can be output directly from the trained network model. The proposed method is trained in an end-to-end manner and does not have cumbersome pipelines. Experiments on four real PolSAR datasets show that the performance of proposed method surpasses some state-of-the-art deep learning unsupervised classiﬁcation methods.


Introduction
Polarimetric synthetic aperture radar (PolSAR) is a side-looking active imaging system and it has the advantages of working all day and night, working under all weather conditions, large scope, and certain penetration capacity. PolSAR has developed rapidly recent years, and it plays a significant role in Earth observation, such as land use planning, disaster prevention, environment monitoring, target detection, and so on [1][2][3].
PolSAR image classification is one of the fundamental applications in PolSAR image interpretation. Supervised PolSAR image classification has achieved excellent performance. Many traditional statistical model-based methods and non-neural machine learning [4] methods can achieve good results, such as the CoAS model [5], random forest (RF) [6], support vector machine (SVM) [7], and XGBoost [8]. In [9], two mixture models were proposed for modeling heterogeneous regions in single-look and multi-look polarimetric SAR images, along with their corresponding maximum likelihood classifiers for land cover classification. Feng et al. [10] proposed a classification scheme for forest growth stage types and other cover types while using a SVM that was based on the Polarimetric SAR Interferometric (PolInSAR) data. The interferometric polarimetric SAR multi-chromatic analysis (MCA-PolInSAR) signal processing method that was proposed in [11] can confirm the feasibility to resolve the volume-oriented indetermination problem. Deep learning is a branch of machine learning and it provides the state-of-the-art solutions to many problems in natural image processing field [12,13]. It also shows excellent performance in supervised PolSAR image classification [14][15][16]. Chen et al. [17] used the roll-invariant polarimetric features and hidden polarimetric features in the rotation domain in order to drive deep convolutional neural network and improved the classification performance. Liu et al. [18] proposed a polarimetric convolutional neural network that was based on a new polarimetric scattering coding method to classify PolSAR images by making full use of polarimetric information. Deep learning can automatically learn the discriminative feature representation of input data. With sufficient labeled training samples, the performance of deep learning based PolSAR image classification methods far surpass traditional machine learning methods [17,19].
Labeled PolSAR data are often insufficient and labeling PolSAR data accurately is expensive and time consuming [20]. Unsupervised image classification is one of the fundamental problems in information processing and it does not need labeled data. Central grouping approaches, such as k-means, were popular in early computer versions, since they could be computed efficiently. Lee et al. [21] proposed the iterative Wishart classifier, which is the most widely used classifier for PolSAR covariance matrix data. Spectral clustering, which is based on eigendecomposition of matrices, has good performance on arbitrary shape cluster and it is also often used for unsupervised SAR image classification [22,23]. Song et al. [24] designed a computationally tractable and memory-saving affinity matrix for spectral clustering and could be used for large size PolSAR image clustering. With the development of machine learning, increasing methods have been proposed based on machine learning for unsupervised PolSAR image classification. Hua et al. [25] presented an unsupervised classification algorithm with an adaptive number of classes for PolSAR data, which is capable of automatically estimating the class numbers. Zou et al. [26] proposed an unsupervised classification framework for PolSAR images by combining the superpixel segmentation, Gaussian kernels, consensus similarity network fusion, spectral clustering, and a new post-processing procedure. The non-neural machine learning based methods have achieved promising results for unsupervised PolSAR image classification [27,28], but the current methods still suffer some problems. First, some methods have cumbersome pipelines, such as pre-processing, feature extraction, clustering, post processing, and so on. For example, superpixel segmentation is usually used to take advantage of the spatial information of pixels [24,26,28,29]. Some methods over-cluster PolSAR images and manually merge the similar classes to improve the performance [24]. Second, the separation of feature extraction and clustering will make the solution sub-optimal [30]. Third, some methods require huge computing resources and they cannot classify large size PolSAR images [31].
Image clustering methods that are based on deep learning develop rapidly in natural image processing and they can be coarsely divided to three categories: (1) the combination of traditional clustering algorithms and deep learning. Deep Subspace Clustering (DSC) [32] introduced a novel self-expressive layer between the encoder and decoder of a deep auto-encoder to mimic the "self-expressiveness" property. Subsequently, from the parameters of the self-expressive layer, an affinity matrix was constructed in order to perform spectral clustering to obtain the final clusters. Zhou et al. [33] combined DSC with Generative Adversarial Networks (GAN) to faithfully evaluate the clustering quality. However, these methods often lead to degenerate solutions and they have cumbersome pipelines, such as pre-training, feature post-processing, and clustering mechanisms external to the network [34]. (2) Deep discriminative feature representation learning methods. Donahue et al. [35] added an encoder to GANs for better visual feature extraction. Hjelm et al. [36] performed unsupervised learning of representations by maximizing mutual information between an input and the output of a deep neural network encode. DeepCluster [37] used the pseudo-label that was computed by k-means as supervision to train the deep neural networks. These methods still require extra classifiers or clustering algorithms to output classification results. (3) Deep mutual information based methods. Invariant Information Clustering (IIC) [34] involved a simple mutual information objective function for paired data in a neural network, which was end-to-end and without any labels. An end-to-end classification method in deep learning can classify images by a single neural network model. The input of the model is the image or low level feature and the output is the classification result. For exploring and taking full advantage of various kinds of correlations behind the unlabeled data, Deep Comprehensive Correlation Mining (DCCM) [30] combined the Deep Adaptive Clustering (DAC) [38] architecture with pseudo-graph supervision, pseudo-label supervision, and triplet mutual information for unsupervised image clustering. Deep learning provides a new way for unsupervised PolSAR image classification. Bi et al. [39] proposed an unsupervised PolSAR image classification method that incorporated polarimetric image factorization and deep convolutional networks into a principled framework. At present, deep learning has great potential to further improve the performance of unsupervised PolSAR image classification. Deep learning based unsupervised PolSAR image classification methods are worthy of further investigation. Some methods [25,40] can obtain the optimal number of clusters in the cluster algorithms. Most unsupervised classification methods use a predefined number of classes in both natural image processing and remote sensing image interpretation fields [24,26,30,34,41]. In this paper, we also focus on the unsupervised PolSAR image classification method with a predefined number of classes.
The mutual information based methods IIC and DCCM are end-to-end deep clustering frameworks, which could be used for unsupervised PolSAR image classification. Different methods use different deep features in order to compute mutual information. IIC used the deep prediction features-the output feature of the softmax layer for classification task-of sample x and its randomly geometry transformed version x in order to compute mutual information. DCCM constructed the positive pairs and negative pairs based on the pseudo-graph and extracted the shallow layer and deep layer features to compute the triplet mutual information. The geometry transformations, which are suitable for natural image processing, are used in both DCCM and IIC. However, because of the image mechanism of SAR, it is hard to learn the deep mutual information representation of PolSAR data only via geometry transformations. For unsupervised PolSAR image classification, IIC is hard to converge with the use of mutual information alone. DCCM also cannot learn discriminative feature representation of PolSAR images and the performance is unsatisfactory. This paper aims to propose an end-to-end unsupervised PolSAR image classification method, which does not have cumbersome pipelines and it is simple to apply to practical applications. In order to further improve the unsupervised classification performance, the state-of-the-art unsupervised deep learning algorithms are adopted to unsupervised PolSAR image classification. Therefore, an unsupervised PolSAR image classification method that is based on Convolution Long Short Term Memory (ConvLSTM) [42] network while using Rotation Domain Deep Mutual Information(RDDMI) of polarimetric coherent matrix is proposed in this paper. Two improvements are introduced in proposed method in order to better learn the deep mutual information of PolSAR data. First, the mutual information algorithms of IIC and DCCM are combined to better learn the PolSAR image feature representation and improve the unsupervised PolSAR image classification performance. Second, as a unique data transformation algorithm in PolSAR data interpretation field, the polarimetric matrix rotation is used in order to improve the deep mutual information learning. The polarimetric response of a target is related to the orientation of the target. The hide features in rotation domain can provide useful polarimetric information and be used to improve the classification performance [43][44][45]. Different Polarimetric Orientation Angles (POAs) are used in order to generate a sequence of polarimetric coherent matrices and then the ConvLSTM is used to learn the rotation domain features of the sequence. ConvLSTM can process long-term dependent sequential data with spatial-temporal information and it has been applied to remote sensing image interpretation [46][47][48]. The advantages of proposed method are summarized, as follows: (1) the deep mutual information in rotation domain is introduced for unsupervised PolSAR image classification. (2) Proposed method is an end-to-end model and do not have cumbersome pipelines. The input is the low level polarimetric features and the output is the class label of a pixel in a PolSAR image. Extra preprocessing or post-processing is not required. By introducing the mutual information algorithms of IIC and DCCM and the unique polarimetric matrix rotation, the proposed method can extract more discriminative feature representation and the performance of unsupervised PolSAR image classification is improved. Figure 1 shows the proposed architecture and it consists of three modules: the first one is the input module, the second one is the network module, and the latter one is the loss function module. Sequence of polarimetric coherent matrices Figure 1. The architecture of proposed method. It consists of three modules: the first one is the input module, the second one is the network module and the latter one is the loss function module. The paired sequences of polarimetric coherent matrices of each pixel's neighborhood window, for example the three red boxes in the input module, are the inputs of the network module. The network module is stacked by some deep learning layers, such as Convolutional Long Short Term Memory (ConvLSTM) (CL) layers, Convolutional (C) layers, and so on. The loss function module is used for network back propagation.

Methods
The input data are the sequence of polarimetric coherent matrices in rotation domain, which is one of the most common used low level features for PolSAR image classification. For each pixel in a PolSAR image, the polarimetric matrix rotation along with geometry transformations is used in order to generate paired sequences of polarimetric coherent matrices, x and x , as the inputs of deep neural network.
The network used in proposed method is a convolutional LSTM network. It mainly contains two ConvLSTM layers (CL 1 and CL 2 ), three convolutional layers (C 1 , C 2 , and C 3 ), three max-pooling layers (M 1 , M 2 and M 3 ), two fully connected layers (FC 1 and FC 2 ), a softmax layer, and other auxiliary layers, such as ReLU and batch normalization [49].
The loss functions are pseudo-label loss, pseudo-graph loss, and two mutual information losses, which are used to guide the network training. The computation of the IIC mutual information is based on the prediction features of x and x . The pseudo-label supervision loss, pseudo-graph supervision loss, and triplet mutual information loss were first introduced in DCCM for image clustering. The prediction features of the network are used in order to compute the similarity matrix among samples. Subsequently, the similarity matrix is used to construct pseudo-graph and pseudo-label to guide the network training. Based on the pseudo-graph, the positive pairs and negative pairs are selected to construct triplet correlations. Finally, the shallow layer and deep layer features of samples, which have triplet correlations, are used in order to compute the triplet mutual information loss. The following sections introduce the detailed information.

Input Polarimetric Features
The deep mutual information learning of each pixel in a PolSAR image requires paired data x and x , which are both the low level features of the pixel. A simple way to generate paired features for a pixel is to transform the low level feature x to x by two transformation algorithms. One is the random geometry transformation algorithm, which is used in IIC and DCCM. The other one is the unique polarimetric matrix rotation algorithm of PolSAR data. It can further improve the deep mutual learning and improve the performance of unsupervised PolSAR image classification. In this paper, x and x are two sequences of polarimetric coherent matrices in rotation domain with different POAs.
Polarimetric information of PolSAR data can be expressed by polarimetric coherent matrix T. The polarimetric matrix rotation on T is defined, as follows [43]: where θ denotes POA and rotation matrix where Re[·] and Im[·] denote the real part and imaginary part of a complex T ij , respectively. We change the POA from 0 to π/2 with step π/18 and obtain nine POAs θ x . Subsequently, θ x are used to generate nine polarimetric coherent matrices. The polarimetric coherent matrices have different POAs and contain spatial information, so it is named "POA-spatio" sequences. For example, Figure 2 shows the Pauli pseudo-color images of the polarimetric coherent matrices of RADARSAT-2 Flevoland dataset. The polarimetric coherent matrices in the rotation domain have different polarimetric properties, which can be used to improve the PolSAR image classification performance. The rotation domain polarimetric data of each pixel can be defined as a vector t p , For each pixel, the neighborhood window data are used to better reserve the spatial information. The vectors of all pixels in a neighborhood window are used in order to generate the sample x from the nine polarimetric coherent matrices. The size of each sample x is 9 × 9 × w × w, where w denotes the window size, the first nine is the number of POAs, and the second nine is number of channels, as shown in Figure 3. Each pseudo-color image patch presented in Figure 3 is generated from T(θ), and it denotes the neighborhood window data of a pixel. The deep mutual information learning requires paired data, so the other sample x is also shown in Figure 3. There are two steps to generate sample x . First, sample x is also generated from polarimetric coherent matrices with nine POAs θ x , which are different from θ x . Sample x should be similar to x, so the nine POAs θ x of x are close to θ x . We make small changes to θ x , and then the POAs Second, the random geometry transformations are applied to x . The geometry transformations include rotation, skewing, scaling, flipping, channel shifting, and so on.
In other words, for each pixel, sample x is generated from the rotation domain polarimetric coherent matrices, and then x is converted to x by two procedures, which are the polarimetric matrix rotation and random geometry transformations. The network can better learn the deep mutual information of PolSAR data by introducing the unique polarimetric matrix rotation.  Figure 4 shows the network used in our method, where N denotes the number of classes. The input sample x or x is sequence data and it has spatial information, so the first two ConvLSTM layers are used in order to capture the rotation domain features from the sequence of polarimetric coherent matrices. Subsequently, some convolution layers, as well as fully connected layers, are used in order to further learn deep features of the input sample. Finally, the softmax layer outputs a one-hot vector and the argmax function is used in order to compute the class of the sample. The shallow and deep layer features, which are used to compute the triplet mutual formation loss, are the outputs of the first max-pooling layer and second fully-connected layer, respectively.  LSTM can model long-range dependencies, but it contains too much redundancy for spatial data [42]. ConvLSTM is proposed in order to solve this problem. ConvLSTM replaces the fully connected gate layers of the LSTM with convolutional layers, so it is capable of encoding sequence data that have spatial information [47]. The main equations of ConvLSTM are shown below:

Network Architecture
where * denotes the convolution operation, • denotes the Hadamard product, W denotes learnable weights, b denotes the bias, x t denotes the input of time t, f t and C t denote the hidden state, o t denotes the output gate, and h t denotes the final output state of time t. σ and tanh are the activation functions.
The "POA-spatio" polarimetric coherent matrices can be regarded as the time steps of ConvLSTM, and sample x or x is "POA-spatio" sequence data. Therefore, ConvLSTM fits well with the deep feature representation learning of the PolSAR "POA-spatio" sequence data.

IIC Mutual Information
be paired samples in a PolSAR image, where x i or x i is the i-th sample and K is the number of total sample pairs. IIC aims to learn a representation Φ that preserves what is in common between paired data x and x , while discarding instance-specific details [34], which can be achieved by maximizing the mutual information (MI): where Φ is a neural network. The output of the softmax layer z = Φ(x), which is the prediction feature of sample x, can be interpreted as the distribution of a discrete random variable y over N classes, formally given by P(y = n|x) = Φ n (x) [34]. Let y and y be the cluster assignment variables of x and x , respectively. Because y and y are not independent after marginalization over the dataset [34], the joint probability distribution of y and y is given by a N × N matrix P. Each element at row n and column n of the matrix P is P nn = P(y = n, y = n ). Subsequently, P has the following form: The marginals P n = P(y = n) and P n = P(y = n ) can be computed by summing the elements of the rows and columns of the matrix P [34]. Plugging the matrix P into the mutual information expression [50] and Equation (5) can be computed equivalently: Subsequently, the IIC mutual information loss can be formulated as:

Pseudo Graph and Pseudo Label Supervision Loss
The pseudo graph and pseudo label are used in order to guide the network training and the computation of triplet mutual information. The pseudo-graph is used to explore the binary correlation between samples and the pseudo-label loss is used to make full use of category information behind the data [30].
The neural network Φ(x) can output the prediction feature of the input data x. Based on the prediction feature vector z of the softmax layer, the cosine similarity of two samples can be calculated by the following equation: In [30], a large thresh w is set to the similarity matrix S in order to construct pseudo-graph W: If the cosine similarity of two samples is larger than thresh w , then the two samples are considered to be the same class. During the network training, the cosine similarity of these samples will be maximized. Otherwise, the samples are thought to be different classes and the cosine similarity will be minimized. The pseudo-graph supervision loss has the following form: where l g is the binary cross-entropy loss [38]. If S ij is assumed to be distinctive to each other in similarity matrix S, then X can be divided into exactly N partitions {P 1 , P 2 , ..., P N } by a threshold t [30]. The samples that have high cosine similarity will be in the same partition and partition n can be set as the pseudo-label of each x. The pseudo-label can be formulated as: where [·] n denotes the n-th component of the prediction vector. The probability of the predicted pseudo-label is p i = max[Φ(x i )] n . Because of the optimization problem, by setting a large threshold thresh l , only a highly confident pseudo-label is selected to train the network [30].
The pseudo-label supervision loss is formulated as: where l l is the cross-entropy loss.
The local robustness assumption is used in [30]. The feature representations between x and the transformed version x should be invariant, which means Φ(x) ≈ Φ(x ) and the labels of x and x should be the same. Accordingly, the feature invariant loss has the following form: where l r denotes the l 2 -norm, which measures the distance between the deep features of x and x . Subsequently, the pseudo-graph and pseudo-label information computed based on transformed samples should be consistent with the original samples. The loss function can be formulated as:

Triplet Mutual Information
The instance-level mutual information between the shallow layer and deep layer features of the same sample should be maximized. The instance-level mutual information of two random variables (D, S) is equal to the Jensen-Shannon divergence (JSD) between samples coming from the joint distribution J and their product of marginals M [36,51]. Different layer features of the same sample should follow the joint distribution. If the features are from different samples, then they should follow the marginal product distribution [36]. The JSD version mutual information is defined as: where d denotes the deep layer features, s denotes the shallow layer features, and sp(z) is the softplus function sp(z) = log(1 + e z ). The discriminator T is used to distinguish whether d and s are sampled from the joint distribution or not. It is a convolutional neural network, which uses the deep layer and shallow layer features of samples as input. The output feature maps of T are the inputs of sp(z). The detailed implementation of the discriminator T is introduced in [36]. For two different samples, x 1 and x 2 , which belong to the same class, the mutual information between x 1 's shallow-layer representation and x 2 's deep-layer representation should also be maximized. Therefore, in [30], the pseudo-graph W in Equation (10) is used to select positive pairs and negative pairs to construct triplet correlations. When two samples are the same class, their features are positive pairs; otherwise, the features are negative pairs. In this way, the deep neural network can learn triplet-level mutual information other than instance-level mutual information. The Equation (18) can be expanded in order to calculate the triplet mutual information. Let d i j and s i j denote the deep layer features and shallow layer features of sample j and the class of the sample is i. Subsequently, D i = {d i 1 , d i 2 , · · · , d i n } and S i = {s i 1 , s i 2 , · · · , s i n } are the features sets of class i. Variables D and S are defined by D = {D 1 , D 2 , · · · , D N } and S = {S 1 , S 2 , · · · , S N }, respectively. Triplet mutual information can be formulated as: Afterwards, the triplet mutual information loss has the following form: The triplet mutual information has the advantage that, for two different samples x 1 and x 2 , which are the same class, the mutual information between x 1 's shallow layer representation and x 2 's deep layer representation is also maximized. Subsequently, the deep neural network can learn more discriminative representations through the triplet mutual information.

Model Optimization
By combining the investigations of IIC and DCCM, the final objective of proposed method can be formulated as: min where α, β, and γ control the importance of corresponding loss functions, L PG = L PG + L PG , and L PL = L PL + L PL . The proposed method is trained in a minibatch based end-to-end manner. After the model is trained, the cluster label can be computed by Equation (12) while using the output one-hot vector of the softmax layer. The overall training steps are similar to [30], as shown below: • generate paired "POA-spatio" samples X and X from the polarimetric coherent matrices in rotation domain. • initialize the parameters of network Φ randomly; • for each randomly selected minibatch X B and X B , compute the shadow layer, deep layer, and softmax layer features of Φ(x i ) and Φ(x i ); • compute the similarity matrix S, pseudo-graph W, and the pseudo-labels; • select positive and negative pairs based on W; • compute the final loss by Equation (21); • update the parameters of Φ; and, • calculate the unsupervised classification label by Equation (12) after the network is well trained.

Datasets
An airborne PolSAR image and three spaceborne PolSAR images are used as the experiment datasets in this section in order to fully demonstrate the performance of proposed method. The detailed data information is presented in Table 1. For each pixel in a PolSAR image, a paired data x and x with size of 9 × 9 × w × w are generated as the training samples. Except the pixels from the edge areas of a PolSAR image, all other pixels are used for training.   Figure 5 shows the Pauli pseudo-color image and ground truth map. The scattering power of vegetation 1 is stronger than that of vegetation 2. Vegetation 1 is shrubs or trees, and vegetation 2 is grasslands or reeds. The ground truth map was first used in [8].

RADARSAT-2 Flevoland Dataset
The other spaceborne dataset is a C-band RADARSAT-2 (RS-2) PolSAR dataset, which is famous and widely used for classification performance evaluation. It covers the scene of Flevoland in Netherlands, with an image size of 1400 × 1200 pixels. The spatial resolution is 12 m in range direction and 8 m in azimuth direction. According to high resolution optical images, this dataset contains four land cover classes, which are water, farmland, building, and forest. Figure 6 shows the ground truth map and Pauli pseudo-color image.

AIRSAR Flevoland Dataset
The airborne dataset is acquired by the NASA/JPL AIRSAR L-band system and it is a four-look fully polarimetric image. This dataset also covers over the scene of Flevoland, Netherlands. The spatial resolution is 6m in range direction and 12 m in azimuth direction. The image size is 750 × 1024 pixels. This dataset has a well-established ground truth map [52] and it is widely used for supervised PolSAR image classification. It contains 11 land cover types, which are forest, water, bare soil, and other eight crop types. Figure 7 shows the ground truth map and Pauli pseudo-color image.

RADARSAT-2 Wuhan Dataset
The last spaceborne dataset is acquired by RADARSAT-2 C-band PolSAR system at fine quad-pol mode and it covers over the scene of central area of Wuhan city, China. The image size is 5500 × 2400 and the spatial resolution is 12 m in range direction and 8 m in azimuth direction. A total of three main land cover types are identified in this scene, which are water, building, and forest. The Pauli pseudo-color image and ground truth map are shown in Figure 8a,b, respectively. Wuhan is a big city and it contains high density buildings with different orientations. For example, the two red boxes in the Pauli pseudo-color image are building areas and the corresponding optical images are shown in Figure 8c,d. The optical images show high density buildings with different orientations clearly. This dataset and ground truth map were first used in [20].

Experiments
The implementation of proposed method is based on the Pytorch implementation of DCCM (https://github.com/Cory-M/DCCM) and IIC (https://github.com/xu-ji/IIC), and most of the training parameters are not changed. The RMSprop optimizer with lr = 1e − 4 is used. The thresholds thresh l and thresh w are set to 0.9. The parameters that control the importance of different losses are α = 1, β = 5, and γ = 0.05. The probability outputs of softmax layer are used in order to compute IIC mutual information and the similarity matrix S. The shallow layer feature and deep layer feature that are used to compute triplet mutual information are the outputs of M 1 and FC 2 . The discriminator of triplet mutual information estimation is the same to DCCM. For all of the PolSAR dataset, the neighborhood window size w is 15.
The overall accuracy (OA), kappa coefficient, purity, and entropy [28,41] are used in order to evaluate the classification performances.
• OA is one of the most common used measures for classification performance evaluation and it can be formulated as OA =n K (22) where K denotes the number of total samples andn denotes the correctly classified samples. • Kappa is an indicator of consistency and it can be computed by the following equation where p o is OA, p e = ∑ i n i * ñ i K×K , n i is the number of samples that belong to class i, andñ i is the number of samples that are classified to class i.

•
Purity and entropy are two commonly used measures for clustering performance evaluation. The purity is the higher the better and the entropy is the lower the better. The two measures are defined, as follows [53] where q is the number of classes, k the number of clusters, n r is the size of cluster r, and n i r is the number of data points in class i clustered in cluster r.
DAC [38], DCCM, DCCM+IIC, and three traditional unsupervised PolSAR image classification methods, which are k-means, wishart cluster, and ASC-S [31], are used to compare the performance with proposed method. The original implementation of ASC-S over-clustered PolSAR images and merged similar classes manually. In this paper, ASC-S clusters the PolSAR image to the number of classes that are defined in the ground truth map directly. DCCM+IIC just combines the DCCM with IIC mutual information loss. The input data of DCCM+IIC are generated from polarimetric coherent matrix T, and only random geometry transformations are used to generate paired samples. No rotation domain data are used. In this way, we can more clearly show that the polarimetric matrix rotation in the proposed method can improve deep mutual information learning and the performance of unsupervised PolSAR image classification. The random forest (RF) is used as baseline and it is trained in a supervised manner. For each PolSAR data, 25,000 labeled samples are randomly selected in order to train RF. We run all of the methods multiple times and choose the best classification results. Figure 9 shows the classification results of GF-3 Wuhan dataset and Table 2 shows the performance evaluation. Because the backscattering property of some pixels in the farmland area is similar to the building type, so the farmland and building areas are difficult to cluster in this dataset. The classification accuracies of k-means, wishart, and ASC-S are low. ASC-S classifies water area to two land cover types and cannot distinguish farmland and building types. The results of DAC and DCCM are comparable. The result of DCCM+IIC is much better. The proposed RDDMI achieves the best classification result, especially the classification accuracies of farmland and building types are both good, and the OA is 4.75% higher than DCCM+IIC. The kappa coefficient, purity, and entropy of the proposed method are also the best. Therefore, the polarimetric features in rotation domain can improve the unsupervised PolSAR data classification.

Results of RS-2 Flevoland
In the RS-2 Flevoland dataset, the farmland type has different backscattering properties. The backscattering property of some small farmland areas is similar to the forest type. Besides, some small forest areas are interspersed among the building areas, and the forest and building types are hard to cluster. Figure 10 and Table 3 illustrate that k-means, wishart, ASC-S, and DAC achieve bad results for farmland, forest, or building types. The classification accuracies of DCCM, DCCM+IIC, and proposed method are much better than the above four methods and the OA of proposed method is 1.07% high than DCCM+IIC. Furthermore, the performance of proposed method is very close to the supervised RF, which shows the superiority of proposed method.

Results of AIRSAR Flevoland
A total of 11 land cover types are identified in the AIRSAR Flevoland dataset and it is a challenge for unsupervised classification. This dataset contains too many unidentified areas, which may contain new land cover types; hence, only the labeled areas are used for training and evaluation. In this dataset, some land cover types have similar backscattering properties. Besides, the observed polarimetric matrices from the same land cover type may also be quite different. The seven methods all do not achieve impressive classification results, as shown in Figure 11 and Table 4. The classification accuracies of many land cover types in the results of k-means, wishart, ASC-S, DAC, and DCCM are zero or very low. The backscattering property of water type is similar to other types, so DCCM+IIC and proposed method also do not classify water types correctly. However, the OA, kappa, purity, and entropy of proposed method are still much better than the other six methods. The OA of RDDMI is 5.11% higher than DCCM+IIC. It again shows that the polarimetric matrix rotation is helpful for deep mutual information learning and it can improve the performance of unsupervised PolSAR image classification.

Results of RS-2 Wuhan
The RS-2 Wuhan dataset contains buildings with different orientations, and it is difficult to classify building type correctly. Figure 12 and Table 5 illustrate the classification results and the performance evaluation, respectively. The result of k-means is very bad. The results of Wishart, DAC, DCCM, and DCCM+IIC are better, but the accuracy of building type or forest type is not high. The OA of RDDMI is the best and the accuracies of building and forest types are both high. The kappa, purity, and entropy of RDDMI are also the best. The proposed method uses rotation domain polarimetric coherent matrices and it shows high robustness to buildings with different orientations. Besides, the OA of the proposed method is close to supervised RF, only 2.14% lower than RF, and it again shows that the proposed method has high performance for unsupervised PolSAR image classification.

Semi-Supervised Classification
The proposed method can learn discriminative feature representation, which can be used for semi-supervised classification. The RF is used as the classifier of the semi-supervised classification. Three kinds of data are used as the input of RF. One is the polarimetric coherent matrix T, and the other two are the features from the two fully connected layers FC 1 and FC 2 on the top of the network. The output of FC 1 is a 64 dimension vector and the output of FC 2 is a N dimension vector, where N is the number of classes. The GF-3 Wuhan, RS-2 Flevoland, and AIRSAR Flevoland datasets are used for the semi-supervised classification performance evaluation. For each dataset, the classification results with different numbers of labeled training samples are illustrated. When the labeled training samples are less than 1000, the experiments are thought to be semi-supervised classification [54]. We run all of the experiments multiple times and choose the best results. Table 6 shows the semi-supervised classification results of GF-3 Wuhan dataset. When the number of labeled training samples is 25,000, the OA of FC 1 is the best and the OA of T is a little lower. When the number of samples reduces to 1000, the OA of T decreases a lot. The OA of FC 1 is the best and the OA of FC 2 is the lowest. When the number of samples is 100, the OAs of T and FC 2 are closed to each other, and the OA of FC 1 is the best.  Table 7 shows the classification results of the RS-2 Flevoland dataset. When the number of samples is 25,000, the OA of FC 1 is high, and the OAs of T and FC 2 are also good. When the number of samples reduces to 1000 and 100, the OA of T decrease a lot. Especially, when only 100 labeled training samples are used, the OA T is very low, only 66.98%. The OAs of FC 1 and FC 2 decrease a little and the performance of FC 1 is the best. For ARISAR dataset, the classification result of FC 1 is very good when the number of samples is 25,000, and the OAs of T and FC 2 are much lower, as shown in Table 8. When the number of samples reduces to 110, all of the classification accuracies are decreased. However, the OA of FC 1 is still good enough, but the OA of T decreases to 49.44% and it is very low. The above analysis shows that the feature representation of FC 1 can achieve good results in semi-supervised classification for all three datasets with a commonly used RF classifier. The classification results of FC 2 feature representation are also considerable. When the number of samples is very low, the performance of low level polarimetric feature T is low, and it is not suitable for semi-supervised PolSAR image classification. Therefore, the proposed method can learn discriminative feature representation in an unsupervised manner and improve the performance of semi-supervised PolSAR image classification.

Training Accuracy Trend
In this section, the classification accuracies of the four datasets during the training process are discussed, as shown in Figure 13. As the number of training steps increases, the cluster accuracy also increases, and it can achieve a relatively stable value. For the GF-3 Wuhan, AIRSAR Flevoland, and RS-2 Wuhan datasets, the accuracies of proposed methods are always higher than DAC, DCCM and DCCM+IIC during training. For the RS-2 Flevoland dataset, the accuracies of DCCM, DCCM+IIC, and the proposed method are closed to each other, but the accuracy of proposed method still a litter higher than DCCM+IIC. Besides, for the four datasets, the performance of DCCM+IIC is better than DCCM, and the performances of DCCM and DAC are comparable. The combination of the deep mutual information algorithms of IIC and DCCM can improve the classification accuracies. By learning the rotation domain deep mutual information while using IIC and DCCM mutual information algorithms, the proposed method is superior to DAC, DCCM, and DCCM+IIC, and it can achieve good performance for unsupervised PolSAR image classification.
As shown in Figure 13, the proposed method also has a disadvantage. As with other unsupervised methods, the cluster accuracies of the proposed method still change in a small range after long time training. However, the range of such accuracy variation is very small and rather smaller than the other methods. It has limited impact on the cluster performance. Therefore, the proposed method has promising value in practical applications.

Conclusions
In this paper, the convolutional LSTM and two mutual information losses are used in order to learn the rotation domain deep mutual information of "POA-spatio" polarimetric coherent matrices for unsupervised PolSAR image classification. The proposed method achieves good performance on four real PolSAR image datasets. When compared to six unsupervised classification methods, the performance of proposed method is the best. The proposed method is end-to-end, and no extra preprocessing or post-processing is needed. The deep mutual information of PolSAR image in rotation domain is very helpful for discriminative feature representation learning, and the proposed method also shows good performance in semi-supervised PolSAR image classification.