Geometry-Aware Discriminative Dictionary Learning for PolSAR Image Classiﬁcation

: In this paper, we propose a new discriminative dictionary learning method based on Riemann geometric perception for polarimetric synthetic aperture radar (PolSAR) image classiﬁcation. We made an optimization model for geometry-aware discrimination dictionary learning in which the dictionary learning (GADDL) is generalized from Euclidian space to Riemannian manifolds, and dictionary atoms are composed of manifold data. An efﬁcient optimization algorithm based on an alternating direction multiplier method was developed to solve the model. Experiments were implemented on three public datasets: Flevoland-1989, San Francisco and Flevoland-1991. The experimental results show that the proposed method learned a discriminative dictionary with accuracies better those of comparative methods. The convergence of the model and the robustness of the initial dictionary were also veriﬁed through experiments.


Introduction
Polarimetric synthetic aperture radar (PolSAR) is a powerful tool in remote sensing, which transmits and receives electromagnetic waves in different states. Unlike 2D images, SAR complex images containing four polarized matrices could provide more detailed information using different polarimetric channels. Due to increasing demands of disaster assessment, field interpretation, and environmental monitoring, PolSAR image classification attracts more and more attention, in which the core problem is the feature representation of PolSAR images.
Until now, the representation of PolSAR images has still been challenging. The polarimetric decomposition methods [1][2][3], the informative signature methods [4][5][6][7][8][9], the dimensional reduction methods [10][11][12][13] and the sparse representation methods [14][15][16][17][18] are four ways to represent PoISAR images. Generally, the decomposition methods could not represent the original data perfectly because some information is lost while decomposing, and the classification performance is not desired. The polarimetric SAR response contains three real and three complex parameters, and signatures contain the inherent characteristics of PolSAR data. As the informative signatures are correlated with each other, these informative signature methods result in the curse of dimensionality and the high computational complexity of classifiers. Moreover, the existing dimensional reduction methods are pixel-wise, which neglects the structure of PolSAR images.
Recently, inspired by the success of sparse representation in image classifications and image restoration, sparse representation has been used for PolSAR image classification, and sparse representation has achieved promising results [15,16] in Euclidean space. The classical descriptors of polarimetric SAR, covariance and coherency matrices, are of Hermitian semidefinite and form a Riemannian manifold. These sparse representation-based methods [17,18] implement sparse representations of PolSAR images under a Riemannian manifold, and then train a classifier to achieve superior classification results. Admittedly, these results show that the Riemannian manifold is a better representation space for PolSAR images.
However, the limitations of the above-mentioned sparse representation methods are three-fold: (1) they are implemented on vector-valued data; (2) the Riemannian structure is neglected; (3) the classification model is not jointly optimized-that is, sparse representation is separated from the classification.
In order to solve the problems, in this paper, we propose the geometry-aware discriminative dictionary learning method (GADDL). In contrast to the existing vector-valued sparse representation method, we made a tensor-valued dictionary with which the data in the form of symmetric positive definite (SPD) matrices are represented as sparse conic combinations of SPD atoms. Moreover, we made a joint optimization model which unifies the sparse representation and classifier. Concretely, in order to avoid losing implicit information caused by extracting features from a Hermitian positive definite (HPD) matrix, each of the dictionary atoms is described as an HPD matrix directly. Considering that conventional Euclidean metrics are not suitable for a Riemannian manifold, various divergences and metrics are implemented. In fact, this framework is robust in classifying different types of land cover, and gives perfect performance in all experiments. We highlight the main contributions of this paper as follows: (1) We propose a novel geometry-aware discriminative dictionary learning framework for PolSAR image classification. Each data point is represented as a nonnegative linear combination of HPD atoms from the learned dictionary with a large margin of constraint, such that the coding coefficient for the original data point is characterized by encoding the category information and intrinsic Riemannian geometry information.
(2) We present an efficient optimization algorithm to solve the proposed model. All the variables, including the atoms of the HPD dictionary, the coding coefficients and the large margin hyperplanes, can be jointly training in a unified framework.
(3) We conducted the extensive evaluation of our method on two challenging datasets, where significant improvements over state-of-the-art PolSAR classification methods were achieved.

Related Work
Many methods represent PolSAR images, divided into four classes: the polarimetric decomposition methods, the informative signature methods, the dimensional reduction methods, and the sparse representation methods.
Polarimetric decomposition method. The polarimetric decomposition methods use different polarimetric decomposition methods with a physical scattering mechanism such as statistical, scattering, texture, spatial, and color information. Cloude-Pottier [1] employed a three-level Bernoulli statistical model to generate estimates of the average target scattering matrix parameters from the data. Yamaguchi [2] extended the three-component decomposition method introduced by Freeman and Durden [3] to a four-component decomposition method dealing with a general scattering case, such as surface scattering, double-bounce scattering, volume scattering, and helix scattering from objects. Hence, the target's structure information can be deduced as the sum of all four scattering components. However, the existing decomposition methods could not represent the original data perfectly because some information is lost while decomposing, and the classification performance is not desired.
Informative signature method. Informative signatures are used in supervised PolSAR image classification and are selected by different classifiers, such as neural networks [4], SVMs [5][6][7], Adaboost [8] and random forest [9]. For each pixel, the polarimetric SAR response contains three real and three complex parameters, and signatures contain the inherent characteristics of PolSAR data. As the informative signatures are correlated with each other, these methods result in the curse of dimensionality and the high computational complexity of classifiers.

Dimensional reduction method.
A dimensional reduction is a popular tool in PolSAR image classification. PCA and independent component analysis are implemented on the high-dimension polarimetric data for dimension reduction to form the feature vectors [10][11][12]. Laplacian eigenmaps are used to process the, and nonlinear dimensionality reduction in [13]. The existing dimensional reduction methods are pixel-wise, which neglects the structure of PolSAR images.
Sparse representation method. Sparse representation is used for PolSAR image classification and sparse representation has achieved promising results. He et al. [14] firstly employed a sparse coding algorithm to transform the features extracted from the wavelet domain as the sparse representation vectors for classification. Zhang et al. [15] combined the multi-dictionary algorithm with the simplified matching pursuit (SMP) algorithm to simplify the procedure and achieved higher accuracy. Xie et al. [16] applied the D-KSVD algorithm under the non-subsampled contourlet transform (NSCT)-domain to obtain more useful information. However, PolSAR image classification is a high-dimensional, nonlinear mapping problem. The sparse representation with the Euclidean distance does not favor this problem, because the classical descriptors of polarimetric SAR, covariance, and coherency matrices are of Hermitian semidefinite and form a Riemannian manifold. Some non-Euclidean distance is combined with sparse representation. Fan et al. [17] proposed a Stein-sparse, representation-based classification method, which employed a Stein kernel on a Riemannian manifold instead of Euclidean metrics in sparse representation among different frequency bands. Zhong et al. [18] implemented the sparse coding on covariance matrices under the circumstances of the Riemannian manifold. The dictionary atoms were formed by k-means, and SVM was learned for class prediction.
Differently from the above methods, we propose a novel, geometry-aware discriminative dictionary learning framework under the Riemannian metric and a joint-training method for PolSAR image classification. This method directly extracts features from the HPD matrix in Riemannian space, which can avoid losing implicit information. The presented optimization algorithm can solve the proposed model in which the atoms of the HPD dictionary, the coding coefficients, and the large margin hyperplanes can be jointly trained.

PolSAR Coherence Matrices
Compared to single-polarization SAR, the fully PolSAR transmit and receive electromagnetic waves in different states, whose signals consist of the amplitude and phase, form a complex matrix instead of a simple value. Therefore, each resolution cell of the PolSAR image can be described as a complex scattering matrix S.
Consider the reciprocal backscattering S V H = S HV ; the Pauli scattering vector of the polarization matrix is expressed as: where the superscript T denotes the matrix transpose. In general, the scattering properties of complex targets are determined by different independent sub-scatterers with their interactions, and the spatial speckle must be used to reduce the inherent speckle in the SAR data. Therefore, for a complex target, such as a multi-look PolSAR image, the scattering properties used to be described as statistical coherence matrix T, which is a 3 × 3 nonnegative definite Hermitian matrix.
where A = S HH + S VV , B = S HH − S VV and C = 2S HV . < · > denotes the ensemble average in the data processing, N is the number of looks, the superscript * denotes complex conjugation, and T denotes transpose operation of vector or matrix.

Discriminative Dictionary Learning
Assume that x ∈ R m is an m dimensional vector with class label y ∈ {1, 2, ..., C}, where C denotes the number of classes. The training set with n samples is denoted as X = [x 1 , x 2 , ..., x n ] ∈ R m×n , and it also can be denoted as X = [X 1 , X 2 , ..., X C ], where X c is the subset of n c training samples of class c. We can denote the learned dictionary as D = [d 1 , d 2 , ..., d K ] ∈ R m×K , in which d i represents the atom. Let Z = [z 1 , z 2 , ..., z n ] denote the coding vector of X over dictionary D; then a general discriminative dictionary learning (DDL) model can be formulated as: where R(X, D, Z) is the reconstruction term, and L(Z) denotes the discrimination term for Z. p is the parameter of the l p − norm regularizer. λ 1 and λ 2 are the trade-off parameters. By using a single dictionary shared among all classes, we can further get the following model: Intuitively, the discrimination can be induced by using a large margin criterion. In this case, we introduce a discriminant function S(z, y) ∈ R that measures the correctness of the association between coding vector z and class label y. Then, the general large margin discriminant term can be described as: where S(z i , y i ) ∆ = max y∈Y\{y i } S(z i , y i ), which means, for each coding pattern z i , we want to make sure that S(z i , y i ) of the correct association is greater than all the scores S(z i , y) of the incorrect associations, where y = y i . R(S) is a regularization term to constraint the complexity of function S. The slack variables ξ i , following the standard SVM derivation, are introduced to account for the potential violation of the constraints. Recently, the SVGDL [19] was introduced as a special case of general large margin DDL. By setting S(z i , y i ) = y i (ω T z i + b), S(z i , y i ) = 0 and R(S) = ω 2 2 , the discrimination term of two-class classification problem becomes: For multi-class classification, SVGDL simply adopts the one-vs-all strategy by learning C hyperplanes W = [w 1 , w 2 , ..., w C ] and corresponding biases b = [b 1 , b 2 , ..., b C ]. We can formulate SVGDL as: where y c = [y c 1 , y c 2 , ..., y c n ], y c i = 1 if y i = c, or y c i = −1. || · || F is the Frobenius norm.

Sparse Coding on Riemannian Manifold
There are some internal relations between elements in the HPD matrices, which may be dropped in the case of extracting features by decomposing the original data directly. Although these symmetric positive definite matrices form an open subset of Euclidean space, it is much easier to capture the internal logic while observing in the Riemannian manifold. Chetat et al. [20] extended the dictionary learning and sparse coding to the Riemannian space where the representation loss is computed via the affine invariant Riemannian metric (AIRM).
For a dataset X = {X 1 , X 2 , · · ·, X n }, where X i is a HPD matrix, assume that we obtain the third-order tensor dictionary B = {B 1 , B 2 , · · ·, B m }; the goal is to find a list of nonnegative vectors A = {α 1 , α 2 , · · ·, α n }, which makes each X i approximate to Bα i under the AIRM. Thus, the sparse coding problem can be described as: where In [20], the convex constraint of objective function can be described as:

Proposed Method
In contrast to most methods which extract many features via various decomposed functions and further reduce their dimensions, we only implement the original coherent matrix without any preprocessing, except speckle filters. Then, we cluster the initial geometry-aware dictionary of each category under the Riemannian metric instead of Euclidean metric, which retains vital discriminative information as much as possible. Moreover, we merge these initial dictionaries to form a discriminative dictionary. Finally, we propose a joint optimization strategy to optimize the discriminative dictionary learning and classifier training alternately. The framework is shown in Figure 1. Different from other methods that generate the dictionary only once and then optimize the classification model, the proposed joint optimization method can make the dictionary more robust and suitable to the current classification task. In the following, we derive our optimization equation and the details to solve the equation.

Riemannian Discriminative Dictionary Learning for PolSAR Data
The existing dictionary learning approaches usually apply only to vector data in Euclidean space. However, the typical representations of PolSAR data are HPD covariance matrices, which forms an open subset of space H d of d × d Hermitian matrices. Since the PolSAR data are sampled from the Riemannian manifold instead of Euclidean space, the proposed method extends DDL into Riemannian DDL in the following two ways to accommodate PolSAR data. Firstly, the PolSAR data could be kept in matrix form, avoiding losing information when treating them as vectors. Secondly, instead of direct use of Euclidean distance, HPD matrices are usually found to be inferior in performance. The intrinsic Riemannian distance corresponds to a geodesic distance on the manifold of HPD matrices. The intrinsic Riemannian distance is a more reasonable similarity measure and can be introduced to reformulate the reconstruction term in Equation (4).
n be the product manifold achieved by the Cartesian product of n HPD manifolds-i.e., Given the labels y i ∈ {1, 2, ..., C}(i = 1, ..., N) of the training set X , the proposed model aims to learn a third-order tensor (dictionary) B ∈ M d n . Each frontal slice of B denotes a HPD dictionary atom B j ∈ H d + (j = 1, ..., M), and we represent each X i approximately by a conic combination of atoms in B, i.e., Based on that notation, the objective function of Riemannian discriminative dictionary learning (RDDL) for HPD data can be defined as: where the function Ω(·) represents the regularizer on the dictionary tensor. Here, we use the trace regularization, i.e., Tr(B i ), as it is simpler and performs well empirically. The geodesic distance d 2 R (X, Y) is referred to as the affine invariant Riemannian metric, which has been proven to be invariant to affine transformations of the input matrices. With this objective function, the proposed method can not only effectively capture the Riemannian geometric structure of the HPD manifold, but also properly encodes the support vector-induced, large margin discriminative information into the learned dictionary to guide the classification better.

Model Optimization
The solution of our model can be summarized in two key steps: Riemannian discriminative dictionary learning and classifier training. Two steps can be trained in a joint way in an iterative manner.

Discriminative Dictionary Learning
In contrast to the vectorial DDL formulations in Equation (8), for which the subproblems are convex with respect to each variable, the RDDL model in Equation (11) is neither a jointly convex problem nor separately convex for its subproblems. Hence, we adopt an alternative minimization scheme for updating B, Z, and W, b respectively. The detailed optimization procedure can be partitioned into three steps alternatively.
Optimize Z: When B and W, b are all fixed, for a given data matrix X i ∈ H d + , the minimization of Z can be formulated as the following subproblem: (7). We also can set the hinge loss to zero directly due to its computational simplicity and the better smooth property. Lemma 1 ([20]). Let B, C and X be fixed SPD matrices. Consider the function f ( According to the Lemma 1, we can derive the partial derivative of Θ(z j ) with regard to z i j as follows: where Given the above derivative, the subproblem Equation (12) can be efficiently solved by using the spectral projected gradient (SPG) method, which is described in detail in [21]. An important issue of the proposed model is that the choice of l 1 norm or l 2 norm regularizes the coding vector z. It is a common way for the existing dictionary learning method to take the sparsity as a primary principle for learning a discriminative dictionary. Nevertheless, in the experiments, we grant a l 2 norm regularizer.
We repeat Equation (12) until convergence in order that the optimization of each z i has a closed-form solution. From Figure 2a, the eigenvalue of Equation (11) becomes lower with iterations, and the curve approximates to parallel to the c-axis after updating z nearly 500 times. Optimize B: Assuming Z and W, b are all fixed, the minimization of B can be formulated as the following nonconvex optimization problem: According to [22], the Riemannian conjugate gradient (CG) method [23] is adopted in our implementation since it is empirically more stable and faster than other first-order methods, such as steepest-descent and trust-region approaches [24]. For the non-linear function Θ(B i ), B i ∈ H d + , the CG method uses the following recurrence at step k + 1: where γ k is the step-size found via an efficient line-search method [25], and the direction of descent ξ (k) is defined as: where in which the map Φ A (B) defines the vector transport for two points A, B ∈ T P M as: Lemma 2 ([20]). For a dictionary tensor B ∈ M d n , let Θ(B) be a differentiable function. Then, the Riemannian gradient gradΘ(B) satisfies: where ∇Θ(B) is the Euclidean gradient of Θ(B). The Riemannian gradient for the j−th dictionary atom is given by: i , and given the above Lemma.2, the derivative ∇ B j Θ(B) of Equation (20) can be calculated as: As shown in Figure 2b, Let Z and W, b be all fixed, the eigenvalue of the optimized equation (Equation (11)) is decreasing with iterations of dictionary updating, and becomes almost smooth in 10 times.
Optimize W and b: By fixing B and Z, the minimization problem for W and b can be formulated as a multi-class linear SVM problem, which can be further separated into C linear one-against-all SVM subproblems. Due to the better smooth property and the computational simplicity, we adopt the quadratic hinge loss function [26] in our implementation to replace the traditional hinge loss function; i.e., In conclusion, the Riemannian discriminative dictionary learning can be divided into five steps described as follows: (1) Learning an initial dictionary consisting of HPD matrices by k-means clustering under the Riemannian metric; (2) transforming each data point to a nonnegative linear combination of HPD atoms in the initial dictionary; (3) finding out the best parameters of the multi-class linear SVM model according to the given sparse coding and category information; (4) updating the dictionary and parameters by mining the objective function (Equation (12) and (14), respectively) until the optimized dictionary only changes a little compared to the previous one; (5) Establishing the final model with the variables obtained, including the optimized dictionary and the best matching hyperplanes.

Classifier Training
Once the dictionary B and the large margin model W, b are learned, the classification task can be performed as follows. Given a test sample X, its coding vector z with respect to dictionary B can be achieved by solving the following coding problem via SPG [27] method: Then, we can apply the C linear classifier ω c , b c , in which c ∈ {1, 2, ..., C}, on the coding vector z to predict the label of the sample X by:

Experimental Results and Analysis
In order to evaluate the effectiveness of the proposed classification algorithm, we applied the proposed method to two real PolSAR images. The proposed algorithm is compared with the classical and the state-of-the-art supervised algorithms herein. Furthermore, the performance of classification in terms of select parameters is analyzed.

Description of Datasets
Flevoland-1989 was obtained from a subset of an L-band, and a multi-look PolSAR image, acquired by the AIRSAR airborne platform in 1989. It is an agricultural area from Flevoland in the Netherlands consisting of 750 × 1024 pixels. In total, 11 types of land cover are labeled in pixels, including bean, forest, potato, alfalfa, wheat, bare land, beet, rape, pea, grass and water. The ground truth map is shown in Figure 3b. The other pixels without ground truth are filled with black. We visualize it as a composite RGB image on a Pauli basis shown in Figure 3a, where |S HH − S VV | is normalized as red, |S HV | is normalized as green and |S HH + S VV | is normalized as blue.  San Francisco consists of four-look NASA/JPL AIRSAR L-band data of the San Francisco area in 1992. These PolSAR data with dimensions of 900 × 1024 pixels cover San Francisco Bay and California, as shown in Figure 4. However, this dataset was one of the most widely used datasets in PolSAR image classification experiments in the past few years and had different ground-truth maps referring to the previous research. We used the ground truth given in [28], where four terrain classes are considered, consisting of the sea, mountains, grass, and buildings.  Flevoland-1991 was obtained from the Flevoland test site in 1991 and contains a variety of crops and artificial targets, and the pseudo-RGB image synthesized by its L,P,Cband SPANs is shown in Figure 5a. The ground truth was inherited from Hoekman [29] and CRPM-Net [30] shown in Figure 5b. In the ground truth map, the black pixels are those that were not involved in the experiment. For the three PolSAR datasets, each class indicates a type of land cover and is identified by one color. It is noted that the unlabeled pixels were categorized as void and removed from our experiments. Many approaches have shown the harm of speckle and proposed lots of useful filters; we first applied a Boxcar [31] filter with the window size of 7 × 7. In order to clean the original data further, we replaced some outliers whose traces are smaller to 10 −5 with the average of rounding pixels. Considering the discrepancy of the number, for each class, we choose five percent of the total randomly as the training data and treat the rest as testing data. Given the random selection of training data, we independently conducted each experiment 10 times. The overall accuracy (OA), mean of the 10 total classification accuracies (average accuracy-AA), and kappa are used to evaluate the performance of each method.

Evaluation on Flevoland-1989
To demonstrate the superiority of the proposed method, we compare it here with other classical and state-of-art methods, including the classical maximum likelihood classifier based on Wishart distance [32] (denoted as Wishart-ML), the Laplacian Eigenmaps and nonlinear dimensionality for representation [33] (denoted as LE-NDR), the D-KSVD model based on an NSCT-domain [16] (denoted as ND-KSVD) and the SVM model based on Riemannian sparse coding [18] (denoted as RSC-SVM). Figure 6d-h shows the visual classification results from all the algorithms on the Flevoland image. It can be seen that Wishart-ML and K-SVD both made some obvious classification errors. For example, in Figure 6d classified by Wishart-ML, the wheat growing at the middle and bottom was mistaken as rape, and the bare patch on the left was classified as water. As for Figure 6f classified by ND-KSVD, the Khaki grass in the image was hardly found. The LE-NDR also mistook peas growing along the bottom as alfalfa, as did the Wishart-ML method, which ND-KSVD mistook as wheat. RSC-SVM and the proposed method were roughly correct for most parts, and the proposed method achieved higher accuracy in almost all types of land cover. However, there were many wrong classification points distributed among the correct blocks randomly, which can be simply amended by morphological open and close operations. Given the accuracy shown in Table 1, the proposed method achieved the highest values of AC and kappa among all five methods. RSC-SVM, which also used Riemannian sparse coding, was exceeded by the proposed method by 3.0 in AC and 4.1 in kappa. As for ND-KSVD, which used sparse coding in Euclidean space, the AC and kappa were 13.5 and 16.0 better for our method, respectively.

Evaluation on SanFransco
For the SanFransco image, the visual classification results of each algorithm are shown in Figure 7c-h and the classification accuracies are in Table 2. It can be seen clearly that all the methods worked better than on the Flevoland-1989 data due to the SanFransco image having fewer classes. Significantly, our method was also better than the others both regarding AC and kappa, except for LE-NDR. As shown in Figure 7d-h, the sea which occupies half of the image was classified well with all methods, but the isle was classified wrong by all. From Figure 7d,f, the Wishart-ML method clearly mistakenly classified most of the land as mountains and the ND-KSVD method classified the Golden Gate Bridge badly. In Figure 7e, some line targets which are a boulevard in truth were wrongly labeled as urban buildings by RSC-SVM.
From Table 2, our GADDL achieved the second highest performance on the whole. Almost all algorithms could not distinguish the grass well, which can be seen on the right of the Figure 7d-h. Our GADDL got low accuracy for grass, which was probably due to the classification ability mainly relying only on target decomposition. LE-NDR is a method based on polarization target decomposition which relies on the prior knowledge of the designer. For the SanFransco dataset with a small number of categories, this method can easily obtain more discriminative features and achieve the best classification results. Compared with RSC-SVM, our GADDL is more robust to category imbalances. The category of "Mountain" ( Table 2) only made up 6% of samples, which is a small sample category. Our method still achieved a higher classification accuracy than RSC-SVM by 14.1% in terms of OA. (a)

Sea Mountain
Grass Building

Evaluation on Flevoland-1991
To further verify the performance of our method, experiments on another fully PolSAR image with a far more unbalanced number of categories were implemented. Given Table 3, our GADDL achieved the best performance. In our chosen region, the number of pixels for the two categories (i.e., Maize and Buildings) were only 378 and 961, accounting for only 0.48% and 1.21% of the image, respectively. The accuracy of the comparison method in these categories was significantly lower than that for other categories, especially ND-KSVD. However, our method can still achieve high accuracy. This also further shows that our method is robust to class imbalance. The confusion matrices of the compared methods and GADDL are shown in Figure 8a-e. From the comparison of these confusion matrices, it can be seen that our GADDL can distinguish categories well, even the categories with small numbers of samples.

Computational Cost
We tested the performance and efficiency of GADDL and the compared methods on three datasets. The results of the test time and OA are summarized in Table 4. All the experiments were implemented using Matlab 2014b on a standard computer with I7 8700k CPU and 64 GB RAM. According to the comparison results, the performance and efficiency of each method had the same trends on the three datasets. For example, on the Flevoland-1991 dataset, Wishart-ML, LE-NDR and ND-KSVD were faster than ours, but the accuracy was lower. GADDL achieved 24.5%, 7.5% and 18.5% better speed, respectively. In the testing phase, GADDL was the same as RSC-SVM, but we still gained a 2.3% improvement. It can be concluded that our method achieved a good trade-off in terms of accuracy and efficiency.

Convergence Analysis
The proposed GADDL is based on a sparse dictionary, which is used to illustrate the convergence by calculating the eigenvalue. We randomly selected 5% points for each type of Flevoland image; in all, 8203 matrices were used as experimental samples, and λ 1 , λ 2 , λ 3 and θ, represent 0.4, 10 −3 , 0.1, 10 −7 and 50 atoms respectively. We calculated the sum value of the reconstructed term, sparse regular term and discriminant term after each iteration. As can be seen from Figure 9, as the number of iterations increased, the curve decreased continuously and finally stayed nearly level.

Parameter Analysis
In our method, there are several parameters closely related to the final results which need to be set, including the trade-off coefficients λ 1 , λ 2 and λ 3 related to the vector regular term, the discriminant term and the sparsity of dictionary, respectively. Moreover, the learning rate θ of the linear multi-svm classifier and the number of atoms for each type in dictionary are two key parameters. We conducted experiments in turn to prove the superiority of the selected parameters. The impacts assessed by comparing the total classification of the proposed method with different parameters are drawn out as curves for visualization.
According to the optimization strategy, the parameter θ, which stands for the learning rate of the classifier, is only used in the step optimizing W, b and has no effect on other parts of the experiment. From [34] we can find that with a decrease of the learning rate, the hyperplane obtained can be more suitable but more time consuming. We simply let θ be equal to a small value, 10 −7 , which will be refined later. Then, the number of atoms for the dictionary for each type was first set as 30, which we decided by referring to other articles [18].
In general, due to the reconstructed terms and the regularized terms being the main parts of the dictionary learning, we first set λ 2 and λ 3 to relatively small values, such as 10 −7 and 0.1, respectively, to quantify the influences of different λ 1 . As shown in Figure 10, the weight coefficient of the vector of sparse constraint has a great influence on the final result. When λ 1 is set to between 0.1 and 1, we obtained better precision than for any other range. Furthermore, we changed the value of λ 1 from 0.1 to 1 with an interval of 0.1 to explore the influence on classification accuracy; the fluctuation of the resulting curve was less than 0.02. Then we set λ 1 to 0.4 in the experiments later.
Similarly, we performed experiments on the same data to investigate the impact of the parameter λ 2 , which is the weight coefficient of the discriminant term in the optimization target. It also is an important component of the Equation (20) optimizing z. As shown in Figure 11a, the value of λ 2 has a greater impact on the final accuracy. A value bigger than 10 −5 makes the total classification accuracy lower; this may be caused due to an unbalanced contribution to the objective function. Additionally, a small value of λ 2 seems to lower the final result as well. To further verify the improvement of taking the discriminant term into consideration, we tried to set λ 2 to 0 fixing other parameters: the final precision was only 0.84; meanwhile, the highest precision was 0.86 as the value of λ 2 was set to 10 −7 . Furthermore, we implemented the same experiment to confirm the best value of parameter λ 3 . λ 3 is a parameter used to weigh the regularizer term of the dictionary tensor and constrain the sparsity of the dictionary, which can contribute to the gradient of dictionary in optimizing dictionary B. From Figure 11b, we can observe that the curve has a little fluctuation of within 0.01. In other words, the accuracy seems to make no difference with variable values of λ 3 . This may be because the initial dictionary obtained by clustering is better. It may also be attributed to the Riemann conjugate gradient method for direction and the line-search method for step-size. Therefore, the parameter λ 3 can be taken to be any integer from 1 to 0.001. We set it as 0.01 in the later experiments.
After that, we analyzed the influences of different numbers of dictionary atoms on the results. As we know, too small a size of the dictionary makes the model underfit, while a redundancy of dictionary atoms increases the computational burden. Thus, we need to set the size of the dictionary to as small as possible while the total classification accuracy is acceptable. In our experiment, the number of atoms for each class was set as the same value from 10 to 100 with an interval of 20 to observe a better balance between the dictionary size and the final accuracy. According to the experimental results shown in Figure 12, we found that when the number of dictionaries was set as about 50, the curve reached its peak value. However, considering the uneven distribution of the number of sample categories, the number of atoms in the dictionary for each class do not have to be the same. In this case, we assume that the number of atoms in the dictionary for a larger number of samples should be high, and vice versa. Therefore, we used the number of atoms of each type as 1 5 , 1 10 , 1 20 or 1 30 of the training data in each experiment, which showed higher accuracy in the same dataset in contrast to a similar dictionary size with an equal number of atoms for each class. We implemented the following experiment while setting the number of atoms to 1 10 of the training data to trade-off efficiency and precision.  Finally, we verified the optimal range of θ. θ is the learning rate when solving the optimal linear multi-SVM classification problem, which affects the balance between accuracy and time consumption. In our experiments, we had the parameter θ vary from 10 3 to 10 8 with average precision and time cost consumption. In Figure 13, we can see that the accuracy was almost the same when θ was bigger than 10 6 , and the time consumption increased in a geometrical progression. Meanwhile, when we set θ to 10 8 , the final results decreased a little. This may be explained that in this dataset, some classes such as water may have so few points that the classifier would overfit with a small θ. In conclusion, we can set θ = 10 −7 for better performance and simplicity.

Robustness Analysis
For the optimization problem, the l 0 norm is impossible to calculate, and the l 1 norm, l 2 norm may have some different influences on the final result. We statistically analyze the effects of different norms in Figure 14a.
The objective function's value with l 2 norm is generally half a point higher than the one with l 1 norm with the same parameters on the same datasets. The results using quadratic hinge loss were much better than those using squared loss, which further emphasizes the importance of the sparse weight matrix. Thus, we chose the l 2 norm regularizer in the later discussion due to its computational efficiency.
As a discriminant dictionary learning model, the initial dictionary is a vital subproblem. Although some effective but complex algorithms have been proposed to obtain an initial dictionary, we just applied the k-means algorithm and extended it to the Riemannian manifold to obtain cluster centers as dictionary atoms. However, the traditional k-means cluster method implemented on Euclidean space works badly on the HPD matrix dataset. Therefore, different distance metrics for the Riemannian manifold were proposed. Assume two SPD matrices, X, Y ∈ S d + ; for statistical measures, the log-determinant divergence has the following form: d B (X, Y) = Tr(XY −1 ) − log|XY −1 | − d. As for differential geometric schemes, one of the most popular is the log-Euclidean metric defined as d le = Log(X) − Log(Y) F . As for kernelized schemes, the Stein divergence is defined as d S (X, Y) = log| 1 2 (X + Y)| − 1 2 log(XY). We simply replaced the distance metric in the k-means algorithm with the Riemannian metric to cluster center points for each class as the atoms of sub-dictionary, and then combined them to create our initial discriminant dictionary. In all, eight different distance calculation methods were used to prove the robustness of the proposed method. As shown in Figure 14b, the Riemannian metric achieved high values of classification accuracy. The range of difference in the classification accuracy with different initial dictionaries was less than 0.5%, which is very small. Therefore, the proposed algorithm is robust for the different Riemannian metrics. Then, we used log-Euclidean in the following experiments.

Conclusions
In this paper, we propose a novel, geometry-aware discriminative dictionary learning framework for classifying land covers in PolSAR data. For each pixel in the PolSAR image being described as an HPD matrix, in contrast to traditional sparse coding approaches, which use extracted features from HPD matrices as atoms of a dictionary, we directly create the dictionary with an HPD matrix. The initial dictionaries are obtained utilizing k-means algorithm under the Riemannian metric, so that we obtain a list of nonnegative linear combinations of dictionaries for each point, which is named sparse coding. We first attempted to optimize the dictionary and match the large hyperplanes respectively, and then to obtain more suitable sparse coding. We repeat this step so that a more consummate model is generated. Experimental results on the real PolSAR datasets demonstrate that the proposed method outperforms many state-of-the-art methods in terms of accuracy and kappa.
The proposed algorithm also has limitations. As shown in the SanFransico dataset, the boundary between two labeled classes is not accurate. Additionally, for the correct division, some outliers need post-processing. Our randomly selected training data also contains some outliers, which damage the final average accuracy. In this case, we can improve the initial clustering method to reduce the impact.