Geodesic Flow Kernel Support Vector Machine for Hyperspectral Image Classification by Unsupervised Subspace Feature Transfer

Alim Samat 1,2,*, Paolo Gamba 3, Jilili Abuduwaili 1,2, Sicong Liu 4 and Zelang Miao 5 1 State Key Laboratory of Desert and Oasis Ecology, Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences, Urumqi 830011, China; jilil@ms.xjb.ac.cn 2 Chinese Academy of Sciences Research Center for Ecology and Environment of Central Asia, Urumqi 830011, China 3 Department of Electrical, Computer and Biomedical Engineering, University of Pavia, 27100 Pavia, Italy; paolo.gamba@unipv.it 4 College of Surveying and Geoinformatics, Tongji University, Shanghai 200092, China; sicongliu.rs@gmail.com 5 Department of Land Surveying and Geo-Informatics, Hong Kong Polytechnic University, Kowloon, Hong Kong 999077, China; cumtzlmiao@gmail.com * Correspondence: alim.smt@gmail.com; Tel.: +86-991-788-5432


Introduction
In the past few years, pattern recognition (PR) and machine learning (ML) techniques have been employed to derive land use/land cover (LULC) maps using remotely sensed (RS) data in automatic or semi-automatic ways [1][2][3].Supervised classification methods ideally require a large amount of labeled samples with good statistical properties, as they need to be class unbiased and allow a good discrimination among classes.However, such labeled sample collection is a challenging task, especially for large area mapping and classification of multi-temporal high-resolution and hyperspectral remote sensing images [4,5].Aiming at a more accurate classification with small training sets, many supervised [6], hybrid [7] and semi-supervised learning (SSL) [8,9] methods have been proposed for multi/hyper-spectral image classification in recent years.Furthermore, in real applications one is often faced with scenarios where the training data used to deduce a model and the validation data have different statistical distributions.As a matter of fact, radiometric differences, atmospheric and illumination conditions, seasonal variation, and variable acquisition geometries may cause such distribution shifts in single or multi-temporal RS images [5,7,10].Existing inconsistencies between the source and target data distributions include covariate shift [11] or sampling selection bias [12].To tackle such distribution shifts, it is very difficult and expensive to recollect enough training data.Traditional solutions, including image-to-image normalization [13], absolute and relative image normalization [14,15], histogram matching (HM) [16], and a multivariate extension of the univariate matching [17], have been therefore considered.However, a more efficient solution would be to transfer knowledge between the domains.In this regard, transfer learning (TL) and domain adaptation (DA) methods have recently attracted ever increasing attention [18][19][20][21].
According to the technical literature, transfer learning can be categorized as inductive transfer learning (ITL), transductive transfer learning (TTL) and unsupervised transfer learning (UTL) [22] according to the availability of labeled data only for the target domain, or for the source domain, or their total unavailability.Multi-task learning (MTL) is a special case of ITL, where labeled data is available for the source and target domains, and the system is trained simultaneously in both of them [23].Similarly, domain adaptation is a special case of TTL, when labeled data are available only in the source domain, but there is only a single task to be learned [22,24].Finally, based on the kind of "knowledge" transferred across domains, transfer learning can be roughly clustered into instance-based transfer learning (e.g., instance reweighting [25,26] and importance sampling [27]), feature-based transfer learning [28], parameter transfer approaches [29], and relational knowledge transfer techniques [30].
In this work, we focus on domain adaptation (DA).Seminal works, such as the adjustment of parameters for a maximum-likelihood classifier in a multiple cascade classifier system by retraining, have been carried out in [5,7] to update a land-cover map using multi-temporal RS images.Multivariate alteration detection (MAD) was introduced in [31], and was proved to be effective in detecting invariant regions in bi-temporal images.Other DA methods, such as the feature selection approach [32], manifold transformation [33], multi-domain method [34] and domain adaptation support vector machine (DASVM) [18] have also been proposed.A semi-supervised transfer component analysis (SSTCA) method for DA [35] was thoroughly investigated in [21] for hyperspectral and high resolution image classification.Finally, considering that labeled samples may be insufficiently available in some cases, active learning (AL) has been also combined with DA.For instance, an SVM-based boosting of AL strategies for efficient DA in very high resolution (VHR) and hyperspectral image classification was proposed in [36].
As mentioned above, one possible way to transfer knowledge between two domains is to look for a feature-based representation of the source and target domains that minimizes cross domain differences [22,35,37].To this aim, feature extraction (FE) methods are used to project the two domains into a common latent space, implementing transfer component analysis (TCA) [21,35].For instance, sampling geodesic flow (SGF) and geodesic flow kernel (GFK) were proposed in [38,39] to exploit subspaces in a Grassmannian manifold [40].
Notwithstanding the many studies in this area, challenging issues still remain open.For example, due to the huge computation complexity of the empirical estimate of maximum mean discrepancy (MMD) between the distribution of source and target domains, TCA is only suitable for small images.Similarly, the SGF approach requires to select a priori a sampling strategy, the transformed features to consider, and the dimensionality of the transformed subspaces.Instead, GFK has the advantage of being computationally efficient, automatically inferring the parameters without extensive cross-validation techniques [39], but has not been studied on remote sensing image classification.Moreover, the original GFK actually uses a linear kernel function, which may limit the performance in nonlinear feature transfer tasks.Thereby, the major objectives and contributions of this paper are: (1) to propose and investigate geodesic Gaussian flow kernel SVM when hyperspectral image classification requires domain adaptation; (2) to investigate the performances of different subspace feature extraction techniques, such as factor analysis (FA) [41], randomized nonlinear principal component analysis (rPCA) [42] and non-negative matrix factorization (NNMF) [43], with respect to the standard PCA, originally used in GFKSVM; and (3) to compare the proposed methodology with conventional support vector machines (SVMs) and state-of-the-art DA algorithms, such as the information-theoretical learning of discriminative cluster for domain adaptation (ITLDC) [44], the joint distribution adaptation (JDA) [45], and the joint transfer matching (JTM) [46].
The rest of this paper is organized as follows.In Section 2, we formalize the framework used for unsupervised domain adaptation using the geodesic flow, and shortly recall the basics of the geodesic flow Gaussian kernel based support vector machines.Section 3 describes the datasets used and the setup of our experiments.Then, the experimental results are reported, analyzed and discussed for evaluation and comparison purposes in Section 4. Finally, in Section 5, we summarize the main findings of this paper and discuss several possible future research directions.

Related Works
Let us consider that a specific domain D consists of a feature space X and the corresponding marginal distribution P(X), where X = {x 1 ,x 2 , . . .,x n } P X, so that D = {x, P(X)}.Furthermore, let us consider that a "task" Y is the joint set of a label space and a predictive function f, so that Υ " ty, f u.In general, if two domains are different, it means that either χ S ‰ χ T or P S pXq ‰ P T pXq or both.Similarly, the condition Υ S ‰ Υ T implies that either y S ‰ y T or P pY S |X S q ‰ P pY T |X T q or both, from a probabilistic point of view.In this scenario, TL aims at improving the learning of the predictive function f T in the target domain D T using the knowledge available in the source domain D s and in the learning task Y S when either D s ‰ D T or γ S ‰ γ T .
As mentioned above, and according to the availability of source domain labeled data, supervised, semi-supervised or unsupervised feature construction methods can be adopted.Unsupervised domain adaptation is the only way when there is no correspondence between the domains to estimate the transformation between the labeled samples.In this area, dimensionality reduction via maximum mean discrepancy embedding (MMDE) and TCA were recently proposed [47,48].The goal of unsupervised DA approached is to obtain meaningful intermediate subspaces corresponding to the D s and D T first, and then predict the target labels by exclusively using the labeled source training data.In this work, we focus on using GFK for unsupervised subspaces feature transfer in hyperspectral image classification, following the works in [38,39].

Geodesic Flow for DA
In a general metric space with distance metric, the length of a path between two points is defined as the smallest upper bound of any finite approximation of the path, a straight line in a Euclidean space (dashed line in Figure 1).However, the distance can be computed as a length (see the solid red line in Figure 1) in a geodesic metric space where Riemannian manifolds are used.
According to the theory of manifold learning (ML), the collection of all d-dimensional subspaces from the Grassmannian manifold G td, Ducan be used to define a smooth Riemanian manifold with geometric, differential and probabilistic structures, where D is the dimensionality of original data [38][39][40]49].Since a full-fledged explanation for Grassmannian and Riemanian manifolds is beyond the scope of this paper, we refer the interested readers to papers [38][39][40]49,50].Additionally, the d-dimensional subspaces in D can be easily identified in a G td, Du.If two points mapped from the original subspaces D s and D T are closer in G td, Du, then the two domains could be made similar in this manifold, i.e., P S (X) -P T (X).

Remote
 to T  can be formed as [39]: where are the orthonormal matrices, given by the singular value decomposition (SVD): where T* means matrix transpose, Γ and Σ are diagonal matrices whose elements are cosθi and sinθi, for, and θi are the angles measuring the degree of "overlap" between two subspaces S  and T  .

Geodesic Flow Kernel SVM
According to Equation (1), the geodesic flow can be seen as a collection of infinite subspaces gradually varying from Ds to DT. Starting from the original D-dimensional feature vector X , with projections denoted as . The inner product between two projected feature vectors with infinite dimensions i ∞ z and j ∞ z defines the GFK [39]: Let P S , P T P R Dˆd denote the sets of orthogonal basis obtained by a principal component analysis (PCA) in the source and target domains.Moreover, let R S P DˆpD´dq represent the orthogonal complement to P S , i.e., R T S P S " 0. In a Grassmannian manifold, using the canonical Euclidean metric for Riemannian manifold, with the constraints Φp0q " P S and Φp1q " P T , the geodesic flow Φ : t P r0, 1s Ñ Φ ptq P G td, Du from P s to P T can be formed as [39]: where U 1 P dˆd and U 2 P DˆpD´dq are the orthonormal matrices, given by the singular value decomposition (SVD): where T* means matrix transpose, Γ and Σ are diagonal matrices whose elements are cosθ i and sinθ i , for, and θ i are the angles measuring the degree of "overlap" between two subspaces P s and P T .

Geodesic Flow Kernel SVM
According to Equation (1), the geodesic flow can be seen as a collection of infinite subspaces gradually varying from D s to D T .Starting from the original D-dimensional feature vector X, with projections denoted as Φ : t P r0, 1s, it is possible to form a feature set of infinite dimension z 8 " tΦ ptq x : t P r0, 1su.The inner product between two projected feature vectors with infinite dimensions z 8  i and z 8 j defines the GFK [39]: where G P DˆD is the positive semidefinite matrix Remote Sens. 2016, 8, 234 5 of 23 Substituting Equation (1) into Equation (4), and ignoring for the moment the constant part: where Γ ptq and Σ ptq are matrices whose elements are cos(tθ i ) and sin(tθ i ), respectively.Thus, by integrating in closed-form: where i = 1,2, . . .,d and, λ 1,i , λ 2,i λ 3,i become the i-th diagonal elements of the diagonal matrices Λ 1 , Λ 2 and Λ 3 respectively.Accordingly, matrix G assumes the form: where Ω represents the constant parts in Equation (1): It can be noted that Equation (3) is exactly the "kernel trick", where a linear kernel function induces inner products between infinite-dimensional features.
However, we may be faced with the situation of non-linear patterns in the data, patterns that linear kernel function cannot handle.Nonlinear kernel functions have proved instead to be effective in these situations not only for classification but also for feature extraction [51].Hence, we apply to Equation (3) a non-linear transformation, and more specifically a kernel method [52].A generic kernel K `xi , x j ˘is represented by: Assuming γ = 1 and q = 2 the generic geodesic flow Gaussian kernel mapping function is obtained: where the parameter σ is a scaling factor.Let now X S , Y T ( " `xS i , y S i ˘(n S i"1 , x S i P D , y S i P t1, . . ., Mu denote the set of n s labeled source training data corresponding to M classes.X T " , x T j P D represents the set of n T unlabeled data from D T .Using a Lagrangian formulation to the classic linearly constrained optimization problem, the final dual problem can be rewritten as: where the kernel matrix K ´xS i , x S j ¯is computed either by Equation (3) or Equation (12).Thereby, the decision function for any test vector x T j from X T is finally given by: where α i are the Lagrange multipliers, N SVS is the total numbers of support vectors (SVs), and the bias term b can be obtained by using the k unbounded Lagrange multipliers according to: b " 1 w " where Φ p¨q denotes the geodesic projected feature vectors.In summary, the steps of the GFKSVM algorithm may be reported as: (1) Extract features able to represent the source domain sunspaces: S sub img " FE s pS img q and T sub img " FE T pT img q; (2) Compute the geodesic flow kernel G " GpS sub img , T sub img , dq according to Equation ( 9); (3) Compute the kernel matrix K Xs for X s by means of Equation (3) or Equation (12) according to the Kernel type K; (4) Compute the SVM model parameters by solving Equations ( 13); (5) Compute the kernel matrix K XT for X T by Equation (3) or Equation (12) according to the Kernel type K; Classify: Return the predicted label for X T according to Equation (14).

Datasets Descriptions
For the experimental analysis, two hyperspectral datasets with different degrees of shift between source and target domains were considered.
For the first test case, two hyperspectral images collected by the Reflective Optics Spectrographic Image System (ROSIS) sensor over the University of Pavia and Pavia City Centre were considered (see Figure 2).The ROSIS optical sensor provides up to 115 bands with a spectral range coverage ranging from 0.43 µm to 0.86 µm, and a spatial resolution of 1.3 meters per pixel.The Pavia City Centre image contains 102 spectral bands and has a size of 1096 ˆ1096 pixels.The Pavia University image contains instead 103 spectral reflectance bands and has a size of 610 ˆ3400 pixels.Both images are provided with ground truths of nine classes for each.Seven classes are shared by both images, and were considered in our experiments (Table 1).In the experiments, the Pavia University image was considered as the source domain, while the Pavia City Center image as the target domain, or vice versa.Differences in imaging conditions, roof materials, and vegetation types cause remarkable variations of spectral and statistical properties of these land-cover classes across the images, as observed in Figure 2.Moreover, since the GFKSVM is not suitable for situations where the source and target domains are represented by a different number of features, only 102 spectral bands of the Pavia University image were used in the experiments.

Datasets Descriptions
For the experimental analysis, two hyperspectral datasets with different degrees of shift between source and target domains were considered.
For the first test case, two hyperspectral images collected by the Reflective Optics Spectrographic Image System (ROSIS) sensor over the University of Pavia and Pavia City Centre were considered (see Figure 2).The ROSIS optical sensor provides up to 115 bands with a spectral range coverage ranging from 0.43 μm to 0.86 μm, and a spatial resolution of 1.3 meters per pixel.The Pavia City Centre image contains 102 spectral bands and has a size of 1096 × 1096 pixels.The Pavia University image contains instead 103 spectral reflectance bands and has a size of 610 × 3400 pixels.Both images are provided with ground truths of nine classes for each.Seven classes are shared by both images, and were considered in our experiments (Table 1).In the experiments, the Pavia University image was considered as the source domain, while the Pavia City Center image as the target domain, or vice versa.Differences in imaging conditions, roof materials, and vegetation types cause remarkable variations of spectral and statistical properties of these land-cover classes across the images, as observed in Figure 2.Moreover, since the GFKSVM is not suitable for situations where the source and target domains are represented by a different number of features, only 102 spectral bands of the Pavia University image were used in the experiments.The second dataset is a 2.5 spatial resolution hyperspectral image consisting of 144 spectral bands in the 0.38-1.05μm region.The data were acquired by the National Science Foundation (NSF)funded Center for Airborne Laser Mapping (NCALM) over the University of Houston campus and the neighboring urban area on 3 June 2012, it is freely provided by Hyperspectral Image Analysis Lab affiliated with the University of Houston's Electrical and Computer Engineering Department.Originally, the data sets have a size of 1905 × 349 pixels and their ground truth includes 15 land cover  The second dataset is a 2.5 spatial resolution hyperspectral image consisting of 144 spectral bands in the 0.38-1.05µm region.The data were acquired by the National Science Foundation (NSF)-funded Center for Airborne Laser Mapping (NCALM) over the University of Houston campus and the neighboring urban area on 3 June 2012, it is freely provided by Hyperspectral Image Analysis Lab affiliated with the University of Houston's Electrical and Computer Engineering Department.Originally, the data sets have a size of 1905 ˆ349 pixels and their ground truth includes 15 land cover types.Similarly to the previous case, we considered two disjoint sub-images with 750 ˆ349 pixels (Figure 3a) and 1155 ˆ349 pixels (Figure 3b), respectively.These sub-images share eight classes in the ground truth: healthy grass, stressed grass, trees, soil, residential, commercial, road and parking lots, listed in Table 2 with the corresponding sample size.types.Similarly to the previous case, we considered two disjoint sub-images with 750 × 349 pixels (Figure 3a) and 1155 × 349 pixels (Figure 3b), respectively.These sub-images share eight classes in the ground truth: healthy grass, stressed grass, trees, soil, residential, commercial, road and parking lots, listed in Table 2 with the corresponding sample size.To provide a better illustration of the differences between the statistical properties of the same classes in two domains, Figure 4 presents the principal component distributions of different classes from the source and target images for the Pavia and Huston data sets.Clear distribution shifts for Shadows, Bitumen, Asphalt, Bricks and Bare soil can be observed in Figure 4a,c for Pavia.Similarly, please note the large distribution shifts for Soil, Commercial and Parking lot 1 between the two domains in Figure 4e,g for Houston.To provide a better illustration of the differences between the statistical properties of the same classes in two domains, Figure 4 presents the principal component distributions of different classes from the source and target images for the Pavia and Huston data sets.Clear distribution shifts for Shadows, Bitumen, Asphalt, Bricks and Bare soil can be observed in Figure 4a,c for Pavia.Similarly, please note the large distribution shifts for Soil, Commercial and Parking lot 1 between the two domains in Figure 4e,g for Houston.

Experimental Setup
In the experiments, we compare the results of SVMs with radial bias function (RBF) kernel applied to the following features: "All bands"-the original spectral bands; "(S)pca-(T)pca"-difference of PCA/FA/NMF/rPCA transformations separately applied to the source and target images; "(S + T)pca"-PCA/FA/NMF/rPCA transformations applied to a combination of the source and target images; "GFK"-unsupervised subspace features transferred from geodesic flow kernel.Free SVM parameters such as γ and C were tuned in the range (0.01-10) and (1-100), respectively, by a grid searching criterion, and 10-fold cross validation.
As mentioned in the introduction, we also compare results using the ITLDC, JDA and JTM algorithms.As for ITLDC and JDA, the free parameter λ was tuned in the range (0.01-10) with the three cross validations technique.In JTM, an RBF kernel function was applied to build the joint kernel matrix, and the tuning range for the γ factor of this kernel, as well as the λ parameter used for eigenvalue decomposition, were selected in the (0.01-10) range.For a more objective comparison, SVM with RBF kernel was used in the target domain classification phase in ITLDC, JDA and JTM.
All the experiments were carried out using Matlab TM on a Windows 7 64 bit operating system with an x64-based processor Intel ® Core ™ i7-4790 CPU, @3.60 GHz, 32 GB RAM.

Kernel Parameter Evaluation for GFKSVM
Generally, it is expected that an RBF kernel is more suitable than a linear kernel in case of a nonlinear classification task.However, the free tuned parameter of the RBF kernel in the source domain may not be optimal for the target domain, hence limiting or even decreasing the performance of GFKSVM.While it has been proved that GFKSVM has excellent performances with respect to a linear kernel in computer vision adaptation tasks [38,39], its performance on real hyperspectral data with RBF kernel remains to be tested.Hence, we comparatively investigated the performance of GFKSVM with RBF and linear kernels first.
Figure 5 presents the Overall Accuracy (OA) values for GFKSVM with RBF and linear kernel applied to the ROSIS data with different subspace feature transfer approaches, but always with 10

Experimental Setup
In the experiments, we compare the results of SVMs with radial bias function (RBF) kernel applied to the following features: "All bands"-the original spectral bands; "(S) pca -(T) pca "-difference of PCA/FA/NMF/rPCA transformations separately applied to the source and target images; "(S + T) pca "-PCA/FA/NMF/rPCA transformations applied to a combination of the source and target images; "GFK"-unsupervised subspace features transferred from geodesic flow kernel.Free SVM parameters such as γ and C were tuned in the range (0.01-10) and (1-100), respectively, by a grid searching criterion, and 10-fold cross validation.
As mentioned in the introduction, we also compare results using the ITLDC, JDA and JTM algorithms.As for ITLDC and JDA, the free parameter λ was tuned in the range (0.01-10) with the three cross validations technique.In JTM, an RBF kernel function was applied to build the joint kernel matrix, and the tuning range for the γ factor of this kernel, as well as the λ parameter used for eigenvalue decomposition, were selected in the (0.01-10) range.For a more objective comparison, SVM with RBF kernel was used in the target domain classification phase in ITLDC, JDA and JTM.
All the experiments were carried out using Matlab™ on a Windows 7 64 bit operating system with an x64-based processor Intel ® Core™ i7-4790 CPU, @3.60 GHz, 32 GB RAM.

Kernel Parameter Evaluation for GFKSVM
Generally, it is expected that an RBF kernel is more suitable than a linear kernel in case of a nonlinear classification task.However, the free tuned parameter of the RBF kernel in the source domain may not be optimal for the target domain, hence limiting or even decreasing the performance of GFKSVM.While it has been proved that GFKSVM has excellent performances with respect to a linear kernel in computer vision adaptation tasks [38,39], its performance on real hyperspectral data with RBF kernel remains to be tested.Hence, we comparatively investigated the performance of GFKSVM with RBF and linear kernels first.
Figure 5 presents the Overall Accuracy (OA) values for GFKSVM with RBF and linear kernel applied to the ROSIS data with different subspace feature transfer approaches, but always with 10 transferred features.The selection of 10 features is completely arbitrary, as here we focus on the comparison of linear and non-linear kernels.According to the graphs in Figure 5, GFKSVM with RBF kernel shows higher OA values than the linear kernel in most cases, when train and test are carried out only in the source domain (see Figure 5a,c,e,i,m)).However, this advantage decreases when the training and test are carried out in the source and target domains separately (see Figure 5b,d,f,j,n).These results tell us that the free parameters for RBF kernel are capable of reaching higher OA values than the linear kernel.
Remote Sens. 2016, 8, 234 10 of 23 transferred features.The selection of 10 features is completely arbitrary, as here we focus on the comparison of linear and non-linear kernels.According to the graphs in Figure 5, GFKSVM with RBF kernel shows higher OA values than the linear kernel in most cases, when train and test are carried out only in the source domain (see Figure 5a,c,e,i,m)).However, this advantage decreases when the training and test are carried out in the source and target domains separately (see Figure 5b,d,f,j,n).These results tell us that the free parameters for RBF kernel are capable of reaching higher OA values than the linear kernel.

Parameter Evaluation for rPCA
Randomized nonlinear principal component analysis (rPCA) is a nonlinear variant of PCA recently proposed as a low-rank approximation of kernel principal component analysis (KPCA) [42].In comparison with PCA, rPCA is capable of revealing nonlinear patterns in data with computational complexity of O(m 2 n), as opposed to a computational complexity of O(n 3 ) for KPCA and O(d 2 n) for PCA, where m, n and d represent the number of random features, the sample size and the number of extracted features, respectively, and m ě d.According to [42], rPCA performances are mainly controlled by the number of random features (m) and extracted features (d).In the following, we investigate the performances of GFKSVM using rPCA for feature transfer.Figure 6 shows the OA results as a function of the parameters m and d.
From the results in Figure 6, it can easily observed that, in general, the larger the number of extracted features d, the higher the OA values.Instead, the role of the number of random features m is less definite, although it is clear it should be a large value.Accordingly, this parameter was set to five times the original dimensionality of data in all the experiments.

Parameter Evaluation for rPCA
Randomized nonlinear principal component analysis (rPCA) is a nonlinear variant of PCA recently proposed as a low-rank approximation of kernel principal component analysis (KPCA) [42].In comparison with PCA, rPCA is capable of revealing nonlinear patterns in data with computational complexity of O(m 2 n), as opposed to a computational complexity of O(n 3 ) for KPCA and O(d 2 n) for PCA, where m, n and d represent the number of random features, the sample size and the number of extracted features, respectively, and m ≥ d.According to [42], rPCA performances are mainly controlled by the number of random features (m) and extracted features (d).In the following, we investigate the performances of GFKSVM using rPCA for feature transfer.Figure 6 shows the OA results as a function of the parameters m and d.
From the results in Figure 6, it can easily observed that, in general, the larger the number of extracted features d, the higher the OA values.Instead, the role of the number of random features m is less definite, although it is clear it should be a large value.Accordingly, this parameter was set to five times the original dimensionality of data in all the experiments.

Parameter Evaluation for ITLDC
Additionally, performance analysis using ITDLC was conducted.In order to understand the results, please recall that ITLDC identifies a feature space where data from source and target domains are similarly distributed [43].Its performances are controlled by the dimensionality of the transferred feature subspace d and the regularization coefficient λ.Similarly to previous experiments, Figure 7 depicts the OA values using ITLDC as a function of its free parameters.
According to the results in Figure 7, the regularization coefficient does not show significant influence on the OA values.Instead, the effects of the transferred feature subspace (d) are quite obvious and different for different data sets.Specifically, the optimal transferred feature subspace (d) for the relatively heterogeneous ROSIS data is 26 for University (Source)→Center (Target) and 10 for Center (Source)→University (Target) DA learning.Instead, for the more homogeneous Houston data set, the larger the transferred feature subspace (d), the higher the OA value.

Parameter Evaluation for ITLDC
Additionally, performance analysis using ITDLC was conducted.In order to understand the results, please recall that ITLDC identifies a feature space where data from source and target domains are similarly distributed [43].Its performances are controlled by the dimensionality of the transferred feature subspace d and the regularization coefficient λ.Similarly to previous experiments, Figure 7 depicts the OA values using ITLDC as a function of its free parameters.
According to the results in Figure 7, the regularization coefficient does not show significant influence on the OA values.Instead, the effects of the transferred feature subspace (d) are quite obvious and different for different data sets.Specifically, the optimal transferred feature subspace (d) for the relatively heterogeneous ROSIS data is ď26 for University (Source)ÑCenter (Target) and ď10 for Center (Source)ÑUniversity (Target) DA learning.Instead, for the more homogeneous Houston data set, the larger the transferred feature subspace (d), the higher the OA value.

Parameter Evaluation for JDA
The joint distribution adaptation (JDA) algorithm simultaneously reduces the difference in both the marginal distribution and conditional distribution between domains [45].Specifically, JDA uses an extended nonparametric maximum mean discrepancy MMD [53] to measure the differences in both marginal and conditional distributions first, and then integrates it with PCA for constant effective and robust feature representation.According to [45], the performance of JDA is controlled by the regularization parameter λ and the transferred feature subspace d.JDA has a high computational cost O(TdD 2 + TCn 2 + TDn), where T is the iteration number, C represents the cardinality label, and D denotes the original data dimensionality.
According to the results in Figure 8, a larger transferred feature number ( 10) leads to higher and stable OA values, but the main control power is retained by the regularization parameter λ.Indeed, with a constant and large set of transferred features, only the regularization parameter needs to be tuned.

Parameter Evaluation for JDA
The joint distribution adaptation (JDA) algorithm simultaneously reduces the difference in both the marginal distribution and conditional distribution between domains [45].Specifically, JDA uses an extended nonparametric maximum mean discrepancy MMD [53] to measure the differences in both marginal and conditional distributions first, and then integrates it with PCA for constant effective and robust feature representation.According to [45], the performance of JDA is controlled by the regularization parameter λ and the transferred feature subspace d.JDA has a high computational cost O(TdD 2 + TCn 2 + TDn), where T is the iteration number, C represents the cardinality label, and D denotes the original data dimensionality.
According to the results in Figure 8, a larger transferred feature number (d ě 10) leads to higher and stable OA values, but the main control power is retained by the regularization parameter λ.Indeed, with a constant and large set of transferred features, only the regularization parameter needs to be tuned.

Parameter Evaluation for JDA
The joint distribution adaptation (JDA) algorithm simultaneously reduces the difference in both the marginal distribution and conditional distribution between domains [45].Specifically, JDA uses an extended nonparametric maximum mean discrepancy MMD [53] to measure the differences in both marginal and conditional distributions first, and then integrates it with PCA for constant effective and robust feature representation.According to [45], the performance of JDA is controlled by the regularization parameter λ and the transferred feature subspace d.JDA has a high computational cost O(TdD 2 + TCn 2 + TDn), where T is the iteration number, C represents the cardinality label, and D denotes the original data dimensionality.
According to the results in Figure 8, a larger transferred feature number ( 10) leads to higher and stable OA values, but the main control power is retained by the regularization parameter λ.Indeed, with a constant and large set of transferred features, only the regularization parameter needs to be tuned.

Parameter Evaluation for JTM
Joint transfer matching (JTM) aims at reducing the domain shift by jointly matching the features and reweighting the instances in a principal dimensionality reduction procedure [46].Briefly, the feature matching procedure is executed by minimizing the nonparametric MMD [53] in an infinite dimension reproducing kernel Hilbert space (RKHS), while the instance reweighting and domain invariant feature representation is performed by minimizing the -norm using PCA.The overall algorithmic computational complexity for JTM is O(TdD 2 + TCn 2 ), and free parameters are the regularization coefficient λ, the factor γ for RBF kernel, and the transferred feature subspace d. Figure 9 shows the OA values as a function of these free parameters.It can be observed that with a constant and large set of the transferred features (d ≥ 6), the performance of JTM for ROSIS data is mainly controlled by the regularization coefficient λ and the factor γ (Figure 9b-d).
According to the results in Figure 10, once again the larger numbers of transferred features (d ≥ 6, see Figure 10b-d), and smaller values for both the regularization coefficient (λ ~ 0.01) and the gamma factor (γ ~ 0.01) (Figure 10b-e,i) lead to the most effective results for the Houston data.

Parameter Evaluation for JTM
Joint transfer matching (JTM) aims at reducing the domain shift by jointly matching the features and reweighting the instances in a principal dimensionality reduction procedure [46].Briefly, the feature matching procedure is executed by minimizing the nonparametric MMD [53] in an infinite dimension reproducing kernel Hilbert space (RKHS), while the instance reweighting and domain invariant feature representation is performed by minimizing the 2,1 -norm using PCA.The overall algorithmic computational complexity for JTM is O(TdD 2 + TCn 2 ), and free parameters are the regularization coefficient λ, the factor γ for RBF kernel, and the transferred feature subspace d. Figure 9 shows the OA values as a function of these free parameters.
It can be observed that with a constant and large set of the transferred features (d ě 6), the performance of JTM for ROSIS data is mainly controlled by the regularization coefficient λ and the factor γ (Figure 9b-d).
According to the results in Figure 10, once again the larger numbers of transferred features (d ě 6, see Figure 10b-d), and smaller values for both the regularization coefficient (λ ~0.01) and the gamma factor (γ ~0.01) (Figure 10b-e,i) lead to the most effective results for the Houston data.

Parameter Evaluation for JTM
Joint transfer matching (JTM) aims at reducing the domain shift by jointly matching the features and reweighting the instances in a principal dimensionality reduction procedure [46].Briefly, the feature matching procedure is executed by minimizing the nonparametric MMD [53] in an infinite dimension reproducing kernel Hilbert space (RKHS), while the instance reweighting and domain invariant feature representation is performed by minimizing the -norm using PCA.The overall algorithmic computational complexity for JTM is O(TdD 2 + TCn 2 ), and free parameters are the regularization coefficient λ, the factor γ for RBF kernel, and the transferred feature subspace d. Figure 9 shows the OA values as a function of these free parameters.It can be observed that with a constant and large set of the transferred features (d ≥ 6), the performance of JTM for ROSIS data is mainly controlled by the regularization coefficient λ and the factor γ (Figure 9b-d).
According to the results in Figure 10, once again the larger numbers of transferred features (d ≥ 6, see Figure 10b-d), and smaller values for both the regularization coefficient (λ ~ 0.01) and the gamma factor (γ ~ 0.01) (Figure 10b-e,i) lead to the most effective results for the Houston data.

Evaluation of GFKSVM with Different Feature Transfer Approaches
To complete our analysis, in Figures 11-13 we report the results of a sensitivity analysis of GFK with respect to its critical parameters: the number of transferred features d and the adopted unsupervised feature transfer approach (PCA, rPCA, FA and NNMF).According to Equations ( 1) and ( 2), the number

Evaluation of GFKSVM with Different Feature Transfer Approaches
To complete our analysis, in Figures 11-13 we report the results of a sensitivity analysis of GFK with respect to its critical parameters: the number of transferred features d and the adopted unsupervised feature transfer approach (PCA, rPCA, FA and NNMF).According to Equations ( 1) and ( 2), the number of transferred subspaces (d) was set to D. Hence, the maximum d for ROSIS and Houston in Figures 11-13 was set to 50 and 72, respectively.Moreover, according to the results shown in the previous subsection, the number of random features m in rPCA is set to five times the dimensionality of the original data.Finally, all results are compared with the results obtained by directly using RBF kernel based SVM on the original data.These results are considered as the evaluation benchmark.
As illustrated in Figure 11 for the ROSIS dataset (UniversityÑCentre), the largest overall accuracy values are achieved by GFKSVM, while the second best performance is obtained in most cases by using all bands and RBF kernel based SVM.Furthermore, the sensitivity of GFK with respect to the subspace dimensionality is much lower for conventional methods than applying PCA, FA and NNMF transformations to the source and target images in either a separate or combined way.The optimum d for rPCA and NNMF based GFKSVM is ě20 (see Figure 11f,p), while PCA based GFKSVM is much more stable with respect to the number of transferred features.

Remote Sens. 2016, 8, 234 15 of 23
To complete our analysis, in Figures 11-13 we report the results of a sensitivity analysis of GFK with respect to its critical parameters: the number of transferred features d and the adopted unsupervised feature transfer approach (PCA, rPCA, FA and NNMF).According to Equations ( 1) and ( 2), the number of transferred subspaces (d) was set to D .Hence, the maximum d for ROSIS and Houston in Figures 11-13 was set to 50 and 72, respectively.Moreover, according to the results shown in the previous subsection, the number of random features m in rPCA is set to five times the dimensionality of the original data.Finally, all results are compared with the results obtained by directly using RBF kernel based SVM on the original data.These results are considered as the evaluation benchmark.
As illustrated in Figure 11 for the ROSIS dataset (University→Centre), the largest overall accuracy values are achieved by GFKSVM, while the second best performance is obtained in most cases by using all bands and RBF kernel based SVM.Furthermore, the sensitivity of GFK with respect to the subspace dimensionality is much lower for conventional methods than applying PCA, FA and NNMF transformations to the source and target images in either a separate or combined way.The optimum d for rPCA and NNMF based GFKSVM is ≥20 (see Figure 11f,p), while PCA based GFKSVM is much more stable with respect to the number of transferred features.Figure 12 presents the results for GFKSVM with different subspace feature transfer approaches for Pavia, and Figure 13 presents that for Houston.Once again, all these results show that PCA based Figure 12 presents the results for GFKSVM with different subspace feature transfer approaches for Pavia, and Figure 13 presents that for Houston.Once again, all these results show that PCA based GFKSVM is more robust with respect to the dimensionality of the transferred features (d) than rPCA, FA and NMF approaches (Figures 12a,f,k,p and 13a,f,k,p).Indeed, rPCA and NMF require a larger d (Figures 12f,p and 13f,p).Instead, the poor performance of GFKSVM if rPCA is applied to both domains with small d can be significantly improved by replacing rPCA with PCA, FA or NNMF for one domain (see for instance Figure 11f vs. Figure 11e,g,h), Figure 12f  GFKSVM is more robust with respect to the dimensionality of the transferred features (d) than rPCA, FA and NMF approaches (Figure 12a,f,k,p), Figure 13a,f,k,p).Indeed, rPCA and NMF require a larger d (Figure 12f,p) and Figure 13f,p)).Instead, the poor performance of GFKSVM if rPCA is applied to both domains with small d can be significantly improved by replacing rPCA with PCA, FA or NNMF for one domain (see for instance Figure 11f vs. Figure 11e,g,h), Figure 12f vs. Figure 11e,g,h or Figure 13f vs. Figure 11e,g,h.GFKSVM is more robust with respect to the dimensionality of the transferred features (d) than rPCA, FA and NMF approaches (Figure 12a,f,k,p), Figure 13a,f,k,p).Indeed, rPCA and NMF require a larger d (Figure 12f,p) and Figure 13f,p)).Instead, the poor performance of GFKSVM if rPCA is applied to both domains with small d can be significantly improved by replacing rPCA with PCA, FA or NNMF for one domain (see for instance Figure 11f vs.

GFKSVM vs. State-of-the-Art DA Algorithms
Another set of experiments was devoted to the comparison of GFKSVM with ITLDC, JDA and JTM.In consideration of the high computational complexity of the latter algorithms, the training samples for the Pavia Center data set were used to validate the algorithms trained with training samples for the Pavia University data set, and vice versa.Figure 14 reports the OA curves as a function of the d parameter for all algorithms.Note that GFKSVM uses the same subspace feature reconstruction approach (PCA/FA/NMF/rPCA) for both the source and the target domain.
According to the results in Figure 14, GFKSVM after PCA, rPCA or NNMF shows in general larger OA values with respect to ITLDC, JDA and JTM.Instead, for the Houston data set, Figure 14c,d, JDA achieves the largest OA value.We may summarize that GFKSVM is more suitable for more statistically heterogeneous data like the ROSIS data set, while JDA is preferred for statistically homogeneous data like the Houston data set.Looking at this figure, we can also state that ITLDC is the best choice when a small number of features for heterogeneous data or a large number of features for homogeneous data are considered.
For a visual comparison, Tables 3 and 4 report the user accuracy (UA), kappa (κ) and average accuracy (AA) for the best result of every transformation, while Figure 15 presents the classification maps by GFKSVM with different subspace feature construction approaches for the ROSIS dataset.Please note that, since the validation procedures for JDA, ITLDC and JTM are carried out using the training samples for the Center image, while they are trained with the training samples for the University image, or vice versa, thematic maps are not presented.

GFKSVM vs. State-of-the-Art DA Algorithms
Another set of experiments was devoted to the comparison of GFKSVM with ITLDC, JDA and JTM.In consideration of the high computational complexity of the latter algorithms, the training samples for the Pavia Center data set were used to validate the algorithms trained with training samples for the Pavia University data set, and vice versa.Figure 14 reports the OA curves as a function of the d parameter for all algorithms.Note that GFKSVM uses the same subspace feature reconstruction approach (PCA/FA/NMF/rPCA) for both the source and the target domain.
According to the results in Figure 14, GFKSVM after PCA, rPCA or NNMF shows in general larger OA values with respect to ITLDC, JDA and JTM.Instead, for the Houston data set, Figure 14c,d, JDA achieves the largest OA value.We may summarize that GFKSVM is more suitable for more statistically heterogeneous data like the ROSIS data set, while JDA is preferred for statistically homogeneous data like the Houston data set.Looking at this figure, we can also state that ITLDC is the best choice when a small number of features for heterogeneous data or a large number of features for homogeneous data are considered.
For a visual comparison, Tables 3 and 4 report the user accuracy (UA), kappa (κ) and average accuracy (AA) for the best result of every transformation, while Figure 15 presents the classification maps by GFKSVM with different subspace feature construction approaches for the ROSIS dataset.Please note that, since the validation procedures for JDA, ITLDC and JTM are carried out using the training samples for the Center image, while they are trained with the training samples for the University image, or vice versa, thematic maps are not presented.(a) 57.01%(a) 57.01%

Influence of the Source Domain Sample Size
Finally, in Figure 16 we present the results of experiments aimed at gauging the influence of the domain samples size used to train the considered DL algorithms.To this end, curves of the overall classification accuracy as a function of the training samples size with a fixed subspace dimensionality (d = 30) and for both the Pavia and the Houston datasets are presented.For more objective evaluation purposes, each experiment was executed 10 times, randomly selecting the training samples from the pool.According to the results in Figure 16, one can observe that the proposed GFKSVM with PCA or rPCA shows the best OA values for the ROSIS data, while JDA and ITLDC achieve the best results for the Houston dataset.This indicates again that GFKSVM is more suitable to handle heterogeneous data, while JDA and ITLDC are more suitable for homogeneous data.

Conclusions
In order to deal with the data distribution shift problem for the classification of hyperspectral images, in this paper we have experimentally studied the suitability of geodesic flow Gaussian kernel for unsupervised domain adaptation.For comparison purposes, state-of-the-art domain adaptation methods, namely JDA, ITLDC and JTM, were also considered.
Experimental results with two real hyperspectral datasets show that, for relatively statistically "heterogeneous" data sets, the RBF kernel based GFKSVM is more stable with respect to the dimensionality of transferred subspaces, and is capable of learning from small training sets, achieving a better classification performance, especially in the PCA transformed domain.Instead, JDA is more suitable to handle statistically homogeneous datasets, such as the Houston data set, using a large transferred feature dimensionality.Finally, the performances of GFKSVM with rPCA applied to the source and target domains and using a small number of transferred features are not acceptable.Instead, excellent results can be obtained by replacing rPCA with PCA, FA or NMF in one domain.
Future works will be focused on testing the proposed approach in different scenarios, such as domain adaptation between images of different multispectral sensors, as well as in the case of hyperspectral and multispectral sensors as source and target images.Additionally, semi-supervised approaches will be also considered as another interesting line of study.According to the results in Figure 16, one can observe that the proposed GFKSVM with PCA or rPCA shows the best OA values for the ROSIS data, while JDA and ITLDC achieve the best results for the Houston dataset.This indicates again that GFKSVM is more suitable to handle heterogeneous data, while JDA and ITLDC are more suitable for homogeneous data.

Conclusions
In order to deal with the data distribution shift problem for the classification of hyperspectral images, in this paper we have experimentally studied the suitability of geodesic flow Gaussian kernel for unsupervised domain adaptation.For comparison purposes, state-of-the-art domain adaptation methods, namely JDA, ITLDC and JTM, were also considered.
Experimental results with two real hyperspectral datasets show that, for relatively statistically "heterogeneous" data sets, the RBF kernel based GFKSVM is more stable with respect to the dimensionality of transferred subspaces, and is capable of learning from small training sets, achieving a better classification performance, especially in the PCA transformed domain.Instead, JDA is more suitable to handle statistically homogeneous datasets, such as the Houston data set, using a large transferred feature dimensionality.Finally, the performances of GFKSVM with rPCA applied to the source and target domains and using a small number of transferred features are not acceptable.Instead, excellent results can be obtained by replacing rPCA with PCA, FA or NMF in one domain.
Future works will be focused on testing the proposed approach in different scenarios, such as domain adaptation between images of different multispectral sensors, as well as in the case of hyperspectral and multispectral sensors as source and target images.Additionally, semi-supervised approaches will be also considered as another interesting line of study.
Sens. 2016, 8, 234 4 of 23 the original subspaces Ds and DT are closer in { } , d  D , then the two domains could be made similar in this manifold, i.e., PS(X) ≅ PT(X).

Figure 1 .D.
Figure 1.Illustration of the main idea of geodesic flow for domain adaptation.

Figure 1 .
Figure 1.Illustration of the main idea of geodesic flow for domain adaptation.

Figure 2 .
Figure 2. Color composite images of the (a) ROSIS Pavia University with the corresponding ground truths: (b) train map; (c) test map; (d) Pavia City Centre with the corresponding ground truths; (e) train map and (f) test map.

Figure 2 .
Figure 2. Color composite images of the (a) ROSIS Pavia University with the corresponding ground truths: (b) train map; (c) test map; (d) Pavia City Centre with the corresponding ground truths; (e) train map and (f) test map.

Figure 3 .
Figure 3. Color composite of the Source (a) and Target (b) images of Houston with the corresponding ground truths: (c) and (d) train maps for Source and Target; (e) and (f) test maps for Source and Target.

Figure 3 .
Figure 3. Color composite of the Source (a) and Target (b) images of Houston with the corresponding ground truths: (c) and (d) train maps for Source and Target; (e) and (f) test maps for Source and Target.

Figure 6 .
Figure 6.OA versus the parameters m and d of rPCA in GFKSVM for ROSIS (a,b) and Houston (c,d) data sets: (Source→Target).

Figure 6 .
Figure 6.OA versus the parameters m and d of rPCA in GFKSVM for ROSIS (a,b) and Houston (c,d) data sets: (SourceÑTarget).

Figure 7 .
Figure 7. OA versus the number of subspaces d and the parameter λ of ITLDC for the ROSIS (a,b) and Houston (c,d) data sets: (Source→Target).

Figure 7 .
Figure 7. OA versus the number of subspaces d and the parameter λ of ITLDC for the ROSIS (a,b) and Houston (c,d) data sets: (SourceÑTarget).

Figure 7 .
Figure 7. OA versus the number of subspaces d and the parameter λ of ITLDC for the ROSIS (a,b) and Houston (c,d) data sets: (Source→Target).

Figure 8 .
Figure 8. OA versus the number of subspaces d and the parameter λ of JDA (trained with primal) for the ROSIS (a,b) and Houston (c,d) data sets: (Source→Target).

1 Figure 8 .
Figure 8. OA versus the number of subspaces d and the parameter λ of JDA (trained with primal) for the ROSIS (a,b) and Houston (c,d) data sets: (SourceÑTarget).

Figure 8 .
Figure 8. OA versus the number of subspaces d and the parameter λ of JDA (trained with primal) for the ROSIS (a,b) and Houston (c,d) data sets: (Source→Target).

Figure 14 .
Figure 14.OA curves for GFKSVM and other considered state-of-the-art DA methods on ROSIS (a,b) and Houston (c,d) data sets.

Figure 14 .
Figure 14.OA curves for GFKSVM and other considered state-of-the-art DA methods on ROSIS (a,b) and Houston (c,d) data sets.

Figure 14 .
Figure 14.OA curves for GFKSVM and other considered state-of-the-art DA methods on ROSIS (a,b) and Houston (c,d) data sets.

Finally, in Figure 16
we present the results of experiments aimed at gauging the influence of the domain samples size used to train the considered DL algorithms.To this end, curves of the overall classification accuracy as a function of the training samples size with a fixed subspace dimensionality (d = 30) and for both the Pavia and the Houston datasets are presented.For more objective evaluation purposes, each experiment was executed 10 times, randomly selecting the training samples from the pool.(a) University→Center (b) Center→University

Figure 16 .
Figure 16.Classification accuracy curves for GFKSVM and other DA algorithms applied to ROSIS (a,b) and Houston (c,d) data.Each curve shows the average overall accuracy value with respect to the increasing size of training sets over 10 independent runs.The shaded areas show the standard deviation of the overall accuracy within the independent runs.

Figure 16 .
Figure 16.Classification accuracy curves for GFKSVM and other DA algorithms applied to ROSIS (a,b) and Houston (c,d) data.Each curve shows the average overall accuracy value with respect to the increasing size of training sets over 10 independent runs.The shaded areas show the standard deviation of the overall accuracy within the independent runs.

Table 1 .
Class legend for ROSIS University and City Center datasets.

Table 1 .
Class legend for ROSIS University and City Center datasets.

Table 2 .
Class legend for Figure3

Table 4 .
Classification accuracies (%) and kappa statistic (k) for considered DA algorithms applied to Houston data (Right→Left).