A Comparative Review of Manifold Learning Techniques for Hyperspectral and Polarimetric SAR Image Fusion

In remote sensing, hyperspectral and polarimetric synthetic aperture radar (PolSAR) images are the two most versatile data sources for a wide range of applications such as land use land cover classification. However, the fusion of these two data sources receive less attention than many other, because of their scarce data availability, and relatively challenging fusion task caused by their distinct imaging geometries. Among the existing fusion methods, including manifold learning-based, kernel-based, ensemble-based, and matrix factorization, manifold learning is one of most celebrated techniques for the fusion of heterogeneous data. Therefore, this paper aims to promote the research in hyperspectral and PolSAR data fusion, by providing a comprehensive comparison between existing manifold learning-based fusion algorithms. We conducted experiments on 16 state-of-the-art manifold learning algorithms that embrace two important research questions in manifold learning-based fusion of hyperspectral and PolSAR data: (1) in which domain should the data be aligned—the data domain or the manifold domain; and (2) how to make use of existing labeled data when formulating a graph to represent a manifold—supervised, semi-supervised, or unsupervised. The performance of the algorithms were evaluated via multiple accuracy metrics of land use land cover classification over two data sets. Results show that the algorithms based on manifold alignment generally outperform those based on data alignment (data concatenation). Semi-supervised manifold alignment fusion algorithms performs the best among all. Experiments using multiple classifiers show that they outperform the benchmark data alignment-based algorithms by ca. 3% in terms of the overall classification accuracy.


Related Work
Multi-modal data fusion [1][2][3][4][5][6][7] continuously draws attention in the remote sensing community.The fusion of optical and synthetic aperture radar (SAR) data, two important yet intrinsically different data sources, has also began to appear frequently in the context of multi-modal data fusion [8][9][10][11][12][13][14].With the rapid development of Earth observation missions, such as the Sentinel-1 [15], Sentinel-2 [16], and the upcoming EnMAP [17], the availability of both data sources create a huge potential for Earth-oriented information retrieval.Among all optical data [18][19][20], hyperspectral data are well known for their distinguishing power that originates from their rich spectral information [21][22][23][24].Similarly, polarimetric SAR (PolSAR) data are a popular choice for classification task in the field of SAR because it can reflect the geometric and the dielectric property of the scatterers [25][26][27][28][29].It is of great interest to investigate the fusion of hyperspectral and PolSAR images, especially with the application to land use land cover classification (LULC).
Few studies attempted to address the challenge of fusing hyperspectral and PolSAR data.Jouan and Allard [30] proposed a hierarchical fusion strategy for land cover classification using PolSAR and hyperspectral images.In their work, hyperspectral images are firstly used to distinguish vegetation and non-vegetation area.The PolSAR data are used to classify the non-vegetation area to man-made objects, water, or bare soil.Li et al. [31] applied feature level and decision level fusion using hyperspectral and PolSAR data.They combined the parameters of scattering mechanism of the PolSAR data and the features of the hyperspectral image to create a concatenation of features.The classification results of multiple classifiers are then merged using decision fusion.An application of spilled oil detection was studied by Dabbiru et al. [32] using hyperspectral and PolSAR data.They applied pixel level concatenation on the data, and employed support vector machine (SVM) as the classifier.Hu et al. [33] proposed a framework for fusing hyperspectral and PolSAR data based on the segmented objects that provide spatial information.A two-stream convolutional neural network (CNN) was introduced in [34] that takes advantage of the feature extraction power of CNN.
Among the existing fusion methods, including manifold learning-based [35,36], kernel-based [37], ensemble-based [38,39], tensor-based [40,41], and matrix factorization [42], manifold learning is one of most celebrated techniques.However, although it has been proven as a powerful technique in the field of data fusion, it is barely studied in the scope of fusing hyperspectral and PolSAR data.Generally, manifold-based fusion techniques attempt to find a shared latent space where the original data sets can be fused or aligned.Wang and Mahadevan [35,[43][44][45] proposed several manifold-based techniques to find the correspondence of data sets which describe the same object from different aspects via the latent space.A kernel based manifold alignment [46] searches the latent space from a kernel space of the original data, because the kernel space has a better representation of the data than the original feature space of the data.In remote sensing, it was introduced in [36,47] that the manifold latent space is able to align multiple optical data sets and improve the LULC classification.A manifold-based data alignment technique was introduced in [48] for the fusion of hyperspectral and Lidar data with application to classification.Besides data fusion, various manifold techniques can be found in remote sensing field for detection [49], visualization [50], and dimension reduction [51].

Scope of This Paper
When fusing data with manifold techniques, one technical question is that: in which domain should the fusion be carried on?We categorized the existing techniques into two types: (1) data alignment-based approach and (2) manifold alignment-based approach.As shown in the left of Figure 1, the data alignment approach carries out the fusion in the data domain.As the simplest example, it fuses the data by concatenation, and carries out a manifold-based dimension reduction.Essentially, this approach assumes that an intrinsic manifold exists in the concatenated data.Representatives of this approach are the locality preservation projection (LPP) [52] and the generalized graph fusion (GGF) [48].On the contrary, the manifold alignment-based approach carries out the fusion on manifolds which are separately derived from different data sources.This is demonstrated in the right of Figure 1.
The assumption of this approach is that different manifold exists in each data source.Those manifolds can be aligned in a latent space.Representative algorithms are the manifold alignment (MA) [36] and the MAPPER-Induced manifold alignment (MIMA) [53].The other essential research question of manifold-based fusion is: how should the manifold be extracted?We categorize the existing techniques into three learning strategies in terms of the usage of labeled data: unsupervised, semi-supervised, and supervised.When modeling a manifold, a general assumption is that, hidden in the data representation, there exists an underlying lower dimensional manifold where the data truly distributes [54].Early studies [54][55][56][57] model manifolds by following a geometric assumption that the Riemannian manifold can be locally approximated by Euclidean measures.The geometric assumption models the manifold in an unsupervised manner by using k-nearest-neighbor (kNN).With the presence of labeled data, the manifold can be jointly modeled by the Riemannian manifold and the labeled data.For example, one can construct the manifold in a semi-supervised fashion by using both kNN and the labeled data [35,36].The manifold can also be modeled in a supervised manner by using only the labeled data.One of the main goals of this paper is to investigate the impact of these learning strategies on the classification performance on the fused data.

Contribution of This Paper
This paper investigates the performance of manifold learning technique on the fusion of hyperspectral and PolSAR data, based on four state-of-art algorithms, locality preservation projection (LPP) [52], generalized graph fusion (GGF) [48], manifold alignment (MA) [36,44], and MAPPER-induced manifold alignment (MIMA) [53].We implemented 16 variants of the four algorithms which involve the above-mentioned two alignment approaches and the three manifold learning strategies.These algorithms were tested on two study areas for a LULC classification task with five classifiers: one nearest neighbor (1NN) [58], linear SVM (LSVM) [59,60], Gaussian kernel SVM (KSVM) [59,60], random forest (RF) [61], and canonical correlation forest (CCF) [62].We avoided any deep network classifiers, because the goal of this article is to solely evaluate the performance of multi-sensory data fusion.In total, 80 classification maps were produced for each study area, based on which a comprehensive discussion was carried out.The main contributions of this paper are as follows:

•
An exhaustive investigation of existing manifold learning techniques.A sufficient number of manifold techniques and classifiers were tested on the fusion of hyperspectral and PolSAR data in terms of classification.It provides a reliable demonstration on the performance of the manifold technique regarding hyperspectral and PolSAR data fusion.

•
An objective comparison of the performance of different manifold data fusion algorithms.To avoid any fortuity, five classifiers were applied for the classification.A grid search was applied to all tunable hyperparameters of those algorithms.The best classification accuracies are compared.

•
A comprehensive analysis of the results.The experiment results were analyzed in regard to two fusion approaches, three manifold learning strategies, four basic algorithms, and five classifiers.

Structure of This Paper
The second section recalls the theory of manifold technique and the four selected state-of-art algorithms.The third section describes the data sets used in the experiments, introduces the experiment setting, and carries out the discussion.The fourth section concludes the paper.Table 1 also lists the symbols used in this article for a better understanding of the content of the article.

X i
The ith data source M i The manifold of the X i x p i The pth instance of the X i m i The number of dimensions of the X i E i The labeled subset of the X i y p i The pth instance of the Y i l i The number of dimensions of the Y i F The filter function in MAPPER W The weight matrix that models a manifold D The degree matrix of a graph

•
The fusion at certain form L The loss fuction dn The dimension of underlying manifold b The number of bins in MAPPER λ The eigenvalue of generalized eigenvalue decomposition K The total number of data sources Y i The data representation of the M i x q i The qth instance of the X i n i The number of instances of the The qth instance of the Y i f The projection Y = f T X A The binary matrix that models a manifold σ The filtering parameter of weight matrix L The Laplacian matrix of a graph D The pairwise distance matrix k The number of local neighbors µ The weighting of topology structure in MA c The overlap rate in MAPPER

Materials and Methods
In this section, the general concept of the manifold technique is introduced, with the help of necessary mathematical notations.Meanwhile, the theoretical impact of different learning strategies to the fusion result is discussed.Then, the following up sub-sections recall the principles of the four selected state-of-art manifold fusion techniques, namely LPP [52], GGF [48], MA [36,44], and MIMA [53].Pseudo-codes of these four algorithms are listed in the Appendixes A-D, which provides the technical details.Finally, the data sets and the experiment settings are introduced in detail.

Manifold Technique, Learning Strategy, and Notations
Let X i = [x 1 i , ..., x p i , ..., x n i i ] ∈ R m i ×n i be a matrix representing the ith data source, with m i dimensions by n i instances.For simplicity, the subscript i is omitted in the following content when only one data source is involved.The m i -dimensional data space is named as the feature space of data X i in this paper.The term x p i denotes the pth instance of the ith data source.Let K denote the total number of data sources.
A manifold M is a smooth hyper-surface embedded in a higher dimensional space [56], e.g., the surface of a sphere is a 2D manifold in a 3D space.The underlying assumption of the manifold technique is that, for a data X ∈ R m×n of redundant m dimensions, there exists a low dimensional intrinsic manifold M where the data distributes [54,57,63,64].The goal of a manifold technique is to pursue a representation, realized by a projection Y = [y 1 , ..., y p , ..., y n ] ∈ R l×n , l < m of the original data, of the manifold M. In order to approximate Y, the bridging property is that the data point y p on the manifold is locally homeomorphic to its counterpart x p in the feature space [56].It means that a data point has identical local structures in its intrinsic manifold and in its feature space.With this property, a variety of methods [54,55,57,65] extract the local structure of a data [52,[66][67][68][69] in its feature space as an estimation to the local structure in its intrinsic manifold, with different locality criterion.All those methods pursue an optimized projection f which maps data from the feature space to a representation (Y = f T X) of the intrinsic manifold M. In terms of the manifold technique for data fusion [36,44,48], the aim is to find the projection which maps multiple data sources {X 1 , X 2 , ..., X K } into a fused manifold M where the fused data locates.
The centerpiece of the abovementioned algorithms is the modeling of the manifold.Usually, an intrinsic manifold of the data is modeled by an n × n symmetric binary matrix A that describes the connection among the data points.A(p, q) = 1 for a confirmed connection between x p i and x q i while A(p, q) = 0 otherwise.A can be generalized to an n × n symmetric weight matrix W. Different from A, W(p, q) takes a real value in [0, 1], which describes the strength of the connection between x p i and x q i .Essentially, A and W are the adjacency and weight matrix of a graph that captures the topology of the manifold.As introduced in [52], the manifold structure (A or W) can be defined from different perspectives.In this paper, we would like to categorize these perspectives based on how the labeled data is utilized for modeling the manifold, namely the unsupervised learning, supervised learning, and semi-supervised learning.

•
The unsupervised learning takes the original geometric assumption that the manifold and the original data space share the same local property.Besides the geometric measure, model-based similarity measurement can also be used to build up the structure of the manifold.The key point is that the definition of the similarity measurement is capable of revealing the underlying distribution of the data or the physical information in the data.

•
The supervised learning assumes that a given set of labeled data includes sufficient amount of inter-and intra-class connections among the data points, so that they can well capture the topology of the manifold.As a result, the underlying manifold is directly defined by the label information.Thus, the quality of the label has a great impact.

•
The semi-supervised learning pursues a manifold where the data distribution partially correlates to the label information and partially associates to the distribution predefined by a similarity measurement.This manifold implicitly propagates the label information to the unlabeled data.

Locality Preservation Projection (LPP)
LPP aims to find a lower dimensional representation Y of the original data X which reflects the intrinsic manifold M. According to the geometric assumption that the intrinsic manifold and the original data share the same local properties, the lower dimensional representation Y achieved by LPP preserves the local structure of the original data X.The locality is defined by either the k-nearest-neighbor or the -neighborhoods [52] and is mathematically described in a weight matrix W as Equation (1).
x p and x q are local neighbors 0 x p and x q are not local neighbors (1) where σ is a filtering parameter.
LPP pursues an optimized projection f which maps the data X to a lower dimensional representation Y = f T X.As the local structure of the intrinsic manifold is modeled by Equation ( 1), minimizing the objective function expressed by Equation (2) encourages the preservation of the derived local structure in the intrinsic manifold. (2) Thus, the optimization is formulated as follow: Proven in [52], the solution that minimizes the objective function L(f) is given by the minimum eigenvalue solution to the generalized eigenvalue problem expressed in Equation ( 4).
where D is the degree matrix; if p = q, D(p, q) = ∑ p=n p=1 W(p, q), otherwise, D(p, q) = 0; and the L is the Laplacian matrix, L = D − W.
As brief described above, LPP is originally designed to as a dimension reduction algorithm, instead of data fusion.However, it is essential to include this algorithm in the scope of this paper, because (1).When conducting manifold fusion, the dimension reduction is also accomplished as a side effect.Due to the well-known curse-of-dimensionality [70], classification on selective subset of dimensions could result in better performance than using the data with all dimensions [71].LPP can serve as a baseline algorithm to reduce the dimension of the data; and (2).LPP is essentially a manifold learning technique.Some data fusion algorithms [48,72] are developed on the idea of data alignment using the LPP.

Generalized Graph-Based Fusion (GGF)
GGF is originally proposed to fuse hyperspectral data and LiDAR data for land cover classification [48].Its fusion strategy comprises a joint LPP dimension reduction and an additional constraint that captures the common local structure that exists in all data sources.
Technically, GGF concatenates K data sources which are treated as one data sources in its high dimensional feature space.Therefore, GGF is essentially a LPP carried out on the data stack X, with an additional constraint.The constraint assumes that the connectivity Ã of the fused intrinsic manifold M should be a complete subset of the connectivity matrices of the manifolds M i of the individual data sources X i , i ∈ {1, 2, ..., K}.Thus, the assumption is formulated as Equation (5).
where indicates element-wise multiplication.
The manifold constraint Ã is embedded into a n by n pairwise distance matrix D ( D(p, q) = || xp − xq ||), which is expressed by Equation ( 6) where ¬ means logical operator negative, and max(•) means the maximum value of all elements in '•'.The distance between any two data points that are not connected according to Ã is penalized with the maximum distance value of D. The final distance matrix is named as DGGF .
The weight matrix W of the intrinsic manifold is then as follows.
W(p, q) = e − DGGF (p,q) x p and x q are local neighbors 0 x p and x q are not local neighbors (7) After achieving the weight matrix W, similar to the LPP, the optimized projection f is given by the minimum eigenvalue solution to the generalized eigenvalue problem in Equation (8).
where D is the degree matrix.If p = q, D(p, q) = ∑ p=n p=1 W(p, q), otherwise, D(p, q) = 0. L is the Laplacian matrix, L = D − W.

Manifold Alignment (MA)
Manifold alignment [35,36,44] aims to learn a set of projections {f 1 , ..., f K } that (1) apply to individual data sources X i in order to obtain their individual manifolds M i , and (2) align those obtained manifolds {M 1 , ..., M K } to each other.
Designed in [36,44], three properties hold in the fused manifold: (a) data of the same class should locate close to each other; (b) data of different classes should locate far from one another; and (c) the intrinsic manifolds of individual data are preserved.These three properties are respectively formulated by the following three connection matrices Ãs (9), Ãd (10), and Ãg (11).
The connection matrix of similarity ( 9) is computed by the labeled information to pursue property (a).
The connection matrix of dissimilarity is modeled as (10) to accomplish property (b), which is also computed from the label.
The connection matrix (11) describes the manifolds of individual data sources by using kNN, which aims at the property (c).All of the matrices ( 9)-( 11) have the size of (n 1 + n 2 + ... + n k ) × (n 1 + n 2 + ... + n k ).In each matrix, the superscript i, j, e.g., A i,j , represents the relationship between the ith and jth data sources.
Minimizing Equation ( 12) pulls data of the same class together, which meets property (a).
Maximizing Equation ( 13) pushes data of different classes away, which meets property (b).
Minimizing Equation ( 14) preserves the geometric structure of individual data sources, which corresponds to property (c).The terms ( 12)-( 14) jointly construct the objective unction (15): and hence an optimization problem ( 16) can be written as min Proven in [35], the solution {f 1 , ..., f K } that minimizing the cost function L(f 1 , ..., f K ) is given by the smallest non-zero eigenvectors of the generalized eigenvalue decomposition of (17). where Ã{s,d,g} (p, q) p = q 0 p = q .
The matrices D and L with subscript s, d, and g are the degree matrices and the Laplacian matrices, respectively.

MAPPER-Induced Manifold Alignment (MIMA)
MIMA is designed to fuse optical and PolSAR data for the purpose of LULC classification [53].It follows the framework of MA [36,44] yet introduces a novel constraint term which originates from a recent field of topological data analysis (TDA).TDA has emerged as a new mathematical sub-field of big data analysis that aims to derive relevant information from the topological property of a data [73][74][75][76][77].One TDA tool, named MAPPER [78], has been proven capable of revealing unknown insights in medical studies, by interpreting topological structures of data sets [79][80][81][82].As a brief introduction, the MAPPER requires a filter function as an input which projects the data into a parameter space.
The original data is sorted into overlapping bins guided by the projected parameter.MAPPER carries out clustering of data points in each of the data bins, respectively.Afterwards, MAPPER models a graph where a node represents a cluster and an edge links two clusters that share common data points.Finally, a simplified graph is built up to represent the shape of the data.Such graph is an approximation of the Reeb graph [83] .
Technically, MIMA pursues the solution {f 1 , ..., f K } by solving the same generalized eigenvalue decomposition as in Equation ( 17), except the the connection matrix of geometry Ãg (Equation ( 11)) is replace by the MAPPER-derived connection matrix ÃMIMA where A i,i MI MA (p, q) = 1 if x p i and x q i belongs to the same cluster or belongs to two separated but linked clusters; A i,i MI MA (p, q) = 0 elsewhere.Comparing to Ãg , ÃMIMA introduces some unique properties that are listed as follows: • Field knowledge.An expertise knowledge is introduced by the selection of the filter function.
It defines a perspective of viewing the data while deriving the structure.• A regional-to-global structure.Clustering in each data bin provides a regional structure.The design of overlapping bins combines the regional structures into a global one.It makes the derived structure more robust to outliers than the one derived by kNN.

•
A data-driven regional structure.A spectral clustering is applied in the step, which is capable of detecting the number of clusters by the concept of eigen-gap [84].It allows the derived structure constraining to the data distribution.

Data Description
Two sets of real data were used to investigate the manifold learning techniques for the fusion of hyperspectral and PolSAR data.The two data sets are in city of Berlin, Germany, and Augsburg, Germany.

The Berlin Data Set
In the Berlin data set, the hyperspectral image is a synthetic spaceborne EnMAP scene synthesized from airborne HyMap data.It has a size of 817 by 220 pixels, a 30-m ground sampling distance (GSD), and 244 spectral bands ranging from 400 nm to 2500 nm [85].The dual-channel PolSAR data is a VH-VV polarized Sentinel-1 data acquired in interferometric wide swath mode.The Sentinel-1 SLC data is preprocessed using ESA SNAP toolbox and filtered by a non-local mean filter [86].The PolSAR data has a GSD of 13 m and a size of 1723 by 476 pixels.The ground truth is a land use land cover map derived from Open Street Map (OSM) data [87].The ground truth labels are spatially separated into a training data set and a testing data set shown in Figure 2. The details of the training and testing data sets are summarized in Table 2.   3.

Experiment Setting
We start with a reasonable feature selection and extraction strategy from the original data, since it is well known that feature selection and extraction promote the classification performance of remote sensing data.The spectral-spatial feature extraction was employed for the hyperspectral image because of its excellent performance on classification tasks [88][89][90][91].Specifically, the first four and six principal components (PCs) which occupy 99% of the variance of the data were extracted from the hyperspectral images of Berlin and Augsburg, respectively.The morphological profiles with radius of one, two, and three were employed to extract the spatial information on each PC.Thus, in total, 28 features and 42 features were extracted from the hyperspectral images of Berlin and Augsburg, respectively.For the feature extraction of Sentinel-1 dual-Pol data, four polarimetric features were extracted.They are the intensity of the VH channel, the intensity of the VV channel, the coherence of VV and VH, and the intensity ratio of VV and VH.Since the morphological profile was proven to promote classification of PolSAR [92,93], it is also used to extract spatial information from the four polarimetric features with radius equal to one, two, and three.In addition, the local statistics including the mean and standard deviation were extracted using a sliding window of 11 by 11 pixel on those four polarimetric features.In total, 36 features were extracted from the dual-Pol SAR data for both data sets of Berlin and Augsburg, respectively.
To carry out a comprehensive comparison of the fusion algorithms, in total 16 algorithms were implemented.Listed in Table 4, they are (1) PolSAR data only (POL), (2) hyperspectral image only (HSI), (3) feature stacking of hyperspectral and PolSAR data (HSI+POL), (4) data alignment using the original locality preserving projections (LPP) [52], (5) supervised version of LPP (LPP_SU), (6) semi-supervised version of LPP (LPP_SE), ( 7) the generalized graph-based fusion (GGF) [48], (8) supervised version of GGF (GGF_SU), ( 9) semi-supervised version of GGF (GGF_SE), (10) manifold alignment (MA) [36,44], (11) unsupervised version of MA (MA_UN), (12) supervised version of MA (MA_SU), ( 13) MAPPER-Induced manifold alignment with first two principal components as filter functions (MIMA) [53], (14) unsupervised MIMA (MIMA_UN), ( 15) MIMA with local density as filter function (MIMA-D), and ( 16) unsupervised MIMA with local density as filter function (MIMA-D_UN).Table 4. Technical summary of the selected algorithms.'SU', 'UN', and 'SE' represent the learning strategy of supervised, unsupervised, or semi-supervised, respectively.W and A represent the weight matrix and the connection matrix, respectively.The hyperparameter set {k, dn, µ, b} indicates the number of neighbors, the number of dimensions, the topology weighting parameter, and the number of bins.These manifold algorithms listed in Table 4 are categorized into the two approaches (data alignment or manifold alignment) mentioned in Section 1.2.LPP and GGF belong to the category of data alignment which concatenates data as a stack, and applies manifold learning on the stacked data.MA and MIMA belong to the category of manifold alignment which independently project K data sources to a latent space where the data are aligned.

Algorithm
The hyperparameters of each algorithm were tuned via a grid search, so that each algorithm reaches its best performance.The k was set in a range of 10 to 120 with an interval of 10.The number of dimension dn is set in a range of 5 to 50 with an interval of 5.The topology weighting parameter µ is set in a range of 0.5 to 3 with an interval of 0.5.The number of bins b is set in a range of 5 to 55 with an interval of 5.
After the data being fused, five different shallow classifiers were applied to the fused data set in the classification step.They are: one nearest neighbor (1NN) [58], linear SVM (LSVM) [59,60], Gaussian kernel SVM (KSVM) [59,60], random forest (RF) [61], and canonical correlation forest (CCF) [62].The parameter tuning of LSVM is done in a heuristic procedure [60].LIBSVM [94] is employed for the implementation of the KSVM.The number of trees was set as 40 for both RF and CCF.

Experiment Results
The discussion of experiment result mainly focus on the following three aspects: • Manifold learning strategy.The experiment result supports the discussion of the impact that causes by different learning strategies, the unsupervised learning, the supervised learning, and the semi-supervised learning.

•
Data fusion approach.The result supports the discussion of the two fusion approaches, the data alignment-based and the manifold alignment-based, for the fusion of the hyperspectral image and PolSAR data.
• Performance on classification.The experiment result reveals how manifold techniques perform on fusing hyperspectral images and PolSAR data and how different these manifold techniques perform.
The classification result is quantitatively evaluated by the class-specific accuracy, the average accuracy, the overall accuracy, and the kappa coefficient.The class-specific accuracy provides the percentage of correct predictions for a specific class.The average accuracy is the mean value of class-specific accuracy.The overall accuracy indicates the percentage of correctness for all predictions.And kappa coefficient also evaluates the overall correctness, yet is more robust than the overall accuracy [95].

Experiment on the Berlin Data Set
As shown in Figure 4 and Table 5, for the data alignment-based fusion algorithms (LPP and GGF), the unsupervised versions outperform the supervised and the semi-supervised versions.However, for the manifold alignment-based fusion algorithms (MA, MIMA, and MIMA-D), the semi-supervised versions have the best performance comparing to the supervised and the unsupervised ones.Surprisingly, in both type of fusion algorithms, the fully supervised strategy performs the worst.Taking the result of the simple concatenation (HSI+POL in Table 5) as reference, the data alignment-based fusion algorithms (LPP and GGF) marginally improve the classification accuracy.Sometimes the performance even drops below the reference accuracy.On the contrary, the manifold alignment-based fusion algorithms (MA, MIMA, and MIMA-D) have a more consistent improvement of the classification accuracy by ca.3%.In fact, MIMA and MIMA-D have a considerable improvement comparing to LPP, GGF, and MA, especially when RF and CCF are employed as the classifier.This can be seen in Figure 4.Among all the algorithms, MIMA and MIMA-D have the best overall performance.Shown in Table 5, their best performance reach over 0.66, 65%, and 79%, for the kappa coefficient, the average accuracy, and the overall accuracy, respectively.For a visual comparison, Figure 5 plots the ground truth and the classification maps predicted by the 16 algorithms with CCF.

Experiment on Augsburg Data Set
The findings of the Augsburg data set are consistent with that of the Berlin data set.For the data alignment-based fusion algorithms, the unsupervised learning strategy works the best among the three learning strategies.For the manifold alignment-based fusion algorithms, the semi-supervised learning strategy performs the best.Comparing the results to that of simple concatenation (HSI+POL), the data alignment-based fusion (LPP and GGF) barely has any improvement, while the manifold alignment fusion has a 2% improvement comparing to the LPP and GGF.These findings can be seen in Figure 6, and Table 6.Among all the algorithms, combining MIMA or MIMA-D with RF or CCF provide the best classification performance.Their kappa coefficient, the average accuracy, and the overall accuracy, reach 0.56, 62.5%, and 62.5% respectively.A visual comparison of the results is shown in Figure 7. Similar to Figure 5, the classification maps were predicted by CCF.

The Setting of the Training and Testing Samples
As shown in Figures 2 and 3, the training and testing samples are spatially separated as a standard machine learning practice.However, the distribution of training samples of the Berlin and the Augsburg data set are slightly different.For each class of the Berlin data set, the training data are block-wisely scattered over the whole area.For the Augsburg data set, the training data only covers on the western part of the area.There is no sample from the eastern half of the site where the testing data distribute.Both scenarios are common in remote sensing applications.The latter one is naturally more challenging.This is why the overall accuracy of the Augsburg data set fluctuates around 56%, while it is around 76% for the Berlin data set.

The Data Alignment Fusion
An unsupervised data alignment-based fusion in this article pursues an intrinsic manifold of a concatenation of the hyperspectral and PolSAR data.Intuitively, making use of the additional label information in the manifold learning (semi-supervision) should be improve the classification accuracy.However, we observed the exact opposite in our experiments.We believe it is due to the misalignment of image pixels of optical and SAR images caused by their distinct imaging geometry.This pixel misalignment leads to extra difficulty in learning a joint manifold.Adding one more manifold defined by the misaligned label will only lead to destructive effects.Therefore, the data alignment-based fusion algorithm is not competent for fusing hyperspectral and PolSAR data with the resolution similar to our dataset.This finding should also be able to generalized to high resolution optical and SAR data, although we have not conducted any experiment.

The Manifold Alignment Fusion
Different to the data alignment-based fusion, the semi-supervised manifold alignment-based fusion outperforms the unsupervised manifold alignment fusion.This fusion concept is able to introduce the advantage of label information while pursuing the intrinsic manifold.The reason is that this fusion concept models the manifold of individual data sources independently which suits the fact that hyperspectral and PolSAR data are severely dissimilar in geometry and content.The label information is merged into the two manifolds in a way that the two manifolds are separately link to the label and are then aligned to each other by the label.In such manner, the advantage of label data appears on the classification results.Comparing to the data alignment-based fusion, the manifold alignment-based fusion introduces considerable improvements to the classification accuracy, which shows its competence for the fusion of hyperspectral and PolSAR data.

The Filter Function of MIMA
As introduced in Section 2.5, the filter function of MIMA introduces an expertise knowledge while deriving the manifold structure of data.MIMA and MIMA-D in this paper employed PCA and a density estimation as the filter function, respectively.The principal components are frequently used in classification.It is proven to be effective [82,96].The density function is an important property for classification or clustering tasks [97,98].However, from the experiment in this paper, it is inconclusive which choice is more suitable to serve as the filter function.

Conclusions and Outlook
This paper compares 16 variants of four state-of-the-art multi-sensory data fusion algorithms based on manifold learning.The comparison was done via a rigorous evaluation of the performance of the 16 algorithms on land use land cover classification on two sets of spaceborne hyperspectral images and PolSAR data.To carry out an objective comparison, the hyperparameters of the 16 algorithms were optimized via a grid search.Five different shallow classifiers were applied on the data sets fused by the 16 algorithms.We avoided any deep network classifiers, because the goal of this article is to solely evaluate the performance of multi-sensory data fusion algorithms.The experiments conclude that (1) data alignment-based (data concatenation) manifold techniques are less competent for the fusion of hyperspectral images and PolSAR data, or in general optical and SAR images fusion, because a concatenation of the two data sets with distinct imaging geometries causes difficulty even destructive effects when optimizing the target manifold; On the contrary, manifold alignment-based techniques are more competent for the task of optical and SAR images fusion, because the manifolds of the two data are separatly modeled and aligned; (2) Among the manifold alignment-based manifold techniques, semi-supervised methods are able to effectively make use of both the structure of data and the existing label information; (3) the MIMA algorithm cooperating with the CCF classifier provides the best classification accuracy among all the algorithms.
Based on our current research, our future research directions can include:

•
In the current algorithms, the learned manifold is specific to the very input data sets.We would like to study the generalization of such manifold on data sets of the same sensors.Eventually, we aim at big data processing where one common manifold can be applied to all the data sets of the same type.

•
Graph CNN has been an emerging filed in deep lerning.It is also of great interest to combine it with the traditional manifold learning techniques described in this article.

•
Because of the data availability of spaceborne hyperspectral and PolSAR data, they have not been extensively applied to real world problems.We would like to address more real world applications especially those for social good using those two types of data, for example, contributing to the monitoring of Unite Nation's sustainable development goals.
project website: www.so2sat.eu),and the Helmholtz Association under the framework of the Young Investigators Group Signal Processing in Earth Observation (SiPEO) with the grant number VH-NG-1018 (project web-site: www.sipeo.bgu.tum.de).
Algorithm A5: MIMA(X 1 ,X 2 ,E 1 ,E 2 ,k) Input: X 1 : the data source X 1 ∈ R m 1 ×n 1 with n 1 instances and m 1 dimensions X 2 : the data source X 2 ∈ R m 2 ×n 2 with n 2 instances and m 2 dimensions E 1 : E 1 ∈ R 1×n * 1 with n * 1 < n 1 , labels for the first n * 1 instances of X 1 E 2 : E 2 ∈ R 1×n * 2 with n * 2 < n 2 , labels for the first n * 2 instances of X 2 k: the number of local neighbors Output: Ỹ1 : the projected data of X 1 .Ỹ2 : the projected data of X 2 .

Figure 1 .
Figure 1.Frameworks of manifold learning fusion techniques.Left: the data alignment fusion; Right: the manifold alignment fusion.The blue arrow indicates the fusion step.The yellow arrow indicates where the modeling of manifold takes place.The black arrow indicates the feature extraction.X i : ith data source; A or W: mathematical modeling of manifolds; Y: fused feature; f: the learned projection.•: a fusion of certain form.

Figure 2 .
Figure 2. The Berlin data set.From left to right: RGB components of the simulated EnMAP data; Sentinel-1 dual-Pol data; The training data; The testing data.2.6.2.The Augsburg Data SetSimilar to the Berlin data set, the hyperspectral image in the Augsburg data set is a synthetic spaceborne imagery simulated based on an airborne HySpex data.It has a GSD of 30 m, a size of 332 by 485 pixels, and 180 bands ranging from 400 nm to 2500 nm.Same as the Berlin data set, the PolSAR data is a VH-VV polarized Sentinel-1 image with a GSD of 10 m and a size of 997 by 1456 pixels.The training data and the testing data are shown in Figure3which are spatially separated.The details of the training and testing data sets are summarized in Table3.

Figure 3 .
Figure 3.The Augsburg data set.From left to right, top to bottom: RGB components of the hyperspectral image; Sentinel-1 dual-Pol data; The training data; The testing data.

Figure 4 .
Figure 4. Comparison of the classification accuracies of different classifiers applied on the Berlin data set.Each chart is resulted from a corresponding classifier.The right bottom chart demonstrates all the overall accuracy resulted by applying five classifiers on every fused data achieved from the selected 16 algorithms.The y-axis report the overall accuracy in percentage (%).The 'SU', 'SE', and 'UN', represent supervised, semi-supervised, and unsupervised, respectively.

Figure 5 .
Figure 5. Visualization of the classification maps and the ground truth.The 16 classification maps are provided by applying CCF on fused data of the 16 algorithms, for the Berlin data set.Classification maps achieved by manifold alignment fusion methods are more accurate than the maps achieved by data alignment fusion methods.

Figure 6 .
Figure 6.Comparison of the classification accuracies of different classifiers applying on the Augsburg data set.Each chart is resulted from a corresponding classifier.The right bottom chart demonstrates all the overall accuracy resulted by applying five classifiers on the fused data achieved from each of the 16 algorithms.The y-axis is the overall accuracy in percentage (%).The 'SU', 'SE', and 'UN', represent supervision, semi-supervision, and un-supervision, respectively.

Figure 7 .
Figure 7. Visualization of the achieved classification maps and the ground truth.The 16 classification maps are provided by applying CCF on fused data of the 16 algorithms, for the Augsburg data set.Classification maps achieved by manifold alignment fusion methods are more accurate than the maps achieved by data alignment fusion methods, especially MIMA based methods.

3 A6
i,i MI MA = MIMA-MAPPER(X i ,b,c) construct degree matrices Ds , Dd , and DMIMA with Ãs , Ãd , and ÃMIMA , respectively 7 construct Laplacian matrices Ls , Ld , and LMIMA as instructed in Equation (17) 8 organize the data matrix X as instructed in Equation (17) 9 solve the generalized eigenvalue decomposition X(µ Lg + Ls ) XT f = λ X LMIMA XT f so that f 1 and f 2 are achieved, f = f 1 f 2

Table 1 .
The notations used in this article.

Table 2 .
Summary of the training data and the testing data for the scene of city Berlin.

Table 3 .
Summary of the training data and the testing data for the scene of city Augsburg.

Table 5 .
Quantitative performance comparison on the Berlin data, in terms of class-specific accuracy, kappa coefficient, average accuracy, overall accuracy, and mean overall accuracy.The mean overall accuracy is calculated based on the overall accuracies achieved by the five classifiers.The listed indications are achieved after hyperparameter tunning.The hyperparameters of each algorithm are listed under the name of the algorithm and their values are listed in the table.The kappa coefficient, average accuracy, and the overall accuracy that larger than 0.66, 65%, and 79% are marked in bold.And the three highest mean overall accuracies are also marked in bold.

Table 6 .
Quantitative performance comparison on the Augsburg data, in terms of class-specific accuracy, kappa coefficient, average accuracy, overall accuracy, and mean overall accuracy.The mean overall accuracy is calculated based on the overall accuracies achieved by the five classifiers.The listed indications are achieved after hyperparameter tunning.The hyperparameters of each algorithm are listed under the name of the algorithm and their values are listed in the table.The kappa coefficient, average accuracy, and the overall accuracy that larger than 0.56, 62.5%, and 62.5% are marked in bold.And the three highest mean overall accuracies are also marked in bold.