A Collaborative Superpixelwise Autoencoder for Unsupervised Dimension Reduction in Hyperspectral Images

: The dimension reduction (DR) technique plays an important role in hyperspectral image (HSI) processing. Among various DR methods, superpixel-based approaches offer ﬂexibility in capturing spectral–spatial information and have shown great potential in HSI tasks. The superpixel-based methods divide the samples into groups and apply the DR technique to the small groups. Nevertheless, we ﬁnd these methods would increase the intra-class disparity by neglecting the fact the samples from the same class may reside on different superpixels, resulting in performance decay. To address this problem, a novel unsupervised DR named the Collaborative superpixelwise Auto-Encoder (ColAE) is proposed in this paper. The ColAE begins by segmenting the HSI into different homogeneous regions using a superpixel-based method. Then, a set of Auto-Encoders (AEs) is applied to the samples within each superpixel. To reduce the intra-class disparity, a manifold loss is introduced to restrict the samples from the same class, even if located in different superpixels, to have similar representations in the code space. In this way, the compact and discriminative spectral– spatial feature is obtained. Experimental results on three HSI data sets demonstrate the promising performance of ColAE compared to existing state-of-the-art methods.


Introduction
A hyperspectral image (HSI) consists of hundreds of lights that are reflected from an object's surface at different wavelengths, enabling the detection of subtle variations in color, texture, and shape of objects within a scene.It provides valuable information about specific materials and their properties.Due to the powerful ability to capture both spectral and spatial information, HSI has been widely used in many fields, such as agriculture and environmental monitoring.The HSI classification, which involves accurately assigning labels to each pixel to identify ground classes, such as trees, buildings, or grassland, is a crucial task in hyperspectral technology applications and a highly active research area in remote sensing.
The abundance of spectral information in HSI enables accurate classification based on spectral signatures.However, it also introduces challenges due to the high dimensionality of each pixel.These challenges include (1) redundant and noisy information in high-dimensional data, (2) the "curse-of-dimensionality" problem in machine learning, which arises with an increasing number of features, and (3) the higher computational and storage requirements associated with high-dimensional data.These challenges can degrade the performance of subsequent HSI processing steps.To address these issues, dimension reduction (DR) techniques are employed to obtain a compact representation with significantly fewer dimensions, which is beneficial for subsequent procedures.
Band selection [1] and feature extraction [2] are two families of popular DR techniques for HSI classification.Band selection methods reduce dimensionality by selecting a small subset of hyperspectral bands that retain the wavelength information.However, they often struggle to find the optimal subset of bands, according to [3].In this paper, our focus is on feature extraction DR methods for HSI classification.These methods aim to find a compact representation of the data in a transformed feature space, effectively addressing the limitations of band selection approaches.
Over the past few decades, numerous feature extraction DR methods have been developed, which can be categorized into supervised, unsupervised, and semi-supervised.Supervised DR methods utilize labels of the samples during the training process.For instance, Schwaller et al. first applied Linear Discriminant Analysis (LDA), a well-known DR method in machine learning, to HSI classification [4].Studies [5][6][7] proposed methods to address the limited training sample problem in HSIs, studies [8][9][10] tackled the challenge of the nonlinearly separated problem in LDA, and studies [11,12] jointly combined LDA and sparse learning to capture the underlying structure of HSI samples.Unsupervised DR methods, in contrast, do not require label information during training.Lim et al., applied Principal Component Analysis (PCA) to HSI and observed that most energy was concentrated on a few eigenvalues [13].Then, studies [14,15] utilized PCA for efficient features extraction in the HSI classification task.Studies [16][17][18] employed a local manifold model to capture the geometric structure relationship within the data.Semi-supervised methods make use of both labeled and unlabeled samples for model training.Examples of such methods include studies [19][20][21].In recent years, Deep Learning (DL) has gained popularity in various applications, including HSI classification tasks [22][23][24].While DL methods have shown promise, their performance in unsupervised settings, where label information is not utilized, may not meet the requirements of real-world applications.Hence, this paper focuses on the unsupervised scene in the context of HSI classification.
In recent years, there has been a growing interest in utilizing both the spatial and spectral information of HSI to extract more discriminative features for the HSI, which is a typical multi-channel image where the spatial domain also contains rich information.These methods can be broadly categorized into pixel neighbor-based and superpixel-based approaches.Pixel neighbor-based methods consider local pixel patches to incorporate spatial information.For example, He et al. [25] applied LPNPE to the spatial neighbors of each pixel to capture spatial relationships, Fang et al. [26] computed the local covariance matrix for a pixel using its spatial neighbor pixels and used it as a representation for classification, Li et al. [27] used a spatial window of size s × s to formulate a local neighbor space, and defined a new distance measure between samples.Chen et al. [22] directly flattened each sample with its neighbors, and employed a stack Auto-Encoder (AE) to extract spectralspatial features.The superpixel-based methods involve dividing the HSI into homogeneous regions (superpixels) and applying DR methods to each region separately.Studies [28,29] used PCA to extract features from each superpixel, Zhang et al. [30] re-weighted the pixels belonging to the same superpixel and evaluated sparse representation for classification, Zhang et al. [31] employed kernel PCA on samples within a superpixel and boosted the results from multi-scale segmentation to improve performance.Compared to the pixel neighbor-based methods, superpixel-based methods follow a "divide-and-conquer" approach, offering more flexibility.In this paper, we will focus on the superpixel-based method, due to its flexibility and potential for improved performance from leveraging both spatial and spectral information in HSI.
The existing superpixel-based methods, such as SuperPCA [28], S 3 PCA [29], and S-RAE [32], extract features from each superpixel region individually.While these methods can provide feature extractors for each superpixel region, they often neglect the relationship between samples from different superpixel regions.This can be problematic because samples from the same category may be located in different regions, leading to a loss of intra-class structure in the data.To illustrate this issue, an example using samples from the woods category in the Indian Pines data set is considered.Measuring the disparity of multi-dimensional data remains a challenging problem that lacks a definitive solution, thus t-SNE [33] is applied to the samples for visualization to assess the disparity problem.In the original space (Figure 1a), we can observe that the samples from the woods category are located close to each other, indicating a high level of intra-class consistency.However, after applying SuperPCA (Figure 1b), we can see that the intra-class consistency of the data is completely destroyed.The loss of intra-class structure can have negative consequences for subsequent tasks in HSI.Therefore, there is a need to develop methods that not only capture the features that maintain the structure within individual superpixel regions, but also preserve the relationships between samples from different superpixel regions.To solve the aforementioned problem, we propose a novel unsupervised DR method that considers the relationship between samples from different superpixels in this paper.To be more specific, the Entropy Rate Segmentation (ERS) [34] is first adopted to generate a 2D superpixel map.Then, Locally Linear Embedding (LLE) is applied to capture the underlying manifold structure of the mean vectors within each superpixel.A collaborative superpixelwise Auto-Encoder (ColAE) model is proposed to learn the compact representations, which can preserve the structure of data within each superpixel by minimizing their reconstruction error, while meanwhile maintaining the learned manifold structure among superpixels by minimizing the graph loss.The representations are finally fed into Support Vector Machine (SVM) to determine their categories.To evaluate the effectiveness of the proposed ColAE, experiments are conducted on three hyperspectral data sets.We compare our method with state-of-the-art DR techniques; the results validate the proposed ColAE can improve the classification performance of extracted features.
The remainder of this paper is organized as follows.Section 2 provides a review of several related works.In Section 3, the details of our proposed method are presented.The experimental setup, comparison results, result analysis, and the influence of the parameters are presented in Section 4. Section 5 finally concludes the paper and discusses potential future research directions.

Related Works
In this section, we briefly review entropy rate superpixel segmentation, locally linear embedding, and Autoencoder models.

Entropy Rate Superpixel Segmentation Model
In computer vision, superpixels are defined as compact regions consisting of adjacent pixels with similar characteristics, such as color, brightness, and texture.In HSI, where each pixel represents a distinct spectral signature, samples belonging to the same category also tend to exhibit spatial similarities.Consequently, existing superpixel segmentation methods can be effectively employed to partition an HSI into a collection of homogeneous regions.By considering both spectral and spatial characteristics, superpixel segmentation enables the grouping of pixels with shared properties, facilitating the extraction of meaningful features.
ERS [34] is adopted in our method due to its promising performance in HSI classification tasks [28,29], as well as its inherent capabilities in adaptive region generation and texture preservation.As a graph-based method, with a given graph G = (V, E) for an HSI, where the vertical set V denotes the pixel set and the edge set E means the pairwise similarities, ERS tends to choose a subset of edges A ⊆ E, so that the resulting graph G * = (V, A) contains exactly K connected subgraphs.The objective function of ERS is where H(A) is an entropy rate term, which tends to find the homogeneous and compact cluster, B(A) is a balancing term, which makes the cluster with similar sizes, and α is a weight term to tune the contributions of H(A) and B(A).A greedy algorithm is used to solve the problem in (1).

Locally Linear Embedding Model
Researchers in the machine learning area found that the data in the wild may not follow Gaussian distribution, but reside on a manifold, and locally linear embedding (LLE) [35], an algorithm insensitive to global variations and characterized by parameter flexibility, was proposed to preserve the manifold structure of the data in the low-dimensional space.Denote n samples in d-dimensional space as X = {x 1 , x 2 , . . ., x n }; LLE first finds the K-nearest neighbors for each sample, where K is the number of nearest neighbors, and K n.LLE assumes the samples within a small neighborhood are linearly located, and the manifold structure of the data is then captured by minimizing the reconstruction error where N i stands for the K-nearest neighbors set of x i , and the w ij = 0 if x j / ∈ N i .A least-squares problem can be used to solve the Problem (2) [35].
With the weighting matrix W, LLE maps the x i on to a l-dimensional representation y i by minimizing the cost function as follows: The problem in (3) is equivalent to which can be solved by finding the l eigenvectors of Z = (I − W) T (I − W) corresponding to the l smallest eigenvalues.Due to fact that the smallest eigenvalue is not stable, LLE always finds the eigenvectors corresponding to the second smallest eigenvalues.

Auto-Encoder Model
Auto-Encoder (AE) [36] is a well-known neural network architecture used for various tasks.It consists of two main parts: an encoder and a decoder.In the context of a shallow AE, as illustrated in Figure 2a, the encoder takes an input vector a i from the R d space and maps it to a lower-dimensional code f i in R l by f i = f (W (1) a i + b (1) ), where f (•) is an activation function.The decoder then reconstructs the input vector a i from the code f i by âi = g(W (2) f i + b (2) ), where g(•) is also another activation function.The commonly used activation functions for encoder and decoder are the nonlinear Tanh and Sigmoid functions.The parameters of the AE, denoted as Θ = {W (1) , b (1) , W (2) , b (2) }, are learned during the training process.These parameters, including the weights {W (1) , W (2) } , and biases {b (1) , b (2) }, can be optimized by minimizing the reconstruction error R(Θ), defined as which sums up the squared differences between the input vectors a i and their reconstructions âi over all the samples.To perform the optimization, the Backpropagation algorithm (BP) and stochastic gradient descent are commonly used.
Once the AE is trained, the encoder has learned to map the input vector a i ∈ R d to a new, lower-dimensional representation f i ∈ R l , where l is typically chosen to be smaller than d.
The shallow AE, with only one encoder and one decode layer, has a limited capacity to learn complex and high-level representations.To overcome this limitation, deep AEs are proposed, which have multiple encoder and decoder layers.Increasing the number of layers in the AE architecture enhances its learning ability, and allows for the extraction of more intricate features.A deep AE example is presented in Figure 2b.In a deep AE model, the input passes through a series of hidden layers in the encoder, where each layer applies a non-linear transformation to capture the different levels of relevant features.The final hidden layer produces the encoded representation f i .The encoder can be expressed mathematically as: where m represents the depth of the encoder.By adding more layers, the deep AE can learn increasingly complex representations of the input data, helping to capture intricate patterns and structures.This results in improved generalization capabilities and potential efficiency gains compared to shallow AEs.The additional layers allow for a more hierarchical and abstract representation of the data, enabling the model to discover more meaningful and discriminative features.In a deep AE, the decoder takes the code f i and passes it through a series of hidden layers.The final output layer of the decoder produces the reconstructed data âi .The parameters Θ in deep AE include the weights {W (1) , W (2) , . . ., W (2m) } and biases {b (1) , b (2) , . . ., b (2m) }.To optimize the parameters Θ, the aim is to minimize the reconstruction error R(Θ) defined in Equation (5).There are two methods to optimize Θ through R(Θ).The first method trains m shallow AEs individually and then stacks them together to form a deep AE [36].Each shallow AE is trained layer by layer, where the output of one layer is used as the input for the next layer.This approach is also known as Stack AE (SAE).By pretraining the shallow AEs and fine-tuning the entire deep AE, this method allows for the gradual learning of increasingly complex representations.The second method is to initialize the parameters Θ and then use BP and a stochastic gradient descent to iteratively optimize the parameters.This method is known as end-to-end training.In the early stages of deep learning, training deep AEs using this method was challenging because gradients could not propagate effectively to the bottom layer.However, with the development of more effective initialization strategies, such as He initialization [37], and Xavier initialization [38], this issue has been largely mitigated, and it is now possible to directly train deep networks.

Collaborative Superpixelwise Auto-Encoder
In this section, we present the details of ColAE, a method designed for extracting spectral-spatial features for HSI.ColAE consists of two key steps: superpixel segmentation and collaborative AE learning, as depicted in Figure 3.During the superpixel segmentation step, the ERS-based superpixel method is employed to partition the HSI into homogeneous regions.This division creates compact and meaningful regions by grouping pixels with similar characteristics.In the collaborative learning step, LLE is first adopted to learn the underlying manifold structure among samples from different superpixels.This allows us to capture the global structure of the HSI.Next, AE models are applied to each superpixel independently.These models seek representations that minimize the local reconstruction error for samples within the same superpixel, while simultaneously minimizing the manifold reconstruction error for samples from different superpixels.In our proposed ColAE approach, the AE models exchange information among different superpixel regions, leveraging a collaborative learning approach.This enhances similar samples from different superpixels to be similar in the code space.In this way, ColAE can alleviate intra-class disparities.Figure 1c provides empirical evidence supporting this claim.11), which sums up the reconstruction loss within each individual superpixel.loss between AEs denotes the second term in Equation (11), which maintains the manifold structure between superpixels.
In this paper, HSI data are denoted by X ∈ R B×W×H , where B, W, H represent the number of spectral bands, width, and height, respectively.To process the 3D data X, we flatten it into a 2D form, denoted as x iB ] T represents a pixel in the HSI.

Superpixel Segmentation
Traditional spectral-spatial methods often use fix-sized spatial windows to incorporate spectral and spatial information.However, these methods do not fully explore the spatial information available in the image.Superpixel segmentation, on the other hand, offers a more effective way to divide the image into homogeneous regions based on appearance information, thereby considering spatial structures more effectively.This is why we have chosen to employ superpixel segmentation in our proposed work.
The Entropy Rate Segmentation (ERS) algorithm is capable of efficiently segmenting the grayscale (1 channel) or color (3 channels) images into superpixel regions.However, the HSI typically consists of hundreds of spectral bands.To address this, we first reduce the dimensionality of the HSI data to one channel using PCA before applying ERS.
PCA allows us to reduce an HSI, denoted as X, to its 2D form X2. The covariance matrix of the data can be calculated using the formula T , where µ = 1 N ∑ x i is the mean vector of all samples.The eigenvectors v 1 corresponding to the largest eigenvalue of C form the projection matrix V = [v 1 ] for the grayscale image.Next, the 1-dimensional 2D data Y2 can is obtained by performing the transformation Y2 = V T X2.Finally, Y2 can be reshaped into a grayscale image, upon which ERS can be performed on the obtained superpixel segmentation.

Collaborative AEs
After performing superpixel segmentation on the HSI, the resulting 2D representation can be expressed as X2 = {X 1 , X 2 , . . ., X J }, where X i = {x i 1 , x i 2 , . . ., x i N i } represents the samples in the i-th superpixel, and N i indicates the number of samples in that particular superpixel.
To capture the underlying manifold structure of the data, samples from different superpixels are used.Then, an AE model is proposed to preserve this manifold structure among superpixels while simultaneously minimizing the reconstruction error within each superpixel.By jointly considering the manifold structure and the within-superpixel reconstruction, our proposed ColAE allows for the efficient extraction of spectral-spatial features while ensuring the preservation of important relationships between superpixels.

Learning the Manifold Structure among Superpixels
In order to preserve the relations among samples from different superpixels, it is crucial to define and obtain such relations.The manifold structure is commonly employed to model the underlying geometric structure of high-dimensional data, which aligns with our requirements.In our method, we adopt LLE, a classical and efficient manifold learning technique, to capture the manifold structure.
Samples within the same superpixel exhibit similarity; hence, the manifold structure is measured using only the mean vectors of each superpixel.The mean vector is calculated by With the mean vectors {µ 1 , µ 2 , . . ., µ J }, the weighting matrix W can be obtained by minimizing the reconstruction error in Equation ( 2), where J is the number of superpixels.Denoting the representations in the i-th superpixel in code space as Y i = {y i 1 , y i 2 , . . ., y i N i }, the manifold loss over current code is where represents the mean vectors in the code space.The lower the value of L(Y), the better the preserving ability of the code.
It should be noted that the number of K, representing the number of nearest neighbors in the LLE algorithm, needs to be predefined when calculating the weighting matrix W, and it is commonly chosen such that K J.

AE Model with Manifold Constraints
Based on previous works [28,29,32], we adopt a similar approach and employ a single AE for each superpixel.In this way, multiple AEs are used to efficiently capture the local structure within a superpixel and low-dimensional representations of a given HSI can be obtained.The loss function for the i-th AE is defined as were xi j is the output of the i-th deep AE for the j-th sample x i j in the i-th superpixel.The parameters in this AE are denoted as where m is the number of layers in encoder.The reconstruction error for all the samples can be expressed as where Θ = {Θ 1 , Θ 2 , . . ., Θ J } represents the parameters for all AEs.
To preserve the relations among superpixels, the manifold loss in Equation ( 8) can be added to Equation (10).This results in the following loss function: In Equation (11), the first term preserves the structure within each superpixel, while the second term maintains the structure between superpixels.The parameter η balances the two terms.By incorporating two terms in Equation (11), the proposed ColAE ensures each AE can preserve the structure of data within its assigned superpixel, while exchange information between superpixel by considering the manifold structure.In this way, the AEs from each superpixel are collaboratively learned.It should be noted that, in Equation (11), the first term relates to all the parameters in Θ, while the second term only relates to the parameters in the encoder part.
To find the parameters that best fit the data, we first initial each AE using the Xavier method [38].Then, we backpropagate the gradient of Θ to each layer according to Equation (10).Since the number of samples in a superpixel is not large, we feed all the samples in each superpixel once to calculate the loss.After hundreds of iterations, the value of R(Θ) can converge to a small value.

Computational Analysis of ColAE
The procedure of the proposed ColAE is outlined in Algorithm 1.The time complexity of the proposed ColAE can be analyzed as follows.The superpixel segmentation step has a time complexity of O(max(B 3 , B 2 N) + N log N).The manifold structure modeling procedure has a time complexity of O(KJ), and the calculation of loss in Equation (10) has a time complexity of O(NBd 1 ), where d 1 is the dimensionality of the first hidden representation h (1) .The gradient descent method used to optimize the parameters Θ has a time complexity of O(TNBd 1 ).In HSI, the number of bands B is typically much smaller than the number of samples N. Additionally, K and J are also much smaller than N. Therefore, the overall time complexity of ColAE is O(TNBd 1 ).

Algorithm 1 Procedures of ColAE.
Input: An HSI X ∈ R B×W×H , the number of superpixels J, the number of nearest neighbors K in LLE, the balancing weight η, the dimensionality L for the code, the number of iteration T. Output: The output Y ∈ R L×W×H .
1: Reshape X into 2D form, which is X2 ∈ R B×N .Use PCA to reduce the dimensionality of X2 to 1 , and reshape it into the image with three channels; 2: Apply ERS algorithm to segment the image into J non-overlapped regions; 3: Use Equation (7) to compute the mean vector µ i for each superpixel.Then, calculate the weights for each mean vector according to Equation (2); 4: Use Xavier initialization to initial the parameters in Θ (0) ; 5: for t = 0 to T do 6: Calculate the loss R(Θ (t) ) by Equation (11); 7: Calculate the gradient of g (t) using existing optimizer, and update the parameters by Θ (t+1) = Θ (t) + αg (t) ; 8: end for 9: Compute the code by Θ (T) , then reshape the code into Y ∈ R L×W×H .10: return Y.

Experimental Results
In this section, to validate the performance of the proposed ColAE, we carry out extensive experiments on several HSIs in comparison with state-of-the-art methods.

Data Sets
Three HSI data sets are used to evaluate the ColAE in our experiments, which are Indian Pines, the University of Pavia, and Salinas.The details of each data set are as follows. (
To test the proposed method, three metrics are used to evaluate the performance of different dimension reduction methods, which are overall accuracy (OA), average accuracy (AA), and kappa.The HSIs are used in their original form without any further preprocessing.We apply the DR algorithms on the HSI, then feed their outputs an SVM to determine the categories of the samples.The RBF kernel is used to boost the performance of the SVM for non-linear distributed situations, and the parameters of the RBF are determined by a grid search, as was performed in [28].Our experiments are conducted on Windows 10 64-bit platform, with an Intel Core i5-12400F CPU (2.5 GHz), and 32 GB memory.The proposed approaches are implemented mainly using Python 3.6, Pytorch 1.8.0,Scikitlearn 1.2.1 (Sklearn), and Shogun (https://github.com/shogun-toolbox/shogun,accessed on 8 December 2020), which is a well-known machine learning toolbox that provides interfaces for Matlab, R, Python, and so on.With this feature, Shogun offers a convenient way to implement various machine learning algorithms easily.
To test the proposed method, the 10 random splits sets in [28] (https://github.com/junjun-jiang/SuperPCA/tree/master/datasets, accessed on 31 October 2020) are used for training and testing.For each class in the three data sets, T = 3, 5, 7, 10, 15, 20 samples are selected to train the SVM, and the rest of the samples are used as testing sets, where T denotes the number of training samples.For the classes that posses too few samples, such as Grass-pasture-mowed and Oats in Indian Pines, we select a maximum of half of the total samples in the them.The PCA, KPCA, and SVM are implemented by the Sklearn library.KPCA utilized the RBF kernel, and its best parameter is determined through a grid search based on the reconstruction error of the pre-image [44].Moreover, LPP is implemented using the Shogun library, and the optimal number of nearest neighbors (K) and the τ for heat kernel are also determined using a grid search.The implementations for CAE and Con-strastNet are available (https://github.com/jjwwczy/ContrastNet-Unsupervised-Feature-Learning-by-Autoencoder-and-Prototypical-Contrastive-Learning,accessed on 8 March 2034) online.In our experiments, the architectures of AE and ColAE remain consistent, and are listed in Table 2.For Equation ( 6), the tanh is used as the activation function when m = 1, 4, and the linear function is used when m = 2 and m = 3.Furthermore, Xavier initialization is employed to initial the parameters for both AE and ColAE.The ERS is also available (https://github.com/mingyuliutw/EntropyRateSuperpixel,accessed on 19 Auguest 2015) online.The SuperPCA, SuperNPE, SuperLPP, and SuperAE are applied based on the superpixel results obtained from ERS, according to their definitions as mentioned.

Comparisons with Other Algorithms
Table 3 presents the performances of features acquired by 13 methods on the three data sets with diverse training samples when L = 30, where L is the dimensionality of the lowdimensional representation.The best classification results in each setting are highlighted in bold.It is worth noting that KPCA consumes too much memory, making it impossible to execute in the University of Pavia and Salinas data sets.From the results in Table 3, several observations can be concluded as follows.1.In nearly all tested scenarios, the efficacy of our proposed ColAE method surpasses that of the other approaches, highlighting its superior performance.It is important to note that, in the Indian Pines data set, SuperPCA exhibits better average accuracy (AA) results than ColAE.However, when evaluated based on overall accuracy (OA) and kappa, ColAE outperforms SuperPCA.Upon further analysis of the classification outcomes, we present the observation that ColAE consistently exhibits superior performance on categories with larger sample sizes, while its performance diminishes on categories with fewer samples, as illustrated in Tables 4-6.This phenomenon is mainly because the proposed ColAE utilizes LLE to model the manifold structure between superpixels.LLE employs the concept of K-nearest neighbors, where K is often set to a value much smaller than the total number of samples, to capture the local structure of the data.However, categories with only a few samples tend to be confined within a limited number of superpixels.Consequently, when modeling the manifold structure, LLE might incorrectly associate these small categories with others, leading to a lower classification accuracy for categories with a small sample size.In cases where a category has sufficient samples, these samples are always located in a set of superpixels, typically surpassing the value of K. Consequently, the inherent structure can be effectively modeled and preserved by ColAE.In this way, the disparity problem can be well solved, leading to a higher classification.2. ColAE consistently outperforms SuperAE, which proves that the proposed regularization term in Equation ( 11) can efficiently solve the class disparity problem caused by the superpixel-based method.To validate our findings, we randomly select one split from each data set and map the classification results onto the corresponding images, as shown in Figures 4-6.A comparison between Figure 4n,o reveals that ColAE improves the accuracy by mainly relying on correctly classifying the large regions of Soybean-min-till, indicated in pink.Remarkably, based on the superpixel segmentation, it is observed that SuperAE misclassifies samples belonging to the Soybean-min-till class within a superpixel into Soybean-not-till (indicated in blue).In contrast, ColAE successfully minimizes the misclassification rate within the same region, highlighting the efficiency of the proposed graph-regularization term in Equation (11).3. The performances of features obtained solely from the spectral domain are significantly inferior to those obtained from the spectral-spatial domain, substantiating the importance of incorporating information from the spatial domain for classification purposes.Both SuperAE and ColAE outperform ContrastNet and CAE, despite the fact that the architecture of the network is more complex in ContrastNet and CAE compared to SuperAE and ColAE.This outcome validates the superiority of superpixel-based methods.Additionally, the superpixel-based method consumes much fewer computational resources.Because the unsupervised method process all the data by DR models, then splits the data into training and testing sets, KPCA consumes 207,400 × 207,400 × 4 ≈ 160 GB memory for the University of Pavia and 111,104 × 111,104 × 4 ≈ 46 GB memory for Salinas, with a single-precious point floating point when constructing the kernel matrix.SuperKPCA consumes significantly less memory compared to traditional KPCA, further emphasizing the flexibility of superpixel-based approaches.
4. SuperPCA demonstrates surprisingly strong performance across all settings, which indicates the underlying data structure within a superpixel is relatively simple.That finding justifies our use of an AE with only two layers in both the encoder and decoder.The superior performance of both SuperAE and ColAE, compared to SuperPCA, further emphasizes the enhanced generalization ability.Additionally, it is worth noting that SuperKPCA and SuperLPP do not consistently outperform SuperPCA.We attribute this to the fact that the grid-search strategy employed in parameter tuning requires the inclusion of the best parameters within the search space.However, as the data distribution varies from one superpixel to another, it is challenging to accurately tune the parameters of SuperKPCA and SuperLPP to achieve optimal performance.5.The performances of PCA on the three data sets are observed to be comparable to that of the raw feature.The proportion of retained principal components in PCA is 99.25% for Indian Pines, 99.96% for the University of Pavia, and 99.99% for Salinas.These results indicate that PCA can remove the components without valuable information, resulting in little accuracy loss.LPP and KPCA outperform PCA due to the inherent complexity of the underlying data structure in the HSIs.LPP and KPCA can preserve the nonlinear structure of the data, thus yielding improved classification performance.It is interesting that AE performs slightly inferior to raw and PCA.This can be attributed to the limited capacity of a two-layer encoder with only a single nonlinear function to capture the intricate data structure.Utilizing neural networks with more complex architectures can improve the accuracy of the AE.It is important to highlight that we maintained uniform architecture across AE, SuperAE, and ColAE intentionally, aiming to discern the influence of the superpixel-based technique and the introduced regularization term specified in Equation (11).Consequently, we do not design a distinct structure for AE within our experimental setup.
It is important to highlight that we maintained uniform architecture across AE, Su-perAE, and ColAE intentionally, aiming to discern the influence of the superpixel-based technique and the introduced regularization term specified in Equation (11).Consequently, we refrained from designing a distinct structure for the AE within our experimental setup.

Parameter Analyses
In the proposed ColAE, several parameters need to be predefined: the number of superpixels J, the number of nearest neighbors K in the LLE, the balancing weight η, and the dimensionality L for the code.Actually, J is intertwined with K, where K is usually far smaller than J.To strengthen the relationships between the parameters, a ratio (R) can be introduced, which establishes a connection between K and J as K = J × R , where • denotes the round operator, ensuring that K is an integer value.This approach ensures that the choice of K is directly proportional to the number of superpixels J by a factor determined by R. η is also influenced by J and K, since it is impacted by the number of samples within a superpixel, which in turn affects the loss values of the terms in Equation (11).Therefore, our analysis starts with a discussion of L, then examines K, J, and η by considering their interconnected relationship.

The Effect of the Dimensionality of the Code
In our experiments, we set K = 100 for Indian Pines and Salinas, and K = 20 for the University of Pavia.Additionally, we use a fixed ratio R = 0.2 to determine the value of K. Furthermore, we choose η = 0.75.To investigate the effect of the dimensionality of the code L, we vary L from 5 to 50 with an interval of 5, and examine the resulting overall classification accuracies with T = 20 for SVM.The comprehensive experimentation yielded significant insights.For instance, in the case of dimensionality L, the highest Overall Accuracy (OA) of 89.98% was achieved when L = 45 for the Indian Pines data set.Conversely, the lowest OA of 45.34% was observed at L = 5.Similar trends were discerned for the University of Pavia dataset, where OA ranged from 84.01% to 95.30%, and for the Salinas dataset, where OA varied between 86.09% and 98.14%.To provide a more insightful depiction of these findings, these results are illustrated in Figure 7.
It is evident from the figure that when L = 5, the OAs are low across all three data sets, which aligns with common sense.A small number of features restricts the ability to carry sufficient discriminative information for effective classification.However, as L increases, the OAs steadily improve.A relatively large value of L is reached where the growth of OAs becomes slow, indicating that the available discriminative information is already well utilized.A larger L will increase the complexity and computational requirements of the classifier without yielding significant performance gains.Based on the observations, we choose L = 30 for all the subsequent experiments.A notable observation is that the number of superpixels J emerges as the primary factor influencing the performance of ColAE.On the Indian Pines, the classification accuracy of ColAE initially increases and then decreases with the increment of J.This may be attributed to the rich texture information presented in this data set.Too few superpixels can cause different class samples to be merged together, while an excessive number of superpixels may result in too few samples in a superpixel, consequently limiting the learning capabilities of AE within the ColAE framework.Conversely, for the University of Pavia and Salinas data sets, classification accuracy declines if J is set too large.This trend could be attributed to the samples being clustered together in these data sets, where a small number of superpixels is sufficient for effective segmentation.Furthermore, it is worth noting that ColAE is robust with the number of nearest neighbors K and balance weight η, making it adaptable for application to other data sets.

Execution Time
In this work, all the experiments are conducted on a desktop.The implemented codes use the CPU for execution.The running times of nine DR methods on the three data sets are presented in Table 7.It is worth noting that, compared with the training time, projecting the samples onto low-dimensional space demands minimal computational time once the model has been already trained.So we only list the training time in this section.It is important to note that the implementations of CAE and ContrastNet use the GPU to accelerate the training process.However, to ensure fairness in comparing the computational times across different methods, the running times of CAE and ConstrastNet are not included.The number of samples to be processed is 145 × 145 = 21,025 for Indian Pines, 610 × 340 = 207,400 for the University of Pavia, and 512 × 217 = 111,104 for the Salinas.As indicated in Table 7, PCA exhibits the lowest computational time due to its parameter-free nature.On the other hand, KPCA and SuperKPCA consume the most time, since the parameter τ needs to be tuned and both methods construct a dense kernel matrix of size N × N. The grid search strategy employed for parameter tuning further increases their computational burden.In contrast, LPP and SuperLPP also use the grid search for parameter tuning, but they only construct a sparse matrix with K × K entries, significantly reducing the computational burden.The proposed ColAE requires a similar computational time to SuperAE, although ColAE involves an additional step of constructing a manifold graph matrix.However, the size of the graph matrix is relatively small, being J × J.It should be noted that, while all samples within a superpixel are fed into the optimizer in SuperAE and ColAE, the batch size for AE is set to 256.Hence, the computational time of AE is longer compared to SuperAE and ColAE.Furthermore, it is worth mentioning that the computational time of AE, SuperAE, and ColAE can be greatly reduced when GPU is employed for parallel computation.

Conclusions
In this paper, we have discovered that existing superpixel-based DR methods may disrupt the intra-structure of the data.To solve this problem, an unsupervised spectralspatial DR method called ColAE is proposed.In ColAE, the HSI is first segmented into superpixels, then an LLE graph is constructed to model the similarities between the mean vectors from each superpixel.A set of AEs is applied to the samples within each superpixel, with the LLE graph employed to reduce the intra-disparity of the representations in code space.Experimental results on three HSI data sets can validate the effectiveness of the proposed ColAE in addressing the challenges of superpixel-based DR methods.
It should be noted that the ColAE can be extended to a multiscale superpixel version, which is expected to yield higher classification accuracy.Additionally, exploring the utilization of other manifold learning-based graphs can to model the relationship between superpixels will be a focal point for future research efforts.

Figure 1 .
Figure 1.Visualization of representations in woods of Indian Pines data set.(a) shows the samples in the original space; (b) shows the representations obtained by SuperPCA; (c) shows the representations obtained by our proposed ColAE.

Figure 2 .
Figure 2. Illustration of AE.(a) shows a shallow AE, and (b) presents a deep AE.

Figure 3 .
Figure 3.The stages in ColAE.loss within AEs is the first term in Equation (11), which sums up the reconstruction loss within each individual superpixel.loss between AEs denotes the second term in Equation(11), which maintains the manifold structure between superpixels.

Figure 7 .Figure 8 .
Figure 7.The OAs vs. L in the Indian Pines, University of Pavia, Salinas data sets.4.4.2.The Effects of the Number of Superpixels, Number of Nearest Neighbors, and Balance Weight We set dimensionality of the code L to be 30, and varied the balance weight η within the range of [0.5, 0.75, 1, 1.25], as well as the number of superpixels J from the set [20, 50, 70, 100, 120, 150].Additionally, we examined the ratio R between J and K, considering values from the set [0.1, 0.2, 0.3, 0.4], to evaluate the performance of the ColAE on three data sets.Across 1) Indian Pines.The Indian Pines data set was collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over an agricultural area in Indiana, USA.It consists of 145 × 145 pixels and 224 spectral bands, covering a wide range of wavelengths from 400 to 2500 nm.In this paper, 24 bands covering the region of water absorption are removed, and a total of 200 bands are used.The data set contains 16 different classes, including various crops, bare soil, and human-made structures.Approximately 10,249 samples with labels are from the ground-truth map.(2) University of Pavia.The University of Pavia data set was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over an agricultural area in Pavia, Italy.It consists of 610 × 340 pixels and 115 spectral bands, covering wavelengths from 430 to 860 nm.A total of 12 noisy and water bands are removed, and a total of 103 bands are preserved.The data set contains nine different classes, including various crops, bare soil, and meadows.Approximately 42,776 samples with labels are from the ground-truth map.(3) Salinas.The Salinas data set was collected by the AVIRIS sensor over an agricultural area in Salinas Valley, California, USA.It consists of 512 × 217 pixels and 224 spectral bands, covering wavelengths from 400 to 2500 nm.A total of 20 bands are removed for noisy and water bands, and 204 bands are used in our experiments.The data set contains 16 different classes, including various crops, bare soil, and human-made structures.A total of 53,129 labeled samples are used in our experiments.Table 1 lists the number of samples per class for the three datasets.All these datasets are available (https://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_ Sensing_Scenes, accessed on 12 July 2021) from the Internet.

Table 1 .
Number of samples in the Indian Pines, University of Pavia, and Salinas images.

Table 2 .
The architecture of AE in the experiments.The shape is defined in Pytorch style, where −1 means batch size in the shape array.

Table 3 .
Classification performance of the 13 methods on Indian Pines, University of Pavia, and Salinas images.T.N.s/C denotes the number of training samples from each class.

Table 4 .
Classification results for each class in Indian Pines when 15 training samples are used.

Table 5 .
Classification results for each class in the University of Pavia when 20 training samples are used.

Table 6 .
Classification results for each class in Salinas when 20 training samples are used.

Table 7 .
Training time (in seconds) of nine DR methods on three HSI data sets.