Next Article in Journal
From Graph Synchronization to Policy Learning: Angle-Synchronized Graph and Bilevel Policy Network for Remote Sensing Object Detection
Previous Article in Journal
RFANSR: Receptive Field Aggregation Network for Lightweight Remote Sensing Image Super-Resolution
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images

1
Department of Telecommunications and Information Processing, Ghent University, Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
2
Royal Institute for Cultural Heritage (KIK-IRPA), Jubelpark 1, 1000 Brussels, Belgium
3
School of Computer Science, China University of Geosciences, Wuhan 430074, China
4
ETRO Department, Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050 Brussels, Belgium
5
IMEC, Kapeldreef 75, 3001 Leuven, Belgium
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(24), 4030; https://doi.org/10.3390/rs17244030
Submission received: 9 October 2025 / Revised: 5 December 2025 / Accepted: 12 December 2025 / Published: 14 December 2025

Highlights

What are the main findings?
  • A mini-cluster-based optimization scheme is proposed to preserve the non-local structure of hyperspectral image (HSI) data.
  • A one-stage, end-to-end deep clustering network is designed to learn subspace bases under the joint guidance of local and non-local structures.
What are the implications of the main findings?
  • The mini-cluster optimization scheme adaptively models non-local similarity with higher efficiency than manifold-based methods relying on fixed neighbor settings.
  • The end-to-end framework enables local and non-local structures to jointly supervise and optimize the entire clustering process, overcoming the limitations of previous two-stage deep subspace methods.

Abstract

Subspace clustering has become widely adopted for the unsupervised analysis of hyperspectral images (HSIs). Recent model-aware deep subspace clustering methods often use a two-stage framework, involving the calculation of a self-representation matrix with complexity of O ( n 2 ) , followed by spectral clustering. However, these methods are computationally intensive, generally incorporating only local or non-local structure constraints, and their structural constraints fall short of effectively supervising the entire clustering process. We propose a scalable, context-preserving deep clustering method based on basis representation, which jointly captures local and non-local structures for efficient HSI clustering. To preserve local structure—i.e., spatial continuity within subspaces—we introduce a spatial smoothness constraint that aligns clustering predictions with their spatially filtered versions. For non-local structure—i.e., spectral continuity—we employ a mini-cluster-based scheme that refines predictions at the group level, encouraging spectrally similar pixels to belong to the same subspace. These two constraints are jointly optimized to reinforce each other. Specifically, our model is designed as a one-stage approach, in which the structural constraints are applied to the entire clustering process. The time and space complexity of our method are O ( n ) , making it applicable to large-scale HSI data. Experiments on real-world datasets show that our method outperforms state-of-the-art techniques.

1. Introduction

Hyperspectral images (HSIs) record the precise electromagnetic spectrum of the objects in a scene in hundreds of spectral bands, thereby enabling discrimination between objects that are indistinguishable in conventional Red–Green–Blue (RGB) images. As a result, HSIs have been widely applied in fields such as agriculture [1], environmental monitoring [2], and defense and security [3]. Clustering, which categorizes image pixels into different classes without labeled data, plays a crucial role in interpreting HSI data. However, HSI clustering remains challenging due to noise and high spectral variability [4]. Subspace clustering [5,6], which models data as lying in a union of low-dimensional subspaces, has shown strong performance in this context and has gained significant attention in recent years. Subspace clustering assumes that high-dimensional data lie in a few low-dimensional subspaces, where data points within the same subspace are treated as a class. Representative subspace clustering methods can be categorized into model-based subspace clustering [5,7,8,9] and model-aware deep subspace clustering [10,11,12]. The model-based subspace clustering methods include sparse subspace clustering (SSC) [5], low-rank representation (LRR) [7], and the joint-sparsity-based sparse subspace clustering (JSSC) method [13]. These approaches are designed based on the self-representation property, which assumes that each data point in a subspace S i can be expressed as a linear combination of other points within the same subspace, subject to sparsity or low-rank constraints. The resulting representation matrix effectively reveals the affinities between different data points. It is thus used to construct a similarity matrix, which is further fed to spectral clustering to obtain the clustering result. However, these methods are limited by the matrix-decomposition-based shallow representations, making it difficult to cluster HSIs that are often nonlinearly separable in practice.
To solve this problem, model-aware deep subspace clustering methods leverage the feature extraction capacity of deep neural networks to extract discriminative features and take into account nonlinear interactions. Representative methods include a deep subspace clustering network (DSCNet) [10], a generic deep subspace clustering model (DSC) [11], a self-supervised variant with adaptive initialization (SDSC-AI) [14], a Laplacian-regularized model for hyperspectral images (LRDSC) [15], and a pseudo-supervised extension (PSSC) [12]. Typically, autoencoders are used to project the input data onto a latent feature space, and then a fully connected layer is incorporated within the latent space, positioned between the encoder and decoder, to approximate the self-representation model. Li et al. [14] use clustering results as pseudo labels to train the feature extraction network, enhancing feature discriminability for clustering tasks. They also initialize the self-representation layer with a k-nearest neighbors (KNNs) graph to reduce dictionary redundancy, leading to significant performance improvements. In [16], features from undercomplete and overcomplete autoencoders are fused for subspace clustering, achieving outstanding performance without pre-training. Chen et al. [17] propose leveraging self-attention within the autoencoder to capture long-range dependencies, yielding better results than DSCNet. Benefiting from the improved feature representation in the latent space with the encoders, model-aware deep subspace clustering methods are more effective in handling data of complex structures compared with the aforementioned model-based subspace clustering methods. By optimizing the parameters of the fully connected layer, the self-representation matrix can be obtained for the construction of the similarity matrix. However, they still suffer from the following issues. First, these methods are computationally expensive. This is because the self-representation matrix is of size n × n (where n is the number of HSI pixels), leading to a training complexity of O ( n 2 ) that makes large-scale clustering impractical; in addition, the spatial constraints employed further increase the computational burden. Second, the features extracted may not be optimal for clustering. This is because feature extraction and spectral clustering are performed separately, which risks degrading overall performance. Third, these methods struggle to capture the intrinsic cluster structure of HSI. This limitation arises from their focus on either local or non-local dependencies, rather than fully exploiting the spatial relationships present in the data.
In this paper, we propose a scalable context-preserving deep subspace clustering (SCDSC) method, which performs feature extraction and clustering within a unified framework. In contrast to conventional self-representation-based clustering methods that require optimizing a large self-representation matrix, we follow the approach proposed in [18] and instead learn compact subspace bases. Those bases are compact, class-specific subspace dictionaries with fewer parameters in the latent feature space. We then obtain the clustering soft assignment directly by projecting the latent representation onto the subspace bases. The resulting model has low computational complexity, supporting scalability and efficient processing of large HSI.
To capture both local and non-local dependencies in hyperspectral data, we introduce two structural constraints. The local structure constraint enhances spatial homogeneity in the clustering results and improves robustness to noise and spectral variability. We achieve this using a spatial-wise mean filter to smooth the clustering results. The non-local structure constraint promotes consistency among spectrally similar data points, regardless of their spatial distance, by grouping them into mini-clusters and encouraging shared cluster assignments within each group. In contrast to existing local spatial constraints like total variation, which exhibit high computational complexity, our method is computationally efficient. The improved local homogeneity can be propagated to non-local data points through the non-local constraint, and conversely, the non-local constraint can also enhance local homogeneity, creating a mutually reinforcing relationship. To the best of our knowledge, this is the first attempt to develop an end-to-end, scalable deep subspace clustering method for HSIs. Experimental results on four benchmark datasets show that our method consistently outperforms several state-of-the-art methods, both model-based and deep learning-based. A preliminary version of this work was presented in [19], where we applied spatial filtering to embed spatial continuity into the soft assignment optimization and used contrastive learning in the feature space to preserve the non-local structure. Compared with that preliminary work, we develop a novel approach to modeling non-local similarities by means of a mini-cluster grouping. Moreover, we provide a more detailed presentation, a deeper analysis of the overall approach, and critical discussions. We also present a more extensive experimental study.
The remainder of the paper is as follows: Section 2 provides a comprehensive analysis of model-based subspace clustering and deep clustering methods, including purely data-driven and model-aware approaches. Section 3 describes our main contribution, a context-preserving deep subspace clustering method. Section 4 evaluates it using four real-world hyperspectral datasets. Finally, Section 5 concludes this paper.

2. Related Work

In this section, we introduce the key concepts and models that form the foundation of our proposed method. Then, we briefly review the existing approaches for HSI clustering.

2.1. Agglomerative Hierarchical Clustering

In our work, we use agglomerative hierarchical clustering to generate mini-clusters, as detailed in Section 3. Agglomerative hierarchical clustering is a method that groups data points by gradually merging clusters. It starts with each data point as its own separate cluster, and then at each iteration, it merges clusters based on a defined rule, called a linkage criterion. As a variation of this method, a first-integer-neighbor clustering method (FINCH) [20] uses the first neighbor of each sample to identify long neighbor chains and uncover groups within the data. This method shows high performance in clustering complex data with a complexity of O ( n   log   n ) , where n is the number of data points. In every iteration of FINCH, the first nearest neighbor adjacency matrix is created as follows:
A ( i ,   j )   =   1 if j   =   κ i 1 or κ j 1   =   i or κ i 1   =   κ j 1 0 otherwise
where κ i 1 denotes the first nearest neighbor of point i. During the merging process, data points connected by the same neighbor chain are grouped, and a new data point is generated as the mean of these points. The generated data points are merged in the same way in the next iterations. In the initial few iterations, the relationships between data points are simpler, allowing the algorithm to merge data points accurately. Compared to methods that require a predefined number of neighbors, FINCH is a parameter-free algorithm, and it allows clusters of different sizes. This results in a more flexible and adaptive capture of non-local data structures.

2.2. Model-Based Subspace Clustering

Subspace clustering has become a major approach in analyzing high-dimensional data because it efficiently identifies meaningful low-dimensional structures within high-dimensional data. Classical model-based subspace clustering methods such as SSC [5] and LRR [7] optimize self-representation-based models to reveal the affinities between different data points, as shown in Figure 1.
However, they rely solely on spectral representations to identify relationships between HSI pixels, making the models sensitive to noise. Therefore, many extensions attempt to incorporate local and non-local information to improve robustness against noise.
Zhang et al. [21] propose a spectral–spatial sparse subspace clustering method (S4C) that incorporates spatial information by applying a 2D mean filter to the representation matrix. Huang et al. [22] impose a joint constraint on local neighborhoods obtained via super-pixel segmentation to reduce feature variability within clusters. Other methods, including a locally constrained collaborative representation-based Fisher’s LDA (LCR-FLDA) [23], incorporate non-local structure by imposing Laplacian constraints. An alternative to traditional self-representation-based models is dictionary-based approaches, which are generally more computationally efficient. Representative methods include a sketch-based subspace clustering method with total variation (Sketch-TV) [24], a dictionary learning method with adaptive regularization (IDLSC) [25], a total variation regularized collaborative representation clustering method with a locally adaptive dictionary (TV-CRC-LAD) [26], and a structural prior-guided subspace clustering method (SPGSC) [27]. Although these methods achieve notable improvements over traditional subspace clustering approaches for HSI, their underlying assumptions still pose significant limitations. Specifically, they rely on the premise that each data point can be represented as a linear combination of points from a single linear subspace or a given dictionary. This linearity assumption does not hold when the data lie on nonlinear manifolds whose intrinsic geometry cannot be approximated using relatively few global linear subspaces, leading to degraded performance.

2.3. Purely Data-Driven Deep Clustering Methods

Data-driven deep clustering approaches focus on learning from the inherent structure and distribution of data and can be broadly categorized into two types. The first type involves feature learning followed by traditional clustering, where neural networks are used for feature extraction, after which conventional clustering algorithms are applied. For example, the deep embedding network for clustering DEN [28] employs an autoencoder with local similarity and sparsity constraints, followed by K-means clustering on the extracted features. Similarly, deep subspace clustering with sparsity prior (PARTY) [29] trains an autoencoder with structure-prior regularization and then applies subspace clustering. The deep spectral clustering method SpectralNet [30] first trains a neural network to learn an embedding that approximates the eigenvectors of the graph Laplacian and then applies K-means clustering in the learned embedding space. In a similar manner, the manifold-based deep clustering method N2D [31] begins with autoencoder-based feature learning, then performs manifold embedding, and finally applies a traditional clustering algorithm to obtain cluster labels. The second type is the joint optimization of feature learning and clustering, where traditional clustering algorithms are integrated into the neural network’s loss function, allowing for simultaneous clustering and feature learning during network training. For instance, the unsupervised deep embedding method for clustering (DEC) [32] learns a latent representation with an autoencoder and refines cluster assignments using a soft assignment based on the Student’s t-distribution together with a Kullback–Leibler (KL) divergence clustering objective. Building on this, Nalepa et al. [33] employ a three-dimensional (3D) convolutional network to better capture spectral structure, further improving performance. Another method, a deep semantic clustering method based on partition confidence maximization (PICA) [34], maximizes the confidence level of each data point being assigned to a cluster, thereby enhancing clustering results. While purely data-driven deep clustering methods are flexible and can adapt to data with complex structures or noise, they often require large datasets, lack interpretability, and are prone to overfitting.

2.4. Model-Aware Deep Subspace Clustering

Model-aware deep learning, which integrates mathematical modeling with deep neural networks to harness the advantages of both domains, has been widely adopted in various image inverse problems, including image denoising [35], image reconstruction [36], and compressed sensing [37]. In remote sensing, this paradigm has been extensively explored through deep unfolding and plug-and-play (PnP) frameworks. For example, deep unfolding has been applied to tasks such as satellite image super-resolution [38] and pan-sharpening [39], where iterative optimization algorithms are unrolled into deep networks, offering both interpretability and improved performance. In contrast, PnP strategies embed learned priors into traditional optimization pipelines, as demonstrated in hyperspectral unmixing [40]. Similarly, model-aware deep learning techniques have shown success in subspace clustering. A pioneering work is DSCNet [10], which employs a deep autoencoder to map data nonlinearly into a latent space and then applies a fully connected layer on the latent representation to mimic the self-representation model.
To improve robustness to noise, many extensions have been proposed that incorporate non-local structure during model optimization. For example, Zeng et al. [15] employ a Laplacian regularizer on the self-representation matrix to directly impose non-local structure preservation. In [41], the Laplacian regularizer is applied to the self-representation matrix within a residual network. In [42], the hypergraph-structured autoencoder (HyperAE) imposes a hypergraph regularizer on the latent representation, thereby maintaining the non-local structure in the self-representation matrix. In summary, although these methods perform well, they have several limitations. They require significant computational resources due to the large matrices involved; their two-stage design prevents integrated structure preservation, and they do not fully capture the spatial dependencies present in the data.

2.5. Summary and Discussion

In Table 1, we summarize the above-described categorization of HSI clustering approaches, with representatives of model-based and deep learning-based methods, along with their main advantages and limitations. While these methods have achieved remarkable progress in hyperspectral image clustering, they typically incur high computational costs, rely on either linear assumptions or large amounts of data, and often lack the structural priors required to capture the nonlinear spectral–spatial patterns of hyperspectral images.
The emerging basis-representation-based subspace clustering method [18] attempts to learn the basis of subspaces to obtain accurate assignments of data points. This method builds on the property that all vectors within a subspace can be expressed as linear combinations of that subspace’s basis vectors, as illustrated in Figure 2. It achieves comparable performance to self-representation-based methods in image and text-level tasks with linear complexity. However, this method shows limited performance in HSI clustering, as image- and text-level clustering tasks treat each data point as an entire image or document, without considering spatial relationships, rendering spatial context irrelevant. In contrast, HSI clustering involves data points corresponding to individual pixels or image patches, where spatial continuity and non-local structure are critical for accurate clustering.
Our contribution is motivated by the extension of this basis-representation model to hyperspectral data through the integration of structural constraints that explicitly preserve both spatial continuity and non-local structure.

3. Proposed Method

In this section, we present our main contribution, that is, a scalable and context-preserving deep subspace clustering method. To address the high computational complexity of traditional self-representation-based clustering approaches, we propose a novel strategy that avoids learning a self-representation matrix. Instead, our method directly learns a compact subspace basis, effectively integrating both local and non-local structural information inherent to HSI data. In this section, we first formally define the problem tackled, then we detail the structure constraints that are central to our approach, and finally we introduce our end-to-end training strategy. The overall framework of our proposed method is depicted in Figure 3.

3.1. Problem Formulation

Let X = { X 1 ,   X 2 , ,   X n } denote a hyperspectral image under investigation, divided into n overlapping patches, where each patch X i     R a × a × b is centered on a pixel and represents a 3D region of the hyperspectral image, extending over a small a   ×   a spatial neighborhood and all b spectral bands. The goal of HSI clustering is to group these patches into k distinct classes in an unsupervised manner. In this study, k is assumed to be known in advance, which enables consistent comparison with the ground-truth labels. Note that in the subsequent discussion, the term “pixel” refers to the target center pixel of each patch, and both share the same class assignment. This patch-overlap strategy is widely adopted in hyperspectral clustering [42,44], as it enables spatial context modeling for each target pixel. Despite the induced redundancy, it provides consistent context cues that improve learning robustness.
Traditional methods often rely on a two-stage framework involving an n   ×   n self-representation matrix, which results in O ( n 2 ) computational complexity. Moreover, the self-representation matrix calculation and spectral clustering are performed separately, with structure preservation applied only during the first stage, thereby limiting its impact on the overall clustering process.
In this work, we propose a scalable context-preserving deep subspace clustering method for HSIs. Specifically, we aim to learn the bases of different subspaces to cluster the HSI patches. Based on the learned subspace bases, we compute the affinity matrix S     R n × k , where each entry s i , j denotes the soft assignment of the ith pixel to the jth subspace. Let H   =   [ h 1 , ,   h n ]     R n × d be the embedded representation of all pixels with dimension d produced by the encoder of a deep autoencoder and D   =   { D 1 , ,   D k } be the set of the learned subspace bases, with each D j     R d × r representing the basis of the jth subspace (with r basis vectors). The soft assignment between pixel i and subspace j is then computed as
s i , j   =   h i D j   +   θ r j ( h i D j   +   θ r ) ,
where θ is a smoothing constant set to 5 following [18]. Collecting all assignments yields the vector s i   =   [ s i , 1 , ,   s i , k ] , the ith row of S . Each pixel is finally assigned to the subspace with the largest soft assignment, i.e., argmax j   s i , j .
To achieve optimal clustering performance on HSI data, the learned bases must satisfy the following criteria: (1) each subspace should have a distinct basis; (2) the bases should exhibit strong discriminative power to distinguish data points from different subspaces; (3) the bases should capture the intrinsic geometry (that is, spectral properties) of the data for the non-local structure to be preserved during clustering; and (4) the bases should leverage the local structure (that is, spatial continuity) to enhance robustness during feature-basis alignment. Similar to model-aware deep subspace clustering methods, basis learning is formulated within the latent space of an autoencoder.
Based on these requirements, we formulate the optimization problem. To make the optimization targets explicit, let Θ denote the trainable parameters of the deep autoencoder. Since the latent representation matrix H and the reconstructions X ^   =   { X ^ i } i = 1 n are outputs of the network, they are computed from the input patches X   =   { X i } i = 1 n by the autoencoder parameterized by Θ . Consequently, we formulate the joint optimization problem with respect to the subspace bases D and the network parameters Θ as follows:
min D , Θ 1 2 n i = 1 n | | X i     X ^ i | | F 2   +   β ϕ ( D )   +   β 1 η ( H D )   +   β 2 Ψ ( H D ) ,
which consists of the data fidelity term and three regularization terms: ϕ ( D ) —promoting discriminative power of the learned bases, η ( HD ) —non-local (spectral) similarities, and Ψ ( HD ) —local (spatial) consistency within the clusters. X ^ i is the reconstructed version of X i , obtained from the corresponding latent code h i by the decoder of the same autoencoder, which constrains the latent representation H to retain essential spatial–spectral information, and the improved latent representation in turn leads to better estimation of the subspace bases D during training. The basis dissimilarity term ϕ ( D ) ensures that the learned subspace bases are distinct, enabling distinguishing data points from different subspaces. The non-local structure preservation term η ( H D ) encourages data points to have a prediction similar to those of their nearest neighbors in the spectral space. The local structure preservation term Ψ ( H D ) maintains spatial dependencies, ensuring that data points share similar affinities to the subspace basis as their spatial neighbors. These constraints are discussed in detail in the following sections. The positive constants β , β 1 and β 2 balance the contributions of the different terms of the objective function, and their values are discussed in the experimental section (see Section 4).
In contrast to self-representation-based methods, our model does not require maintaining a self-representation matrix of size n   ×   n . Instead, it uses a basis-representation matrix of size r k   ×   n , reducing the computational complexity from O ( n 2 ) to O ( k r n ) . Here, the number of clusters k and the number of basis vectors per cluster r are typically small constants independent of the sample size n. Since k r     n for large-scale HSI data, the effective complexity simplifies to O ( n ) . Moreover, our model follows a one-stage approach, where both local and non-local structure constraints jointly optimize the entire clustering process end to end, thereby offering stronger guidance for model optimization.

3.2. Basis Dissimilarity Constraint

In subspace clustering, the bases of each subspace must be distinct, and orthogonality is often employed to reinforce this distinction, which has been shown to be beneficial for clustering performance in prior work [18]. Additionally, the bases are kept on the same scale to ensure more consistent and effective evaluation. This reduces overlap between subspaces, enhances the discriminative power of the bases, and ultimately leads to more accurate and robust clustering results. To ensure these properties are maintained during basis learning, we adopt a basis dissimilarity constraint ϕ ( D ) , similar to that in [18], as described below:
ϕ ( D )   =   D D     O F 2   +   D D     I     I F 2 ,
where ⊙ represents the Hadamard product, O     R k r   ×   k r is a matrix with diagonal blocks of size r   ×   r are 0, others 1, and I is the identity matrix of appropriate dimensions. This constraint effectively enforces each subspace to have orthonormal bases (unit-norm and mutually orthogonal) and pushes bases from different subspaces apart. In other words, ϕ ( D ) helps D behave like a block-wise orthogonal basis set, enhancing subspace separation and stability.

3.3. Non-Local Structure Preservation

Image pixels with similar spectral responses are likely to belong to the same land cover class, regardless of their spatial location. Making use of these non-local spectral similarities helps to preserve the correct non-local structure in a clustering map. Existing works often apply a Laplacian matrix to maintain this structure in the self-representation matrix, which is computationally expensive and limits scalability for large datasets.
To address this issue, we propose a mini-cluster updating scheme that ensures the subspace bases align with this non-local structure of the data while efficiently preserving it in the final clustering map. Specifically, the original data points are first grouped into mini-clusters using the algorithm FINCH (see Section 2.1 for details). Let C   =   { C 1 ,   C 2 ,   C 3 ,   ,   C l } represent the set of l mini-clusters generated by FINCH, where each mini-cluster comprises a group of neighboring data points. Formally, each mini-cluster C p contains the indices of the data points belonging to this mini-cluster.
We expect data points in each mini-cluster C p to align more strongly with a common basis D j than with any other D e , as shown below:
h q D j     max e j h q D e for all q     C p ,
where h q     R d represents the latent representation of the qth data point. According to the definition of soft assignment in (2), the subspace basis affinity in (5) can be mapped to soft assignments, resulting in similar assignments for all data points within the same mini-cluster; that is, for a given mini-cluster C p ,
s q     s o for all q and o     C p .
To preserve the underlying structure, we optimize the soft assignment at the mini-cluster level to encourage data points within the same mini-cluster to share the same assignment. During this processing, the soft assignments of data points within each mini-cluster are extracted using their mini-cluster index through fancy indexing, a vectorized technique that enables efficient, loop-free index-based value extraction on GPUs. The soft assignments of mini-clusters are represented as M     R l × k   =   [ m 1 , m 2 ,   , m l ] . The soft assignment for the pth mini-cluster m p     R k is obtained by averaging the assignments of the data points it contains as follows:
m p   =   1 | C p | q C p s q .
Since the soft assignment matrix S depends on the feature embedding H and the cluster subspace bases D , this target matrix M is implicitly a function of ( H , D ) . To further enhance the distribution of these assignments, we adopt a refined soft assignment M ˜     R l × k whose entries are defined as
m ˜ p , j   =   m p , j 2 / p   m p , j j ( m p , j 2 / p   m p , j ) ,
where m p , j represents the soft assignment of the pth mini-cluster to class j and m ˜ p , j is its refined soft assignment. This refinement process improves cluster purity by emphasizing high-confidence predictions and mitigating distortions caused by large clusters [32]. By aligning the initial mini-cluster predictions with their refined versions, the quality of soft assignments is enhanced, which in turn strengthens the discriminative power of the subspace bases. Based on the above analysis, we define the non-local structure preservation constraint η ( H D ) as follows:
η ( H D )   =   KL ( M ˜ M )   =   p j m ˜ p , j   log m ˜ p , j m p , j ,
where M ˜ represents the refined soft assignment affinity matrix of the mini-clusters. During training, the refinement calculation increasingly emphasizes class distinctions, causing the refined mini-cluster soft assignment m ˜ i to converge towards a state where one class dominates, with its value approaching 1, while the values for other classes approach 0. As η ( H D ) decreases over time, the mini-cluster is assigned more confidently to a single class.
In parallel, the soft assignment of data points within the mini-cluster follows the same optimization trajectory as the mini-cluster itself as follows:
η ( H D ) s q   =   η ( H D ) m p m p s q   =   1 | C p | η ( H D ) m p , for all q     C p .
This means that as the mini-cluster moves toward being classified as a particular class, all the data points within it also shift toward the same class assignment. As training progresses, this optimization direction causes all the points in the mini-cluster to converge to the same class assignment. The mini-cluster updating scheme enhances the model’s performance in two ways. First, it optimizes the mini-cluster soft assignments, which are more representative and robust than instance assignments, because of the assignment averaging. Second, it promotes consistency among data points within the same mini-cluster, ensuring they share the same prediction and preserving the non-local structure during optimization. Unlike previous methods that relied on a Laplacian matrix with complexity O ( n 2 ) to preserve the non-local structure, our approach is significantly more efficient. By leveraging a mini-cluster index vector of size n and using fancy indexing with complexity O ( n ) , which aligns with Graphics Processing Unit (GPU) implementation efficiency, we achieve efficient non-local structure preservation. Moreover, our method integrates non-local structure preservation throughout the entire clustering process, rather than limiting it to a single stage. This provides stronger guidance and improves the overall clustering performance.

3.4. Local Structure Preservation

In real-world scenarios, neighboring areas often belong to the same land cover class, a property known as spatial continuity. This relationship is a common phenomenon in many types of data, including HSI, where adjacent pixels are likely to belong to the same class. To leverage this property, our model integrates spatial neighborhood information into the basis learning process. This not only improves noise robustness when measuring the alignment between feature vectors and basis vectors but also ensures that the clustering results maintain the intrinsic spatial continuity of HSI. Specifically, as with the non-local structure preservation constraint, we directly apply spatial filtering to the soft assignments of pixels, as illustrated in Figure 4.
As defined in the figure, S     R w × h × k represents a 3D tensor derived from the 2D soft assignment matrix, where w and h are the width and height of the HSI. Each S x , y     R k denotes the soft assignment vector for the pixel at spatial location ( x ,   y ) . By incorporating spatial information, this tensor facilitates a spatially aware prediction of cluster membership. A spatial filtering operation is subsequently applied to each layer of S , leveraging relationships among neighboring pixels to produce a smoothed 3D tensor S ˜ , expressed as
S ˜ x , y   =   ( u , v ) W x , y t u , v   ·   S u , v ( u , v ) W x , y t u , v ,
where S ˜ x , y is the smoothed soft assignment vector at spatial location ( x ,   y ) , with x and y denoting the row and column indices, respectively, and W x , y is a fixed smoothing window centered at ( x ,   y ) . The mask T     R w × h assigns a value of 1 to pixels within the cluster region and 0 otherwise, where t x , y denotes the value at the corresponding location in T . We then extract the final smoothed soft assignment matrix F     R n × k by flattening S ˜ according to pixel ordering.
To incorporate this refined local structure into the original predictions, we minimize the KL divergence loss between the original and the smoothed predictions. The local structure preservation function Ψ ( H D ) is defined as follows:
Ψ ( H D )   =   KL ( F S )   =   i j f i , j   log f i , j s i , j ,
where F     R n × k is the smoothed assignment matrix, f i , j represents the element at position ( i ,   j ) , and S     R n × k is the original soft assignment matrix computed from the embedded features H and the subspace bases D via Equation (2).
Existing methods typically apply spatial constraints, such as spatial smoothing, on a self-representation 3D tensor M     R w × h × n , where w and h denote the width and height of the image, resulting in a computational complexity of O ( n 2 )  [21]. Other approaches apply total variation regularization on a dictionary representation tensor T     R w × h × n , where n     k is the dictionary dimension, leading to a complexity of O ( n n   log   n )  [24].
In contrast, our method directly applies spatial filtering to the clustering soft assignment S     R w × h × k . Denoting the size of the smoothing window as | W | , the computational complexity is O ( | W |   ×   n   ×   k ) , which simplifies to O ( n ) , making it much more efficient. More importantly, our spatial constraint optimizes the entire clustering process, providing stronger guidance for network optimization.

3.5. Objective Function and Training Strategy

The overall objective function consists of multiple loss terms that jointly optimize the reconstruction of the autoencoder, basis dissimilarity, and both local and non-local structure preservation. Specifically, the reconstruction loss is defined as
L R   =   1 2 n i = 1 n | | X i X ^ i | | F 2 ,
which measures the reconstruction error between the input X i and its reconstruction X ^ i . This term encourages the network to preserve the essential spectral information of each pixel during feature learning, ensuring that the latent representation retains sufficient fidelity for downstream clustering. Based on the definition of ϕ ( D ) in Equation (4), the basis dissimilarity loss is formulated as
L D   =   ϕ ( D ) ,
which enforces diversity among the learned basis vectors to avoid redundancy in representation. To preserve both spatial and spectral consistency, we further introduce two structure-preserving constraints. The non-local preservation loss is defined as
L N L   =   η ( H D ) ,
where η ( H D ) (introduced in Equation (9)) enforces non-local spectral consistency between spectrally similar but spatially distant pixels. Similarly, the local preservation loss is given by
L L   =   Ψ ( H D ) ,
where Ψ ( H D ) (defined in Equation (12)) preserves spatial smoothness by encouraging neighboring pixels to exhibit consistent affinities. Finally, the total objective function is given by
L total   =   L R   +   β L D   +   β 1 L N L   +   β 2 L L ,
where β , β 1 and β 2 are trade-off parameters controlling the relative importance of each loss term. By minimizing this objective function, both the autoencoder parameters Θ and the learnable subspace bases D are jointly updated through back-propagation. This optimization forms a coordinated dual mechanism: the gradients arising from the soft assignments S update the subspace bases D and further propagate through H to the encoder, guiding the latent features to align with the underlying subspace structures. Meanwhile, the reconstruction loss between the input X and the output X ^ anchors this evolution, ensuring that the evolving representation H remains grounded in the intrinsic spectral–spatial information needed for faithful data reconstruction.
As shown in Algorithm 1, the network training involves four main steps (Steps 1–4), followed by a final assignment step (Step 5). First, the mini-clusters are generated by FINCH. Next, the autoencoder is pre-trained to obtain initial latent representations of the input data. Then, the K-means algorithm is applied to generate the initial clustering results, which are subsequently processed using SVD to obtain the initial subspace basis for each cluster. Following work [18], we select five main basis vectors, i.e., the top 5 singular vectors with the largest singular values, to capture the subspace structure within each cluster. In the joint optimization step (Step 4), the autoencoder and the subspace bases are jointly optimized, where their parameters are updated under the supervision of all constraints.
Algorithm 1 Training Process for Hyperspectral Image Clustering.
1:
Input: Hyperspectral image patches: X   =   { X 1 ,   X 2 ,   ,   X n }
2:
Output: Cluster labels
3:
Step 1: Mini-cluster generation (FINCH)
4:
Generate mini-clusters: C   =   { C 1 ,   C 2 ,   C 3 ,   ,   C l }
5:
Step 2: Pre-training
6:
Perform data preprocessing and pre-train the deep autoencoder; obtain initial H .
7:
Minimize the reconstruction loss:
L   =   L R
8:
Step 3: Initial subspace basis construction
9:
Apply K-means clustering on H to generate initial clusters.
10:
Initialize the subspace basis D for the initial clusters using Singular Value Decomposition (SVD).
11:
Step 4: Joint Optimization (End-to-End)
12:
for epoch =   1 to MaxEpochs do
13:
      Forward Pass (Compute Variables):
14:
          Compute latent features: H     Encoder ( X ) .
15:
          Compute reconstruction: X ^     Decoder ( H ) .
16:
          Compute soft assignments S via Equation (2) with H and D .
17:
          Update target distributions M , M ˜ , and F based on current S .
18:
          Compute total loss L total via Equation (17).
19:
     Backward Pass (Update Model Parameters):
20:
          Update subspace bases  D using gradients of L total .
21:
          Update autoencoder parameters Θ via back-propagation of L total .
22:
end for
23:
Step 5: Final Assignment
24:
Assign each data point to a cluster based on the highest value in its soft assignment vector:
label i   =   argmax j s i , j .
Our model jointly optimizes these components, enabling local homogeneity to propagate to non-local data points through the non-local constraint and vice versa. In contrast, self-representation-based subspace clustering methods have a complexity of O ( n 2 ) for computing the self-representation matrix, maintaining the non-local structure with Laplacian regularization, and preserving spatial dependency on the self-representation matrix. Our method achieves a complexity of O ( n ) for all key operations, including calculating the soft assignment, maintaining the non-local structure, and preserving the local structure, making it scalable for large datasets. Furthermore, the proposed structural constraints supervise the entire clustering process, providing stronger guidance for optimization.

4. Experiments and Results

In this section, we evaluate the proposed method on four real-world hyperspectral images. We compare it against several popular clustering algorithms. Then, we perform an ablation study to understand the impact of the local and non-local structure preservation constraints.

4.1. Datasets

We conducted experiments on four real-world hyperspectral image datasets. The details of these datasets are as follows:
  • Trento dataset: This dataset was acquired using the Compact Airborne Spectrographic Imager (CASI) sensor and contains 63 spectral bands. It is divided into 6 classes. The image size is 600 × 166 pixels, with 30,214 labeled samples.
  • Houston dataset: The Houston dataset was collected using the ITRES Compact Airborne Spectrographic Imager (ITRES-CASI) sensor, which captures high-resolution hyperspectral imagery across 144 spectral bands. It is categorized into 7 classes. The image size is 130   ×   130 pixels, containing 6104 labeled samples.
  • PaviaU dataset: The PaviaU dataset was acquired with the Reflective Optics System Imaging Spectrometer (ROSIS-3) sensor, providing 103 spectral bands. It is classified into 9 classes. The image size is 610   ×   340 pixels, with 42,776 labeled samples.
  • HYPSO-1 dataset: The HYPSO-1 dataset originates from the Hyperspectral Small Satellite for Ocean Observation (HYPSO-1) CubeSat mission, which provides hyperspectral imagery covering sea, land, and cloud regions with approximately 120 spectral bands. For our experiments, a 150   ×   150 spatial region was selected from one labeled scene, containing three major classes (sea, land, and cloud) with 22,500 labeled samples.
These datasets feature a variety of sensor types, numbers of spectral bands, class categories, and image dimensions, providing a diverse experimental platform to validate the effectiveness of our proposed method.

4.2. Experimental Setting

During training, the model is trained for 400 epochs without early stopping. To balance memory consumption and enrich the feature representation of each HSI pixel, we use a patch size of 7   ×   7 for all compared methods. When generating mini-clusters, we employ a larger patch size of 17   ×   17 to incorporate more detailed neighborhood information. Parameter β is fixed at 10 3 . To ensure statistical robustness, all experiments are repeated 10 times with different random seeds. We report the mean ± standard deviation of all performance metrics. Statistical significance is evaluated using the Wilcoxon signed-rank test to compare our method against each baseline. Detailed statistical results are provided in the Appendix A. All implementation details, including hyperparameter settings and training schedule, are available in our released code. The code and data are publicly available, see the “Data Availability Statement” section for details.
We compare our method with several clustering approaches, including centroid-based methods such as K-means [45] and Fuzzy C-means (FCM) [46]; the graph-based method spectral clustering (SC) [47]; data-driven deep clustering methods such as improved deep embedded clustering (IDEC) [48], SpectralNet (SN) [30], N2D [31], and deep embedded K-means (DEKM) clustering [49]; self-representation-based deep subspace clustering method HyperAE [42] and the nearest neighbor-based method FINCH [20]. Additionally, we compare our method with our preliminary model-aware deep learning (MADL) framework [19], which preserves non-local structures through contrastive learning.
To assess the performance of our model, we employ three widely used metrics: Overall Accuracy (OA), Normalized Mutual Information (NMI), and Cohen’s Kappa ( K ). The OA quantifies the proportion of correctly classified samples relative to the total number of samples, computed by
OA   =   1 n i = 1 n I ( g i   =   g ^ i )
where n is the total number of samples, g i is the true label of the ith sample, g ^ i is the predicted label of the ith sample, and I ( · ) is the indicator function that equals 1 if the condition inside is true and 0 otherwise. The NMI measures the mutual information between the clustering results and the true labels, normalized by the average of their entropies, defined by
NMI   =   2   ×   I ( G ; G ^ ) H ( G )   +   H ( G ^ )
where I ( G ; G ^ ) is the mutual information between the true labels G and the clustering labels G ^ and H ( G ) and H ( G ^ ) are the entropies of G and G ^ , respectively. K assesses the agreement between the clustering results and the true labels while accounting for the possibility of agreement occurring by chance, calculated by
K   =   P o     P e 1     P e
where P o is the observed agreement among raters and P e is the expected agreement by chance. For the three evaluation metrics, a higher value indicates better performance. Additionally, we report the computational time in seconds to compare the efficiency of different methods.

4.3. Performance Analysis

4.3.1. Houston Dataset

The clustering results of different methods on the Houston dataset are presented in Table 2 and Figure 5. We set β 1   =   3 , β 2   =   8 , and use a filter window size of 3   ×   3 . We observe that only the deterministic method FINCH [20] produces fully consistent results across runs and metrics. To provide a fair comparison, we further conduct a Wilcoxon signed-rank analysis among different methods, as summarized in Table A1. The results show that most pairwise differences with our method are statistically significant ( p w   <   0.05 , many at p w   <   0.01 ), confirming the reliability of the observed improvements. For the few cases where the differences are not significant, such as HyperAE and MADL, our model still achieves higher mean accuracies and smaller deviations, showing more stable and consistent performance. IDEC also attains a comparable mean in NMI but with larger variance, while FINCH, as a deterministic algorithm, produces fixed outputs without statistical variance.
Overall, our method achieves the best performance across all three metrics. Compared to our preliminary work [19], our method gains an additional 2.08 % in OA, demonstrating the effectiveness of the mini-cluster updating scheme in enhancing representation stability and clustering accuracy. Compared with HyperAE, which relies on global self-representation reconstruction, our basis-representation model achieves better accuracy with far less computational cost. Meanwhile, data-driven deep clustering methods such as N2D, DEKM, and SpectralNet exhibit inferior and less stable results due to their dependence on feature distribution learning without explicit structural constraints. FINCH also performs strongly among non-deep methods, reflecting its ability to model complex local relationships through a nearest neighbor chain mechanism.
In terms of runtime, FINCH remains the fastest, and traditional clustering methods are generally more efficient than deep learning–based approaches. Nevertheless, our method completes the clustering task in approximately 33 s, whereas HyperAE requires more than 400 s due to its O ( n 2 ) self-representation complexity. This clearly highlights the computational advantage of adopting a compact basis representation instead of full self-representation modeling.
From the cluster map, we observe that our method best aligns with the ground truth. Specifically, our approach accurately distinguishes between parking lot and asphalt areas, while other methods often merge them. Only SpectralNet and N2D partially recognize these distinctions.

4.3.2. Trento Dataset

The clustering results of various methods on the Trento dataset are presented in Table 3 and Figure 6. We set β 1   =   3 , β 2   =   1 , and use a filter window size of 7   ×   7 . We observe that most methods achieve accuracies above 70%, which may be attributed to class imbalance, as several dominant classes occupy the majority of samples. Due to this imbalance, most approaches, including ours, fail to recognize the third class. According to the Wilcoxon signed-rank analysis in Table A1, our method achieves statistically significant improvements ( p w   <   0.05 ) over nearly all baselines on OA and K , confirming its reliability and robustness. Compared with our preliminary work [19], the proposed model achieves an additional 2.31% improvement in OA (90.61% vs. 88.30%) while maintaining almost identical NMI (0.9101 vs. 0.9144). This indicates that the new non-local structure preservation scheme is more adaptive than using a fixed number of contrastive neighbors. Among the compared methods, HyperAE fails on this dataset because its self-representation step involves O ( n 2 ) memory usage, which leads to out-of-memory errors. Meanwhile, our basis-representation formulation maintains linear scalability and achieves better results than MADL.
Overall, our method achieves the best performance in OA and K while maintaining a competitive NMI close to the top-performing MADL. Although N2D conceptually bridges deep and shallow clustering, it performs relatively poorly in this dataset because its clustering is conducted on fixed autoencoded features without feedback to the encoder. The autoencoder and clustering stages are not jointly optimized, so the learned representations are driven purely by reconstruction quality rather than clustering separability. DEKM and SpectralNet exhibit moderate performance but higher variance across trials. However, K-means also performs remarkably well and significantly better than FCM. This can be attributed to the clear boundaries in the Trento data, where hard assignments are more suitable than fuzzy partitioning. Additionally, K-means outperforms FINCH and spectral clustering, suggesting that the Trento dataset has a structure closer to spherical clusters, where centroid-based models are advantageous.
In terms of efficiency, our method requires around 149 s on Trento, which is substantially faster than MADL, DEKM, and IDEC. More importantly, the runtime increases almost proportionally to data volume compared with Houston, where the dataset size is about five times smaller. This consistency demonstrates the linear scalability of the proposed basis-representation formulation with respect to data size while maintaining superior clustering performance.
From the cluster maps, our method best aligns with the ground truth, accurately recognizing the Vineyard and Wood classes, while other methods tend to mix them with surrounding areas. Moreover, our predictions are much smoother, reflecting better spatial consistency. The inclusion of non-local structure preservation also improves the distinction between buildings and roads, further enhancing spatial detail fidelity.

4.3.3. PaviaU Dataset

The clustering results of various methods on the PaviaU dataset are presented in Table 4 and Figure 7. We set β 1   =   3 , β 2   =   7 , and use a filter window size of 7   ×   7 . According to the Wilcoxon signed-rank analysis in Table A1, our method achieves statistically significant improvements ( p w   <   0.05 ) over most baselines in terms of OA and K , with higher mean values across both metrics. More importantly, it achieves significant improvements in NMI over all compared methods, showing a clear margin in both mean and stability, which highlights the consistency and reliability of the learned representation even under severe class imbalance.
Overall, our method achieves the best mean performance across OA, NMI, and K , with a 3.79% improvement in OA compared with our preliminary work [19]. This demonstrates the effectiveness of the model-aware basis representation and the structure-preserving design in improving both discriminative power and clustering robustness. Due to the dataset’s complex and highly mixed structure, all methods achieve OA values below 70%, since accuracy-based metrics are easily affected by dominant-class bias. In contrast, our superior NMI result indicates stronger global consistency and better inter-cluster separation, making it a more reliable indicator on this dataset.
Spectral clustering achieves the second best performance, outperforming several deep methods despite the large data volume. However, due to class imbalance, it tends to detect only a few dominant clusters while ignoring minority ones. Our method also fails to recognize two small classes for the same reason. Among deep models, data-driven approaches such as DEKM, SpectralNet, and N2D show moderate results but less stability. N2D performs particularly poorly since its clustering is conducted on fixed autoencoded features without feedback to the encoder, relying solely on reconstruction rather than clustering optimization. HyperAE again fails due to its O ( n 2 ) self-representation complexity, whereas our linear basis-representation formulation remains efficient and scalable.
From the cluster maps, our method best aligns with the ground truth, accurately distinguishing between bare soil and tree classes, while other methods often confuse these classes. Benefiting from both local and non-local structure preservation, our results exhibit smoother and more spatially consistent cluster boundaries. In terms of runtime, our method runs faster than other deep clustering baselines such as MADL and DEKM while maintaining a nearly linear growth trend relative to data size, consistent with the observations on the previous datasets.

4.3.4. HYPSO-1 Dataset

The clustering results of various methods on the HYPSO-1 dataset are presented in Table 5 and Figure 8. We set β 1 = 3 , β 2   =   0.001 , and use a filter window of size 3   ×   3 . We choose a very small weight for the local structure term because the cloud and water regions exhibit relatively simple spatial patterns. Moreover, cloud areas often contain small spots or holes that can be easily removed by spatial filtering, so assigning a large spatial weight may distort these regions rather than preserve meaningful structure. According to the Wilcoxon signed-rank analysis in Table A1, our method achieves statistically significant improvements ( p w   <   0.05 ) over all baselines in terms of OA and K , with a noticeably large margin of over 3 % . For NMI, although the difference is not statistically significant compared with IDEC and MADL, our method still attains a higher mean value with a low standard deviation of around 0.02 . Although this standard deviation is larger than that of MADL, it remains small overall, indicating stable clustering performance.
Overall, our method achieves the best average performance across OA, NMI, and K . We also observe that spectral clustering performs relatively well, while SpectralNet obtains the worst results. This is perhaps because SpectralNet constructs its KNN graph within mini-batches and approximates the spectral embedding using a neural network; such a procedure makes its neighborhood structure unstable on high-dimensional HSI data, resulting in substantially poorer performance than the global, closed-form spectral clustering. The K-means and DEKM methods also fail to achieve good performance, likely because the data distribution deviates from a spherical assumption and the thin cloud and ground regions are highly mixed, leading to ambiguous class boundaries. Consequently, FCM yields noticeably better performance in this scenario. IDEC, MADL, and our method all perform relatively well, suggesting that refined probability estimation can effectively capture the underlying data distribution. Our method further improves performance owing to the more effective modeling of HSI structural characteristics.
From the clustering maps, we can see that, due to the spatially homogeneous nature of regions such as water and cloud, most methods already produce relatively smooth results. Even so, our method aligns with the ground truth most closely, particularly in the ground and cloud regions in the upper-right area. In terms of runtime, our method maintains linear computational complexity and runs faster than IDEC, DEKM, and MADL, demonstrating both its effectiveness and efficiency.

4.4. Ablation Study

In the ablation study, we analyze the effect of different parts of our model. Specifically, we apply the basis-representation clustering model proposed in [18] as the baseline. We then enhance this model by introducing a mini-cluster-based update scheme to preserve the spectral structure. Finally, we integrate a local preservation module to further smooth out noise and improve robustness. The parameter settings of our model are shown in Table 6; the results are presented in Table 7.
The results from the ablation study indicate that applying only local structure preservation improves NMI on the Houston, Trento, and PaviaU datasets. However, it slightly reduces accuracy on the Houston, Trento, and HYPSO-1 datasets, possibly due to the occasional smoothness in the original predictions. In contrast, applying only non-local structure preservation enhances performance across all datasets and metrics. Combining local and non-local structure preservation further boosts performance, as local structure preservation helps propagate the benefits of non-local preservation. Considering both structures allows for more effective learning of HSI structure compared to using either one alone.

4.4.1. Impact of the Number of FINCH Iterations

In our mini-cluster updating part, the algorithm FINCH [20] is applied for mini-cluster generation. The number of mini-clusters decreases as the number of FINCH iterations increases, as shown in Figure 9a. Meanwhile, the variance within each mini-cluster increases with the number of FINCH iterations. The mean within-cluster variance within every mini-cluster is calculated as shown in Figure 9b. We observe that from the second iteration, the mean within-cluster variance starts to increase, indicating that there are more outliers within the mini-clusters. Specifically, the clustering performance of mini-clusters generated from different iterations is shown in Figure 10. We observe that the accuracy increases at first and then decreases. Specifically, for the Houston and Trento datasets, we have the best performance after two iterations, and then it decreases, which corroborates the findings related to variance. In conclusion, to ensure the quality of the mini-clusters and enhance clustering performance, the mini-clusters generated after the second iteration are applied.

4.4.2. Impact of Mini-Cluster Updating

The parameter β 1 controls the updating of the mini-cluster soft assignment, which is important in the optimization. Here we compare the performance with different β 1 values on various datasets. The clustering results are shown in Figure 11.
From the graph, it is evident that either too large or too small values of β 1 adversely affect clustering performance. For example, a small β 1 decreases the performance on the Houston dataset. Meanwhile, a large β 1 leads to a lower accuracy on the Trento dataset. Since different datasets exhibit varying sensitivities to β 1 , it is advisable to choose β 1 values between 3 and 5. In our experiments, we set β 1 to three across all datasets.

4.4.3. Impact of Local Structure Preservation

To accommodate the diverse inner local structures of different datasets, distinct settings are required for the local structure preservation module. To evaluate the influence of this module, we integrate it into a mini-cluster optimization model with varying weights. The clustering results with different local structure weights are shown in Figure 12.
The results show that the local structure preservation module effectively enhances clustering performance. Additionally, the optimal weight parameters vary across different datasets. For example, with simpler datasets like Trento, a small weight is adequate to achieve smooth clustering. On the other hand, for more complex datasets like Houston or PaviaU, a larger weight may be needed, typically within the range of [6, 10].

4.4.4. Impact of Patch Size

We conducted an additional ablation study (Figure 13) to analyze the influence of different patch sizes [3, 5, 7, 9, 11]. The results indicate that moderate patch sizes yield better performance, while excessively large patches lead to a decline in performance. Notably, all datasets achieved their best performance when the patch size was set to 7   ×   7 . This validates the effectiveness of the chosen patch configuration and highlights its role in achieving optimal clustering performance.

4.4.5. Impact of the Number of Basis Vectors

We also conducted an ablation study (Figure 14) to evaluate the impact of different numbers of basis vectors (3, 5, 7, 9, 11, 13, and 15). The results indicate that although the optimal choice varies slightly across datasets, using five basis vectors consistently yields performance close to the best in all cases. This validates the effectiveness of the adopted setting and provides insight into how the basis dimension influences clustering quality.

4.4.6. Visualization of Embedded Representation and Soft Assignment

The t-distributed Stochastic Neighbor Embedding (t-SNE) [50] visualization of the latent representation for the Houston and PaviaU datasets is shown in Figure 15. We compare the original latent representation produced by the autoencoder with the representation obtained by our proposed method. For the Houston dataset, the t-SNE visualization of the autoencoder’s latent representation reveals significant overlap between some classes, resulting in unclear decision boundaries. Additionally, the intra-class points are loosely distributed, lacking tight clustering, which decreases the overall discriminative power. Our proposed method enhances the representation quality, reducing class overlap and improving intra-class compactness. For example, Class 2 is well separated as highlighted in the red circle, leading to a more distinct and organized representation. Meanwhile, the latent representation for the PaviaU dataset from the autoencoder shows unclear decision boundaries, with many classes being mixed. In contrast, our method generates a more discriminative representation, reducing the mixing of different classes, as illustrated by the red circle, ultimately leading to better class separation.
We also visualize the confusion matrix of the soft assignment matrix S , as shown in Figure 16. Distinct block-diagonal structures can be observed, indicating that the proposed model produces compact clusters with clear inter-cluster boundaries. A small amount of mixing is observed between several clusters, which is reasonable and can be attributed to class imbalance, high spectral similarity, or similar material compositions among adjacent classes.

4.4.7. Convergence Analysis

To validate our model’s convergence, we show the training curve on the datasets. The results are shown in Figure 17. From these curves, we observe that as the training iterations increase, the training loss decreases while the training accuracy increases, eventually plateauing. This indicates that our model converges well.

5. Conclusions

In this paper, we present a concise review of model-based and deep clustering methods, including both purely data-driven and model-aware approaches. Our primary contribution is a scalable context-preserving model-aware deep clustering approach for hyperspectral images. The proposed method learns the subspace basis under the supervision of both local and non-local structures inherent to hyperspectral image data, allowing these structures to mutually reinforce each other during training. Our approach achieves clustering with a computational complexity of O ( n ) , making it scalable for large-scale data. Unlike previous state-of-the-art methods, in our method, both the local and non-local structure preservation constraints optimize the entire clustering process in an end-to-end manner and provide stronger guidance for model optimization. Experimental results on four benchmark hyperspectral datasets demonstrate that our method outperforms state-of-the-art approaches in terms of clustering performance.

Author Contributions

Conceptualization, X.L., S.H. and A.P.; methodology, X.L., S.H. and A.P.; software, X.L.; validation, X.L.; formal analysis, X.L., A.P., N.N. and N.D.; investigation, X.L.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, S.H., N.N., N.D. and A.P.; supervision, A.P. and N.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been partially supported by the Flanders AI Research Programme grant no. 174B09119. This work was also supported by the Bijzonder Onderzoeksfonds (BOF) under Grant BOF.24Y.2021.0049.01, the Research Foundation—Flanders under Grant G094122N (project SPYDER), and the China Scholarship Council under Grant 202106150007. Additionally, this work was supported in part by the National Natural Science Foundation of China under Grant 42301425 and the China Postdoctoral Science Foundation under Grant 2023M743299.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The implementation code and experimental scripts are publicly available at https://github.com/lxlscut/SCDSC (accessed on 10 December 2025). The hyperspectral datasets analyzed in this study are openly available from public repositories as follows: the Trento dataset (https://github.com/A-Piece-Of-Maple/TrentoDateset, accessed on 10 December 2025), the Pavia University dataset on the Hyperspectral Remote Sensing Scenes website (https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes, accessed on 10 December 2025, “Pavia University scene” section), the Houston dataset from the IEEE GRSS Data Fusion Contest (https://www.grss-ieee.org/community/technical-committees/2013-ieee-grss-data-fusion-contest/, accessed on 10 December 2025), and the HYPSO-1 dataset (https://ntnu-smallsat-lab.github.io/hypso1_sea_land_clouds_dataset/, accessed on 10 December 2025).

Acknowledgments

The authors thank the providers of the hyperspectral datasets used in this work. N. Deligiannis is supported by the “Onderzoeksprogramma Artificiele Intelligentie (AI) Vlaanderen” programme and by the ERC Consolidator Grant IONIAN (No. 101171240, DOI: 10.3030/101171240). Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Overall Wilcoxon Signed-Rank Results Across Datasets

We report two-sided Wilcoxon signed-rank p-values ( p w ) from paired comparisons of each method against our proposed SCDSC baseline (Target) over repeated trials. Smaller p w indicates stronger evidence of a difference from SCDSC.
Table A1. Wilcoxon signed-rank p-values ( p w ) versus the proposed SCDSC method. Values are formatted in scientific notation ( p   <   0.05 is marked with *, p   <   0.01 with **).
Table A1. Wilcoxon signed-rank p-values ( p w ) versus the proposed SCDSC method. Values are formatted in scientific notation ( p   <   0.05 is marked with *, p   <   0.01 with **).
DatasetMetricK-MeansFCMSCIDECFINCHDEKMSNHyperAEN2DMADL
HoustonOA 1.95 × 10 3   1.95 × 10 3   5.86 × 10 3   5.86 × 10 3   1.05 × 10 1 1.95 × 10 3   1.95 × 10 3   2.75 × 10 1 1.95 × 10 3   1.93 × 10 1
NMI 1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   6.95 × 10 1 6.45 × 10 2 3.91 × 10 3   1.95 × 10 3   3.75 × 10 1 3.91 × 10 3   4.88 × 10 2  
K 1.95 × 10 3   1.95 × 10 3   5.86 × 10 3   9.77 × 10 3   1.05 × 10 1 1.95 × 10 3   1.95 × 10 3   2.75 × 10 1 3.91 × 10 3   1.31 × 10 1
TrentoOA 1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   3.91 × 10 3   1.95 × 10 3   3.91 × 10 3   3.91 × 10 3   1.95 × 10 3   5.86 × 10 3  
NMI 1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   5.86 × 10 3  
K 1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   3.91 × 10 3   1.95 × 10 3   9.77 × 10 3   3.91 × 10 3   1.95 × 10 3   5.86 × 10 3  
PaviaUOA 1.95 × 10 3   1.95 × 10 3   3.22 × 10 1 3.91 × 10 3   1.95 × 10 3   2.32 × 10 1 1.95 × 10 2   1.95 × 10 3   2.32 × 10 1
NMI 1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3  
K 1.95 × 10 3   1.95 × 10 3   2 . 73 × 10 2   3.91 × 10 3   1.95 × 10 3   1.05 × 10 1 9.77 × 10 3   1.95 × 10 3   4.88 × 10 2  
HYPSO-1OA 1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   3.91 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1 . 37 × 10 2  
NMI 1.95 × 10 3   5.86 × 10 3   1.95 × 10 3   1.93 × 10 1 1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   4.92 × 10 1
K 1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   3.91 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1.95 × 10 3   1 . 37 × 10 2  
Significance markers: * p < 0.05, ** p < 0.01. All significant pw are shown in bold.

References

  1. Liu, Y.; Pu, H.; Sun, D.W. Hyperspectral imaging technique for evaluating food quality and safety during various processes: A review of recent applications. Trends Food Sci. Technol. 2017, 69, 25–35. [Google Scholar] [CrossRef]
  2. Stuart, M.B.; McGonigle, A.J.; Willmott, J.R. Hyperspectral imaging in environmental monitoring: A review of recent developments and technological advances in compact field deployable systems. Sensors 2019, 19, 3071. [Google Scholar] [CrossRef] [PubMed]
  3. Briottet, X.; Boucher, Y.; Dimmeler, A.; Malaplate, A.; Cini, A.; Diani, M.; Bekman, H.; Schwering, P.; Skauli, T.; Kasen, I.; et al. Military applications of hyperspectral imagery. In Proceedings of the Targets and Backgrounds XII: Characterization and Representation; SPIE: Bellingham, WA, USA, 2006; Volume 6239, pp. 82–89. [Google Scholar]
  4. Huang, S.; Zhang, H.; Zeng, H.; Pižurica, A. From Model-Based Optimization Algorithms to Deep Learning Models for Clustering Hyperspectral Images. Remote Sens. 2023, 15, 2832. [Google Scholar] [CrossRef]
  5. Elhamifar, E.; Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2765–2781. [Google Scholar] [CrossRef]
  6. Vidal, R. Subspace clustering. IEEE Signal Process. Mag. 2011, 28, 52–68. [Google Scholar] [CrossRef]
  7. Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; Ma, Y. Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 171–184. [Google Scholar] [CrossRef] [PubMed]
  8. Wang, Y.X.; Xu, H.; Leng, C. Provable subspace clustering: When LRR meets SSC. Adv. Neural Inf. Process. Syst. 2013, 26, 64–72. [Google Scholar] [CrossRef]
  9. Tian, L.; Du, Q.; Kopriva, I. L 0-motivated low rank sparse subspace clustering for hyperspectral imagery. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; IEEE: New York, NY, USA, 2020; pp. 1038–1041. [Google Scholar]
  10. Ji, P.; Zhang, T.; Li, H.; Salzmann, M.; Reid, I. Deep subspace clustering networks. Adv. Neural Inf. Process. Syst. 2017, 30, 24–33. [Google Scholar]
  11. Peng, X.; Feng, J.; Zhou, J.T.; Lei, Y.; Yan, S. Deep subspace clustering. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 5509–5521. [Google Scholar] [CrossRef] [PubMed]
  12. Lv, J.; Kang, Z.; Lu, X.; Xu, Z. Pseudo-Supervised Deep Subspace Clustering. IEEE Trans. Image Process. 2021, 30, 5252–5263. [Google Scholar] [CrossRef]
  13. Huang, S.; Zhang, H.; Pižurica, A. Joint sparsity based sparse subspace clustering for hyperspectral images. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; IEEE: New York, NY, USA, 2018; pp. 3878–3882. [Google Scholar]
  14. Li, K.; Qin, Y.; Ling, Q.; Wang, Y.; Lin, Z.; An, W. Self-supervised deep subspace clustering for hyperspectral images with adaptive self-expressive coefficient matrix initialization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3215–3227. [Google Scholar] [CrossRef]
  15. Zeng, M.; Cai, Y.; Liu, X.; Cai, Z.; Li, X. Spectral-spatial clustering of hyperspectral image based on Laplacian regularized deep subspace clustering. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; IEEE: New York, NY, USA, 2019; pp. 2694–2697. [Google Scholar]
  16. Valanarasu, J.M.J.; Patel, V.M. Overcomplete deep subspace clustering networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; IEEE: New York, NY, USA, 2021; pp. 746–755. [Google Scholar]
  17. Chen, Z.; Ding, S.; Hou, H. A novel self-attention deep subspace clustering. Int. J. Mach. Learn. Cybern. 2021, 12, 2377–2387. [Google Scholar] [CrossRef]
  18. Cai, J.; Fan, J.; Guo, W.; Wang, S.; Zhang, Y.; Zhang, Z. Efficient Deep Embedded Subspace Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 21–30. [Google Scholar]
  19. Li, X.; Nadisic, N.; Huang, S.; Deligiannis, N.; Pižurica, A. Model-Aware Deep Learning for the Clustering of Hyperspectral Images with Context Preservation. In Proceedings of the 2023 31st European Signal Processing Conference (EUSIPCO), Helsinki, Finland, 4–8 September 2023; IEEE: New York, NY, USA, 2023; pp. 885–889. [Google Scholar]
  20. Sarfraz, S.; Sharma, V.; Stiefelhagen, R. Efficient parameter-free clustering using first neighbor relations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 8934–8943. [Google Scholar]
  21. Zhang, H.; Zhai, H.; Zhang, L.; Li, P. Spectral–Spatial Sparse Subspace Clustering for Hyperspectral Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 3672–3684. [Google Scholar] [CrossRef]
  22. Huang, S.; Zhang, H.; Pižurica, A. Semisupervised Sparse Subspace Clustering Method with a Joint Sparsity Constraint for Hyperspectral Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 989–999. [Google Scholar] [CrossRef]
  23. Xu, J.; Fowler, J.E.; Xiao, L. Hypergraph-Regularized Low-Rank Subspace Clustering Using Superpixels for Unsupervised Spatial–Spectral Hyperspectral Classification. IEEE Geosci. Remote Sens. Lett. 2021, 18, 871–875. [Google Scholar] [CrossRef]
  24. Huang, S.; Zhang, H.; Du, Q.; Pižurica, A. Sketch-based subspace clustering of hyperspectral images. Remote Sens. 2020, 12, 775. [Google Scholar] [CrossRef]
  25. Huang, S.; Zhang, H.; Pižurica, A. Subspace Clustering for Hyperspectral Images via Dictionary Learning With Adaptive Regularization. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  26. Zhai, H.; Zhang, H.; Zhang, L.; Li, P. Total variation regularized collaborative representation clustering with a locally adaptive dictionary for hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 166–180. [Google Scholar] [CrossRef]
  27. Huang, S.; Zeng, H.; Chen, H.; Zhang, H. Spatial and Cluster Structural Prior-Guided Subspace Clustering for Hyperspectral Image. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
  28. Huang, P.; Huang, Y.; Wang, W.; Wang, L. Deep embedding network for clustering. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; IEEE: New York, NY, USA, 2014; pp. 1532–1537. [Google Scholar]
  29. Peng, X.; Xiao, S.; Feng, J.; Yau, W.Y.; Yi, Z. Deep subspace clustering with sparsity prior. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), New York, NY, USA, 9–15 July 2016; AAAI Press: Washington, DC, USA, 2016; pp. 1925–1931. [Google Scholar]
  30. Shaham, U.; Stanton, K.; Li, H.; Nadler, B.; Basri, R.; Kluger, Y. SpectralNet: Spectral Clustering Using Deep Neural Networks. arXiv 2018, arXiv:1801.01587. [Google Scholar] [CrossRef]
  31. McConville, R.; Santos-Rodríguez, R.; Piechocki, R.J.; Craddock, I. N2D: (Not Too) Deep Clustering via Clustering the Local Manifold of an Autoencoded Embedding. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: New York, NY, USA, 2021; pp. 5145–5152. [Google Scholar] [CrossRef]
  32. Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; PMLR: New York, NY, USA, 2016; pp. 478–487. [Google Scholar]
  33. Nalepa, J.; Myller, M.; Imai, Y.; Honda, K.I.; Takeda, T.; Antoniak, M. Unsupervised Segmentation of Hyperspectral Images Using 3-D Convolutional Autoencoders. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1948–1952. [Google Scholar] [CrossRef]
  34. Huang, J.; Gong, S.; Zhu, X. Deep semantic clustering by partition confidence maximisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 8849–8858. [Google Scholar]
  35. Zeng, H.; Cao, J.; Feng, K.; Huang, S.; Zhang, H.; Luong, H.; Philips, W. Degradation-Noise-Aware Deep Unfolding Transformer for Hyperspectral Image Denoising. arXiv 2023, arXiv:2305.04047. [Google Scholar] [CrossRef]
  36. Chen, X.; Xia, W.; Yang, Z.; Chen, H.; Liu, Y.; Zhou, J.; Wang, Z.; Chen, Y.; Wen, B.; Zhang, Y. SOUL-net: A sparse and low-rank unrolling network for spectral CT image reconstruction. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 18620–18634. [Google Scholar] [CrossRef] [PubMed]
  37. Kouni, V.; Paraskevopoulos, G.; Rauhut, H.; Alexandropoulos, G.C. ADMM-DAD net: A deep unfolding network for analysis compressed sensing. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; IEEE: New York, NY, USA, 2022; pp. 1506–1510. [Google Scholar]
  38. Wang, J.; Shao, Z.; Huang, X.; Lu, T.; Zhang, R. A deep unfolding method for satellite super resolution. IEEE Trans. Comput. Imaging 2022, 8, 933–944. [Google Scholar] [CrossRef]
  39. Tao, H.; Li, J.; Hua, Z.; Zhang, F. DUDB: Deep unfolding-based dual-branch feature fusion network for pan-sharpening remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
  40. Zhao, M.; Wang, X.; Chen, J.; Chen, W. A plug-and-play priors framework for hyperspectral unmixing. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  41. Cai, Y.; Zeng, M.; Cai, Z.; Liu, X.; Zhang, Z. Graph regularized residual subspace clustering network for hyperspectral image clustering. Inf. Sci. 2021, 578, 85–101. [Google Scholar] [CrossRef]
  42. Cai, Y.; Zhang, Z.; Cai, Z.; Liu, X.; Jiang, X. Hypergraph-structured autoencoder for unsupervised and semisupervised classification of hyperspectral image. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  43. Liu, S.; Huang, N.; Xiao, L. Locally Constrained Collaborative Representation Based Fisher’s LDA for Clustering of Hyperspectral Images. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; IEEE: New York, NY, USA, 2020; pp. 1046–1049. [Google Scholar]
  44. Cai, Y.; Zhang, Z.; Ghamisi, P.; Ding, Y.; Liu, X.; Cai, Z.; Gloaguen, R. Superpixel contracted neighborhood contrastive subspace clustering network for hyperspectral images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  45. Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar] [CrossRef]
  46. Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  47. Von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
  48. Guo, X.; Gao, L.; Liu, X.; Yin, J. Improved deep embedded clustering with local structure preservation. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Australia, 19–25 August 2017; AAAI Press: Washington, DC, USA, 2017; Volume 17, pp. 1753–1759. [Google Scholar]
  49. Guo, W.; Lin, K.; Ye, W. Deep Embedded K-Means Clustering. In Proceedings of the 2021 International Conference on Data Mining Workshops (ICDMW), Auckland, New Zealand, 7–10 December 2021; IEEE: New York, NY, USA, 2021; pp. 686–694. [Google Scholar] [CrossRef]
  50. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Figure 1. An illustration of the self-representation model.
Figure 1. An illustration of the self-representation model.
Remotesensing 17 04030 g001
Figure 2. An illustration of the basis-representation model.
Figure 2. An illustration of the basis-representation model.
Remotesensing 17 04030 g002
Figure 3. Structure of the proposed method. The autoencoder maps the input data nonlinearly to a latent representation H . This representation is then projected onto various subspace bases to form the subspace affinity matrix S . To enhance the quality of the subspace basis, the following optimization modules are applied: (1) the mini-cluster updating module, which generates mini-cluster assignment M and updates it by minimizing the KL divergence loss to a refined version M ˜ ; (2) the local-structure-preserving module, which encourages the subspace affinity matrix S to be similar to its smooth version F .
Figure 3. Structure of the proposed method. The autoencoder maps the input data nonlinearly to a latent representation H . This representation is then projected onto various subspace bases to form the subspace affinity matrix S . To enhance the quality of the subspace basis, the following optimization modules are applied: (1) the mini-cluster updating module, which generates mini-cluster assignment M and updates it by minimizing the KL divergence loss to a refined version M ˜ ; (2) the local-structure-preserving module, which encourages the subspace affinity matrix S to be similar to its smooth version F .
Remotesensing 17 04030 g003
Figure 4. Calculation process of the smoothed soft assignment matrix F . First, the soft assignments of image pixels are arranged into a 3D tensor based on their spatial locations. Next, a local window is applied to perform mean filtering for each point. Specifically, the filtering is calculated within the task area defined by a mask T , where labeled data are available, to facilitate effective evaluation of the results.
Figure 4. Calculation process of the smoothed soft assignment matrix F . First, the soft assignments of image pixels are arranged into a 3D tensor based on their spatial locations. Next, a local window is applied to perform mean filtering for each point. Specifically, the filtering is calculated within the task area defined by a mask T , where labeled data are available, to facilitate effective evaluation of the results.
Remotesensing 17 04030 g004
Figure 5. Houston dataset. (a) False-color image. (b) Ground truth and the clustering results obtained by (c) K-means, (d) FCM, (e) SC, (f) IDEC, (g) FINCH, (h) DEKM, (i) SpectralNet, (j) HyperAE, (k) N2D, (l) MADL, and (m) SCDSC.
Figure 5. Houston dataset. (a) False-color image. (b) Ground truth and the clustering results obtained by (c) K-means, (d) FCM, (e) SC, (f) IDEC, (g) FINCH, (h) DEKM, (i) SpectralNet, (j) HyperAE, (k) N2D, (l) MADL, and (m) SCDSC.
Remotesensing 17 04030 g005
Figure 6. Trento dataset. (a) False-color image. (b) Ground truth and the clustering results obtained by (c) K-means, (d) FCM, (e) SC, (f) IDEC, (g) FINCH, (h) DEKM, (i) SpectralNet, (j) N2D, (k) MADL, and (l) SCDSC.
Figure 6. Trento dataset. (a) False-color image. (b) Ground truth and the clustering results obtained by (c) K-means, (d) FCM, (e) SC, (f) IDEC, (g) FINCH, (h) DEKM, (i) SpectralNet, (j) N2D, (k) MADL, and (l) SCDSC.
Remotesensing 17 04030 g006
Figure 7. PaviaU dataset. (a) False-color image. (b) Ground truth and the clustering results obtained by (c) K-means, (d) FCM, (e) SC, (f) IDEC, (g) FINCH, (h) DEKM, (i) SpectralNet, (j) N2D, (k) MADL, and (l) SCDSC.
Figure 7. PaviaU dataset. (a) False-color image. (b) Ground truth and the clustering results obtained by (c) K-means, (d) FCM, (e) SC, (f) IDEC, (g) FINCH, (h) DEKM, (i) SpectralNet, (j) N2D, (k) MADL, and (l) SCDSC.
Remotesensing 17 04030 g007
Figure 8. HYPSO-1 dataset. (a) False-color image. (b) Ground truth and the clustering results obtained by (c) K-means, (d) FCM, (e) SC, (f) IDEC, (g) FINCH, (h) DEKM, (i) SpectralNet, (j) N2D, (k) MADL, and (l) SCDSC.
Figure 8. HYPSO-1 dataset. (a) False-color image. (b) Ground truth and the clustering results obtained by (c) K-means, (d) FCM, (e) SC, (f) IDEC, (g) FINCH, (h) DEKM, (i) SpectralNet, (j) N2D, (k) MADL, and (l) SCDSC.
Remotesensing 17 04030 g008
Figure 9. Mini-cluster generation across FINCH iterations. (a) Mini-cluster number by number of FINCH iterations, (b) Mini-cluster variance by number of FINCH iterations.
Figure 9. Mini-cluster generation across FINCH iterations. (a) Mini-cluster number by number of FINCH iterations, (b) Mini-cluster variance by number of FINCH iterations.
Remotesensing 17 04030 g009
Figure 10. Performance with mini-clusters generated by the number of FINCH iterations. (a) Houston, (b) Trento, (c) PaviaU.
Figure 10. Performance with mini-clusters generated by the number of FINCH iterations. (a) Houston, (b) Trento, (c) PaviaU.
Remotesensing 17 04030 g010
Figure 11. The impact of mini-cluster updating in clustering results. (a) Houston, (b) Trento, (c) PaviaU.
Figure 11. The impact of mini-cluster updating in clustering results. (a) Houston, (b) Trento, (c) PaviaU.
Remotesensing 17 04030 g011
Figure 12. The impact of local structure preservation in clustering results. (a) Houston, (b) Trento, (c) PaviaU.
Figure 12. The impact of local structure preservation in clustering results. (a) Houston, (b) Trento, (c) PaviaU.
Remotesensing 17 04030 g012
Figure 13. Impact of patch size on clustering performance across different datasets. (a) Houston, (b) Trento, (c) PaviaU.
Figure 13. Impact of patch size on clustering performance across different datasets. (a) Houston, (b) Trento, (c) PaviaU.
Remotesensing 17 04030 g013
Figure 14. Impact of the number of basis components on clustering performance across datasets. (a) Houston, (b) Trento, (c) PaviaU.
Figure 14. Impact of the number of basis components on clustering performance across datasets. (a) Houston, (b) Trento, (c) PaviaU.
Remotesensing 17 04030 g014
Figure 15. Visualization of the latent representation with t-SNE on the Houston and PaviaU datasets.
Figure 15. Visualization of the latent representation with t-SNE on the Houston and PaviaU datasets.
Remotesensing 17 04030 g015
Figure 16. Visualization of S matrix on three datasets. (a) Houston, (b) Trento, (c) PaviaU.
Figure 16. Visualization of S matrix on three datasets. (a) Houston, (b) Trento, (c) PaviaU.
Remotesensing 17 04030 g016
Figure 17. Training loss and accuracy curves on three datasets. (a) Houston, (b) Trento, (c) PaviaU.
Figure 17. Training loss and accuracy curves on three datasets. (a) Houston, (b) Trento, (c) PaviaU.
Remotesensing 17 04030 g017
Table 1. Summary of representative clustering methods for hyperspectral images.
Table 1. Summary of representative clustering methods for hyperspectral images.
CategorySub-CategoryAlgorithmsRemarks
Model-basedSelf-representationSSC [5], LRR [7], JSSC [13], S4C [21], LCR-FLDA [43]Learn global or structured self-representation coefficients; effective for capturing subspace structure; performance degrades on nonlinear manifolds; do not scale well.
Dictionary-basedSketch-TV [24], IDLSC [25], SPGSC [27], (TV-CRC-LAD) [26]Use compact or structured dictionaries to improve scalability; suitable for large HSIs but still rely on linear reconstruction assumptions.
Deep learning-basedData-drivenDEC [32], DEN [28], PICA [34], PARTY [29]
SpectralNet [30], N2D [31]
Learn latent representations in an end-to-end manner; flexible and scalable, but rely heavily on network design and sufficient data.
Model-awareDSCNet [10], SDSC-AI  [14], HyperAE [42], LRDSC [15]
PSSC [12], DSC [11]
Incorporate model priors into deep networks to improve accuracy; usually involve complex architectures and higher computational cost.
Table 2. Quantitative evaluation of different clustering methods on the dataset Houston.
Table 2. Quantitative evaluation of different clustering methods on the dataset Houston.
ClassK-Means [45]FCM [46]SC [47]IDEC [48]FINCH [20]DEKM [49]SN [30]HyperAE [42]N2D [31]MADL [19]SCDSC
146.5046.5062.4647.2053.5045.5341.4047.1651.8447.2048.60
210010010010010010010099.6299.98100100
358.050.035.6688.9470.6124.3032.1170.0367.7674.7079.66
410010010096.3710010076.4499.3663.1099.9699.65
594.770.7090.0094.5410085.6980.00100.0038.9269.3860.00
6027.070070.9401.54010.8523.0530.70
700.4609.9704.0059.1636.6073.8439.5647.97
OA(%) Mean64.5063.2365.8667.5872.1061.0160.7770.3663.6172.3374.41
OA Std0.080.015.241.200.003.003.795.502.225.854.21
NMI Mean0.69730.59350.61810.78510.77020.70110.64390.76970.74650.76560.7902
NMI Std0.00060.00020.06920.02690.000.04840.04750.04000.02130.02660.0329
K Mean0.53540.52500.54410.58590.64240.49220.50570.62250.56790.64920.6759
K Std0.00110.00010.07150.02040.000.04120.05680.07000.02760.07930.0590
Time (sec) Mean2.217.322.6842.960.7367.8816.75419.4026.4060.5233.43
Time Std0.173.060.300.480.169.310.3742.100.801.071.11
The best results are highlighted in bold, and the second-best results are underlined.
Table 3. Quantitative evaluation of different clustering methods on the dataset Trento. “–” indicates out-of-memory during execution.
Table 3. Quantitative evaluation of different clustering methods on the dataset Trento. “–” indicates out-of-memory during execution.
ClassK-Means [45]FCM [46]SC [47]IDEC [48]FINCH [20]DEKM [49]SN [30]HyperAE [42]N2D [31]MADL [19]SCDSC
171.370089.5093.9599.3973.6999.6299.98100
214.331.27.8235.38031.8539.8426.321087.98
301.1053.900008.7900
499.2995.9699.5299.1699.5499.9199.4182.08100100
592.1999.4895.5476.9291.9676.4492.5554.78100100
684.7214.0092.8270.4767.1380.0469.9973.8886.0536.80
OA(%) Mean81.8365.1673.7680.2876.2381.4783.2067.5588.3090.61
OA Std3.153.001.325.210.006.577.166.050.192.87
NMI Mean0.77170.56850.76120.82340.82000.82570.78350.75680.91440.9101
NMI Std0.01550.05290.01480.04110.000.03880.08310.02460.00440.0024
K Mean0.75660.48850.63530.74300.69630.76070.77200.59780.84340.8746
K Std0.04230.04990.01640.06940.000.08110.10460.06900.00250.0392
Time (sec) Mean12.815.34154.15211.6916.99291.1281.56267.49230.38148.87
Time Std0.850.499.401.360.3783.831.0252.237.582.99
The best results are highlighted in bold, and the second-best results are underlined.
Table 4. Quantitative evaluation of different clustering methods on the dataset PaviaU. “–” indicates out-of-memory during execution.
Table 4. Quantitative evaluation of different clustering methods on the dataset PaviaU. “–” indicates out-of-memory during execution.
ClassK-Means [45]FCM [46]SC [47]IDEC [48]FINCH [20]DEKM [49]SN [30]HyperAE [42]N2D [31]MADL [19]SCDSC
192.4499.9810096.5399.9877.0495.1788.6699.5593.64
247.7588.9910038.4858.8875.3363.9732.1570.8357.67
30000010.57095.2800
472.630.1673.7695.5989.4965.4377.4480.4097.2498.98
566.40039.1876.6810093.4210080.0099.96100
63.920.02053.1644.0422.1314.9455.3018.6585.46
70000017.320.2815.0400
893.690099.33082.3154.6697.1179.8499.98
90076.2414.20068.0094.4189.509.1042.20
OA(%) Mean50.9754.3167.3056.1155.9064.6659.8952.8865.6969.48
OA Std0.010.020.003.390.004.944.875.663.078.02
NMI Mean0.59260.34590.69050.67220.65100.64800.64080.62620.65110.7490
NMI Std0.00010.00020.000.01930.000.01960.03130.02230.02600.0358
K Mean0.39180.34910.52900.48110.47620.54080.48050.45510.55210.6250
K Std0.00040.00030.00010.03330.000.05090.06250.02850.03850.0892
Time (sec) Mean22.095.34297.12301.5338.01839.60115.77376.29391.85320.33
Time Std0.770.499.203.480.6475.040.8454.0917.965.31
The best results are highlighted in bold, and the second-best results are underlined.
Table 5. Quantitative evaluation of different clustering methods on the HYPSO-1 dataset. “–” indicates out-of-memory during execution.
Table 5. Quantitative evaluation of different clustering methods on the HYPSO-1 dataset. “–” indicates out-of-memory during execution.
ClassK-Means [45]FCM [46]SC [47]IDEC [48]FINCH [20]DEKM [49]SN [30]HyperAE [42]N2D [31]MADL [19]SCDSC
199.8799.0166.4588.7396.9897.7542.3274.7799.1393.91
20.000.1252.9283.2366.2653.2119.9872.5092.9995.75
369.0493.6589.7264.1994.0437.0676.2561.5260.5971.83
OA(%) Mean67.6177.6673.6477.3475.7963.5952.3068.7581.8184.96
OA Std0.000.020.016.300.007.916.8210.950.412.43
NMI Mean0.58920.61020.45460.60640.46620.42500.27250.42210.63530.6380
NMI Std0.000.00170.000.05660.000.12460.07530.08820.00480.0195
K Mean0.47620.62370.59310.65920.59710.44300.24580.53080.72690.7739
K Std0.000.00030.000.10490.000.13730.10670.16470.00570.0340
Time (sec) Mean6.8830.7721.35141.4316.87262.3763.10110.59145.00131.10
Time Std0.3716.551.331.110.3548.721.930.552.093.66
The best results are highlighted in bold, and the second-best results are underlined.
Table 6. Hyperparameter settings for each dataset.
Table 6. Hyperparameter settings for each dataset.
DatasetLearning Rate β 1 β 2 Smooth Window
Houston0.000138 3   ×   3
Trento0.00531 7   ×   7
PaviaU0.00537 7   ×   7
HYPSO-10.00530.001 3   ×   3
Table 7. Results of the ablation study.
Table 7. Results of the ablation study.
ModelHouston Trento PaviaU HYPSO-1
OA(%)NMI K OA(%)NMI K OA(%)NMI K OA(%)NMI K
Baseline72.530.75380.6555 81.980.86070.7701 54.190.63400.4638 83.720.63050.7563
L72.470.76190.6545 81.850.86110.7651 60.130.67440.5282 83.680.63020.7558
MC73.920.78430.6698 90.250.88980.8704 61.780.67430.5286 84.890.63690.7729
MC&L74.410.79020.6759 90.610.91010.8746 69.480.74900.6250 84.960.63800.7739
The best results are highlighted in bold. Baseline is the original basis representation model proposed in [18]; L is the baseline model with local structure preservation; MC is the baseline model with non-local structure preservation; MC&L is the baseline model with non-local and local structure preservation constraints.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Nadisic, N.; Huang, S.; Deligiannis, N.; Pižurica, A. Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images. Remote Sens. 2025, 17, 4030. https://doi.org/10.3390/rs17244030

AMA Style

Li X, Nadisic N, Huang S, Deligiannis N, Pižurica A. Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images. Remote Sensing. 2025; 17(24):4030. https://doi.org/10.3390/rs17244030

Chicago/Turabian Style

Li, Xianlu, Nicolas Nadisic, Shaoguang Huang, Nikos Deligiannis, and Aleksandra Pižurica. 2025. "Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images" Remote Sensing 17, no. 24: 4030. https://doi.org/10.3390/rs17244030

APA Style

Li, X., Nadisic, N., Huang, S., Deligiannis, N., & Pižurica, A. (2025). Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images. Remote Sensing, 17(24), 4030. https://doi.org/10.3390/rs17244030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop