Spectral–Spatial Superpixel Bi-Stochastic Graph Learning for Large-Scale and High-Dimensional Hyperspectral Image Clustering

Chen, Cheng; Wang, Nian; Wang, Shengming; Cao, Jiping; Wang, Tao; Cui, Zhigao; Su, Yanzhao

doi:10.3390/rs17233799

Open AccessArticle

Spectral–Spatial Superpixel Bi-Stochastic Graph Learning for Large-Scale and High-Dimensional Hyperspectral Image Clustering

by

Cheng Chen

^†

,

Nian Wang

^†

,

Shengming Wang

,

Jiping Cao

,

Tao Wang

^*,

Zhigao Cui

and

Yanzhao Su

Rocket Force University of Engineering, Xi’an 710025, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(23), 3799; https://doi.org/10.3390/rs17233799

Submission received: 30 September 2025 / Revised: 17 November 2025 / Accepted: 21 November 2025 / Published: 23 November 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We identify the dual challenges of linear complexity and the “curse of dimensionality” inherent in prevailing anchor-based methods, and accordingly propose a novel paradigm centered on superpixel encoding and data projection to resolve them.
We introduce a framework that learns a small-scale, sparse bi-stochastic graph from the data, supported by an optimization strategy that guarantees convergence to the optimal solution.

What are the implications of the main finding?

Extensive experiments demonstrate that our method achieves state-of-the-art clustering performance, showing significant superiority over existing competitors.
The proposed solution exhibits remarkable scalability and efficiency, establishing a new benchmark for clustering large-scale and high-dimensional hyperspectral images.

Abstract

Despite the substantial body of work that has achieved large-scale data expansion using anchor-based strategies, these methods incur linear complexity relative to the sample size during iterative processes, making them quite time-consuming. Moreover, as feature dimensionality reduction is often overlooked in this procedure, most of them suffer from the “curse of dimensionality”. To address all these issues simultaneously, we introduce a novel paradigm with a superpixel encoding and data projecting strategy, which learns a small-scale bi-stochastic graph from the data matrix with large-scale pixels and high-dimensional spectral features to achieve effective clustering. Moreover, a symmetric neighbor search strategy is integrated into our framework to ensure the sparsity of graph and further improve the calculation efficiency. For optimization, a simple yet effective strategy is designed, which simultaneously satisfies all bi-stochastic constraints while ensuring convergence to the optimal solution. To validate our model’s effectiveness and scalability, we conduct extensive experiments on various-scale hyperspectral images (HSIs). The results demonstrate that our method achieves the state-of-the-art clustering performance, and can be better extended to large-scale and high-dimensional HSIs.

Keywords:

hyperspectral image clustering; superpixel encoding; data projection; bi-stochastic graph learning

1. Introduction

Hyperspectral images (HSIs) capture abundant information from hundreds of spectral bands, providing a strong capability to discriminate land cover types—particularly those with highly similar signatures in color space [1]. This advantage has led to the widespread application of HSIs in high-level earth observation tasks, including mineral exploration, precision agriculture, and military monitoring [2,3,4,5]. However, pixel-level annotation for HSIs is often laborious and challenging in practice, driving the need for unsupervised classification methods, i.e., clustering. A central challenge in this task is to effectively learn the inherent similarity relationships within the data.

Graph-based methods play a pivotal role in representing the inherent relationships in complex data with a graph structure. This paradigm has driven significant advancements, particularly in tasks such as graph-based clustering [6], manifold learning-based dimensionality reduction [7], and graph embedding [8]. However, since graph construction generally exhibits a quadratic memory burden with respect to both sample and feature dimensions, graph-based methods demonstrate poor scalability—particularly when modeling HSIs, which are characterized by large-scale pixels and high-dimensional spectral features, leading to significant challenges for learning a graph representation [9].

To address scalability challenges, the anchor graph method is widely used. In this approach, a small set of m representative anchors is selected via random sampling or K-means [10], and an

n \times m

graph is constructed for data representation. Although this approach reduces the storage and computational complexity to a linear function of n, it still faces scalability issues due to the

O (n m d + d^{2})

computational cost per iteration, incurred by performing eigen-decomposition on an

n \times m

graph. Moreover, optimal anchor selection remains an open challenge. Adding to this issue, the inherent randomness of these selection strategies and their failure to capture local pixel relationships often yields poorly discriminative anchors, leading to suboptimal clustering results despite dimensionality reduction.

Another challenge of current works lies in precisely quantifying the affinities between nodes to faithfully capture the intrinsic geometric structure of the data [11]. Traditional graph construction methods rely on affinities measured by Euclidean distance or Gaussian functions [12,13], causing intractable parameter tuning (e.g., the bandwidth coefficient). Recent advances in graph learning have introduced adaptive neighbor theory [6,14,15]. This theory fundamentally reformulates graph construction as a constrained optimization problem. This paradigm simultaneously performs probabilistic affinity estimation and neighborhood control during the iterations, thereby obtaining a sparse graph that not only enhances robustness against outlier-induced noise but also reduces computational overhead [14]. Nevertheless, existing methods often ignore the necessary symmetry condition due to challenges in optimization design and commonly employ post-processing symmetry operations. Such a multi-stage paradigm leads to affinity shifts and inaccurate data relationships [16].

To address these issues, this paper proposes Spectral–Spatial Superpixel Bi-stochastic Graph Learning (

S^{3}

BGL) for large-scale and high-dimensional hyperspectral image clustering. As shown in Figure 1, the effective reduction of both pixel dimensions (from n to m) and feature dimensions (from d to r), where

m ≪ n

and

r ≪ d

, transforms the representation of large-scale and high-dimensional data into a compact graph learning process. A bi-stochastic constraint is imposed on this graph, ensuring non-negativity, symmetry, and double normalization to enhance representation quality. The main contributions of this paper are as follows:

We propose $S^{3}$ BGL, a novel approach that learns a bi-stochastic graph from data matrix, enabling fully data-driven graph learning. Our method is a unified framework that includes graph learning, superpixel encoding and data projecting, effectively reducing storage and computational complexity to $O (m^{2})$ and $O (m^{2} r)$ , independent for both the large-scale pixels n and high-dimensional spectral bands d.
We provide some in-depth theoretical insights that enable a comprehensive analysis of the connections between the proposed $S^{3}$ BGL and previous approaches.
We introduce a symmetric neighbor search mechanism to ensure the sparsity of the graph, which further accelerates the model.
We present a novel optimization strategy to address bi-stochastic constraints. Specifically, by deriving some key equivalent transformations, our method enables the simultaneous optimization of all bi-stochastic constraints, ensuring globally optimal solutions. The proposed approach exhibits strong extensibility, making it applicable to a broad range of objectives, thereby significantly advancing progress in related research fields.
We show that our proposed method consistently learns high-quality graphs, achieving state-of-the-art clustering performance. Additionally, compared to previous works, it can be more effectively scaled to large-scale and high-dimensional HSIs.

To enhance method clarity, matrices are denoted in uppercase bold italics, vectors in lowercase (presented as column vectors). Some general notations are provided in Table 1.

Organization: Section 2 introduces related works. Section 3 illustrates the proposed

S^{3}

BGL. Section 4 presents a design of an effective optimization scheme. Section 5 presents conducted experiments to verify the merits of proposed model and discusses the influences of hyperparameters. Section 7 concludes the paper and provides an outlook.

2. Related Work

2.1. Bi-Stochastic Graph Learning

A bi-stochastic graph is characterized by its affinity matrix

W

being

w_{i} = w^{i}, 1^{T} w_{i} = 1

,

w_{i j} \geq 0

for all i, where

w_{i} = w^{i}

is the symmetry condition meaning the ith row

w_{i}

and column

w^{i}

of

W

are equal. Since recent works tended to learn a probabilistic graph with constraints

1^{T} w_{i} = 1, w_{i j} \geq 0

, bi-stochastic graph learning in essence combines the graph symmetry and affinity estimation into one step. As shown in Figure 2, this graph learning paradigm has been proven to effectively enhance affinity relationships, yielding an ideal graph that accurately reveals the underlying data structure. As such, it has been extensively studied in the literature.

A pioneering work termed DSN [16] first learned a bi-stochastic graph by minimizing Problem (1) as

min_{W} {∥W - A∥}_{F}^{2}, s . t . w_{i} = w^{i}, 1^{T} w_{i} = 1, w_{i j} \geq 0,

(1)

where

A

and

W

denote the affinity matrix of input graph and bi-stochastic result, respectively.

{∥\cdot∥}_{F}

denotes the Frobenius norm, which makes

W

approximate the affinities in

A

but obtains the bi-stochastic property.

The optimization of Problem (1) relies on using the VNSP lemma [17], which separates the bi-stochastic conditions, turning into two sub-optimization problems as follows:

min_{W} {‖ W - A ‖}_{F}^{2} s . t . w_{i} = w^{i}, 1^{T} w_{i} = 1,

(2)

min_{W} {‖ W - A ‖}_{F}^{2} s . t . w_{i j} \geq 0 .

(3)

According to [18], the closed-form solution of Problem (2) and Problem (3) are

W_{1} = A + (\frac{1}{n} I + \frac{1^{⊤} A 1}{n^{2}} I - \frac{1}{n} A) 1 1^{⊤} - \frac{1}{n} 1 1^{⊤} A,

(4)

W_{2} = A_{+},

(5)

where

A_{+}

means

a_{i j} = \max (a_{i j}, 0)

.

An alternating optimization strategy is then employed, which mutually projects the solutions until they converge to a common solution for both Problems (2) and (3).

Some recent works have learned a bi-stochastic graph to improve graph-based clustering performance [18,19,20,21,22,23,24]. However, they mainly follow the paradigm of DSN, which involves a bi-stochastic graph approximation problem with VNSP-based optimization. Such a paradigm is influenced by the accuracy of the precalculated affinities of the input graph. Moreover, applying the VNSP lemma requires separating the bi-stochastic constraints, forming two subproblems for alternating optimization. This approach is time-consuming and difficult to extend to other objectives, limiting the diversity of potential graph learning formulations.

2.2. Recent Large-Scale HSI Clustering Models

The anchor-based approach is the most widely used method for large-scale HSI clustering. It selects m representative anchors (

m ≪ n

) to construct a pixel–anchor graph, thereby reducing both memory usage and computational complexity to a linear function of n. Some studies [25,26] have employed random sampling or K-means to generate anchors for pixel–anchor graph learning, followed by singular value decomposition (SVD) and additional discretization steps. However, such multi-stage pipelines and the instability of post-processing often limit clustering performance. To address this, increasing efforts have been devoted to developing end-to-end frameworks. For instance, SGCNR [27] introduced a non-negative orthogonal relaxation strategy to directly derive cluster labels from low-dimensional embeddings. GNMF [28] proposed a unified model based on non-negative matrix factorization to produce soft cluster assignments. SSAFC [29] presented a joint learning framework combining self-supervised spectral clustering and fuzzy clustering to directly output soft clustering indicators. Furthermore, BGPC [30] introduced a structured bipartite graph model that enforces low-rank constraints on the bipartite graph, enabling direct clustering via connectivity analysis. SGLSC [31] proposed a graph-learning method from both global and local aspects. The global graph is constructed in a self-expressive manner, whereas the local graph is built by leveraging Manhattan distance. SAGC [26] proposed a fast fuzzy clustering method for large-scale HSI processing. MCDLT [32] developed a graph selection mechanism with dynamic low-rank tensor approximation to fuse multi-order graphs and enhance HSI clustering. Recently, a number of subspace-based HSI clustering methods have been proposed. For example, ref. [33] introduced a structure prior-guided subspace clustering approach that simultaneously integrates local and non-local spatial information along with clustering priors. The homogeneity of clustering results is enhanced through an

l_{2, 1}

-norm based constraint to improve clustering performance. Ref. [34] proposed a subspace graph construction method combining pixel-wise and spectral-wise graphs and presented three graph fusion strategies. Ref. [35] developed a superpixel-based neighborhood shrinkage and contrastive subspace clustering method for hyperspectral images. Benefiting from the rapid development of deep neural networks (DNNs), several deep HSI clustering models have also emerged. For instance, refs. [36,37] introduced a self-supervised efficient low-pass contrastive graph clustering method for HSI, which preserves local spatial-spectral consistency while significantly reduces graph complexity. Ref. [38] proposed an adaptive affinity structure graph clustering approach that starts by generating homogeneous regions to process hyperspectral images and construct an initial graph and designs an adaptive filtering graph encoder to capture both high- and low-frequency features. Ref. [39] presented a masked superpixel contrastive subspace clustering method for large-scale HSI classification. By adopting a hyperspectral masked autoencoder instead of a conventional autoencoder as the backbone network, this method mitigates interference from non-target region samples during feature extraction, as well as the issue of over-smoothing in superpixel-level samples. Despite these advances, current methods often face two major challenges: First, performing SVD on an

n \times m

pixel–anchor graph incurs a complexity of

O (m^{2} n + m^{3})

, which remains relatively high. Second, existing anchor learning strategies fail to effectively leverage spatial contextual information among pixels, resulting in less representative anchors and suboptimal clustering performance.

3. Materials and Methods

Our

S^{3}

BGL first adopts superpixel encoding to capture spatial contextual information and generate representative anchors, while a data projection strategy is introduced to reduce feature dimensionality. We then learn a small-scale anchor–anchor graph for clustering, which confines the computational complexity of the graph learning process to the scale of the dimensionality-reduced data. A bi-stochastic constraint is imposed during graph learning to achieve an accurate graph representation. Moreover, a symmetric neighbor search strategy is integrated into our framework to ensure graph sparsity and further improve computational efficiency.

3.1. Superpixel Encoding and Anchor Generation

Anchor generation methods like random sampling or K-means fail to capture local pixel relationships, consequently yielding poorly discriminative anchors that lead to suboptimal clustering results. To solve this problem, we propose a superpixel encoding method to learn discriminative anchors by leveraging the spatial contextual relationships of the pixels. Concretely, as shown in Figure 3, we first employ the Entropy Rate Superpixel (ERS) [40] segmentation method to partition the image into local regions. The ERS method groups spatially correlated pixels into coherent superpixels in an unsupervised manner based on their local correlations, resulting in a specified number of superpixels. Denote HSI data matrix

X \in R^{d \times n}

, where n is the total number of pixels and d is the feature dimensions, the anchors are then computed as the mean feature vector within each superpixel, formulated as

a_{i} = \frac{1}{| S_{i} |} \sum x_{j} \in S_{i},

(6)

where

S_{i}

represents the i-th superpixel segment and

x_{j} \in R^{d \times 1}

denotes the spectral feature vector of the j-th pixel within

S_{i}

. As can be seen in Figure 3, the proposed superpixel encoding and anchor generation method offers three distinct advantages:

Discriminative Anchors: Since a superpixel encodes a local homogeneous region, the majority of pixels within it typically belong to the same land-cover category. This phenomenon becomes more pronounced as the number of superpixels (i.e., anchors) increases. Consequently, the learned anchors, derived from the spectral features of pixels within the same category, exhibit significantly enhanced discriminative power.
Flexibility: The partitioning of superpixels is entirely based on the local similarity relationships among pixels. This allows for a variable number of pixels within each superpixel, resulting in an accurate and highly flexible image partition that adapts to the intrinsic image structure.
Uniform Spatial Distribution: Anchors are generated by applying average weighting to all pixels within a superpixel, leading to their relatively uniform spatial distribution across the whole HSI. This ensures comprehensive coverage of information from different land-cover categories while the anchors are produced from local pixel contexts.

Therefore, this anchor generalization strategy has theoretical interpretability, which ensures effective encoding of the spatial–spectral characteristics of local regions, thereby benefiting the generation of discriminative anchors and accurate clustering.

3.2. Bi-Stochastic Graph Learning

Current works generally use non-negative and sum-to-one constraints to learn probabilistic affinity matrices. This paradigm ensures the simplicity of iterative optimization by ignoring the symmetry constraint but relies on post-processing to achieve artificially symmetric graphs. The separation of affinity measurement and graph symmetry limits the accuracy of graph representation [41]. Although some works [20,21,22] have considered bi-stochastic graph learning, they learn a bi-stochastic approximation from a pre-calculated graph based on the DSN model. This approach is affected by the accuracy of the pre-calculated graph. Moreover, directly using doubly stochastic graphs to estimate pixel correlations results in a storage burden of

n^{2}

and a computational burden of

n^{3}

. In this paper, we propose a model to directly learn bi-stochastic graphs from small-scale anchor matrices, achieving effective clustering for large-scale and high-dimensional HSI. To achieve this, we first introduce an adaptive neighbor graph construction approach:

\begin{matrix} min_{W} \sum_{i, j = 1}^{m} ({∥a_{i} - a_{j}∥}_{2}^{2} w_{i j} + γ w_{i j}^{2}) \\ s . t . 1^{T} w_{i} = 1, w_{i j} \geq 0, \end{matrix}

(7)

where

W \in R^{m \times m}

denotes the affinity matrix of the graph, where m denotes the number of anchors. The term

{∥a_{i} - a_{j}∥}_{2}^{2} w_{i j}

means that, for any anchors

a_{i}

, all other nodes (e.g.,

a_{j}

) could be its neighbors with probabilistic affinity

w_{i j}

, under the key assumption that any two nodes with a smaller Euclidean distance

{∥a_{i} - a_{j}∥}_{2}^{2}

have a larger probability

w_{i j}

to be neighbors. The second term

w_{i j}^{2}

with the hyperparameter

γ

is a regularization term to avoid trivial solution—only the nearest node becomes its neighbor with the probability as 1.

w_{i}

denotes the i-th row of

W

and

1

is a column vector with all the entries being 1. Therefore,

1^{T} w_{i} = 1

is the row-sum constraint and

w_{i j} \geq 0

is the non-negative constraint, which binds the probabilistic affinities in [0–1]. Therefore, Problem (7) simultaneously estimates the affinities and controls the neighborhood in an adaptive way.

If we denote a vector

d_{i} \in R^{m \times 1}

, whose j-th element is

d_{i j} = {∥a_{i} - a_{j}∥}_{2}^{2}

, we obtain the matrix form as

\begin{matrix} min_{W} T r (D^{T} W) + γ {∥W∥}_{F}^{2} \\ s . t . W 1 = 1, W \geq 0, \end{matrix}

(8)

where

D \in R^{m \times m}

is a matrix spanned by distance vectors

d_{i}

,

i \in 1, 2, \dots, m

. Problem (8) learns a probabilistic affinity matrix

W

by constraints

W 1 = 1, W \geq 0

. Since the necessary symmetry condition is ignored during affinity estimation, we must use

W = (W + W^{T}) / 2

for post-processing. Such a method fails to generate a strictly probabilistic graph and can lead to poor connectivity, as the node degrees may vary significantly.

To address the issue, we propose the following model:

\begin{matrix} min_{W} T r (D^{T} W) + γ {∥W∥}_{F}^{2} \\ s . t . W = W^{T}, W 1 = 1, W \geq 0, \end{matrix}

(9)

where

W = W^{T}, W 1 = 1, W \geq 0

are bi-stochastic constraints of matrix

W

, ensuring a symmetric probabilistic graph.

γ

is a hyperparameter, and we use an adaptive way for it to avoid parameter tuning. The proposed

S^{3}

BGL combines affinity estimation and graph symmetry into one step, streamlining the procedure and generating the optimal graph with all the degrees of nodes being 1.

Since the symmetry condition

W = W^{T}

is hard to solve, we then show an ingenious equivalent transformation by introducing Theorem 1.

Theorem 1.

If μ is a large enough value, Problem (9) is equal to Problem (10).

\begin{matrix} min_{W} & T r (D^{T} W) + γ {∥W∥}_{F}^{2} + μ {∥W - W^{T}∥}_{F}^{2} \\ s . t . & W 1 = 1, W \geq 0 . \end{matrix}

(10)

Theorem 1 is essentially a simplified form by using Augmented Lagrangian Method (ALM) to Equation (9), and we provide the Appendix A to give a detailed explanation. By using Theorem 1, we in essence embed the symmetry condition into the target function. During the iterations, if

μ

is magnified to a large enough value, the second term will meet

W - W^{T} = 0

, and thus, the symmetry condition will be spontaneously satisfied. Therefore, combining with the constraints

W 1 = 1 W \geq 0

, Problem (10) simultaneously optimizes all the bi-stochastic constraints to the globally optimal solution due to the convex property. The core of Problem (10) is to decide the value of

μ

. We heuristically set this parameter, which governs the bi-stochasticity of the learned graph, thereby avoiding the need for manual tuning.

3.3. Data Projection

In some cases, the data matrix might have an extremely high feature dimension. To address the “curse of dimensionality”, we incorporate an effective feature projection method and rewrite Problem (10) as follows:

\begin{matrix} min_{W, P} & T r (Q^{T} W) + γ {∥W∥}_{F}^{2} + μ {∥W - W^{T}∥}_{F}^{2} \\ s . t . & W 1 = 1, W \geq 0, P S P^{T} = I; S = X X^{T}, \end{matrix}

(11)

where

P \in R^{r \times d}

is the projection matrix and

Q \in R^{m \times m}

is the anchor distance matrix in the projected space, with elements

q_{i j} = | | P^{T} a_{i} - P^{T} a_{j} {| |}_{2}^{2}

. The constraint

P S P^{T} = I

reduces the original high feature dimension d to r orthogonal dimensions. Although the dimensionality reduction in Problem (11) may entail some performance degradation due to information loss, it effectively enables our model to handle high-dimensional data. More importantly, for the highly redundant spectral features of HSI, dimensionality reduction actually improves clustering accuracy in most cases. Therefore, our model constitutes a unified graph learning framework that simultaneously reduces both sample and feature dimensions, thereby ensuring efficient scalability to large-scale and high-dimensional data. Following iterative optimization, the bi-stochastic graph learned from Problem (11) undergoes spectral decomposition to generate the final clustering results.

3.4. Symmetry Neighbor Search

In practice, learning a sparse graph that focuses on data locality often yields better performance by suppressing the influence of outliers. Moreover, it reduces computational complexity by eliminating unnecessary connections. Therefore, it is advantageous to restrict connections to the c nearest neighbors of each node. Below, we introduce a symmetric neighbor search method to achieve local connectivity, which is tailored for our proposed model.

Since our model determines affinities based on Euclidean distance, it is natural to preserve local connectivity by updating affinities only for the top-c smallest distances. Without loss of generality, assume the distances

q_{i 1}, q_{i 2}, \dots, q_{i m}

for node i are sorted in ascending order. We then update affinities only for the c indices corresponding to the smallest distances,

q_{i 1}

to

q_{i c}

. This results in a c-neighbor connectivity pattern and yields an index matrix

G

, illustrated in Figure 4a. However, this row-wise search for the c nearest neighbors does not guarantee a symmetric neighborhood. As our model learns a symmetric probabilistic graph through bi-stochastic constraints, we propose a symmetric neighbor search strategy. Specifically, we utilize the setdiff function in MATLAB to find the set difference between the indices in

g_{i}

and

g^{i}

, where

g_{i}

and

g^{i}

denote the i-th row and column of

G

, respectively. This set difference is denoted as

g_{d}

. The neighbors of each node are then updated by concatenating the original and differential indices:

k_{i} = [g_{i}, g_{d}]

. Consequently, the missing indices in symmetric positions are compensated for, producing a symmetric neighbor matrix, as shown in Figure 4b.

3.5. Connections to Previous Works

In this subsection, we give some in-depth insights to show the relation of proposed

S^{3}

BGL and classic works.

3.5.1. $S^{3}$ BGL vs. Ratio Cut [42]

When

μ = 0

and

γ \to \infty

, the proposed

S^{3}

BGL is formally equivalent to ratio cut.

Proof.

First, since the learned matrix

W

can be divided into k submatrices

{W_{[1]}, W_{[2]}, \dots, W_{[k]}}

according to the main diagonal, the target function of Problem (10) can be transferred to k parallel and independent problems as

\begin{matrix} min_{W_{[l]}} & \sum_{l = 1}^{k} T r (D_{[l]}^{T} W_{[l]}) + γ {∥W_{[l]}∥}_{F}^{2} + μ {∥W_{[l]} - W_{[l]}^{T}∥}_{F}^{2} \\ s . t . & W_{[l]} 1 = 1, W_{[l]} \geq 0 . \end{matrix}

(12)

When

γ \to \infty

, the regularization term

γ ∥ W_{[l]} ∥_{F}^{2}

forces uniform cluster distributions, i.e.,

W_{[l]} = \frac{1}{m_{l}} 11^{T}

, where

m_{l}

denote the number of nodes in submatrix

W_{[l]}

and, thus,

m = \sum_{l = 1}^{k} m_{l}

. If

μ = 0

, we arrive at

\begin{matrix} min_{W} & \sum_{l = 1}^{k} T r (D_{[l]}^{T} W_{[l]}) \\ s . t . & W_{[l]} = \frac{1}{n_{l}} 11^{T} . \end{matrix}

(13)

Introducing a binary cluster indicator matrix

Y \in {0, 1}^{m \times k}

, where each column vector

y_{l}

contains a single 1 at the position indicating the node’s assigned cluster, with all other entries being 0. We know

y_{l}^{T} y_{l} = n_{l}

and obtain

\begin{matrix} min_{y \in I n d} & \sum_{l = 1}^{k} \frac{y_{l} D y_{l}}{y_{l}^{T} y_{l}}, \end{matrix}

(14)

which is formally equivalent to ratio cut, but the

D

is a distance matrix rather than a probabilistic affinity matrix. □

3.5.2. $S^{3}$ BGL vs. K-Means [43]

When

μ = 0

and

γ \to \infty

, the proposed

S^{3}

BGL is equal to using centroid based clustering K-means on a centralized data matrix

XH

, where

H

is a centering matrix as

H = I - \frac{1}{n} 11^{T}

.

Proof.

By defining the centering matrix

H

, we can transfer Problem (13) to

\begin{matrix} min_{Y \in I n d} & Tr (Y {(Y^{T} Y)}^{- 1} Y^{T} H D H) . \end{matrix}

(15)

Since

D

is a matrix consisting of Euclidean distance

d_{i j} = {∥ x_{i} - x_{j} ∥}_{2}^{2}

, we know

H D H = - 2 H X^{T} X H

and Problem (15) is equivalent to

\begin{matrix} max_{Y \in I n d} & Tr (Y {(Y^{T} Y)}^{- 1} Y^{T} H X^{T} X H), \end{matrix}

(16)

which is equal to using centroid based clustering K-means on centralized data matrix

XH

. □

3.5.3. $S^{3}$ BGL vs. LPP [44]

Our proposed

S^{3}

BGL implicitly employs the Locality Preserving Projections (LPP) framework for feature projection. In standard LPP, the projection matrix is learned under the constraint

P^{T} X D X^{T} P = I

, where

D

is the degree matrix of the affinity graph. In

S^{3}

BGL, however, the learned graph

W

is bi-stochastic, which implies its degree matrix is the identity matrix (

D = I

). Substituting

D = I

into the LPP constraint simplifies it to

P^{T} X X^{T} P = I

. Consequently, in terms of the projection model,

S^{3}

BGL is equivalent to a specific and efficient instantiation of LPP. This formulation offers a significant computational advantage. By circumventing the explicit construction and manipulation of the

O (m^{2})

dense matrix

D

, our method achieves higher efficiency, which is particularly beneficial for large sample sizes n.

4. Optimization and Analysis

4.1. Optimization Scheme

Since the objective function involves two variables—the bi-stochastic anchor–anchor graph

W

and the projection matrix

P

—we employ an alternating iterative method to optimize the model.

4.2. Update $P$ with Others Fixed

Problem (11) becomes

\begin{matrix} min_{P} \sum_{i, j = 1}^{m} {∥P^{T} x_{i} - P^{T} x_{j}∥}_{2}^{2} W_{i j} \\ s . t . P^{T} S P = I; S = X X^{T} . \end{matrix}

(17)

From the property of the Laplacian matrix, we know

2 T r (F^{T} L_{W} F) = \sum_{i, j = 1}^{m} {∥f_{i} - f_{j}∥}_{2}^{2} w_{i j}

. Therefore, Problem (17) is equal to

\begin{matrix} min_{P} & T r (\frac{P^{T} X^{T} L_{W} X P}{P^{T} S P}) \\ s . t . & P^{T} S P = I; S = X X^{T}, \end{matrix}

(18)

Let

\hat{P} = S^{- \frac{1}{2}} P

; we further obtain

\begin{matrix} min_{\hat{P}} & T r (\frac{{\hat{P}}^{T} S^{- 1} X^{T} L_{W} X \hat{P}}{{\hat{P}}^{T} \hat{P}}) \\ s . t . & {\hat{P}}^{T} \hat{P} = I, \end{matrix}

(19)

which is an Eigen decomposition problem on

S^{- 1} X^{T} L_{W} X

, where

S^{- 1}

is the matrix inversion of

S

. According to the generalized Rayleigh quotient, the solution is formed by the eigenvectors corresponding to the r smallest eigenvalues, where r is the dimensionality of the reduced feature space.

4.3. Update $W$ with Others Fixed

Problem (11) becomes

\begin{matrix} min_{W} & T r (Q^{T} W) + γ {∥W∥}_{F}^{2} + μ {∥W - W^{T}∥}_{F}^{2} \\ s . t . & W 1 = 1, W \geq 0, \end{matrix}

(20)

For Problem (20), the key challenge arises because the objective function simultaneously depends on both

W

and its transpose

W^{T}

. To address this, we introduce Theorem 2:

Theorem 2.

Consider a matrix variable

M

and an optimization problem of the form:

\begin{matrix} min_{M} {∥ M - M^{T} + Θ ∥}_{F}^{2} \\ s . t . M 1 = 1, M \geq 0, \end{matrix}

(21)

whose objective function depends on both the variable

M

and its transpose

M^{T}

. Here, Θ is a given matrix that is independent of

M

. Furthermore, the constraints on

M

are separable with respect to its rows, such as

M 1 = 1

and

M \geq 0

. Such a problem can be solved row-by-row and is equivalent to a set of independent vector-form subproblems:

\begin{matrix} min_{m_{i}} ∥ m_{i} - m^{i} + θ_{i} ∥_{2}^{2} + {∥ m^{i} - m_{i} + θ^{i} ∥}_{2}^{2} \\ s . t . \forall i, 1^{T} m_{i} = 1, m_{i} \geq 0, \end{matrix}

(22)

where

m_{i} - m^{i} + θ_{i}

and

M^{i} - M_{i} + θ^{i}

denote the i-th row and i-th column of matrix

M - M^{T} + Θ

, respectively.

Proof.

A graphical illustration is provided in Figure 5. When

i = 1

, the row constraints

1^{T} m_{i} = 1

and

m_{i} \geq 0

apply to all elements

m_{11}, m_{12}, \dots, m_{1 n}

in the first row of

M

. However, because the objective involves both

M

and

M^{T}

, if we only update the first row of

M - M^{T} + Θ

(shown in red in Figure 5a), certain elements in the first row of

M

—namely,

m_{12}, m_{13}, \dots, m_{1 n}

(blue in Figure 5a)—are not directly constrained in the corresponding columns of

M^{T}

. Therefore, the terms

∥ m i - m^{i} + θ_{i} ∥ 2^{2}

and

{∥ m^{i} - m_{i} + θ^{i} ∥}_{2}^{2}

jointly constrain both the i-th row and i-th column of

M - M^{T} + Θ

. This ensures that for any i, all elements subject to the constraints

1^{T} m_{i} = 1

and

m_{i j} \geq 0

are actively involved in the optimization (as illustrated by the yellow highlighting in Figure 5b).

Applying Theorem 2, we reformulate the final term in the objective of Problem (20) into vector form, leading to

\begin{matrix} min_{w_{i}} & q_{i}^{T} w_{i} + γ w_{i}^{2} + μ {∥ w_{i} - w^{i} ∥}_{2}^{2} + μ {∥ w^{i} - w_{i} ∥}_{2}^{2} \\ s . t . & \forall i, 1^{T} w_{i} = 1, w_{i j} \geq 0, \end{matrix}

(23)

Here,

w i - w^{i}

and

w^{i} - w i

represent the i-th row and column of the matrix

W - W^{T}

, respectively. Through simple algebraic manipulation, we obtain

min_{\forall i, 1^{T} w_{i} = 1, w_{i j} \geq 0} {∥ w_{i} - ϑ_{i} ∥}_{2}^{2},

(24)

where

ϑ i = (4 μ w^{i} - q i^{T}) / (4 μ + 2 γ)

. Problem (24) can be efficiently solved using Newton’s method, as detailed in the next subsection. □

Model Acceleration: We alternately update

W

and

P

until the solution to Problem (11) converges to the optimal bi-stochastic matrix

W^{*}

. In practice, we update

μ

using a heuristic strategy to avoid manual parameter tuning and to accelerate convergence. Specifically, we initialize

μ = 1

, thereby assigning equal importance to the affinity estimation and graph symmetry terms in the objective function. After each iteration, we compute an error matrix

E = D W - I

to quantify the deviation of

W

from the bi-stochastic condition. For a perfectly bi-stochastic matrix

W

, the degree matrix

D W

must equal the identity matrix

I

, meaning all diagonal entries are 1 and off-diagonals are 0. Thus, if

| E_{i i} | < 0.001

for all

i \in 1, 2, \dots, m

, we consider the current

W

to satisfy the bi-stochastic property sufficiently and terminate the optimization early. Otherwise, we double the value of

μ

to strengthen the enforcement of the bi-stochastic constraint in the next iteration. For clarity, the complete optimization procedure for Problem (11) is summarized in Algorithm 1.

Algorithm 1 Algorithm to Solve Problem (11)

Input: $data matrix X \in R^{d \times n};$
Output: $bi - stochastic matrix W \in R^{m \times m};$
Preprocessing The HSI is partitioned into m superpixels using ERS. Then the anchors $a_{i}$ (where $i \in 1, 2, \dots, m$ ) are derived by applying Equation (6).
Initialize $μ, γ and$ $W$ ;
$While not converge do$
$(1) Update$ $P \in R^{r \times d} by solving Problem (18);$
$(2) Update$ $W by solving Problem (24);$
$(3) If E_{i i} < 0.001 for \forall i \in 1, 2, \dots, m :$
$Output W in advance .$
$else :$
$Update μ by μ = 2 μ .$
$end While$

4.4. Computational Complexity Analysis

Our

S^{3}

BGL method alternately updates

W

and

P

. Updating

P

has a complexity of

O (r^{3})

, where r is the reduced spectral dimensionality after projection. Updating

W

costs

O (m^{2} r)

operations, where m is the number of anchors. Updating

μ

incurs only

O (1)

cost, which is negligible. Consequently, the per-iteration computational complexity of our model is independent of the original data dimensions n and d. In contrast, recent large-scale HSI clustering methods typically require eigendecomposition of an

n \times m

graph, incurring a per-iteration complexity of at least

O (n m d + d^{2})

. This comparison highlights the clear efficiency advantage of our approach, which efficiently handles both large-scale and high-dimensional data.

4.5. Convergence Analysis of Problem (23)

Through equivalent derivations, we have transformed the problem with bi-stochastic constraints into Problem (23). Using Theorem 3, we prove that the proposed model is theoretically guaranteed to converge.

Theorem 3.

The objective value of Problem (23) decreases monotonically during the iterations and converges rapidly to the root via Newton’s method.

Proof.

We begin by defining the Lagrangian function for Problem (23):

L (w_{i}, α, β) = \frac{1}{2} {∥ w_{i} - ϑ ∥}_{2}^{2} - α (1^{T} w_{i} - 1) - β^{T} w_{i},

(25)

where

α

and

β

are the Lagrange multipliers. Let

w_{i}

denote the optimal solution to Problem (23), with corresponding Lagrange multipliers

α

and

β^{*}

. According to the Karush–Kuhn–Tucker (KKT) conditions, the following holds:

\{\begin{matrix} \forall j, w_{i j}^{*} - ϑ_{j} - α^{*} - β_{j}^{*} = 0 & (a) \\ \forall j, w_{i j}^{*} \geq 0 & (b) \\ \forall j, 1^{T} w_{i}^{*} = 1 & (c) \\ \forall j, β_{j}^{*} \geq 0 & (d) \\ \forall j, w_{i j}^{*} β_{j}^{*} = 0 & (e) \end{matrix}

(26)

The vector form of Equation (26)(a) is

w_{i}^{*} - ϑ - α^{*} - β^{*} = 0

. Due to the constraint

1^{T} w_{i}^{*} = 1

, we obtain

α^{*} = \frac{1 - 1^{T} ϑ - 1^{T} β^{*}}{n}

. Substituting

α^{*}

, we obtain the optimal

w_{i}^{*}

as

w_{i}^{*} = ϑ + \frac{1}{n} - \frac{1^{T} ϑ 1}{n} - \frac{1^{T} β^{*} 1}{n} + β^{*} .

(27)

Let

P = ϑ + \frac{1}{n} - \frac{1^{T} ϑ 1}{n}

and

{\hat{β}}^{*} = \frac{1^{T} β^{*}}{n}

; then, Equation (26) can be simplified as

w_{i}^{*} = P - {\hat{β}}^{*} 1 + β^{*}

. For

\forall j

, it is

w_{i j}^{*} = p_{j} - {\hat{β}}^{*} + β_{j}^{*} .

(28)

Based on Equations (26)(b)(d)(e) and (27), we know

p_{j} - {\hat{β}}^{*} + β_{j}^{*} = {(p_{j} - {\hat{β}}^{*})}_{+}

, where

s_{+} = max (s, 0)

for a scalar s. As a result, if we obtain the value of

{\hat{β}}^{*}

, the optimal

w_{i j}^{*} = {(p_{j} - {\hat{β}}^{*})}_{+}

can be calculated.

Adjusting Equation (27) we can easily obtain

β_{j}^{*} = w_{i j}^{*} + {\hat{β}}^{*} - p_{j}

. Based on Equation (26)(b)(d)(e), we arrive at

β_{j}^{*} = {({\hat{β}}^{*} - p_{j})}_{+}

. Due to

{\hat{β}}^{*} = \frac{1^{T} β^{*}}{n}

, we further obtain

{\hat{β}}^{*} = \frac{1}{n} \sum_{j = 1}^{n} {({\hat{β}}^{*} - p_{j})}_{+}

. So far, we can define a function about

\hat{β}

as

F (\hat{β}) = \frac{1}{n} \sum_{j = 1}^{n} {(\hat{β} - p_{j})}_{+} - \hat{β} .

(29)

As can be seen,

{\hat{β}}^{*}

can be determined by solving the root of Problem (28) as

F ({\hat{β}}^{*}) = 0

. Since

\hat{β} \geq 0

,

F ({\hat{β}}_{t}) \leq 0

and

F ({\hat{β}}_{t}) \leq 0

, it is a piece-wise linear convex function, which can be effectively solved by using Newton’s method as

{\hat{β}}_{t + 1} = {\hat{β}}_{t} - \frac{F ({\hat{β}}_{t})}{F^{'} ({\hat{β}}_{t})} .

(30)

Thus, the optimal

{\hat{β}}^{*}

is obtained through iterative updates, ensuring the convergence of Problem (23). □

5. Experimental Results

5.1. Data Description

We evaluate the proposed method on three HSIs of varying scales to validate its clustering performance and scalability. Here, we briefly introduce the benchmark datasets used in our experiments.

Xuzhou dataset is a hyperspectral image acquired in 2014, covering a peri-urban area of Xuzhou, Jiangsu, China. The scene consists of 130,000 pixels (

500 \times 260

spatial resolution) across 436 spectral bands, with ground-truth annotations for nine distinct land-cover classes. A total of 68,877 labeled pixels are available across these classes. Detailed information for this dataset is provided in Table 2.

Longkou dataset was publicly released in 2018 and features a UAV-captured hyperspectral scene of Longkou Town, Hubei Province, China. Acquired by a UAV-borne imaging system, the data consists of 220,000 pixels (

550 \times 400

spatial resolution) with 270 spectral bands, covering nine distinct land-cover categories. Detailed class-specific information is provided in Table 3.

Pavia Center (PaviaC for short) dataset was made available in 2003 and captures an urban hyperspectral scene of Pavia, Italy. Acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor, the image comprises 783,640 pixels (

1096 \times 715

spatial resolution) with 102 spectral bands. The image contains nine distinct land-cover categories, with detailed class distributions provided in Table 4.

5.2. Comparative Methods and Experimental Setting

We compare the proposed method against several state-of-the-art approaches, including DvD (2017) [23], SWCAN (2020) [15], HESSC (2020) [45], SGLSC (2021) [31], SAGC (2023) [26], MCDLT (2025) [32], EGFSC (2025) [34], and

S^{2}

GCL (2024) [37]. Among them, DvD is a closely related model that learns a bi-stochastic graph for clustering, while SWCAN is a representative method that constructs a graph using adaptive neighbor theory. SGLSC, SAGC, and MCDLT are state-of-the-art approaches for HSI clustering. HESSC and EGFSC are subspace clustering models, and

S^{2}

GCL is a recent graph-based deep clustering model. For all methods, the number of neighbors was set to 5 to construct a sparse graph, and other parameters were tuned within the ranges recommended by the respective authors to achieve optimal performance. All experiments were conducted on a computer running Windows 10, equipped with a 2.3 GHz Intel Xeon Gold 5218 CPU, 128 GB of RAM, and MATLAB 2020b.

5.3. Metric

We use user’s accuracy (UA) to evaluate the classification performance for each individual land-cover class. To clearly define the overall evaluation metrics, we first introduce the concept of a confusion matrix. For a classification problem with k classes, the confusion matrix

Q

is a

k \times k

matrix defined as

Q = [\begin{matrix} q_{11} & q_{12} & q_{13} & \dots & q_{1 k} \\ q_{21} & q_{22} & q_{23} & \dots & q_{2 k} \\ q_{31} & q_{32} & q_{33} & \dots & q_{3 k} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ q_{k 1} & q_{k 2} & q_{k 3} & \dots & q_{k k} \end{matrix}],

where the rows or columns of

Q

represent the direction of ground truths and predicted labels respectively. Therefore,

q_{i j}

denotes the number of pixels truly from class i that were incorrectly assigned to class j. Thus, the diagonal element

q_{i i}

, for

i \in 1, 2, \dots, k

, represents the number of pixels correctly assigned to the i-th land-cover class. Therefore, the sum of all elements

\sum_{i = 1}^{k} \sum_{j = 1}^{k} q_{i j}

must equal the total number of pixels n.

Overall Accuracy (OA) is the proportion of correctly classified pixels, calculated as

OA = \frac{\sum_{i = 1}^{k} q_{i i}}{n} .

Average Accuracy (AA) is the mean of the per-class accuracy rates, reflecting the model’s balanced performance across all categories. It is computed as

AA = \frac{\sum_{i = 1}^{k} \frac{q_{i i}}{q_{:, i}}}{k} .

Kappa is another commonly used metric for HSI clustering, which considers the influence of random factors when evaluating the accuracy of clustering results. The mathematical definition is

K a p p a = \frac{n \sum_{i = 1}^{k} q_{i i} - \sum_{i = 1}^{k} (q_{i, :} \times q_{:, i})}{n^{2} - \sum_{i = 1}^{k} (q_{i, :} \times q_{:, i})} .

5.4. Experiments on Benchmarks

To comprehensively evaluate clustering performance and computational efficiency, we use three datasets of varying scales—Xuzhou, Longkou, and PaviaC—representing small-, medium-, and large-scale scenes based on total pixel count. We compare both quantitative metrics and visual clustering maps and report the average time cost per iteration for efficiency comparison.

(1) Results of Xuzhou Dataset: We first evaluate performance on the small-scale Xuzhou dataset (130,000 pixels). Quantitative results are summarized in Table 5. As shown, none of the models achieve high accuracy across all land-cover categories. Notably, S³BGL fails to correctly identify “Red-title”, a limitation we attribute to the challenges of fully unsupervised learning. In the absence of label information, suboptimal anchor selection or affinity estimation can lead to large-scale misclassification, drastically reducing accuracy for certain land-cover types. Despite this, S³BGL achieves high OA, AA, and Kappa scores, demonstrating robust overall clustering. Furthermore, owing to its lower computational complexity, our method requires less time than all competing approaches.

Visual clustering maps are provided in Figure 6. As shown, methods such as HESSC, SAGC and

S^{2}

GCL produce fragmented clustering results, indicating a limited ability to maintain spatial consistency. In contrast, DVD, SGLSC, SWCAN, MCDLT and EGFSC yield smoother maps and exhibit strong discrimination for certain land-cover types such as “Bareland1”. However, for other categories (e.g., “Bareland2”), our model performs more effectively, accurately identifying relevant regions. These results suggest that the bi-stochastic property of the anchor–anchor graph contributes to improved clustering.

(2) Results of Longkou Dataset: We next evaluate performance on the medium-scale Longkou dataset (220,000 pixels). As shown in Table 6, our method achieves significantly higher OA, AA, and Kappa values than all competitors. In terms of per-class accuracy, our method performs well across several categories and achieves the highest accuracy in two of the nine classes. Several recent methods (e.g., SWCAN, SAGC, MCDLT, EGFSC) also show strong classification ability for specific land-cover types, but their accuracy falls below 60% in at least four categories, lowering overall metrics. In terms of computational efficiency, our method achieves the lowest time cost (3.52 s), owing to the joint use of data reduction and feature projection.

Visual clustering maps are presented in Figure 7. Methods such as HESSC, SGLSC, SWCAN, and SAGC exhibit noticeable misclassification in water areas and scattered noise in other regions. In contrast, DVD, MCDLT and EGFSC yield smoother clustering results. Although

S^{2}

GCL achieved good clustering results in the water category, a large number of clustering errors occurred in the broad-leaf soybean and corn categories. Our model performs particularly well, correctly classifying most challenging regions and exhibiting clearer spatial consistency.

These results demonstrate that, despite using a compact anchor–anchor graph, our method effectively captures inter-anchor relationships via bi-stochastic graph learning, leading to state-of-the-art clustering performance.

(3) Results of PaviaC Dataset: We finally evaluate performance on the large-scale PaviaC dataset (783,640 pixels). Quantitative results and visual clustering maps are presented in Table 7 and Figure 8, respectively. Our method achieves the highest accuracy in two categories: Water and Tiles. Notably, it reaches near-perfect accuracy for Water (100%) and Tiles (99.97%).

In terms of overall performance, our method outperforms all competitors in OA, AA, and Kappa, showing substantial improvements over recent methods such as HESSC, SGLSC and EGFSC. The visual results further confirm that our approach produces smooth clustering maps, performing particularly well in challenging categories such as Water and Meadows.

Regarding efficiency, despite the dataset containing nearly 800,000 pixels, our method requires an average iteration time of only 13.48 s, significantly lower than all competing methods.

These findings collectively confirm the strong advantages of our method in both clustering performance and computational efficiency.

6. Discussion

6.1. Motivation Verification

These results demonstrate that our bi-stochastic graph learning method consistently produces valid bi-stochastic graphs across diverse datasets. By enabling

S^{3}

BGL to achieve state-of-the-art clustering performance, the proposed method offers a substantial improvement over conventional graph construction techniques. For quantitative comparison, we create Variant A, which replaces the bi-stochastic graph learning in

S^{3}

BGL with a conventional approach. Specifically, this variant learns probabilistic similarities under the constraints

1^{T} w i = 1

and

w i j \geq 0

, followed by post hoc symmetrization using

W = (W + W^{T}) / 2

.

We first verify that the proposed method successfully yields bi-stochastic graphs across all three HSI datasets. Figure 9 shows the node degrees of the learned graphs. As shown,

S^{3}

BGL produces a degree matrix that is effectively the identity matrix, confirming that all node degrees are strictly 1 and the graph is bi-stochastic. In contrast, the node degrees of Variant A fluctuate significantly and are generally less than 1, due to its separate treatment of affinity estimation and graph symmetrization. These results confirm that

S^{3}

BGL enforces strict probabilistic affinities, ensuring the affinity sum for each node equals 1.

Given its theoretical advantage of integrating affinity estimation, neighborhood selection, and graph symmetry into a unified process, we further examine whether our bi-stochastic graph learning indeed enhances graph representation and consequently improves clustering. Figure 10 visualizes the affinity matrices learned by Variant A and our method on the three HSIs. The Overall Accuracy (OA) achieved by each graph is displayed below the corresponding matrix. As illustrated, since all three HSI datasets contain nine land-cover classes, the ideal anchor–anchor affinity matrix should exhibit nine distinct block-diagonal structures, where each block corresponds to one cluster and its internal values reflect within-class similarity. However, the affinity matrix from Variant A suffers from degree fluctuations due to artificial symmetrization, which introduces erroneous similarities and disrupts the block-diagonal structure. In contrast, our approach learns a cleaner, more structured affinity matrix via bi-stochastic graph learning, leading to superior clustering results. This demonstrates the critical role of learning a proper graph structure on a small scale to enable accurate clustering after dimensionality reduction.

6.2. Parameter Study

Following [14], the hyperparameter

γ

is set automatically based on the number of neighbors, eliminating manual tuning. The hyperparameter

μ

requires no manual tuning, as it is updated heuristically during the optimization process.

Our method involves two key parameters: the number of anchors m and the reduced feature dimensionality r. We investigate their impact via a grid search over

m \in 200,

1000, 2000, 5000, 10,000 and

r \in d / 5, 2 d / 5, 3 d / 5, 4 d / 5, d

, where d is the original spectral dimensionality. As shown in Figure 11, increasing m generally improves performance, though gains diminish for larger values. The method is more sensitive to r, with optimal performance consistently achieved at moderate dimensions such as

r = 2 d / 5

or

r = 3 d / 5

.

This behavior aligns with the nature of HSIs, which often contain redundant and noisy spectral bands. Moderate dimensionality reduction can mitigate noise and improve similarity measurements, thereby enhancing clustering. However, over-reduction discards discriminative information, leading to performance degradation. In practice, to balance efficiency and performance, we recommend choosing the smallest m and r that maintain acceptable clustering accuracy.

6.3. Convergence Study

To evaluate the convergence of the model, we plotted the convergence curves of

S^{3}

BGL on three HSI datasets. As shown in the Figure 12, our

S^{3}

BGL model converges rapidly: the objective function value decreases sharply within the first few iterations and stabilizes within 30 epochs. More importantly, even for the PaviaC dataset, which contains nearly 800,000 pixels,

S^{3}

BGL still converges within 30 iterations. These results demonstrate that since

S^{3}

BGL performs clustering via an anchor-to-anchor graph, the iterative process depends only on the number of anchors after dimensionality reduction. This leads to highly efficient convergence, with a rate that remains largely insensitive to the total number of pixels.

7. Conclusions

This paper has presented the Spectral–Spatial Superpixel-based Bi-stochastic Graph Learning (

S^{3}

BGL) framework for large-scale, high-dimensional HSI clustering. By jointly learning anchors and projecting features,

S^{3}

BGL reduces the data dimensionality in both the pixel and spectral domains, enabling efficient clustering on a compact anchor–anchor graph. A bi-stochastic constraint applied to this graph yields a discriminative similarity representation, leading to accurate cluster assignments that are propagated back to all pixels. To solve the resulting problem efficiently, we introduce a novel optimization scheme that reformulates bi-stochastic matrix learning as a set of parallel vector problems. This approach simultaneously enforces all constraints and guarantees convergence to the global optimum. Extensive experiments on three HSI datasets demonstrate that

S^{3}

BGL scales effectively to large-scale, high-dimensional data, achieving state-of-the-art clustering performance with high computational efficiency. The underlying optimization theory, based primarily on algebraic transformations, is general and can be adapted to other bi-stochastic graph learning problems, including multi-view clustering. Future work will extend this framework to multi-modal HSI clustering, leveraging the complementary information from different data sources.

Author Contributions

Conceptualization, C.C. and N.W.; methodology, C.C., N.W. and S.W.; validation, C.C., N.W. and S.W.; formal analysis, C.C. and N.W.; writing—original draft, C.C., N.W. and S.W.; writing—review and editing, T.W., Z.C. and Y.S.; supervision, J.C., Z.C. and Y.S.; project administration, Z.C. and Y.S.; funding acquisition, J.C., T.W. and Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Shaanxi Province under grants 2020JQ-298 and 2023-JC-YB-501.

Data Availability Statement

The original data presented in the study are openly available and we also provide them with our code at https://github.com/NianWang-HJJGCDX//HPCDL.git (accessed on 1 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Appendix A. Transformation from Equation (9) to Equation (10)

This appendix provides a detailed explanation of the transformation from Equation (9) to Equation (10). Equation (10) is essentially a simplified form of Equation (9) after applying the Augmented Lagrangian Method (ALM). The ALM is widely used in multi-variable optimization models for variable substitution and decoupling. To clarify this process, we briefly introduce the ALM method using a general type of function. First, define a general function as

min_{h (X) = 0} f (X) .

(A1)

Using ALM to Equation (A1), we embed the constraint

h (X) = 0

into target function and obtain

min_{X} f (X) + \frac{μ}{2} {∥h (X) + \frac{1}{μ} Λ∥}_{F}^{2},

(A2)

where

μ

is penalty coefficient and

Λ

is Lagrange multiplier. Algorithm A1 summarizes the solution process of Equation (A2), and it can be seen that using ALM essentially involves iteratively seeking a sufficiently large

μ

such that the condition

h (X) = 0

is satisfied. The transition from Equation (9) to Equation (10) essentially employs a simplified ALM approach, embedding the symmetry condition of

W

into the objective function. The difference lies in the fact that we omit the use of Lagrange multipliers and set

ρ = 2

. During the iterative process,

μ

is continuously magnified by

ρ

to find a value that satisfies the constraint

W - W^{T} = 0

. Recent work still cannot directly provide a definite limit value for

μ

in the ALM method. In practical application, we consider the fulfillment of the bi-stochastic conditions (including the symmetry condition) as the convergence criterion for

W

, which implies that a bi-stochastic matrix

W

will be automatically output after a sufficiently large

μ

is reached.

Algorithm A1 ALM to Solve Problem (A1)

Set $ρ > 1,$ $μ > 0;$
Initialize $Λ;$
$While not converge do$
$(1) Update X by min_{X} f (X) + \frac{μ}{2} {∥h (X) + \frac{1}{μ} Λ∥}_{F}^{2};$
$(2) Update Λ by Λ = Λ + μ h (X);$
$(3) Update μ by μ = ρ μ;$
$end While$

References

Wang, N.; Yang, A.; Cui, Z.; Ding, Y. Capsule attention network for hyperspectral image classification. Remote Sens. 2024, 16, 4001. [Google Scholar] [CrossRef]
Pour, A.B.; Zoheir, B.; Pradhan, B.; Hashim, M. Editorial for the special issue: Multispectral and hyperspectral remote sensing data for mineral exploration and environmental monitoring of mined areas. Remote Sens. 2021, 13, 519. [Google Scholar] [CrossRef]
Avola, G.; Matese, A.; Riggi, E. An overview of the special issue on “precision agriculture using hyperspectral images”. Remote Sens. 2023, 15, 1917. [Google Scholar] [CrossRef]
Xue, Y.; Jin, G.; Shen, T.; Tan, L.; Wang, N.; Gao, J.; Yu, Y.; Tian, S. Target-Distractor Aware UAV Tracking via Global Agent. IEEE Trans. Intell. Transp. Syst. 2025, 26, 16116–16127. [Google Scholar] [CrossRef]
Chen, C.; Cao, J.; Wang, T.; Su, Y.; Wang, N.; Zhang, C.; Zhu, L.; Zhang, L. GLFFEN: A Global–Local Feature Fusion Enhancement Network for Hyperspectral Image Classification. Remote Sens. 2025, 17, 3705. [Google Scholar] [CrossRef]
Wang, N.; Cui, Z.; Li, A.; Lu, Y.; Wang, R.; Nie, F. Structured Doubly Stochastic Graph-Based Clustering. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 11064–11077. [Google Scholar] [CrossRef]
Nie, F.; Zhao, X.; Wang, R.; Li, X. Fast Locality Discriminant Analysis with Adaptive Manifold Embedding. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9315–9330. [Google Scholar] [CrossRef] [PubMed]
Yuan, Z.; Huang, W.; Tang, C.; Yang, A.; Luo, X. Graph-based embedding smoothing network for few-shot scene classification of remote sensing images. Remote Sens. 2022, 14, 1161. [Google Scholar] [CrossRef]
Jiang, G.; Zhang, Y.; Wang, X.; Jiang, X.; Zhang, L. Structured Anchor Learning for Large-Scale Hyperspectral Image Projected Clustering. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 2328–2340. [Google Scholar] [CrossRef]
Ranjan, S.; Nayak, D.R.; Kumar, K.S.; Dash, R.; Majhi, B. Hyperspectral image classification: A k-means clustering based approach. In Proceedings of the 2017 4th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 6–7 January 2017; pp. 1–7. [Google Scholar] [CrossRef]
Wang, N.; Cui, Z.; Zhang, C.; Lan, Y.; Xue, Y.; Su, Y.; Li, A. Discrete multi-view graph-based clustering via hierarchical initialization and supercluster similarity minimization. Neurocomputing 2025, 132099. [Google Scholar] [CrossRef]
Zelnik-Manor, L.; Perona, P. Self-tuning spectral clustering. In Proceedings of the Advances in Neural Information Processing Systems 17, Vancouver, BC, Canada, 13–18 December 2004; pp. 1601–1608. [Google Scholar]
Polk, S.L.; Cui, K.; Chan, A.H.; Coomes, D.A.; Plemmons, R.J.; Murphy, J.M. Unsupervised diffusion and volume maximization-based clustering of hyperspectral images. Remote Sens. 2023, 15, 1053. [Google Scholar] [CrossRef]
Nie, F.; Wang, X.; Huang, H. Clustering and Projected Clustering with Adaptive Neighbors. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 977–986. [Google Scholar]
Nie, F.; Wu, D.; Wang, R.; Li, X. Self-Weighted Clustering With Adaptive Neighbors. IEEE Trans. Neural Netw. Learning Syst. 2020, 31, 3428–3441. [Google Scholar] [CrossRef] [PubMed]
Zass, R.; Shashua, A. Doubly Stochastic Normalization for Spectral Clustering. In Proceedings of the Advances in Neural Information Processing Systems 19, Vancouver, BC, Canada, 4–7 December 2007; pp. 1569–1576. [Google Scholar]
Von Neumann, J. Functional Operators; Princeton University Press: Princeton, NJ, USA, 1950; Volume 2. [Google Scholar]
Wang, X.; Nie, F.; Huang, H. Structured Doubly Stochastic Matrix for Graph Based Clustering. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1245–1254. [Google Scholar]
Wang, F.; Li, P.; Konig, A.C. Learning a Bi-Stochastic Data Similarity Matrix. In Proceedings of the IEEE International Conference on Data Mining, Sydney, Australia, 13–17 December 2010; pp. 551–560. [Google Scholar]
Chen, M.; Gong, M.; Li, X. Robust Doubly Stochastic Graph Clustering. Neurocomputing 2022, 475, 15–25. [Google Scholar] [CrossRef]
Lim, D.; Vidal, R.; Haeffele, B.D. Doubly Stochastic Subspace Clustering. arXiv 2021, arXiv:2011.14859. [Google Scholar] [CrossRef]
Yang, Z.; Oja, E. Clustering by Low-Rank Doubly Stochastic Matrix Decomposition. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, UK, 26 June–1 July 2012; pp. 707–714. [Google Scholar]
Park, J.; Kim, T. Learning Doubly Stochastic Affinity Matrix via Davis-Kahan Theorem. In Proceedings of the IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 377–384. [Google Scholar]
Wang, Q.; He, X.; Jiang, X.; Li, X. Robust Bi-stochastic Graph Regularized Matrix Factorization for Data Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 390–403. [Google Scholar] [CrossRef]
Wang, R.; Nie, F.; Yu, W. Fast spectral clustering with anchor graph for large hyperspectral images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2003–2007. [Google Scholar] [CrossRef]
Chen, X.; Zhang, Y.; Feng, X.; Jiang, X.; Cai, Z. Spectral-spatial superpixel anchor graph-based clustering for hyperspectral imagery. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Wang, R.; Nie, F.; Wang, Z.; He, F.; Li, X. Scalable graph-based clustering with nonnegative relaxation for large hyperspectral image. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7352–7364. [Google Scholar] [CrossRef]
Huang, N.; Xiao, L.; Xu, Y.; Chanussot, J. A bipartite graph partition-based coclustering approach with graph nonnegative matrix factorization for large hyperspectral images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5506918. [Google Scholar] [CrossRef]
Wu, C.; Zhang, J. One-Step Joint Learning of Self-Supervised Spectral Clustering with Anchor Graph and Fuzzy Clustering for Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11178–11193. [Google Scholar] [CrossRef]
Zhang, Y.; Jiang, G.; Cai, Z.; Zhou, Y. Bipartite graph-based projected clustering with local region guidance for hyperspectral imagery. IEEE Trans. Multimed. 2024, 26, 9551–9563. [Google Scholar] [CrossRef]
Zhao, H.; Zhou, F.; Bruzzone, L.; Guan, R.; Yang, C. Superpixel-level global and local similarity graph-based clustering for large hyperspectral images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5519316. [Google Scholar] [CrossRef]
Wang, N.; Cui, Z.; Li, A.; Xue, Y.; Wang, R.; Nie, F. Multi-order graph based clustering via dynamical low rank tensor approximation. Neurocomputing 2025, 647, 130571. [Google Scholar] [CrossRef]
Huang, S.; Zeng, H.; Chen, H.; Zhang, H. Spatial and cluster structural prior-guided subspace clustering for hyperspectral image. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5511115. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, X.; Jiang, X.; Zhang, L.; Du, B. Elastic Graph Fusion Subspace Clustering for Large Hyperspectral Image. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 6300–6312. [Google Scholar] [CrossRef]
Cai, Y.; Zhang, Z.; Ghamisi, P.; Ding, Y.; Liu, X.; Cai, Z.; Gloaguen, R. Superpixel contracted neighborhood contrastive subspace clustering network for hyperspectral images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5530113. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, Z.; Yang, A.; Cai, Y.; Xiao, X.; Hong, D.; Yuan, J. SLCGC: A lightweight self-supervised low-pass contrastive graph clustering network for hyperspectral images. arXiv 2025, arXiv:2502.03497. [Google Scholar] [CrossRef]
Yang, A.; Li, M.; Ding, Y.; Xiao, X.; He, Y. An efficient and lightweight spectral-spatial feature graph contrastive learning framework for hyperspectral image clustering. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5537714. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, Z.; Kang, W.; Yang, A.; Zhao, J.; Feng, J.; Hong, D.; Zheng, Q. Adaptive homophily clustering: Structure homophily graph learning with adaptive filter for hyperspectral image. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5510113. [Google Scholar] [CrossRef]
Zhang, M.; Han, T.; Qu, X.; Gao, X.; Liu, X.; Niu, S. Masked superpixel contrastive subspace clustering network for unsupervised large-scale hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5520616. [Google Scholar] [CrossRef]
Liu, M.Y.; Tuzel, O.; Ramalingam, S.; Chellappa, R. Entropy rate superpixel segmentation. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 2097–2104. [Google Scholar]
Ding, T.; Lim, D.; Vidal, R.; Haeffele, B.D. Understanding Doubly Stochastic Clustering. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 5153–5165. [Google Scholar]
Hagen, L.; Kahng, A. New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 1992, 11, 1074–1085. [Google Scholar] [CrossRef]
MacQueen, J. Some methods for classification and analysis of multi-variate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1967; pp. 281–297. [Google Scholar]
He, X.; Niyogi, P. Locality preserving projections. In Proceedings of the Advances in Neural Information Processing Systems 16, Vancouver, BC, Canada, 8–13 December 2003. [Google Scholar]
Rafiezadeh Shahi, K.; Khodadadzadeh, M.; Tusa, L.; Ghamisi, P.; Tolosana-Delgado, R.; Gloaguen, R. Hierarchical sparse subspace clustering (HESSC): An automatic approach for hyperspectral image analysis. Remote Sens. 2020, 12, 2421. [Google Scholar] [CrossRef]

Figure 1. Our motivation. We employ joint superpixel encoding and feature projection to simultaneously reduce both the pixel-wise and spectral dimensions of HSIs, generating a compact bi-stochastic graph representation. This approach ensures scalable and efficient processing of large-scale and high-dimensional hyperspectral data.

Figure 2. An example for bi-stochastic graph learning. (a) Ideal affinity matrix. The blocks in main-diagonal unveil data distribution in five clusters. The intra-block connections are “must-links” that provide necessary connectivity while the inter-block connections are “cannot-links”, reflecting spurious affinities between different clusters that should ideally be zero. (b) The perturbed affinity matrix by adding noise to (a). (c) The bi-stochastic approximation of (b).

Figure 3. A graphic illustration to show the method of superpixel encoding and anchor generation.

Figure 4. A graphic illustration to show the symmetry neighbor search strategy. The results are generated when

n = 10

and

c = 3

.

Figure 4. A graphic illustration to show the symmetry neighbor search strategy. The results are generated when

n = 10

and

c = 3

.

Figure 5. A graphic illustration to explain the equivalence between Problems (21) and (22).

Figure 6. The visual maps for the clustering outputs on the Xuzhou dataset.

Figure 7. The visual maps for the clustering outputs on the Longkou dataset.

Figure 8. The visual maps for the clustering outputs on the PaviaC dataset.

Figure 9. The visual degrees of proposed

S^{3}

BGL.

Figure 9. The visual degrees of proposed

S^{3}

BGL.

Figure 10. Visualization of the affinity matrices on three HSI datasets. The first and second row show the results of Variant A and our method, respectively.

Figure 11. Parameter sensitivity experiment of our method.

Figure 12. Convergence experiment of

S^{3}

BGL.

Figure 12. Convergence experiment of

S^{3}

BGL.

Table 1. Common notations for this paper.

Notation	Definition
n	Number of samples (nodes);
d	Number of spectral bands;
m	Number of superpixels;
r	Number of projected features;
c	Number of neighbors for graph building;
k	Number of clusters;
$X \in R^{d \times n}$	The data matrix;
$x_{i} \in R^{d \times 1}$	The i-th data sample;
$W \in R^{m \times m}$	The affinity matrix of bi-stochastic graph;
$W^{T}$	The transpose of matrix $W$ ;
${∥W∥}_{F}$	The Frobenius norm of matrix $W$ ;
$w_{i} \in R^{m \times 1}$	The i-th row of matrix $W$ ;
$w^{i} \in R^{m \times 1}$	The i-th column of matrix $W$ ;
$w_{i j}$	The (i,j)-th entry of matrix $W$ ;

Table 2. Description of Xuzhou Dataset.

No.	Class Name	Samples
1	Bareland1	26,396
2	Lakes	4027
3	Coals	2783
4	Crops-1	5214
5	Cement	13,184
6	Trees	2436
7	Bareland2	6990
8	Crops	4777
9	Red-title	3070
Labeled		68,877
Unlabeled		61,123
Total		130,000

Table 3. Description of Longkou Data set.

No.	Class Name	Samples
1	Corn	34,511
2	Cotton	8374
3	Sesame	3031
4	Broad-leaf soybean	63,212
5	Narrow-leaf soybean	4151
6	Rice	11,854
7	Water	67,056
8	Roads and houses	7124
9	Mixed weed	5229
Labeled		204,542
Unlabeled		15,458
Total		220,000

Table 4. Description of PaviaC Data set.

No.	Class Name	Samples
1	Water	65,971
2	Trees	7598
3	Asphalt	3090
4	Self-blocking bricks	2685
5	Bitumen	6584
6	Tiles	9248
7	Shadows	7287
8	Meadows	42,826
9	Bare Soil	2863
Labeled		148,152
Unlabeled		635,488
Total		783,640

Table 5. Quantitative comparison on small-scale dataset Xuzhou. Numbers in bold denote the best value in comparison to others.

No.	DvD	HESSC	SGLSC	SWCAN	SAGC	MCDLT	EGFSC	$S^{2}$ GCL	$S^{3}$ BGL
Bareland1	88.71	82.43	89.75	80.09	87.67	88.25	75.50	99.84	88.51
Lakes	2.61	99.39	96.37	94.34	96.40	2.61	99.33	0.00	96.60
Coals	96.98	0.00	0.00	79.23	65.65	96.48	0.00	29.78	67.70
Crops-1	29.40	62.48	42.08	72.36	39.09	46.91	19.91	93.53	45.80
Cement	52.03	0.22	99.80	31.08	50.05	52.33	80.84	3.89	54.98
Trees	46.79	87.61	0.00	0.00	0.00	47.33	48.23	99.77	48.40
Bareland2	98.04	71.94	0.00	81.69	69.26	81.73	100.0	99.41	98.56
Crops	90.48	52.78	99.79	98.60	69.94	91.34	72.78	59.16	98.85
Red-title	38.86	27.77	53.84	61.11	4.85	0.00	100.0	78.89	0.00
OA	70.35	64.72	71.64	68.69	66.92	68.37	73.08	63.40	74.85
AA	59.57	58.26	65.87	65.34	53.79	56.74	66.28	57.04	66.54
Kappa	63.29	57.02	62.82	61.97	58.78	60.76	67.18	68.15	68.91
CPU Time	55.48	106.32	43.32	5.31	9.47	18.61	2.67	12.32	2.43

Table 6. Quantitative comparison on medium-scale dataset Longkou. Numbers in bold denote the best value in comparison to others.

No.	DvD	HESSC	SGLSC	SWCAN	SAGC	MCDLT	EGFSC	$S^{2}$ GCL	$S^{3}$ BGL
Corn	54.84	91.19	99.83	48.65	54.84	54.84	99.96	71.37	99.97
Cotton	93.32	12.22	99.39	99.51	52.76	90.35	96.94	99.71	96.94
Sesame	0.00	0.00	0.00	99.87	0.00	0.00	99.47	0.00	0.00
Broad-leaf soybean	59.28	72.26	79.48	33.24	92.29	59.08	35.15	32.69	84.85
Narrow-leaf soybean	55.72	0.00	0.22	0.00	99.83	63.94	0.02	0.16	0.02
Rice	99.70	8.62	87.61	99.56	99.72	99.66	60.22	100.0	78.22
Water	99.76	76.82	71.94	70.53	60.45	99.42	99.95	99.56	99.95
Roads	41.72	49.10	52.78	58.83	0.91	60.06	0.67	53.93	41.64
Mixed weed	78.71	5.22	27.77	0.00	47.20	47.18	0.00	0.34	43.20
OA	74.47	65.74	76.69	64.98	68.80	74.28	69.45	72.46	80.97
AA	64.79	58.51	65.46	66.70	56.45	63.87	58.71	63.19	70.33
Kappa	68.25	55.45	70.79	65.73	61.70	68.04	62.41	67.21	81.53
CPU Time	89.43	206.42	122.54	6.87	74.46	31.42	4.12	22.04	3.52

Table 7. Quantitative comparison on large-scale dataset PaviaC. Numbers in bold denote the best value in comparison to others.

No.	DvD	HESSC	SGLSC	SWCAN	SAGC	MCDLT	EGFSC	$S^{2}$ GCL	$S^{3}$ BGL
Water	97.32	97.05	94.49	98.27	54.84	98.22	98.94	100.0	100.00
Trees	65.83	100.00	44.35	76.68	52.76	74.93	4.69	32.39	45.12
Asphalt	2.75	0.00	0.00	7.38	0.00	18.16	0.00	81.08	42.82
Self-blocking bricks	98.62	0.00	11.92	20.30	92.29	37.43	60.74	51.91	14.86
Bitumen	64.96	55.56	0.02	43.03	99.83	43.38	94.27	6.70	0.00
Tiles	44.49	18.51	99.76	27.54	99.72	49.44	8.03	0.02	99.97
Shadows	38.80	75.96	78.30	55.52	60.45	69.71	38.78	0.00	74.82
Meadows	28.84	95.30	98.41	49.17	0.91	59.17	87.85	97.35	93.66
Bare Soil	22.07	40.13	0.00	19.25	47.20	23.33	52.91	3.46	0.00
OA	64.89	84.03	83.09	69.16	68.80	76.31	78.42	77.43	85.00
AA	51.52	60.42	59.04	44.13	56.45	58.20	55.58	49.23	63.48
Kappa	54.05	76.30	76.32	58.49	61.70	66.10	69.52	66.63	78.22
CPU Time	216.75	489.64	256.54	25.79	34.47	102.39	17.58	64.96	13.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, C.; Wang, N.; Wang, S.; Cao, J.; Wang, T.; Cui, Z.; Su, Y. Spectral–Spatial Superpixel Bi-Stochastic Graph Learning for Large-Scale and High-Dimensional Hyperspectral Image Clustering. Remote Sens. 2025, 17, 3799. https://doi.org/10.3390/rs17233799

AMA Style

Chen C, Wang N, Wang S, Cao J, Wang T, Cui Z, Su Y. Spectral–Spatial Superpixel Bi-Stochastic Graph Learning for Large-Scale and High-Dimensional Hyperspectral Image Clustering. Remote Sensing. 2025; 17(23):3799. https://doi.org/10.3390/rs17233799

Chicago/Turabian Style

Chen, Cheng, Nian Wang, Shengming Wang, Jiping Cao, Tao Wang, Zhigao Cui, and Yanzhao Su. 2025. "Spectral–Spatial Superpixel Bi-Stochastic Graph Learning for Large-Scale and High-Dimensional Hyperspectral Image Clustering" Remote Sensing 17, no. 23: 3799. https://doi.org/10.3390/rs17233799

APA Style

Chen, C., Wang, N., Wang, S., Cao, J., Wang, T., Cui, Z., & Su, Y. (2025). Spectral–Spatial Superpixel Bi-Stochastic Graph Learning for Large-Scale and High-Dimensional Hyperspectral Image Clustering. Remote Sensing, 17(23), 3799. https://doi.org/10.3390/rs17233799

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spectral–Spatial Superpixel Bi-Stochastic Graph Learning for Large-Scale and High-Dimensional Hyperspectral Image Clustering

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Bi-Stochastic Graph Learning

2.2. Recent Large-Scale HSI Clustering Models

3. Materials and Methods

3.1. Superpixel Encoding and Anchor Generation

3.2. Bi-Stochastic Graph Learning

3.3. Data Projection

3.4. Symmetry Neighbor Search

3.5. Connections to Previous Works

3.5.1. S 3 BGL vs. Ratio Cut [42]

3.5.2. S 3 BGL vs. K-Means [43]

3.5.3. S 3 BGL vs. LPP [44]

4. Optimization and Analysis

4.1. Optimization Scheme

4.2. Update P with Others Fixed

4.3. Update W with Others Fixed

4.4. Computational Complexity Analysis

4.5. Convergence Analysis of Problem (23)

5. Experimental Results

5.1. Data Description

5.2. Comparative Methods and Experimental Setting

5.3. Metric

5.4. Experiments on Benchmarks

6. Discussion

6.1. Motivation Verification

6.2. Parameter Study

6.3. Convergence Study

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Transformation from Equation (9) to Equation (10)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.5.1. $S^{3}$ BGL vs. Ratio Cut [42]

3.5.2. $S^{3}$ BGL vs. K-Means [43]

3.5.3. $S^{3}$ BGL vs. LPP [44]

4.2. Update $P$ with Others Fixed

4.3. Update $W$ with Others Fixed