Next Article in Journal
Determination Method of Optimal Decomposition Level of Discrete Wavelet Based on Joint Jarque–Bera Test and Combination Weighting Method
Next Article in Special Issue
Penalized Exponentially Tilted Likelihood for Growing Dimensional Models with Missing Data
Previous Article in Journal
Thermodynamics of Morphogenesis: Beading and Branching Pattern Formation in Diffusion-Driven Salt Finger Plumes
Previous Article in Special Issue
Bayesian Regression Analysis for Dependent Data with an Elliptical Shape
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Subspace Learning for Dual High-Order Graph Learning Based on Boolean Weight

by
Yilong Wei
1,
Jinlin Ma
2,*,
Ziping Ma
1 and
Yulei Huang
3
1
School of Mathematics and Information Science, North Minzu University, Yinchuan 750021, China
2
School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China
3
School of Mathematics and Statistics, Ningxia University, Yinchuan 750021, China
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(2), 107; https://doi.org/10.3390/e27020107
Submission received: 14 November 2024 / Revised: 18 January 2025 / Accepted: 21 January 2025 / Published: 22 January 2025

Abstract

:
Subspace learning has achieved promising performance as a key technique for unsupervised feature selection. The strength of subspace learning lies in its ability to identify a representative subspace encompassing a cluster of features that are capable of effectively approximating the space of the original features. Nonetheless, most existing unsupervised feature selection methods based on subspace learning are constrained by two primary challenges. (1) Many methods only predominantly focus on the relationships between samples in the data space but ignore the correlated information between features in the feature space, which is unreliable for exploiting the intrinsic spatial structure. (2) Graph-based methods typically only take account of one-order neighborhood structures, neglecting high-order neighborhood structures inherent in original data, thereby failing to accurately preserve local geometric characteristics of the data. To pursue filling this gap in research, taking dual high-order graph learning into account, we propose a framework called subspace learning for dual high-order graph learning based on Boolean weight (DHBWSL). Firstly, a framework for unsupervised feature selection based on subspace learning is proposed, which is extended by dual-graph regularization to fully investigate geometric structure information on dual spaces. Secondly, the dual high-order graph is designed by embedding Boolean weights to learn a more extensive node from the original space such that the appropriate high-order adjacency matrix can be selected adaptively and flexibly. Experimental results on 12 public datasets demonstrate that the proposed DHBWSL outperforms the nine recent state-of-the-art algorithms.

1. Introduction

With the rapid proliferation of high-dimensional data characterized by increasing dimensionality and complexity, dimensionality reduction has emerged as an essential yet challenging aspect of data analysis and interpretation [1]. This necessity is particularly evident in applied fields such as text processing, facial recognition, image recognition, and natural language processing. The primary objective of dimensionality reduction is to eliminate noise and redundant information from high-dimensional datasets while preserving critical information, thereby enhancing the efficiency of these methodologies [2,3]. As two key techniques of dimension reduction, subspace learning and feature selection manage to establish a limited number of discriminative features to effectively minimize noise and redundancy [4,5]. Specifically, subspace learning focuses on mapping high-dimensional data from its original space to a lower-dimensional subspace [6], whereas feature selection concentrates on identifying and retaining the most informative subset of features [7].
Generally, subspace learning methods can be categorized into two main categories: supervised and unsupervised based on the offering of labeled information during the learning process. As one of the classical supervised subspace learning methods, Linear Discriminant Analysis (LDA) [8] employs label information to learn discriminative projection, thereby enhancing inter-class distance and reducing intra-class distance. Classical unsupervised subspace learning involves Principal Component Analysis (PCA) [9], Locally Preserved Projection (LPP) [10], and Locally Linear Embedding (LLE) [11]. PCA identifies a projection that retains the principal variance of the data to capture the overall structure. However, PCA is based on the assumption of linearity, which might result in overlooking significant local structures within the data. In contrast, LPP addresses this limitation by deriving projections based on different geometric structures of the original data. LPP focuses on preserving local information, but it may struggle with global data structures and requires the careful tuning of parameters to achieve optimal performance. Similarly, LLE also aims to preserve the local geometry within each neighborhood, effectively maintaining local relationships among data points. Due to excelling in retaining local features in LLE, it is computationally expensive, leading to it being an arduous task on large datasets. In conclusion, classical subspace learning methods typically fail to fully exploit the local geometric features of data, especially when faced with complex distributions or highly nonlinear datasets. Inevitably, this limitation can lead to information loss and a deterioration in classification performance.
Feature selection is one of the dimension reduction techniques, primarily aimed at identifying the most relevant features while preserving their inherent physical significance. Similarly, due to the offering of labeled information, selection methods can be segmented into supervised and unsupervised approaches. As conventional supervised learning methods, the Fisher Score [12] and Subset-Level Score (SFS) [13] leverage statistical methods to assess the relationship among attributes and classes, effectively identifying important features. The former assesses the discriminative ability of each feature by measuring the ratio of inter-class variance to intra-class variance, whereas the latter utilizes a trace ratio form to calculate the subset-level score to establish a global optimal feature subset. In contrast, in the realm of unsupervised learning, prevalent methods involve the Laplacian Score (LS) [14] and Multi-Cluster Feature Selection (MCFS) [15]. The LS constructs a graph-based representation by embedding the manifold regularization constraints to measure the inherent relevance between features, whereas MCFS retains the multi-cluster structure of the data by fully taking account of the potential correlations among diverse features. It is worth noting that the limitations of the above feature selection methods stem from their reliance on predefined statistical metrics, which might result in insufficiently capturing the complex relationships inherent in the data. From another view, this dependence tends to lead to the omission or suboptimal selection of discriminative features or high sensitivity to noise that will arise within complex data structures [16,17].
To tackle the challenges of subspace learning and feature selection, a fused framework has recently been established by integrating the strengths and advantages of these two kinds of approaches [18,19]. In detail, the essence of these methods is to convert unsupervised feature selection into a matrix factorization problem from the perspective of subspace learning, thereby effectively distinguishing the most critical feature subspaces consistent wtih the original space [20,21,22]. Inevitably, in unsupervised feature selection methods based on subspace learning methods, focusing more on capturing the global structure rather than the local structure of the data hinders sample generalization [23]. To alleviate this, recent research has conducted subspace learning with discriminative information by generating pseudo-label information, simultaneously embedding various orthogonal constraint strategies to guide feature selection, thereby highlighting more local information [24,25]. Alternatively, one of the fundamental concepts that excels in uncovering local structures is “manifold learning” which underscores the similarity between two subspaces [26]. Guided by geometric structure information on the feature manifold, the process of learning the feature selection matrix and coefficient matrix will be accelerated, and thus a more accurate manifold structure will be extracted [27]. Similarly, incorporating both local manifold structure information and local discriminative information could yield a more effective feature selection process [28]. Notably, the above-mentioned methods succeed in constructing a fixed graph structure throughout the optimization process but lack the flexible capability to accommodate potential variations and complexities of diverse datasets. For this reason, an adaptive similarity matrix is constructed to learn higher-quality graph structures, thereby preserving more robust global reconstruction information and local geometric structures and offering a flexible learning mechanism [29]. Building on this foundation, the concept of maximizing the inter-class scatter matrix is accomplished by incorporating the trace ratio approach and adaptive graph learning methods, guiding the feature selection process to capture richer and more detailed manifold information [30].
In summary, the aforementioned methods build a bridge between feature selection and other essential concepts such as subspace learning and manifold learning, but they suffer from two notable limitations: (1) The correlated information between both features and samples is not taken into account simultaneously. (2) Compared with the basic facts in real data, simple one-order adjacency matrices always lack certain critical connections to fully exploit the complex structure inherent in the data.
To solve these problems, an unsupervised feature selection method named DHBWSL is proposed in this paper. The main contributions to this are summarized as follows:
  • A novel unsupervised feature selection method is proposed that captures structural information more comprehensively by considering hidden high-order similarities in both data space and feature space. A new discovery of DHBWSL is the adaptive learning of dual high-order adjacency matrices, the dynamic adjustment of the graph structure, and the selection of discriminative features within a unified framework.
  • We design an adaptive dual high-order graph learning mechanism by associating Laplacian rank constraints with Boolean variables to adaptively learn adjacency matrices with consensus structures from suitable high-order adjacency matrices, thereby enhancing the quality of the graph structure.
  • Extensive experiments conducted on 12 public datasets demonstrate that DHBWSL outperforms the performance of various leading unsupervised feature selection models.
The rest of the paper is organized as follows: Section 2 briefly describes the related work. In Section 3, our proposed method is presented, including the objective function, the alternating iteration scheme for solving the optimization problem, and the computational complexity and convergence analysis. Extensive experiments are conducted to demonstrate the effectiveness and superiority of DHBWSL compared to the state-of-the-art methods in Section 4. Finally, Section 5 concludes the paper.

2. Related Work

In this section, we provide an overview of the mathematical symbols used in this paper and primarily review some of the work closely related to our algorithm.

2.1. Related Notations

Let X T and T r X specify the transpose and trace of the matrix X , and let l represent the number of chosen features ( l d ). The norm of any matrix X is defined as: | | X | | r , s = ( i = 1 d ( j = 1 n | X i j | r ) s / r ) 1 / s . Based on the previous definition, the norm is known as the Frobenius norm or the 2-norm for r = s = 2. When r = 2 and s = 1 , the norm is known as the 2,1-norm. For clarity, we summarize the basic notations employed in this research in Table 1.

2.2. MFFS

Motivated by the principles of subspace learning, MFFS is formulated as a feature selection method using a projection matrix [20]. It is assumed that all features reside within a linear manifold within the true feature space, allowing feature selection to be achieved by approximating the high-dimensional original space through a minimal set of low-dimensional subspaces.
The goal of subspace learning is to depict the distance between the raw data matrix X and the feature subset XI:
X X I V F 2 , s . t .   I = l ,
where I is the index set of the selected features, and |I| is the number of elements in it, V is the coefficient matrix of the initial feature space, and l represents the number of selected features.
From the viewpoint of matrix factorization, the feature selection problem is expressed as follows:
X X Z V F 2 , s . t . Z 0 ,   V 0 , Z T Z = I l ,
where Z is the feature selection matrix and Il is the l × l identity matrix.

2.3. CLR

To accomplish this concept of structured adjacency matrix learning derived from graph theory, the Constrained Laplacian Rank (CLR) intended to dynamically refine the graph structure S, learning a new data graph S from a given data graph A to guarantee that the resulting graph S contains exactly k connected components, where k corresponds to the number of clusters.
For a non-negative affinity matrix A, the Laplacian matrix L A = D A ( A T + A ) 2 , where DA is the diagonal matrix whose ith element is j ( a i j + a j i ) 2 , has the following crucial feature [31]:
Theorem 1.
The multiplicity k of the eigenvalue zero of the Laplacian matrix LA equals the number of linked components in the graph associated with A.
According to Theorem 1, the graph is perfect if r a n k ( L S ) = n k . To avoid the situation in which some rows of S are entirely zero, we further limit S such that the total of each row is one. The CLR optimizes the following issue:
| | S A | | F 2 , s . t . j s i j = 1 ,   s i j 0 , r a n k ( L S ) = n k ,
addressing this optimization challenge is complex due to the fact that LS is defined as L S = D S ( S T + S ) 2 , where DS depends on S, and the constraint r a n k ( L S ) = n k is a complicated nonlinear constraint. The particular solution’s steps will be detailed in depth in the next section.

3. Methodology

In this section, we propose a novel unsupervised feature selection model called DHBWSL. The concept of DHBWSL is illustrated by the following dual high-order graph learning as well as its most important properties, such as the optimization process and convergence analysis.

3.1. Dual High-Order Graph Learning

In general, similar graphs are limited to capturing pairwise relationships between data points and can only reflect one-order neighbor relationships. However, these pairwise relationships are highly sensitive to changes in the parameters of the nearest neighbor; in other words, even small adjustments can significantly alter the matrix structure, thereby diminishing clustering performance. To address these challenges, we propose the new concept “neighbors of neighbors are also neighbors” that is capable of exploiting more node information to construct a high-order adjacency matrix, which contributes to improving the robustness of nearest-neighbor parameters and better capturing the structural characteristics of the data.
Specifically, based on the definition of one-order adjacency matrix S 1   R , we introduce the concept of high-order adjacency matrix S n = S n 1 × S R n ( n = 2 , , O ) , where O denotes the max order of S . Considering that there may be multiple path interactions between data points, a high-order adjacency matrix can enrich the diversity of local structural information and highlight more representative and distinguished features. For complementing intra-class and inter-class information, we are inspired to learn a uniform adjacency matrix by applying a multi-order adjacency matrix from different paths.
i = 1 O | | W S i | | F 2 , s . t .   W 1 = 1 , W R + , r a n k ( L W ) = n K ,
where LW is the Laplacian matrix.
Although high-order adjacency matrices are more robust than one-order adjacency matrices in dealing with the sensitivity of the nearest neighbor parameter k, it is difficult to find the balance to determine the most suitable of the number of high-order adjacency matrices. The main reason may lie in that the lower order readily results in the failure of the adjacency matrix to be appropriately consistent with the intricate relationships in the data, potentially accompanying the loss of vital information, whereas a high-order might cause negative effects such as superfluous duplicate information, increasing the model complexity, and overtraining.
To address the above issues, we set the order O to a considerably elevated value to contain a wider number of alternatives. Then, we choose M relevant subsets of ε O M . To find the proper order, we use an adaptive order adjacency-learning approach that minimizes the M residuals of { | | W S i | | F 2 | i { 1,2 , , O } } :
i = 1 O p i | | W S i | | F 2 , s . t .   W 1 = 1 , W R + , r a n k L W = n K , p T 1 = M , p 0,1 O ,
where p i is a Boolean variable that specifies whether to use the i-th order adjacency matrix S i .
Given that the rank limitation on the Laplace matrix L is a strong constraint, solving Equation (5) directly appears to be quite difficult. To address this issue, we relax the rank constraint on the Laplacian matrix L by decreasing its rank requirement. In the data space, we suppose that σ i L V signifies the i -th lowest eigenvalue of the Laplace matrix L V . For any i, there is σ i L V 0 . When β is big enough, the best solution W V for solving the problem (5) equals the second term i = 1 K σ i L V = 0 . Equation (5) transforms into:
i = 1 O p i | | W V S i V | | F 2 + β i = 1 K σ i L V , s . t . W V 1 = 1 , W V R + , p T 1 = M , p 0,1 O ,
where β is the regularization parameter and WV is the structure matrix of the data space.
According to the Ky Fan Theorem (Fan 1949) [32], for every Laplacian matrix L, the sum of its first K minimum eigenvalues may be calculated by solving the following minimization problem as follows:
i = 1 K σ i L V = T r V L V V T ,
therefore, the issue (6) is also equal to:
i = 1 O p i | | W V S i V | | F 2 + β T r V L V V T , s . t . W V 1 = 1 , W V R + , V 0 , p T 1 = M , p 0,1 O ,
where β is the equilibrium parameter and V is the coefficient matrix.

3.2. The Proposed Feature Selection Method

This section describes the objective function of DHBWSL as follows:
O = X T X T H V F 2 + i = 1 O p i V | | W V S i V | | F 2 + β T r ( V L V V T ) + i = 1 O p i U | | W U S i U | | F 2 + γ T r ( H T X L U X T H ) + λ H 2,1 , s . t . H 0 ,   V 0 , H T H = I l , W V 1 = 1 , W V R + , W U 1 = 1 , W U R + , ( p i U ) T 1 = M , p i U 0,1 O , ( p i V ) T 1 = M , p i V 0,1 O .
In Equation (9) above, the first term integrates subspace learning to effectively reduce dimensionality and filter out noise. It projects the data into a lower-dimensional space, thereby capturing fundamental feature relationships and enhancing the model’s robustness and efficiency. The second and third terms employ adaptive mechanisms to account for high-order feature interactions, transcending simple pairwise relationships to consider complex synergies among multiple features. The fourth and fifth terms related to higher-order data point distributions consider the high-order relationships between data points, thereby more accurately preserving the original data’s local structure. Lastly, the sixth term enforces sparsity, promoting the selection of only the most relevant features. This results in a compact and interpretable model, enabling a more precise identification of feature subsets.
The regularization parameters β and γ balance the smoothness of the data and feature space, whereas λ is a sparsity constraint parameter used to modify HRd×l. To simplify the formula, define URn×l as the product of variables XT and H, i.e., U = XTH. The feature transformation matrix H optimizes the score for each feature ‖hi2 to reflect its importance. High scores indicate more significant aspects. To generate the new data matrix Xnew, we sorted the scores in decreasing order and chose the top l features from the original dataset of d features. The overall framework of DHBWSL is shown in Figure 1.

3.3. Comparison of Unsupervised Feature Selection Based on Subspace Learning

To evaluate the clustering performance of DHBWSL, the following representative subspace learning-based unsupervised feature selection methods are summarized as Table 2 for the comparison task.

3.4. Optimization

In this section, a new efficient approach is introduced for solving with the primary goal of minimizing the objective function (9) [38,39]. To establish restrictions on Hij ≥ 0 and Vij ≥ 0, two Lagrange multipliers, ϕij and ψij, are introduced. Hence, the Lagrange function of Equation (9) is as follows:
L = T r ( ( X T X T H V ) X T X T H V T ) + i = 1 O p i V | | W V S i V | | F 2 + β T r ( V L V V T ) + i = 1 O p i U | | W U S i U | | F 2 + γ T r ( H T X L U X T H ) + λ T r ( H T Q H ) + σ 2 T r ( H T H I l H T H I l T ) + T r ϕ H + T r ψ V ,
where σ is an orthogonal constraint parameter, Q = [ q i j ] R d × d is a diagonal matrix with the i-th diagonal member q i i represented as:
q i i = 1 2 | | h i | | 2 ,
to solve Equation (11), we insert a minor constant ε to prevent overflow:
q i i = 1 2 m a x ( | | h i | | 2 , ε ) .

3.4.1. Update V and H

(1)
Fix V and update H:
Set the partial derivative of L(H,V) with respect to H to zero,
L H = 2 X X T V T + 2 X X T H V V T + 2 γ X L U X T H + 2 λ Q H + 2 δ H H T H 2 δ H + ϕ ,
applying the KKT requirements [40], ϕ i j H i j = 0 , we obtain:
2 X X T V T + 2 X X T H V V T + 2 γ X L U X T H + 2 λ Q H + 2 δ H H T H 2 δ H i j H i j = 0 ,
using LU = DUWU, we obtain the iterative updating procedure for H :
H i j H i j X X T V T + γ X W U X T + δ H i j X X T H V V T + γ X D U X T H + λ Q H + δ H H T H i j .
(2)
Fix H and update V:
Taking the partial derivative of Equation (10) with regard to V yields:
L V = 2 H T X X T + 2 H T X X T H V + 2 β V L V + ψ ,
using the KKT conditions [40], ψ i j V i j = 0 , we obtain:
2 H T X X T + 2 H T X X T H V + 2 β V L V i j V i j = 0 ,
using LV = DVWV, we obtain the iterative update procedure for V:
V i j V i j H T X X T + β V W V i j H T X X T H V + β V D V i j .

3.4.2. Update pV and WV

We only consider the variables pV and WV due to the symmetry of the data and feature spaces. The same applies to pU and WU.
(1)
Fix WV and update pV:
i = 1 O p i V | | W V S i V | | F 2 , s . t . ( p i V ) T 1 = M , p i V 0,1 O ,
which has a closed-form solution as follows:
p i V = 1 , i f   | | W V S i V | | F 2 | | W V S M V | | F 2 0 , O t h e r w i s e ,
where | | W V S M V | | F 2 is the M-th smallest value from the set { | | W V S i V | | F 2 | i { 1,2 , , O } } .
After removing the items that are not related to pU, we may update pU by solving the following issue using the preceding method:
i = 1 O p i U | | W U S i U | | F 2 , s . t . ( p i U ) T 1 = M , p i U 0,1 O ,
(2)
Fix pV and update WV:
i = 1 O p i V | | W V S i V | | F 2 , s . t . W V 1 = 1 , W V R + , r a n k L V = d K .
according to:
i = 1 O p i W V F 2 2 T r ( W V T i = 1 O p i S i V ) + i = 1 O p i | | S i V | | F 2 W V F 2 T r ( ( W V ) T 2 i = 1 O p i S i V i = 1 O p i ) .
The concerns involved can be characterized as the intrinsic challenges of the CLR technique, which are discussed below:
| | W V i = 1 O p i S i V i = 1 O p i | | F 2 , s . t . W V 1 = 1 , W V R + , r a n k L V = d K .
In another form, when A V = i = 1 O p i S i V i = 1 O p i , we can obtain:
W V A V F 2 , s . t . W V 1 = 1 , W V R + , r a n k L V = d K .
Theorem 1, described in Section 3.1 [32], transforms the constraint r a n k L V = d K into T r V L V V T :
W V A V F 2 + β T r V L V V T , s . t . W V 1 = 1 , W V R + , V 0 ,
where β is the graph regularization parameter, which converts to vector form when V is fixed:
i , j = 1 d ( w i j V a i j V ) 2 + β i , j = 1 d | | v i v j | | 2 2 w i j V , s . t . j = 1 d w i j V = 1,0 w i j V 1 .
For distinct i, issue (27) is independent; hence, we can solve the following problems separately:
j = 1 d ( w i j V a i j V ) 2 + β j = 1 d | | v i v j | | 2 2 w i j V , s . t . j = 1 d w i j V = 1,0 w i j V 1 .
Using m i j V = | | v i v j | | 2 2 , and m i V as a vector with j-th element equal to m i j V (and similarly for a i j V ), issue (28) may be written in vector form as follows:
j = 1 d | | w j V ( a j V β 2 m j V ) | | 2 2 , s . t . w i V T 1 = 1,0 w i j V 1 .
This issue may be solved using either a closed-form solution, as stated in Equation (29), or an efficient iterative technique. When h j V = ( a j V β 2 m j V ) , we have:
j = 1 d | | w j V + h j V | | 2 2 , s . t . ( w i V ) T 1 = 1,0 w i j V 1 ,
where τ and g i     0 are the Lagrangian multipliers. Without the loss of generality, we set k to 1/2. Using the KKT criteria [40], the following equation is derived:
j , ( w i j V ) * + h i j * τ * g i j * = 0 j , ( w i j V ) * g i j * = 0 j , ( w i j V ) * 0 j , g i j * 0 ,
where ( w i j V ) * represents the j -th element of ( w i V ) * . Using the restriction ( w i V ) T 1 = 1 , we obtain:
τ * = 1 + 1 T h i 1 T g i * d ,
combining (32) with the first term of (31), we obtain:
( w i j V ) * = 1 d 1 1 T g i * 1 ( h i 1 d 1 T h i 1 ) + g i * ,
and:
( w i j V ) * = 1 d 1 T g i * d h i j + 1 T h i d + g i j * .
To prevent misunderstandings, remember that 1 T h i 1 = ( 1 T h i ) 1 , where 1 T h i represents a constant. Furthermore, we designate g i * ¯ = ( 1 T g i * / d ) , e i j = ( 1 / d ) ( h i j ) 1 T h i , and ( w i j V ) * as follows:
( w i j V ) * = e i j + g i j * g i * ¯ .
Given the third and fourth terms in Equation (31), Equation (35) may be substituted:
( w i j V ) * = ( e i j g i * ¯ ) + .
Identifying the ideal g i * ¯ yields the optimal solution ( w i V ) * . Equation (36) states that g i * ¯ = ( w i j V ) * + g i * ¯ - e i j and g i j * = ( g i * ¯ e i j ) + . Averaging the variable g i j * yields:
g i * ¯ = 1 d j = 1 d ( g i * ¯ e i j ) + .
The best value of g i * ¯ may be determined using the Newton approach and a cost function. The cost function is defined as the following:
Φ g i ¯ = 1 d j = 1 d ( g i * ¯ e i j ) + g i ¯ .
If the cost function Φ g i ¯ = 0 , we have the optimum g i * ¯ . The update rules for the t + 1 -th iteration are as follows:
g i ¯ t + 1 = g i ¯ t Φ ( g i ¯ t ) [ Φ g i ¯ t g i ¯ t ] 1 ,
the following features contribute to the effectiveness of the Newton method: g i ¯ 0 , Φ g i ¯ t / g i ¯ t and Φ g i ¯ being a piecewise linear convex function.
After eliminating items unrelated to WU, we may apply the previous strategy to address the following problem:
i = 1 O p i U | | W U S i U | | F 2 , s . t . W U 1 = 1 , W U R + , r a n k L U = n K .

3.5. Convergence Analysis

This section provides a convergence analysis of DHBWSL and theoretically proves that the objective function in Equation (9) exhibits monotonically decreasing properties under the iterative update rules (15).
Definition 1.
According to Lee and Seung [41], if the following criteria are met:
M ( x ,   x ) G ( x ) ,   M ( x ,   x ) = G ( x ) ,
where M ( x ,   x ) is an auxiliary function for G(x). Then, G(x) is monotonically decreasing, and the modified formula is as follows:
x t + 1 a r g min x M ( x , x t ) .
Proof of Definition 1. 
G ( x t + 1 ) M ( x t + 1 , x t ) M ( x t , x t ) = G ( x t ) and G(x) is convergent.
To construct the following function, we selectively keep the terms involving H from Equation (9).
G H = X T X T H V F 2 + γ T r H T X L U X T H + λ H 2,1 + σ 2 H T H I l F 2 = T r ( X T X T H V X T X T H V T ) + γ T r H T X L U X T H + λ T r H T Q H + σ 2 T r ( H T H I l H T H I l T ) ,
By calculating the first- and second-order partial derivatives of G(H) with respect to H, we obtain:
G i j = [ G H ] i j = [ 2 X X T V T + 2 X X T H V V T + 2 γ X L U X T H + 2 λ Q H + 2 δ H H T H 2 δ H ] i j ,
G i j = 2 [ X X T ] i i [ V V T ] j j + [ 2 γ X L U X T + 2 λ Q + 2 δ H H T 2 δ I ] i i ,
Lemma 1.
Define the auxiliary functions of Gij as:
M ( H i j ,   H i j t ) = G i j ( H i j t ) + G i j ( H i j t ) ( H i j H i j t ) + [ X X T H V V T + γ X D U X T H + λ Q H + δ H H T H ] i j H i j t ( H i j H i j t ) 2 ,
Taylor expansion of G i j ( H i j ) considered:
G i j ( H i j ) = G i j ( H i j t ) + G i j ( H i j t ) ( H i j H i j t ) + X X T V V T + γ X L U X T + λ Q + δ H H T δ I i j ( H i j H i j t ) 2 .
By establishing Equations (46) and (47), we obtain M ( H i j ,   H i j t ) G i j ( H i j ) inequality. This is comparable to:
[ X X T H V V T + γ X D U X T H + λ Q H + δ H H T H ] i j H i j t [ X X T V V T + γ X L U X T + λ Q + δ H H T δ I ] i j .
There are obviously:
[ X X T H V V T + λ Q H + δ H H T H ] i j = k = 1 d [ X X T ] i k H k j t V V T j j + [ λ Q + δ H H T ] i k H k j t [ X X T ] i i H i j t [ V V T ] j j + λ Q + δ H H T i i H i j t ,
and:
γ [ X D U X T H ] i j = γ k = 1 d X D U X T i k H k j t γ X D U X T i i H i j t γ X D U W U X T i i H i j t = γ X L U X T i i H i j t ,
Equations (49) and (50) hold, the inequality M ( H i j ,   H i j t ) G i j ( H i j ) holds, and the equation M ( H i j ,   H i j t ) = G i j ( H i j ) holds when H i j = H i j t .
The monotonically declining update rule is then shown to apply to variable H :
Proof of Lemma 1. 
By substituting the auxiliary function M ( H i j ,   H i j t ) into Equation (42), we derive the following update rule:
H i j t + 1 = H i j t H i j t G i j ( H i j t ) 2 [ X X T H V V T + γ X D U X T H + λ Q H + δ H H T H ] i j = H i j t X X T V T + γ X W U X T + δ H i j X X T H V V T + γ X D U X T H + λ Q H + δ H H T H i j .
Based on the above derivation, we can conclude that under the update rule for H, the objective function is monotonically decreasing. Similarly, the update rules for other variables also ensure that the objective function value decreases monotonically, thereby guaranteeing their convergence. □
Based on the analysis presented above, Algorithm 1 explains the DHBWSL process.
Algorithm 1 The procedure of DHBWSL.
Input: Data matrix XRd×n; Parameter β, γ, λ, σ and k; The number of selected features l; The maximum number of iterations NIter.
Initialization: The iteration time t = 0; H = rand(d,l), V = rand(l,d), Il = eye(l), Construct the attribute score matrix Q;
  Repeat:
  1. Update the feature selection matrix H with Equation (15).
  2. Update coefficient matrix V with Equation (18).
  3. Update p i V by solving subproblem (20).
  4. Update p i u by solving subproblem (21).
  5. Update WV by solving subproblem (39).
  6. Update Wu by solving subproblem (40).
  Until Convergence
Output: Index of selected features; New data matrix XnewRl×n.
Feature selection: The score of d features is calculated according to ‖hi2, and the first l feature with the highest score is selected.

4. Experiments

In this section, we evaluate the performance of DHBWSL by comparing with related methods on a public dataset, and provide the parameter sensitivity analysis and convergence analysis. All experiments are implemented in MATLAB 2021b, and run on a Windows machine with 3.20 GHz, i7-75800H, 16 GB of main memory.

4.1. Datasets

To evaluate the effectiveness of DHBWSL in terms of clustering performance, 12 public datasets are used in the following experiments, including JAFFE, COIL20, ORL, lung, Isolet, EYale B, TOX_171, GLIOMA, RELATHE, ALLAML, orlraws10P, and COIL100 [33,34,35,36,37,42,43], downloaded at https://jundongl.github.io/scikit-feature/datasets.html, accessed on 3 August 2023, and https://www.face-rec.org/databases/, accessed on 3 August 2023, and Table 3 illustrates the details of these datasets.

4.2. Comparison Methods

Since DHBWSL is a UFS method, the comparison experiments are conducted under unsupervised conditions. Nine state-of-the-art unsupervised feature selection algorithms are employed to highlight the superiority of the proposed method. We selected the Baseline method as it offers a fundamental reference point, emphasizing the importance of feature selection in enhancing clustering performance. The choice of graph-based learning methods, such as ECGFS and SOGFS, demonstrates the advantages of our approach in preserving local structure. Subspace-based learning methods, such as VCSDFS and RSPCA, help validate our method’s effectiveness in dimensionality reduction. Additionally, feature selection methods, including MCFS, UDFS, UFS2, and UDS2FS, allow us to verify that the feature subset we select is superior.
Baseline: The k-means clustering technique is directly applied to the original data clusters.
MCFS [15]: A two-step feature selection framework is accomplished by integrating spectral analysis and sparse learning.
UDFS [44]: A local discriminative UFS model emphasizes discriminative information and feature correlations to identify the most distinguishing features.
SOGFS [45]: This model can adaptively preserve local structure by constructing a more accurate similarity matrix to emphasize more distinguishing features.
EGCFS [46]: Adaptive graph learning is constructed to select distinguishing features.
VCSDFS [47]: A variance–covariance is established to redefine the subspace distances so as to eliminate the irrelevant features.
UFS2 [48]: This uses binary vectors in k-means to select more accurate numbers of features for clustering.
UDS2FS [49]: Training soft labels is designed to guide the feature selection process to identify more discriminative subspaces.
RSPCA [50]: The σ -norm was employed as the reconstruction error, while the ℓ2,0-norm constraint was applied to the subspace projection in the feature selection task.

4.3. Experimental Settings

Regarding the parameter settings, the maximum number of iterations NIter is set to 30. The neighborhood size k for DHBWSL and all comparative methods based on graph learning is set to 5. According to the literature requirements, for methods including MCFS, UDFS, SOGFS, EGCFS, VCSDFS, UFS2, UDS2FS and RSPCA, a grid search strategy is employed to select appropriate values from the set {10−4, 10−3, 10−2, 10−1, 100, 101, 102, 103, 104} for parameters that need adjustment. For DHBWSL, there are four regularization parameters: data graph regularization parameter β, feature graph regularization parameter γ, sparsity parameter λ, and orthogonal parameter δ, which is set to 1. To ensure fairness in comparative experiments, a parameter search is conducted for β, γ, and λ within the range {10−4, 10−3, 10−2, 10−1, 100, 101, 102, 103, 104}. The selected range of values for the number of features l is {20, 30, 40, 50, 60, 70, 80, 90, 100}. Due to the dependency of k-means clustering results on initialization, we computed the average of 20 runs to obtain the final results for ACC and NMI.

4.4. Clustering Results and Analysis

The comparative clustering experiments are conducted in terms of ACC and NMI on 12 distinct datasets, as illustrated in Table 4 and Table 5. Due to the sensitivity of k-means clustering to initialization, we compute the average and standard deviation of ACC and NMI of twenty independent results, as detailed in Table 4 and Table 5. For clarity in comparing results, we highlight the best ACC and NMI values in bold and underline the second-best values. Furthermore, Figure 2 and Figure 3 analyze the variation in ACC and NMI with changes in the number of features. In Figure 2 and Figure 3, the x-coordinate represents the number of selected features, while the y-coordinate represents the value of ACC or NMI. From the experimental results presented in Table 4 and Table 5 and Figure 2 and Figure 3, it is evident that in most cases, DHBWSL achieves higher ACC and NMI values compared to other compared algorithms. This significantly exhibits its effectiveness in selecting features and its exceptional capabilities in graph learning. Detailed conclusions are provided below:
(1) Overall, most UFS methods outperform Baseline on the majority of datasets. This performance difference highlights the ability of these UFS methods to enhance clustering performance by eliminating noise and redundant features.
(2) The results presented in Table 4 and Table 5 demonstrate that our proposed DHBWSL achieves significant performance improvements compared to other comparative methods, effectively validating the superiority of DHBWSL. Specifically, compared to Baseline, MCFS, UDFS, SOGFS, EGCFS, VCSDFS, UFS2, UDS2FS and RSPCA, DHBWSL shows substantial increases in ACC by 20.20%, 33.50%, 25.14%, 35.15%, 17.70%, 11.65%, 37.53%, 15.00% and 19.46%, respectively. This indicates that DHBWSL better preserves local structural information in both data and feature spaces while effectively eliminating noise and redundancy in unlabeled data.
(3) In various real datasets encompassing diverse scenarios such as images, text, and videos, DHBWSL consistently outperforms other comparative approaches, effectively validating its superiority. Particularly on the Isolet dataset, while other methods fail to reach Baseline levels in terms of ACC and NMI, DHBWSL surpasses the Baseline performance. This outcome convincingly demonstrates the effectiveness of DHBWSL. The success is attributed to learning a more stable dual high-order graph structure and a well-sparse feature selection matrix, and thus the selected features are of higher quality, leading to more stable results in k-means clustering.
(4) The findings presented in Table 6 indicate that DHBWSL is notably competitive in terms of computational efficiency when juxtaposed with the majority of the algorithms evaluated. In particular, there is a marked decrease in time cost when compared to earlier adaptive graph learning approaches like SOGFS and EGCFS. While DHBWSL might incur a marginally higher time cost than certain subspace learning techniques such as VCSDFS, UDS2FS, and RSPCA, it nonetheless maintains superior computational efficiency and attains the highest level of clustering accuracy. This advantage is particularly pronounced when DHBWSL is contrasted with conventional methods, including Baseline, MCFS, and UDFS.
(5) Specifically, even compared to the latest UFS methods like EGCFS, VCSDFS, UFS2, UDS2FS and RSPCA, DHBWSL demonstrates superior clustering performance. This excellent performance is attributed to the high-order adjacency learning that facilitates learning a more stable high-order graph structure for guiding feature selection for unlabeled data compared to other comparative subspace learning methods such as VCSDFS, UDS2FS and RSPCA.
(6) DHBWSL performs worse than VCSDFS on the COIL100 dataset, likely because VCSDFS leverages the intrinsic statistical information and feature correlation more effectively through the Variance–Covariance Subspace Distance framework. This allows for better identification of the representative feature subset with minimum norm properties during feature selection, leading to improved dimensionality reduction and subspace learning performance. By contrast, DHBWSL is not confined to relying exclusively on statistical information. Instead, it initiates from the geometric structure of the data and, through the construction of a double high-order graph, more comprehensively captures the higher-order local structure features.
(7) To elucidate the marked enhancement in the clustering outcomes of DHBWSL as depicted in Table 7 and Table 8, a statistical analysis was conducted comparing the results of DHBWSL against those of the comparative algorithms. Specifically, a paired t-test was employed. Each algorithm was required to independently perform clustering 20 times to obtain the average results presented in Table 4 and Table 5, with these 20 results serving as the basis for the paired t-test. Observing the h and p values obtained from the statistical experiments, an h value of 0 indicates that the null hypothesis cannot be rejected at a significance level of 5%. Conversely, an h value of 1 indicates that the null hypothesis can be rejected at the 5% level. The p-value represents the significance level. When h = 1 and the p-value is small, it is generally concluded that there is a difference between the two samples, suggesting a notable improvement in the results of DHBWSL. Table 7 and Table 8 present the paired t-tests for DHBWSL and the comparative algorithms across all datasets. Table 7 and Table 8 demonstrate that in the paired t-tests, h = 1 and the p-values are small for the majority of datasets. These results indicate that, compared to other algorithms, there is a significant difference between the ACC and NMI of DHBWSL and the comparative algorithms, illustrating a substantial improvement in DHBWSL and validating its superiority.
Table 4. The clustering performance of compared methods on ACC ± STD% on 12 datasets.
Table 4. The clustering performance of compared methods on ACC ± STD% on 12 datasets.
DatasetsBaselineMCFSUDFSSOGFSEGCFSVCSDFSUFS2UDS2FSRSPCADHBWSL
JAFFE81.10
±4.85
(all)
88.24
±5.40
(40)
85.77
±4.31
(50)
83.29
±6.15
(90)
80.40
±5.29
(30)
87.37
±5.97
(30)
67.58
±4.06
(100)
83.76
±5.40
(30)
69.95
±4.23
(100)
89.41
±4.54
(60)
COIL2065.75
±4.16
(all)
65.14
±2.53
(60)
54.05
±2.63
(100)
48.04
±1.40
(100)
62.98
±2.87
(100)
67.16
±2.81
(30)
40.25
±1.18
(90)
68.43
±3.36
(80)
58.26
±1.85
(100)
70.52
±2.71
(30)
ORL52.90
±3.08
(all)
53.86
±1.65
(100)
52.39
±2.48
(40)
51.20
±2.21
(30)
53.73
±2.18
(100)
54.77
±3.03
(60)
35.49
±1.35
(100)
53.09
±2.93
(50)
44.73
±2.05
(90)
55.85
±2.53
(100)
lung70.10
±8.22
(all)
83.42
±7.23
(30)
70.15
±2.82
(80)
48.67
±3.42
(100)
80.02
±1.26
(30)
72.17
±6.20
(70)
48.65
±1.87
(100)
78.47
±0.70
(40)
70.62
±1.34
(30)
83.82
±1.03
(100)
Isolet61.73
±2.77
(all)
54.16
±2.53
(90)
42.83
±1.89
(100)
48.35
±1.51
(100)
51.07
±2.93
(100)
58.98
±2.11
(50)
30.44
±1.27
(90)
58.01
±2.45
(100)
56.28
±1.89
(100)
67.97
±2.29
(100)
EYale B9.64
±0.45
(all)
15.10
±0.56
(20)
10.26
±0.39
(100)
10.42
±0.35
(80)
13.98
±0.48
(60)
11.59
±0.41
(20)
12.20
±0.38
(40)
10.54
±0.35
(20)
9.76
±0.33
(40)
17.14
±0.53
(20)
TOX_17143.86
±2.17
(all)
44.42
±1.27
(80)
48.42
±1.59
(40)
47.98
±3.80
(70)
49.06
±0.26
(70)
46.99
±0.46
(20)
44.06
±0.58
(30)
41.43
±2.57
(60)
43.92
±0.73
(20)
54.27
±0.52
(50)
GLIOMA59.20
±2.19
(all)
45.90
±4.70
(100)
57.30
±4.17
(30)
66.50
±4.58
(30)
61.70
±4.46
(50)
71.70
±5.44
(60)
67.20
±3.75
(50)
64.40
±5.05
(60)
73.10
±3.08
(70)
79.40
±4.95
(20)
RELATHE54.51
±0.10
(all)
53.72
±0.76
(90)
56.27
±0.21
(50)
56.53
±1.07
(60)
59.26
±0.95
(20)
56.90
±0.03
(40)
55.41
±0.88
(30)
56.11
±0.81
(100)
56.85
±1.24
(30)
59.85
±0.12
(70)
ALLAML72.08
±1.62
(all)
76.39
±0.78
(20)
86.11
±1.42
(80)
81.94
±2.60
(60)
70.90
±2.27
(30)
84.65
±0.71
(30)
70.83
±0.01
(20)
76.25
±0.62
(20)
76.32
±1.39
(90)
89.24
±1.18
(20)
orlraws10P76.60
±6.17
(all)
80.05
±5.49
(100)
73.55
±6.60
(70)
73.00
±3.93
(90)
75.60
±5.08
(80)
75.45
±4.68
(50)
60.00
±2.87
(100)
78.35
±5.32
(60)
80.20
±4.81
(100)
82.30
±4.23
(50)
COIL10050.72
±1.30
(all)
49.01
±1.21
(80)
29.39
±0.61
(90)
42.30
±1.09
(90)
45.49
±1.12
(60)
51.69
±1.27
(80)
23.93
±0.74
(100)
50.15
±1.32
(90)
44.80
±0.93
(100)
51.47
±1.18
(90)
Table 5. The clustering performance of compared methods on NMI ± STD% on 12 datasets.
Table 5. The clustering performance of compared methods on NMI ± STD% on 12 datasets.
DatasetsBaselineMCFSUDFSSOGFSEGCFSVCSDFSUFS2UDS2FSRSPCADHBWSL
JAFFE85.43
±3.51
(all)
88.88
±4.03
(100)
86.06
±3.51
(80)
83.58
±3.26
(90)
81.48
±2.59
(30)
88.44
±3.20
(80)
72.25
±2.38
(100)
85.79
±3.36
(30)
71.38
±2.56
(100)
90.48
±2.40
(70)
COIL2076.69
±1.99
(all)
74.42
±1.50
(60)
65.33
±1.75
(100)
62.12
±1.20
(100)
73.36
±1.61
(100)
75.84
±1.31
(80)
51.27
±0.81
(90)
76.60
±1.75
(80)
69.56
±1.35
(100)
77.67
±1.09
(90)
ORL72.83
±1.74
(all)
72.61
±1.13
(100)
72.09
±1.39
(40)
70.88
±1.37
(30)
73.13
±1.17
(100)
73.34
±1.63
(60)
57.60
±0.93
(100)
72.06
±1.56
(50)
66.07
±1.50
(90)
74.77
±1.31
(40)
lung54.47
±2.84
(all)
65.38
±6.55
(30)
46.72
±1.36
(80)
54.21
±2.54
(100)
55.80
±2.69
(100)
50.91
±1.58
(70)
32.46
±1.26
(100)
50.06
±1.47
(100)
50.97
±2.24
(70)
58.91
±0.89
(100)
Isolet76.06
±1.26
(all)
67.43
±0.92
(90)
56.84
±1.27
(100)
64.37
±0.71
(100)
64.43
±1.30
(100)
67.74
±1.09
(50)
43.56
±0.65
(90)
68.11
±1.46
(100)
70.81
±0.97
(100)
77.75
±0.76
(100)
EYale B12.77
±0.52
(all)
25.31
±0.48
(20)
16.46
±0.47
(90)
16.41
±0.56
(80)
24.20
±0.60
(60)
18.64
±0.41
(20)
19.95
±0.95
(40)
16.42
±0.50
(20)
14.54
±0.24
(20)
28.13
±0.48
(20)
TOX_17114.98
±3.03
(all)
13.03
±1.19
(20)
23.61
±2.09
(50)
22.20
±1.74
(60)
22.85
±1.12
(80)
19.40
±0.66
(20)
17.66
±0.80
(30)
16.54
±0.82
(60)
11.00
±1.60
(100)
27.16
±0.46
(50)
GLIOMA50.20
±1.60
(all)
22.94
±5.08
(100)
36.05
±4.99
(100)
52.37
±3.62
(90)
51.23
±1.53
(90)
55.72
±3.37
(60)
52.75
±1.91
(50)
50.68
±2.39
(100)
58.60
±4.82
(70)
58.89
±2.40
(30)
RELATHE0.05
±0.02
(all)
0.35
±0.01
(30)
1.01
±0.09
(70)
1.73
±0.95
(60)
2.96
±0.75
(100)
2.76
±0.49
(40)
0.79
±0.41
(40)
1.43
±0.09
(50)
5.54
±0.57
(40)
7.14
±1.12
(60)
ALLAML13.33
±1.84
(all)
18.63
±6.11
(30)
41.97
±4.63
(80)
34.99
±1.24
(20)
11.22
±2.39
(100)
36.68
±2.05
(30)
10.94
±2.20
(90)
15.58
±0.62
(20)
28.40
±5.96
(50)
47.42
±3.72
(20)
orlraws10P81.76
±4.70
(all)
85.17
±4.34
(100)
79.05
±3.36
(100)
77.19
±2.37
(90)
81.69
±3.14
(80)
76.69
±3.04
(50)
63.29
±2.13
(100)
81.59
±3.00
(60)
84.77
±3.04
(100)
85.34
±2.48
(50)
COIL10075.73
±0.34
(all)
72.88
±0.6
(100)
53.15
±0.53
(80)
65.91
±0.37
(100)
69.29
±0.36
(100)
74.51
±0.54
(80)
45.97
±0.46
(100)
74.42
±0.50
(90)
67.78
±0.40
(100)
74.95
±6.30
(100)
Table 6. Computation time (seconds) of different methods on real-world datasets.
Table 6. Computation time (seconds) of different methods on real-world datasets.
DatasetsBaselineMCFSUDFSSOGFSEGCFSVCSDFSUFS2UDS2FSRSPCADHBWSL
JAFFE0.060.861.484.401.120.381.591.520.731.32
COIL200.476.6413.9496.5319.711.2814.825.851.3810.91
ORL0.182.032.797.963.511.243.373.131.194.18
lung2.756.0468.342408.0993.148.965.2441.5227.9230.28
Isolet0.467.0713.4123.7318.861.4011.241.680.778.89
EYale B1.453.2624.9265.1758.571.4346.042.591.6834.10
TOX_17125.181.28873.812834.971242.3627.1717.83180.34135.95405.69
GLIOMA12.390.89607.042314.70597.9216.813.7287.4236.44245.35
RELATHE11.4211.86418.58944.73579.2016.5737.48143.2362.56338.26
ALLAML47.206.751298.9610,709.381145.44132.987.40453.89149.60243.76
orlraws10P163.6715.682566.0619,028.844159.7087.6322.62805.71451.261746.03
COIL10045.4231.35320.69962.122015.051.3388.2012.6711.09311.45
Table 7. The paired t-test result of ACC of DHBWSL and comparison algorithms on all datasets.
Table 7. The paired t-test result of ACC of DHBWSL and comparison algorithms on all datasets.
Datasets BaselineMCFSUDFSSOGFSEGCFSVCSDFSUFS2UDS2FSRSPCA
JAFFEp
h
0.0033
1
4.9999 × 10−9
1
1.1243 × 10−11
1
7.6029 × 10−16
1
0.0088
1
8.1090 × 10−4
1
1.7584 × 10−18
1
6.6927 × 10−5
1
2.7192 × 10−10
1
COIL20p
h
0.0083
1
2.2206 × 10−7
1
2.7762 × 10−29
1
2.1609 × 10−44
1
3.2358 × 10−35
1
1.8528 × 10−6
1
7.8998 × 10−41
1
0.0109
1
2.0050 × 10−17
1
ORLp
h
4.1188 × 10−4
1
6.1222 × 10−7
1
1.6053 × 10−11
1
9.9364 × 10−5
1
0.0046
1
0.2030
0
5.2170 × 10−28
1
2.5978 × 10−5
1
2.4171 × 10−16
1
lungp
h
1.7868 × 10−7
1
1.4484 × 10−16
1
2.9665 × 10−8
1
1.7777 × 10−12
1
9.0574 × 10−23
1
0.0356
1
3.4934 × 10−19
1
1.8391 × 10−6
1
5.7823 × 10−10
1
Isoletp
h
5.9167 × 10−28
1
2.1798 × 10−14
1
2.7699 × 10−25
1
1.9143 × 10−36
1
1.4784 × 10−35
1
1.4844 × 10−11
1
5.9769 × 10−40
1
2.9701 × 10−7
1
7.7368 × 10−19
1
EYale Bp
h
1.6388 × 10−29
1
1.0569 × 10−8
1
1.4683 × 10−28
1
2.1944 × 10−41
1
2.6630 × 10−32
1
9.4853 × 10−31
1
1.6083 × 10−26
1
4.4488 × 10−24
1
4.5104 × 10−35
1
TOX_171p
h
3.8869 × 10−6
1
0.0093
1
1.7762 × 10−6
1
4.5149 × 10−16
1
5.0969 × 10−11
1
4.9939 × 10−9
1
6.9265 × 10−13
1
5.2356 × 10−9
1
0.0219
1
GLIOMAp
h
4.3778 × 10−13
1
7.9113 × 10−26
1
7.3934 × 10−22
1
7.5027 × 10−14
1
2.8855 × 10−23
1
9.1073 × 10−17
1
2.4122 × 10−16
1
6.6404 × 10−8
1
7.1229 × 10−18
1
RELATHEp
h
0.0371
1
5.2259 × 10−45
1
1.9256 × 10−81
1
4.8465 × 10−65
1
1.2509 × 10−74
1
6.0813 × 10−7
1
3.2163 × 10−65
1
5.9863 × 10−10
1
0.0069
1
ALLAMLp
h
2.0298 × 10−37
1
6.0648 × 10−20
1
2.9455 × 10−51
1
1.3964 × 10−49
1
1.1054 × 10−9
1
5.1027 × 10−16
1
5.5017 × 10−34
1
5.9833 × 10−64
1
2.2404 × 10−30
1
orlraws10Pp
h
5.5865 × 10−18
1
1.5253 × 10−21
1
5.3369 × 10−13
1
1.6683 × 10−17
1
1.3257 × 10−15
1
1.1109 × 10−11
1
2.3574 × 10−15
1
1.2112 × 10−19
1
9.2441 × 10−18
1
COIL100p
h
1.9101 × 10−21
1
1.0699 × 10−15
1
1.1474 × 10−45
1
3.6026 × 10−13
1
5.8528 × 10−26
1
3.1366 × 10−23
1
4.8302 × 10−49
1
1.1606 × 10−14
1
3.6670 × 10−32
1
Table 8. The paired t-test result of NMI of DHBWSL and comparison algorithms on all datasets.
Table 8. The paired t-test result of NMI of DHBWSL and comparison algorithms on all datasets.
Datasets BaselineMCFSUDFSSOGFSEGCFSVCSDFSUFS2UDS2FSRSPCA
JAFFEp
h
5.0954 × 10−9
1
4.5287 × 10−14
1
1.2985 × 10−18
1
2.1976 × 10−25
1
4.3009 × 10−7
1
1.7813 × 10−5
1
4.0952 × 10−18
1
5.8906 × 10−10
1
3.4081 × 10−16
1
COIL20p
h
0.9210
0
1.8565 × 10−11
1
8.3184 × 10−35
1
7.7725 × 10−54
1
1.3827 × 10−42
1
1.5602 × 10−12
1
2.1203 × 10−48
1
0.1891
0
1.5006 × 10−26
1
ORLp
h
1.3472 × 10−6
1
3.8923 × 10−6
1
4.2445 × 10−17
1
7.9821 × 10−9
1
1.1772 × 10−5
1
0.1182
0
1.1551 × 10−37
1
6.1003 × 10−6
1
9.6748 × 10−23
1
lungp
h
2.4998 × 10−18
1
1.4410 × 10−20
1
2.0524 × 10−8
1
2.4254 × 10−14
1
2.1454 × 10−24
1
3.2489 × 10−7
1
1.5531 × 10−20
1
3.3657 × 10−6
1
6.0400 × 10−16
1
Isoletp
h
6.9752 × 10−38
1
5.3449 × 10−22
1
3.7346 × 10−36
1
5.3022 × 10−43
1
8.5538 × 10−47
1
7.4706 × 10−8
1
2.4377 × 10−53
1
0.0243
1
7.7473 × 10−30
1
EYale Bp
h
7.9018 × 10−44
1
7.7974 × 10−7
1
4.9514 × 10−35
1
4.1198 × 10−45
1
6.5993 × 10−40
1
7.0041 × 10−42
1
2.5865 × 10−28
1
3.1037 × 10−37
1
1.0691 × 10−47
1
TOX_171p
h
1.2300 × 10−15
1
1.0453 × 10−4
1
0.0275
1
1.8397 × 10−22
1
1.1552 × 10−26
1
9.4111 × 10−18
1
6.4780 × 10−21
1
4.0051 × 10−12
1
0.0073
1
GLIOMAp
h
1.5787 × 10−4
1
2.1801 × 10−31
1
1.1535 × 10−32
1
3.4468 × 10−14
1
7.5197 × 10−34
1
4.1431 × 10−7
1
1.7450 × 10−20
1
0.0239
1
5.2694 × 10−6
1
RELATHEp
h
0.0153
1
1.1618 × 10−74
1
1.0218 × 10−93
1
1.2836 × 10−55
1
5.8358 × 10−90
1
2.2538 × 10−7
1
3.5715 × 10−53
1
6.3098 × 10−8
1
5.7191 × 10−13
1
ALLAMLp
h
2.4430 × 10−31
1
5.4551 × 10−19
1
4.3531 × 10−39
1
1.8018 × 10−62
1
5.2290 × 10−5
1
1.7163 × 10−11
1
1.3913 × 10−24
1
8.8682 × 10−95
1
1.7695 × 10−25
1
orlraws10Pp
h
2.8720 × 10−22
1
2.2141 × 10−25
1
2.8931 × 10−17
1
2.6862 × 10−18
1
2.1888 × 10−17
1
3.8995 × 10−14
1
5.8639 × 10−6
1
1.4837 × 10−23
1
9.1751 × 10−21
1
COIL100p
h
2.1942 × 10−40
1
7.4485 × 10−31
1
1.7060 × 10−56
1
3.3267 × 10−28
1
2.7279 × 10−38
1
5.0523 × 10−33
1
1.9495 × 10−60
1
2.0572 × 10−33
1
9.8478 × 10−42
1
Figure 2. The ACC of all the algorithms for selecting different numbers of features on the 12 datasets.
Figure 2. The ACC of all the algorithms for selecting different numbers of features on the 12 datasets.
Entropy 27 00107 g002
Figure 3. The NMI of all the algorithms for selecting different numbers of features on the 12 datasets.
Figure 3. The NMI of all the algorithms for selecting different numbers of features on the 12 datasets.
Entropy 27 00107 g003aEntropy 27 00107 g003b

4.5. Visualization on Fashion MNIST

The Fashion MNIST dataset encompasses 70,000 grayscale images, distributed across 10 distinct categories. Specifically, it consists of a training set with 60,000 samples and a test set comprising 10,000 samples, where each sample is represented as a 28 × 28 grayscale image. In the present experiment, we select images from the test set to utilize as training samples for the purpose of conducting tests. In the following experiments, we verify the interpretability of the feature selection task in the Fashion MNIST dataset. We visualize the feature subsets derived from the feature selection methods on the Fashion MNIST dataset using t-SNE. In our experiments, we compare the Baseline, UDS2FS, and DHBWSL methods. For the Baseline method, the entire set of features is used as the feature subset to represent the original dataset, whereas, for both UDS2FS and DHBWSL, we select the top 100 features as the feature subsets.
As the experimental results show in Figure 4a,b, the intraclass distance is very small, indicating that neither Baseline nor UDS2FS methods can effectively distinguish between different classes. In contrast, Figure 4c demonstrates that our DHBWSL successfully reduces the intra-class distance while enhancing the inter-class distance. Especially when the coordinate scale is the same as that in Figure 4a, the overall spatial structure of DHBWSL remains consistent compared to Baseline. This further confirms that DHBWSL effectively considers higher-order relationships among data points, thereby more accurately preserving the local structure of the original data. Concurrently, it minimizes class clustering, maximizes inter-class distances, and maintains the spatial geometry of the data, while selecting discriminative features.

4.6. Two-Moon Dataset and Noise Test

In this segment, we assess the local learning ability of DHBWSL using a synthetic dataset, specifically the Two-moon dataset. To ensure a fair evaluation, the neighbor count is fixed at 20. As depicted in Figure 5a, the original Two-moon dataset features two distinct classes, represented by red and blue data points, each comprising 90 samples, with a noise level of 0.12% incorporated. Figure 5b presents the visualization of the similarity graph constructed via the K-nearest neighbor (KNN) approach. Notably, there are instances where multiple lines link data points from separate categories. This implies that certain data points have nearest neighbors belonging to different classes, underscoring the adverse impact of noise features on the sample’s similarity structure. Such noise compromises the chart’s reliability. Conversely, Figure 5c reveals a starkly clear demarcation between the two distinct categories, the red and blue data points. This clarity indicates that DHBWSL’s clustering performance is nearly on par with that of the pristine Two-moon dataset. The rationale behind this outcome is DHBWSL’s capability to dynamically learn a high-quality similarity graph. This ensures that the resultant graph is exclusively composed of the two categories, effectively eradicating any connecting lines between the red and blue data points.
We conduct noise tests to further confirm the effectiveness of DHBWSL, with four noisy datasets generated by adding Gaussian noise with variances of 15 and 25 to ORL and variances of 0.1 and 0.2 to the COIL20 dataset. These noises are randomly added to the original image as shown in Figure 6b,c,e,f. In Figure 6c,f, significant blurring effects on facial and image features can be observed. As the clustering results depicted in Table 9 show, it is evident that under various noise conditions, DHBWSL consistently exhibits superiority to other compared methods. This further validates the enhanced robustness of DHBWSL in its ability to learn a more stable dual high-order graph structure.

4.7. Ablation Study

In this section, we conducted ablation experiments to validate the effectiveness of dual high-order graph regularization and the 2,1-norm constraint. We considered two specific scenarios of the DHBWSL, denoted as DHBWSL-1 and DHBWSL-2. For DHBWSL-1, we set the 2,1-norm parameter λ = 0 in Equation (14). For DHBWSL-2, we set the dual high-order graph regularization parameters γ = 0 and β = 0 in Equation (14). The results are shown in Figure 7 and Figure 8. It can be observed from the figures that DHBWSL achieves higher clustering accuracy across all datasets compared to DHBWSL-1 and DHBWSL-2. Regarding clustering performance, DHBWSL-1 performs similarly to DHBWSL, while DHBWSL-2 exhibits the lowest performance. Therefore, we conclude that in our approach, dual high-order graph regularization has more influence than sparse 2,1-norm constraint on the clustering performance, which is attributed to the fact that dual high-order graph regularization preserves the local geometric structure in the data space and feature space.

4.8. Parameters Sensitivity Analysis

There are four parameters in DHBWSL, i.e., the orthogonality constraint parameter δ set to 1 and three adjusted regularization parameters: the dual-graph regularization parameters γ and β, and the sparse constraint parameter λ. From the ablation analysis in Section 4.6, the dual-graph regularization term significantly contributes to clustering performance. Therefore, we primarily conducted sensitivity experiments on the parameters γ and β. For convenience, we fixed the number of selected features at 100 and employed a grid search to adjust the values of γ and β within the range {10−4, 10−3, 10−2, 10−1, 100, 101, 102, 103, 104}, while keeping λ = 1 and δ = 1 fixed.
From Figure 9 and Figure 10, it can be observed that there is no significant overall variation in the ACC and NMI values across the 12 datasets with changes in parameters. This implies that DHBWSL is not highly sensitive to parameter variations to a certain extent, that is, it maintains relative stability over a wide range of parameter values. Hence, the appropriate ranges for the parameters γ and β are {10−4, 10−2} and {102, 104}, respectively. Within these ranges, it is easier to achieve good performance, suggesting that DHBWSL can consistently deliver effective results under various parameter settings.

4.9. Convergence Analysis and Computational Performance

The convergence of DHBWSL is empirically evaluated in Figure 11, in which convergence curves are illustrated on 12 public datasets with the y-axis indicating the objective function value. It can be observed that as the number of iterations grows, the objective function value rapidly falls and converges in less than five iterations, a fact that is guaranteed theoretically in Lemma 1.
The computational complexity of eight comparison algorithms is summarized in Table 10. Where c is the number of clusters, d is the dimensionality of the original features, l is the number of selected features, n is the total number of samples, and t is the number of iterations in the k-neighborhood. In our experiments, the computational cost of subspace dimensionality reduction is nd, and the computational cost of the similarity matrices WV and WU in dual space is d2 + n2 + ld2 + d2n. The cost of sparse learning is dl. Therefore, the overall computational complexity of the DHBWSL algorithm is O(t(nd + d2 + n2 + ld2 + d2n + dl)).

5. Conclusions

This paper proposes an unsupervised feature selection method, called DHBWSL. Specifically, a new graph regularization term, called dual high-order graph learning, is used to extract the geometric structure information inherent in the data and feature space. Accordingly, DHBWSL integrates the dual high-order graph learning with Boolean weights so as to improve the prospecting ability of local geometry through adaptive graph learning. Extensive numerical experiments on 12 datasets are utilized to validate the superiority of the performance of DHBWSL compared with nine state-of-the-art feature selection methods.
Although DHBWSL demonstrates better performance in the trial, there are definitely opportunities for additional development in the future. Multi-view clustering, a well-known concept in data analysis, seeks to expose consistent structural information inside datasets by utilizing many data perspectives. In future work, we plan to extend the strategies proposed in this paper to existing multi-view clustering methods to further enhance their performance. In addition, a limitation of DHBWSL is the need to tune three parameters, which can be time-consuming. Therefore, in our future work, we aim to develop a new DHBWSL mechanism that eliminates the need for parameter tuning or design a novel optimization approach that can simultaneously optimize all variables.

Author Contributions

Y.W.: conceptualization, methodology, writing—original draft preparation, validation; J.M.: validation, supervision, writing—original draft preparation; Z.M.: visualization, supervision; Y.H.: resources, funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Basic Research Business of Central Universities of Northern University for Nationalities (No. 2023ZRLG02), the Special Fund for High School Scientific Research Project of Ningxia (No. NYG2024066), the National Natural Science Foundation of China (Nos. 62462001).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data and code that support the findings of this study are available from the corresponding author (J.M.) upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, C.; Wang, J.; Gu, Z.; Wei, J.; Liu, J. Unsupervised feature selection by learning exponential weights. Pattern Recognit. 2024, 148, 0031–3203. [Google Scholar] [CrossRef]
  2. Tang, C.; Wang, J.; Zheng, X.; Liu, X.; Xie, W.; Li, X. Spatial and Spectral Structure Preserved Self-Representation for Unsupervised Hyperspectral Band Selection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531413. [Google Scholar] [CrossRef]
  3. Guo, Y.; Sun, Y.; Wang, Z.; Nie, F.; Wang, F. Double-Structured Sparsity Guided Flexible Embedding Learning for Unsupervised Feature Selection. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 13354–13367. [Google Scholar] [CrossRef]
  4. Wang, Z.; Li, Q.; Nie, F.; Wang, R.; Wang, F.; Li, X. Efficient Local Coherent Structure Learning via Self-Evolution Bipartite Graph. IEEE Trans. Cybern. 2024, 54, 4527–4538. [Google Scholar] [CrossRef]
  5. Lai, Z.; Chen, F.; Wen, J. Multi-view robust regression for feature extraction. Pattern Recognit. 2024, 149, 110219. [Google Scholar] [CrossRef]
  6. Niu, X.; Zhang, C.; Ma, Y.; Hu, L.; Zhang, J. A multi-view subspace representation learning approach powered by subspace transformation relationship. Knowl.-Based Syst. 2023, 277, 110816. [Google Scholar] [CrossRef]
  7. Wang, Z.; Wu, D.; Wang, R.; Nie, F.; Wang, F. Joint Anchor Graph Embedding and Discrete Feature Scoring for Unsupervised Feature Selection. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 7974–7987. [Google Scholar] [CrossRef] [PubMed]
  8. Wen, J.; Fang, X.; Cui, J.; Fei, L.; Yan, K.; Chen, Y.; Xu, Y. Robust Sparse Linear Discriminant Analysis. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 390–403. [Google Scholar] [CrossRef]
  9. Greenacre, M.; Groenen, P.; Hastie, T.; D’Enza, A.; Markos, A.; Tuzhilina, E. Principal component analysis. Nat. Rev. Methods Primers 2022, 2, 100. [Google Scholar] [CrossRef]
  10. Kokiopoulou, E.; Saad, Y. Orthogonal Neighborhood Preserving Projections: A Projection-Based Dimensionality Reduction Technique. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 2143–2156. [Google Scholar] [CrossRef]
  11. Wu, H.; Wu, N. When Locally Linear Embedding Hits Boundary. J. Mach. Learn. Res. 2023, 24, 1–80. Available online: https://jmlr.org/papers/v24/21-0697.html (accessed on 3 August 2023).
  12. Xu, S.; Muselet, D.; Trémeau, A. Sparse coding and normalization for deep Fisher score representation. Comput. Vis. Image Underst. 2022, 220, 103436. [Google Scholar] [CrossRef]
  13. Nie, F.; Xiang, S.; Jia, Y.; Zhang, C.; Yan, S. Trace Ratio Criterion for Feature Selection. In Proceedings of the AAAI Conference on Artificial Intelligence, Chicago, IL, USA, 13–17 July 2008; Available online: https://api.semanticscholar.org/CorpusID:11957383 (accessed on 3 August 2023).
  14. Chandra, B.; Sharma, R. Deep learning with adaptive learning rate using laplacian score. Expert Syst. Appl. 2016, 63, 1–7. [Google Scholar] [CrossRef]
  15. Cai, D.; Zhang, C.; He, X. Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–28 July 2010; Volume 10, pp. 333–342. [Google Scholar] [CrossRef]
  16. Wang, Z.; Nie, F.; Wang, H.; Huang, H.; Wang, F. Toward Robust Discriminative Projections Learning Against Adversarial Patch Attacks. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 18784–18798. [Google Scholar] [CrossRef] [PubMed]
  17. Wang, Z.; Nie, F.; Zhang, C.; Wang, R.; Li, X. Worst-Case Discriminative Feature Learning via Max-Min Ratio Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 641–658. [Google Scholar] [CrossRef]
  18. Yu, W.; Bian, J.; Nie, F.; Wang, R.; Li, X. Unsupervised Subspace Learning With Flexible Neighboring. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 2043–2056. [Google Scholar] [CrossRef]
  19. Wang, S.; Nie, F.; Wang, Z.; Wang, R.; Li, X. Outliers Robust Unsupervised Feature Selection for Structured Sparse Subspace. IEEE Trans. Knowl. Data Eng. 2024, 36, 1234–1248. [Google Scholar] [CrossRef]
  20. Wang, S.; Pedrycz, W.; Zhu, Q.; Zhu, W. Subspace learning for unsupervised feature selection via matrix factorization. Pattern Recognit. 2015, 48, 10–19. [Google Scholar] [CrossRef]
  21. Wang, S.; Pedrycz, W.; Zhu, Q.; Zhu, W. Unsupervised feature selection via maximum projection and minimum redundancy. Knowl.-Based Syst. 2015, 75, 19–29. [Google Scholar] [CrossRef]
  22. Zheng, W.; Yan, H.; Yang, J. Robust unsupervised feature selection by nonnegative sparse subspace learning. Neurocomputing 2019, 334, 156–171. [Google Scholar] [CrossRef]
  23. Wu, J.; Li, Y.; Gong, J.; Min, W. Collaborative and Discriminative Subspace Learning for unsupervised multi-view feature selection. Eng. Appl. Artif. Intell. 2024, 133, 108145. [Google Scholar] [CrossRef]
  24. Zhang, Y.; Wang, Q.; Gong, D.; Song, X. Nonnegative Laplacian embedding guided subspace learning for unsupervised feature selection. Pattern Recognit. 2019, 93, 337–352. [Google Scholar] [CrossRef]
  25. Mandanas, F.D.; Kotropoulos, C.L. Subspace Learning and Feature Selection via Orthogonal Mapping. IEEE Trans. Signal Process. 2020, 68, 1034–1047. [Google Scholar] [CrossRef]
  26. Wu, J.; Song, M.; Min, W.; Lai, J.; Zheng, W. Joint adaptive manifold and embedding learning for unsupervised feature selection. Pattern Recognit. 2021, 112, 107742. [Google Scholar] [CrossRef]
  27. Shang, R.; Wang, W.; Stolkin, R.; Jiao, L. Subspace learning-based graph regularized feature selection. Knowl.-Based Syst. 2016, 112, 152–165. [Google Scholar] [CrossRef]
  28. Shang, R.; Meng, Y.; Wang, W.; Shang, F.; Jiao, L. Local discriminative based sparse subspace learning for feature selection. Pattern Recognit. 2019, 92, 219–230. [Google Scholar] [CrossRef]
  29. Shang, R.; Xu, K.; Jiao, L. Subspace learning for unsupervised feature selection via adaptive structure learning and rank approximation. Neurocomputing 2020, 413, 72–84. [Google Scholar] [CrossRef]
  30. Wang, Z.; Yuan, Y.; Wang, R.; Nie, F.; Huang, Q.; Li, X. Pseudo-Label Guided Structural Discriminative Subspace Learning for Unsupervised Feature Selection. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 18605–18619. [Google Scholar] [CrossRef]
  31. Nie, F.; Wang, X.; Jordan, M.; Huang, H. The Constrained Laplacian Rank algorithm for graph-based clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; 2016; Volume 8, pp. 1969–1976. [Google Scholar] [CrossRef]
  32. Fan, K. On a Theorem of Weyl Concerning Eigenvalues of Linear Transformations I. Proc. Natl. Acad. Sci. USA 1949, 35, 652–655. [Google Scholar] [CrossRef]
  33. Du, H.; Wang, Y.; Zhang, F.; Zhou, Y. Low-Rank Discriminative Adaptive Graph Preserving Subspace Learning. Neural Process. Lett. 2020, 52, 2127–2149. [Google Scholar] [CrossRef]
  34. Sheng, C.; Song, P.; Zhang, W.; Chen, D. Dual-graph regularized subspace learning based feature selection. Digit. Signal Process. 2021, 117, 103175. [Google Scholar] [CrossRef]
  35. Yin, W.; Ma, Z.; Liu, Q. Discriminative subspace learning via optimization on Riemannian manifold. Pattern Recognit. 2023, 139, 109450. [Google Scholar] [CrossRef]
  36. Liu, Z.; Ou, W.; Zhang, K.; Xiong, H. Robust manifold discriminative distribution adaptation for transfer subspace learning. Expert Syst. Appl. 2024, 238, 122117. [Google Scholar] [CrossRef]
  37. Feng, W.; Wang, Z.; Cao, X.; Cai, B.; Guo, W.; Ding, W. Discriminative sparse subspace learning with manifold regularization. Expert Syst. Appl. 2024, 249, 123831. [Google Scholar] [CrossRef]
  38. Tang, J.; Gao, Y.; Jia, S.; Feng, H. Robust clustering with adaptive order graph learning. Inf. Sci. 2023, 649, 119659. [Google Scholar] [CrossRef]
  39. Tang, C.; Zheng, X.; Zhang, W.; Liu, X.; Zhu, X.; Zhu, E. Unsupervised feature selection via multiple graph fusion and feature weight learning. Sci. China Inf. Sci. 2023, 66, 152101. [Google Scholar] [CrossRef]
  40. Xu, W.; Gong, Y. Document clustering by concept factorization. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, 25–29 July 2004; Volume 8, pp. 202–209. [Google Scholar] [CrossRef]
  41. Lee, D.; Seung, H. Algorithms for non-negative matrix factorization. In Proceedings of the 13th International Conference on Neural Information Processing Systems, Denver, CO, USA, 27 November–2 December 2000; Volume 7, pp. 535–541. Available online: https://dl.acm.org/doi/10.5555/3008751.3008829 (accessed on 3 August 2023).
  42. Cao, Z.; Xie, X. Structure learning with consensus label information for multi-view unsupervised feature selection. Expert Syst. Appl. 2024, 238, 121893. [Google Scholar] [CrossRef]
  43. Yang, B.; Xue, Z.; Wu, J.; Zhang, X.; Nie, F.; Chen, B. Anchor-graph regularized orthogonal concept factorization for document clustering. Neurocomputing 2024, 11, 127173. [Google Scholar] [CrossRef]
  44. Yang, Y.; Shen, H.T.; Ma, Z.; Huang, Z.; Zhou, X. ℓ2,1-norm regularized discriminative feature selection for unsupervised learning. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011; Volume 6, pp. 1589–1594. Available online: https://dl.acm.org/doi/10.5555/2283516.2283660 (accessed on 3 August 2023).
  45. Nie, F.; Zhu, W.; Li, X. Unsupervised feature selection with structured graph optimization. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 7, pp. 1302–1308. Available online: https://dl.acm.org/doi/10.5555/3015812.3016004 (accessed on 3 August 2023).
  46. Zhang, R.; Zhang, Y.; Li, X. Unsupervised Feature Selection via Adaptive Graph Learning and Constraint. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1355–1362. [Google Scholar] [CrossRef]
  47. Karami, S.; Saberi-Movahed, F.; Tiwari, P.; Marttinen, P.; Vahdati, S. Unsupervised feature selection based on variance–covariance subspace distance. Neural Netw. 2023, 166, 188–203. [Google Scholar] [CrossRef]
  48. Chang, H.; Guo, J.; Zhu, W. Rethinking Embedded Unsupervised Feature Selection: A Simple Joint Approach. IEEE Trans. Big Data 2023, 9, 380–387. [Google Scholar] [CrossRef]
  49. Chen, K.; Peng, Y.; Nie, F.; Kong, W. Soft Label Guided Unsupervised Discriminative Sparse Subspace Feature Selection. J. Classif. 2024, 41, 129–157. [Google Scholar] [CrossRef]
  50. Bian, J.; Zhao, D.; Nie, F.; Wang, R.; Li, X. Robust and Sparse Principal Component Analysis with Adaptive Loss Minimization for Feature Selection. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 3601–3614. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The framework of the DHBWSL method.
Figure 1. The framework of the DHBWSL method.
Entropy 27 00107 g001
Figure 4. The 2D demonstration of the Fashion MNIST dataset.
Figure 4. The 2D demonstration of the Fashion MNIST dataset.
Entropy 27 00107 g004
Figure 5. Learned graphs on the Two-moon synthetic data.
Figure 5. Learned graphs on the Two-moon synthetic data.
Entropy 27 00107 g005
Figure 6. Samples from ORL and COIL20 datasets with Gaussian noise with different variances.
Figure 6. Samples from ORL and COIL20 datasets with Gaussian noise with different variances.
Entropy 27 00107 g006
Figure 7. Ablation experiment on DHBWSL and its variants in terms of ACC.
Figure 7. Ablation experiment on DHBWSL and its variants in terms of ACC.
Entropy 27 00107 g007
Figure 8. Ablation experiment on DHBWSL and its variants in terms of NMI.
Figure 8. Ablation experiment on DHBWSL and its variants in terms of NMI.
Entropy 27 00107 g008
Figure 9. The ACC of DHBWSL on 12 datasets under values of γ and β(λ = 1, δ = 1).
Figure 9. The ACC of DHBWSL on 12 datasets under values of γ and β(λ = 1, δ = 1).
Entropy 27 00107 g009
Figure 10. The NMI of DHBWSL on 12 datasets under values of γ and β(λ = 1, δ = 1).
Figure 10. The NMI of DHBWSL on 12 datasets under values of γ and β(λ = 1, δ = 1).
Entropy 27 00107 g010
Figure 11. Convergence curves of the DHBWSL algorithm under different iterations on 12 different datasets.
Figure 11. Convergence curves of the DHBWSL algorithm under different iterations on 12 different datasets.
Entropy 27 00107 g011
Table 1. The notations used in this paper.
Table 1. The notations used in this paper.
NotationsDefinition
XThe data matrix
HThe feature selection matrix
VThe coefficient matrix
nThe sample number
dThe feature quantity
lThe number of selected features
kThe number of nearest neighbors
cThe number of clusters
xiThe i-th row vector of matrix X
xjThe j-th column vector of matrix X
p i V The feature Boolean variable
WVThe feature adjacency matrix
DVThe feature graph degree matrix
LVThe feature Laplacian matrix
p i U The data Boolean variable
WUThe data adjacency matrix
DUThe data graph degree matrix
LUThe data Laplacian matrix
1A column vector with all 1 s
Table 2. Differences between unsupervised feature selection based on subspace learning.
Table 2. Differences between unsupervised feature selection based on subspace learning.
MethodsHigh-Order Graph LearningMinimum RedundantBoolean WeightSparse RegularizationOrthogonal
MFFS [20] (2015)××××
MPMR [21] (2015)×××
SGFS [27] (2016)×××2,1-norm
LDSSL [28] (2019)×××1-norm
LRDAGP [33] (2020)×××2,1-norm
DGSLFS [34] (2021)×××2,1-norm
USFN [18] (2023)××××
MODA [35] (2023)××××
RMDDA [36] (2024)×××F-norm
DSSL-MR [37] (2024)×××2,1-norm×
DHBWSL×2,1-norm
Table 3. Specific information for the 12 datasets.
Table 3. Specific information for the 12 datasets.
DatasetsInstanceFeatureClassType
JAFFE21367610Face images
COIL201440102420Object images
ORL400102440Face images
lung20333124Biological
Isolet156061726Speech Signal
EYale B2414102438Face images
TOX_17117157484Biological
GLIOMA5044344Biological
RELATHE142743222Text
ALLAML7271292Biological
orlraws10P10010,30410Face images
COIL10072001024100Object images
Table 9. The ACC and NMI of DHBWSL on four datasets with different variance noises (ACC ± STD% and NMI ± STD%).
Table 9. The ACC and NMI of DHBWSL on four datasets with different variance noises (ACC ± STD% and NMI ± STD%).
Noise DatasetsAccuracy (%)
MCFSUDFSSOGFSEGCFSVCSDFSUFS2UDS2FSRSPCADHBWSL
ORL
(15 variance)
51.94
±2.59
(100)
45.84
±1.78
(40)
49.89
±2.57
(100)
53.31
±2.45
(80)
50.99
±3.04
(90)
33.85
±1.18
(100)
50.66
±2.09
(90)
43.84
±2.62
(100)
53.36
±2.30
(100)
ORL
(25 variance)
44.74
±2.40
(100)
45.01
±2.37
(100)
44.79
±1.95
(100)
49.45
±2.50
(90)
46.90
±1.55
(90)
31.54
±1.08
(100)
43.87
±2.43
(90)
41.30
±2.30
(100)
49.14
±2.27
(100)
COIL20
(0.1 variance)
64.91
±2.67
(60)
62.22
±2.94
(100)
61.76
±3.13
(100)
62.10
±2.64
(80)
67.45
±2.24
(80)
40.18
±1.50
(100)
67.81
±2.42
(50)
59.37
±2.10
(100)
69.19
±1.82
(80)
COIL20
(0.2 variance)
64.32
±3.55
(100)
62.82
±2.80
(90)
62.99
±2.23
(100)
62.41
±1.72
(80)
65.94
±2.47
(50)
37.56
±0.99
(100)
68.33
±1.63
(90)
58.76
±2.03
(100)
69.51
±1.78
(90)
Noise datasetsNormalized Mutual Information (%)
MCFSUDFSSOGFSEGCFSVCSDFSUFS2UDS2FSRSPCADHBWSL
ORL
(15 variance)
71.19
±1.39
(100)
66.92
±0.99
(40)
70.02
±1.09
(100)
72.10
±0.84
(100)
70.25
±1.77
(80)
55.80
±1.15
(100)
70.52
±1.37
(90)
64.82
±1.32
(100)
72.14
±1.06
(100)
ORL
(25 variance)
64.98
±1.61
(100)
64.44
±0.97
(100)
63.88
±1.17
(100)
68.46
±0.93
(90)
66.60
±1.15
(90)
53.21
±1.03
(100)
64.45
±1.42
(90)
62.18
±1.25
(100)
68.27
±1.66
(90)
COIL20
(0.1 variance)
75.10
±0.86
(100)
71.15
±1.28
(100)
72.15
±1.07
(100)
72.72
±1.16
(70)
76.28
±0.95
(80)
50.94
±0.90
(100)
76.80
±1.40
(50)
69.30
±1.36
(100)
77.41
±1.17
(100)
COIL20
(0.2 variance)
73.97
±1.52
(100)
70.87
±1.27
(90)
71.57
±1.02
(80)
71.45
±1.11
(90)
73.76
±1.30
(80)
46.87
±0.77
(100)
76.10
±0.96
(90)
68.90
±1.18
(100)
76.49
±1.26
(80)
Table 10. Computational complexity analysis.
Table 10. Computational complexity analysis.
AlgorithmsComputational Complexity
MCFSO(t(dn2 + cl3 + cnl2 + dlogd))
UDFSO(t(d3))
SOGFSO(t(d3 + n3))
EGCFSO(t(n3 + dn + nl))
VCSDFSO(t(d2))
UFS2O(t(ncd + ld))
UDS2FSO(t(d2lcn + dn2 + ln2))
RSPCAO(t(nd + max(dlk, dlogd, llogl, l3)))
DHBWSLO(t(nd + d2 + n2 + ld2 + d2n + dl))
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, Y.; Ma, J.; Ma, Z.; Huang, Y. Subspace Learning for Dual High-Order Graph Learning Based on Boolean Weight. Entropy 2025, 27, 107. https://doi.org/10.3390/e27020107

AMA Style

Wei Y, Ma J, Ma Z, Huang Y. Subspace Learning for Dual High-Order Graph Learning Based on Boolean Weight. Entropy. 2025; 27(2):107. https://doi.org/10.3390/e27020107

Chicago/Turabian Style

Wei, Yilong, Jinlin Ma, Ziping Ma, and Yulei Huang. 2025. "Subspace Learning for Dual High-Order Graph Learning Based on Boolean Weight" Entropy 27, no. 2: 107. https://doi.org/10.3390/e27020107

APA Style

Wei, Y., Ma, J., Ma, Z., & Huang, Y. (2025). Subspace Learning for Dual High-Order Graph Learning Based on Boolean Weight. Entropy, 27(2), 107. https://doi.org/10.3390/e27020107

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop