Next Article in Journal
From HX-Groups to HX-Polygroups
Previous Article in Journal
Interval Type-3 Fuzzy Inference System Design for Medical Classification Using Genetic Algorithms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unsupervised Feature Selection with Latent Relationship Penalty Term

1
School of Mathematics and Information Science, North Minzu University, Yinchuan 750030, China
2
School of Mathematics and Computer Application, Shangluo University, Shangluo 726000, China
*
Author to whom correspondence should be addressed.
Axioms 2024, 13(1), 6; https://doi.org/10.3390/axioms13010006 (registering DOI)
Submission received: 16 November 2023 / Revised: 14 December 2023 / Accepted: 18 December 2023 / Published: 21 December 2023

Abstract

:
With the exponential growth of high dimensional unlabeled data, unsupervised feature selection (UFS) has attracted considerable attention due to its excellent performance in machine learning. Existing UFS methods implicitly assigned the same attribute score to each sample, which disregarded the distinctiveness of features and weakened the clustering performance of UFS methods to some extent. To alleviate these issues, a novel UFS method is proposed, named unsupervised feature selection with latent relationship penalty term (LRPFS). Firstly, latent learning is innovatively designed by assigning explicitly an attribute score to each sample according to its unique importance in clustering results. With this strategy, the inevitable noise interference can be removed effectively while retaining the intrinsic structure of data samples. Secondly, an appropriate sparse model is incorporated into the penalty term to further optimize its roles as follows: (1) It imposes potential constraints on the feature matrix to guarantee the uniqueness of the solution. (2) The interconnection between data instances is established by a pairwise relationship situation. Extensive experiments on benchmark datasets demonstrate that the proposed method is superior to relevant state-of-the-art algorithms with an average improvement of 10.17% in terms of accuracy.

1. Introduction

With the explosive growth of data and information, dimensionality reduction techniques have become a crucial step in machine learning and data mining [1,2]. The primary dimensionality reduction techniques involve nonnegative matrix factorization (NMF) [3], principal component analysis (PCA) [4], locally linear embedding (LLE) [5], and feature selection (FS) [6]. These dimensionality reduction techniques are beneficial in accelerating the speed of the model’s learning and enhancing clustering performance and prediction accuracy. Typically, feature selection is an effective strategy for dimension reduction due to its excellent property in removing unimportant or meaningless features, which will certainly enhance the interpretability of these models. Consequently, these approaches have been applied in various applied fields such as gene expression analysis [7], image processing [8], natural language processing [9], and other fields.
According to the availability of data labels [10], feature selection methods can be categorized as supervised feature selection methods (SFS) [11,12], semi-supervised feature selection methods (SSFS) [13,14], and unsupervised feature selection methods (UFS) [15,16]. Supervised feature selection methods and semi-supervised feature selection methods can identify discriminative features by effectively mining the latent information in labeled data. However, in practice acquiring label data is extremely time-consuming, especially in some cases there exists unreliability of labelled data. Unsupervised feature selection methods are conducted to address this difficult challenge by identifying discriminative features without the availability of labeled data. As a result, it has garnered increasing attention in recent years. Generally, the strength of unsupervised feature selection methods lies in the ability to achieve satisficed results by assessing the importance of features based on appropriate criteria. However, there are several key factors that unsupervised feature selection methods need to emphasize.
Firstly, in unsupervised scenarios, it is crucial to design various strategies for extracting pseudo-label information to compensate for the absence of labeled data. Generally, the pseudo-label information is performed in UFS based on spectral regression [17,18]. In the unsupervised feature selection (UFS) algorithm, the essence of spectral regression is to construct a similarity measure that is employed to learn pseudo-label information such that each sample is more accurately guaranteed to be categorized into the ground truth class. For example, Zhao et al. [19] proposed a feature selection framework based on spectral regression using the spectrum of graphs to explore the intrinsic properties of features. Cai et al. [20] designed a two-step strategy by adopting spectral regression in the first step to retain the multi-cluster structure of the data, and sparse regression in the second step to simplify calculations. By performing this two-step strategy, the potential correlation between different features can be well investigated, thereby efficiently selecting more discriminative features. In contrast, Hou et al. [21] utilized a one-step strategy by embedding sparse regression and spectral regression into a joint learning framework, in which the clustering performance outperforms the above two-step strategy in literature [20]. Unfortunately, the pseudo-label information generated by spectral analysis in UFS often generally has the drawback of negative values and an inaccurate similarity matrix due to noise interference, which is a challenging issue in UFS.
Secondly, it is significant to explore the internal structure of the original data. It is noted that the sparsity of high-dimensional data implies that such data contains manifold information. That is, the feature subset obtained by UFS should preserve this manifold information. To meet this requirement, a large number of relevant methods have been presented to investigate the internal structure of data in UFS. For example, He et al. [22] exploited local manifold information to evaluate the importance of features by calculating the Laplacian score (LapScor). Shang et al. [23] introduced a graph regularization term into the objective function of UFS and constructed a feature graph to enable the feature vectors of the original data to be consistent with the vectors in the coefficient matrix. Thus, the feature subset preserves this manifold information by learning the manifold information through a similarity matrix. Liu et al. [24] constructed a loss term with ℓ1-norm constraint to maintain the local geometric structure of the data through linear coefficients. To address the unreliability of constructing the similarity matrix in the above UFS methods, Nie et al. [25] constructed a structured graph optimization to learn the similarity matrix adaptively, in which the manifold information will be well preserved by constructing a more satisfactory similarity matrix. Based on the method in [25], Li et al. [26] and Chen et al. [27] innovatively extended structured graph optimization. The former incorporated maximum entropy with the generalized uncorrelated regression model into the method described in [25], in which local manifold information of the data can be retained so that uncorrelated yet distinctive features are effectively selected. The latter derived a flexible optimal graph by leveraging a flexible low-dimensional manifold embedding mechanism to compensate for the unreliability of the conventional UFS methods. However, the above methods ignore the dependency between the data instances (that is, mining interconnection information between the data instances is not well understood yet).
Thirdly, it is important to exploit interconnection information between data instances in that interconnection information inherently implies in data instances. Exploring this interconnection information will lead to effectively improving the clustering performance and reducing the impact of inevitable noises. For this reason, Tang et al. [28] embedded latent representation learning into UFS, named LRLMR, in which latent representation learning learned an affinity matrix with correlation information between data instances. Based on LRLMR, Shang et al. [29] designed DSLRL, in which the affinity matrices regarding interconnection information are constructed to fully investigate the interconnection information. The distinction between LRLMR and DSLRL lies in the fact that the former only considers the interconnection information among data instances, whereas the latter takes into account the interconnection information existing between both data instances and features.
These methods mentioned above can address the above three issues respectively and effectively (i.e., designing successful strategies for extracting pseudo-label information to compensate for the absence of labeled data, exploring the internal structure of data instances, and exploiting interconnection information between data instances). However, the exploration of individual uniqueness in the existing literature remains an unresolved challenge. The reason may lie in that these methods generally assume the importance of each sample is no ranking so that the attribute score of each sample is assigned equally. Such an assumption is unsuitable since it neglects the diversity and particularity (i.e., the uniqueness of data instances). Specifically, three groups of face images illustrated in Figure 1a–c respectively chosen from three classes in the ORL dataset [30]. From a human visual perspective, each individual in Figure 1 is unique whether it belongs to the same class or a different class. Consequently, it is more reasonable to assume an attribute score is flexibly assigned to each individual according to their uniqueness. In other words, the scores of individuals within the same class tend to become more similar, while the scores of individuals from different classes tend to become more distinct. Therefore, emphasizing individual uniqueness can facilitate the effective distinction of individuals and eliminate the interference of redundant features to some extent. Although some researchers [31,32,33] have introduced the idea of LDA into UFS to minimize the within-class distance and maximize the between-class distance, they still ignore the learning of individual uniqueness.
To alleviate the above issues outlined, in this paper a novel embedded unsupervised feature selection method, called unsupervised feature selection with latent relationship penalty term (LRPFS), is proposed. The original intention of LRPFS is to explore the data structure, pseudo-label information, interconnection information, and individual uniqueness. Specifically, (1) a novel method is developed to preserve the spatial data structure by quantifying sample distances in space via inner product relations; (2) we evaluate the uniqueness of samples based on their unique contributions to the whole, and determine the uniqueness of samples through pairwise relationships based on the principles of latent representation learning and symmetric nonnegative matrix factorization. Meanwhile, these pairwise relationships also create connections between samples, enabling the subspace matrix to provide pseudo-label information.
The main contributions of this paper are summarized as follows:
  • A novel unsupervised feature selection with latent relationship penalty term is presented, which simultaneously performs an improved latent representation learning on the attribute scores of the subspace matrix and imposes a sparse term on the feature transformation matrix.
  • The latent relationship penalty term, an improved latent representation learning, is proposed. By constructing a novel affinity measurement strategy based on pairwise relationships and attribute scores of samples, this penalty term can exploit the uniqueness of samples to reduce interference from noise and ensure that the spatial structure of both the original data and the subspace data remains consistent, which is different from other existing models.
  • An optimum algorithm with convergence is designed and extensive experiments are conducted: (1) LRPFS has shown superior performance on publicly available datasets compared to other existing models in terms of clustering performance and more remarkable capability of discriminative feature selection. (2) Experiments verify that LRPFS has fast convergence, short computation time, and significant performance by explicitly evaluating an attribute score for each individual to eliminate redundant features.
The remaining sections of this article are structured as follows. Section 2 reviews the concepts of latent representation learning and inner product space about feature selection. In Section 3, the latent relationship penalty term is introduced, and the LRPFS method is presented. Section 4 demonstrates the superiority of the proposed method over other advanced algorithms through experimental design and analyses the properties of the algorithm itself. Finally, Section 5 provides the conclusion and discusses future work.

2. Related Works

In this section, latent representation learning in feature selection is briefly reviewed. In addition, the inner product space and some notations are introduced.

2.1. Notations

The following notations are illustrated in this paper. For example, an arbitrary A = [ a 1 , a 2 , , a n ] T R n × d is donated as a matrix, a i indicates that the i -th row of matrix A is a vector, a i j represents the elements of the i -th row and j -th column of the matrix A , the F-norm of the matrix A is defined as A F = ( i = 1 n j = 1 d a i j 2 ) 1 / 2 , and the ℓ2,1-norm of the matrix A is defined as A 2,1 = i = 1 n ( j = 1 d a i j 2 ) 1 / 2 .

2.2. Review of Feature Selection and Latent Representation Learning

Generally, feature selection methods are classified into three categories according to the strategy of feature evaluation: filter [22,34], wrapper [35,36], and embedded [37,38]. The filter feature selection aims to evaluate each feature directly with a specific ranking criterion, i.e., variance, Laplacian score, feature similarity, and trace ratio, to select a feature subspace [39]. However, the filter feature selection only considers the feature itself but ignores the interdependence between features [40]. In contrast, the wrapper feature selection constructs a quasi-optimal subset of features by emphasizing feature combinations and correlations between features [41]. In most approaches based on wrapper feature selection, the algorithm complexity tends to increase with the dimension of data space, which may lead to computationally expensive. In addition, to achieve an effective feature subset, these methods based on embedded feature selection integrate feature selection with model learning by adjusting feature priority in the learning iteration process. Compared to filter feature selection, embedded feature selection has gained more attention due to its superior performance in that it can reduce feature redundancy more effectively except that it has greater robustness, whereas, compared to wrapper feature selection, embedded feature selection effectively reduces training time and cost, and speeds up computation. Nevertheless, these approaches based on the above three kinds have drawbacks such as premature convergence and obtaining local optimum.
Traditional UFS methods assume that data are independently and identically distributed, whereas, in the real world, samples are correlated with each other. With the development of embedded feature selection, embedding latent representation learning into UFS has been widely applied to investigate the interconnection information between samples such that it can reveal the latent structure in the data [28,42,43]. Latent representation learning employs a model of symmetric nonnegative matrix factorization [44] to learn the interconnection information between samples. The specific representation is as follows:
R = T V V T F 2 = i , j = 1 n t i j v i v j T 2 ,
where T R n × n is the affinity matrix containing interconnection information between samples [28], V R n × f represents the latent representation matrix, n is the number of samples, and f is the number of latent variables. The t i j is defined as follows [29]:
t i j = e x p ( x i x j 2 2 σ 1 2 ) ,
where i , j = 1, 2, …, n , σ 1 is a bandwidth parameter. The latent representation learning is integrated into the UFS, and the objective function is expressed as follows:
O = X W V F 2 + α T V V T F 2 , s . t .   V 0 ,
where α is the balance parameter, X R n × d is the data matrix, and W R d × f is the feature transformation matrix. The first term of the objective function is the loss function, which enables matrix X to approximate matrix V under the influence of matrix W . Consequently, matrix V can be considered as a subspace of X . Moreover, under the influence of latent representation learning, matrix V can be used as a pseudo-label matrix to guide feature selection. Through the application of latent representation learning, the performance and efficiency of these models can be enhanced in that there exists a degree of reduction in noise and redundant information.

2.3. Inner Product Space

The essence of feature selection is to find a suitable feature subset to represent the original data, which is the mapping of data from a high-dimensional space to a low-dimensional space. Therefore, exploring the spatial structure has become an important issue in feature selection [1,45,46]. First, the objective is to aggregate several elements into a cohesive set with an establishment of the “relationship” or “structure” among these elements in a set to construct a space. However, there are various spaces in mathematics, such as metric space, vector space, normed linear space, and inner product space. Among these spaces, the inner product space adds a “structure” [47], i.e., inner product, in which the angles and lengths of vectors are discussed and it can possess the properties of non-negativity such as non-degeneracy, conjugate symmetry, first-variable linearity and second-variable conjugate linearity. Therefore, in this paper, our objective is to explore samples in the inner product space and a framework that can potentially preserve the structure of the data space by integrating the structure of different elements in the higher-dimensional space into the lower-dimensional space via inner product operations.

3. Methodology

In this section, a novel regularization term, known as the latent relationship penalty term, is presented and incorporated into the derivation of the LRPFS mechanism. Additionally, an optimization algorithm, convergence analysis, and computational complexity analysis for LRPFS are provided.

3.1. Latent Relationship Penalty Term

In a practical application, each individual exhibits distinct characteristics (i.e., uniqueness) that contribute to the whole model, which assumes specific roles in data representation and establishing interconnections with the entire dataset. However, it is important to note that this individual uniqueness is not solely dependent on irrelevant and redundant features, but rather on the presence of significant and informative features. The characteristics of these features play a crucial role in capturing the essence of the data and enabling meaningful representations. Therefore, our objective is to acquire a spatial subset that can accurately identify the significant features while preserving individual uniqueness. According to this premise, a novel latent relationship penalty term based on Equation (1) is developed to exploit individual uniqueness while maintaining the inherent structural relationships within the data through pairwise associations. Subsequently, the methodology and principles of the latent relationship penalty term are presented to illustrate the proposed method, which involves two main steps in constructing the latent relation penalty term as follows.

3.1.1. Preservation of Data Structures

The dataset X R n × d can be considered as a distribution of n individuals x i ( i = 1 , 2 , , n ) within a d dimensional linear space, and x i , x j reflects inherent (inner product) relationship between two vectors. The subspace matrix V R n × f is designated to represent the underlying structure of the dataset X . Based on this, our objective is to preserve the approximate data structure between individuals in both the original space and the subspace by leveraging the inner product space metric distance. This is accomplished through the pairwise relationship of vector multiplication by ensuring that v i , v j x i , x j . As a result, this pairing establishes a direct correspondence between the high-dimensional inner product space and the low-dimensional inner product space, as depicted in the following Equation (4):
R 1 = i , j = 1 n v i , v j λ x i , x j 2 = i , j = 1 n v i v j T λ x i x j T 2 = V V T λ X X T F 2 ,
where λ is a scale parameter to regulate the scale relationship between the original data matrix X and the subspace matrix V . Accounting for the presence of inherent noise in the original dataset, which frequently undermines the integrity of the data structure, our approach focuses on mitigating the influence of irrelevant and redundant features within each sample. To achieve this, the uniqueness of each sample should be exploited, thereby enhancing the preservation of the underlying data structure.

3.1.2. Exploring the Uniqueness of Individuals

To assess the uniqueness of individuals, the concept of an attribute score, denoted as q , is introduced. This score is referred to as the contribution of each individual to the overall dataset by taking into account their interrelationship with the entire sample. Specifically, the attribute score q i i is defined as s i 1 , where q i i represents the score of the i -th sample x i , and s i is the i -th vector in the similarity matrix S . To construct the similarity matrix, S R n × n , a k -neighbourhood graph, denoted as N k , is employed. The value of k is set to 0 or 5, where a value of 0 corresponds to a complete graph. The similarity matrix is defined as follows:
S i j = e x p ( x i x j 2 2 σ 2 ) ,   i f   x j N k ( x i ) 0 , o t h e r w i s e i , j = 1,2 , 3 , , n ,
where σ is the width parameter [23], the obtained score q is introduced into Equation (4) such that v i q i i x i , resulting in the final expression for the latent relationship penalty term as follows.
R 2 = i , j = 1 n v i v j T λ q i i q j j x i x j T 2 .
Classical latent representation learning in UFS typically employs Gaussian functions to measure interconnection information between samples, as demonstrated in Equations (1) and (2). In contrast, Equation (6) introduces a novel measurement approach by defining t i j as λ q i i q j j x i x j T . The affinity matrix constructed using this method not only leverages the uniqueness of individuals as a prior condition to mitigate noise interference but also regulates the structural approximation consistency between the original space and the subspace by capturing pairwise relationships among samples.

3.2. Objective Function

Aiming to incorporate Equation (6) into UFS such that the latent relational penalty term has a potentially constraining function on the feature transformation matrix W R d × f , the objective function is summarized as follows.
W * , V * = a r g   m i n i = 1 n x i W v i 2 2 + i , j = 1 n v i v j T λ q i i q j j x i x j T 2 , s . t . W > 0 , V > 0 .
To prevent the occurrence of trivial solutions, a potentially constraining function is integrated by the latent relationship penalty term. The detailed process is presented in Theorem 1. Furthermore, by imposing the latent relationship penalty term, the subspace matrix V can act as a pseudo-label matrix within the UFS framework to guide the feature selection process. Our goal is to acquire a sparse feature transformation matrix W to improve the efficiency of feature selection. To achieve this purpose, the ℓ2,1-norm regularization term is introduced, which results in the final expression of the objective function as follows:
W * , V * = a r g   m i n i = 1 n x i W v i 2 2 + i , j = 1 n v i v j T λ q i i q j j x i x j T 2 + α W 2,1 , s . t . W > 0 , V > 0 ,
where α is a sparsity constraint parameter to adjust the sparsity of W . The feature transformation matrix W is obtained by the optimization of the objective function with the score of each feature calculated using w i 2 . The higher the score, the more important the features are. The top l features are selected to generate a new data matrix X n e w by ranking the feature scores in descending order.

3.3. Optimization

To simplify the operation, all variables in the objective function of LRPFS are represented as matrices, and Equation (8) can be substituted as follows.
W * , V * = a r g   m i n X W V F 2 + V V T λ Q X X T Q T F 2 + α W 2,1 , s . t . W > 0 , V > 0 ,
where Q R n × n is a diagonal matrix and q i i ( i = 1 , 2 , , n ) is the i -th diagonal element in the matrix Q . The model (9) is a nonconvex problem concerning W and V , so it is not practical to find the global optimal solution at the same time. Nevertheless, the model is convex concerning the other variable when one variable is fixed, therefore this model can be solved by alternately optimizing W and V , respectively. The Lagrange function is constructed as follows.
L = X W V F 2 + V V T λ Q X X T Q T F 2 + α T r W T U W + T r φ W + T r V ,
where, U R d × d is a diagonal matrix. The calculation of the i -th diagonal element u i i of U is performed as follows:
u i i = 1 2 W i 2 .
To avoid overflow, a sufficiently small constant ε is introduced, leading to the following rewriting of Equation (10):
u i i = 1 2 m a x ( W i 2 , ε ) .
(1)
Fix V and update W :
The partial derivative of the Lagrange Function (11) with respect to W is computed, and further results in the following expression:
L W = 2 X T X W 2 X T V + 2 α U W + φ .
According to the Karush-Kuhn-Tucker (KKT) condition [48], the following iterative update formula for W can be derived.
w i j w i j X T V i j X T X W + α U W i j .
(2)
Fix W and update V :
Similar to the optimization variable W , the partial derivative of the Lagrange Function (10) for V is taken, yielding the following result:
L V = 2 X W + 2 V + 4 V V T V 4 λ Q X X T Q T V + .
Following the Karush-Kuhn-Tucker (KKT) condition, the following iterative update formula for V is obtained.
v i j v i j X W + 2 λ Q X X T Q T V i j V + 2 V V T V i j .
With the above analysis, Algorithm 1 summarizes the procedure of LRPFS. The pipeline of LRPFS is visualized in Figure 2.
Algorithm 1: LRPFS algorithm steps
1: Input: Data matrix X R n × d ; Parameters α , λ , f , and k ; The maximum number of iterations m a x I t e r ;
2: Initialization: The iteration time t = 0 ; W = r a n d d , f ; V = r a n d ( n , f ) ; U = e y e ( m ) ; Construct the attribute score matrix Q ;
3: while not converged do
4:           Update   W   using w i j w i j X T V i j X T X W + α U W i j ;
5:           Update   V   using v i j v i j X W + 2 λ Q X X T Q T V i j V + 2 V V T V i j ;
6:           Update   U   using u i i = 1 2 m a x ( W i 2 , ε ) ;
7:           Update   t   by: t = t + 1 ,   t m a x I t e r ;
8: end while
9: Output: The feature transformation matrix W and the subspace matrix V .
10: Feature selection: Calculate the scores of the d features according to w i 2 and select the first l features with high scores.
The potential constraint of error function embedding latent relationship penalty term in LRPFS can be explained by the following Theorem 1:
Theorem 1.
Let  X R n × d ,  V R n × c , and  W R d × c , Assume that there exists a  X  with a left inverse, such that  X W V  and  V V T λ Q X X T Q T , then the term  W W T λ X L Q X X T Q T X R  ( λ  is a scalar,  X L is the left inverse of  X , and  X R is the right inverse of  X ).
Proof. 
Properties of one side inverse matrices [49]: If the matrix M R n × d has rank ρ ( M ) = d , then there exists a left inverse matrix M L R d × n such that M L M = I d . Similarly, the matrix M T R d × n exists a right inverse matrix M R R n × d such that M T M R = I d , where I d R d × d is an identity matrix.
According to the properties above, if X is one side inverse matrice and ρ ( X ) = d , then X has a left inverse X L and X T has a right inverse X R , and the proof is as follows:
V V T λ Q X X T Q T ,   ( Since   X W V ) X W W T X T λ Q X X T Q T , X L X W W T X T λ X L Q X X T Q T , W W T X T λ X L Q X X T Q T , W W T X T X R λ X L Q X X T Q T X R , W W T λ X L Q X X T Q T X R .
The constraint on W W T can be derived by W W T λ X L Q X X T Q T X R , indicating a latent relationship by embedding constraint on W . □
In Theorem 1, the scenario where X has a left inverse is presented. However, in practical applications, it’s possible that the matrix X does not possess a left inverse, resulting in an approximation of X W W T X T λ Q X X T Q T . The final expressions in both situations express the relationship between the matrix W and the known matrices Q and X . The latent relationship penalty term, through this potential constraint, makes the generated W more accurate and avoids trivial solutions.

3.4. Convergence Analysis

This subsection proves the convergence of LRPFS by demonstrating that under the update rules (14) and (16), the objective Function (8) is monotonically decreasing.
First, the introduction of a theorem [45] is necessary, which provides theoretical assurance for the convergence of LRPFS.
Definition 1.
If there is a function  G ( x ,   x )  such that  F ( x ) satisfies:
G x , x F x , G x , x = F x .
Thus  F ( x )  is nonincreasing function under the following updating formula:
x ( t + 1 ) = arg m i n x G ( x , x t ) ,
where  G ( x ,   x )  is an auxiliary function for F ( x ) .
Proof. 
F x ( t + 1 ) G x t + 1 , x t G x t , x t = F x ( t ) .
If the objective function is proved to be monotonic, the objective function is retained to contain the W term, and get the following equations:
F W = X W V F 2 + α W 2,1 .
Through the computation of first-order and second-order partial derivatives of F ( W ) with respect to W , the following expressions can be derived:
F i j = [ 2 X T X W 2 X T V + 2 α U W ] i j ,
F i j = [ 2 X T X + 2 α U ] i i .
Lemma 1.
G W i j , W i j ( t ) = F i j W i j t + F i j W i j t W i j W i j t + [ X T X W + α U W ] i j W i j t W i j W i j t 2 ,
where  G ( W i j ,   W i j ( t ) )  is the auxiliary function of  F i j . When  W i j = W i j ( t ) ,  G W i j ( t ) , W i j ( t ) = F i j W i j t .
Proof. 
The Taylor series expansion of F i j ( W i j ) is:
F i j W i j = F i j W i j t + F i j W i j t W i j W i j t + [ X T X + α U ] i i W i j W i j t 2 .
G W i j , W i j ( t ) F i j W i j is equivalent to:
[ X T X W + α U W ] i j W i j t [ X T X + α U ] i i .
Since:
X T X W + α U W i j = k X T X + α U i k W k j t X T X + α U i i W i j ( t ) .
The inequality G W i j , W i j ( t ) F i j W i j holds. □
Next, it is demonstrated that, in accordance with the iterative update rule (14), F i j exhibits a monotonically decreasing.
Proof. 
Substituting Equation (24) with x ( t + 1 ) = arg m i n x G ( x , x t ) .
W i j ( t + 1 ) = W i j ( t ) W i j t F i j W i j t 2 X T X W + α U W i j = W i j t X T V i j X T X W + α U W i j .
It can be seen from the updating rules of W that F i j monotonically decreases under updating (14). The proof of the updating rules of V is similar to that of W such that the acquisition of updating rule (16) will be generated. Therefore, the conclusion can be drawn that F i j exhibits a monotonically decreasing trend, and the objective function of LRPFS converges. □

3.5. Computational Time Complexity

The time complexity of LRPFS consists of two main parts. The first part involves constructing the feature score matrix Q , which has a time complexity of O ( d n 2 ) . The second part involves iteratively optimizing the feature transformation matrix W and the subspace matrix V , with a calculation complexity of O ( n d 2 ) per iteration. Therefore, the total time complexity is O ( d n 2 + t n d 2 ) , where t is the number of iterations.

4. Experiments

In this section, the superiority of LRPFS is demonstrated through a series of experiments conducted on benchmark datasets. These experiments consist of two main parts: comparative experiments (Section 4.5) and LRPFS analysis experiments (Section 4.6). All of the experimental results are implemented with MATLAB R2018b on a Windows machine with 3.10-GHZ i5-11300H, 16-GB main memory. The code of our proposed LRPFS is available at https://github.com/huangyulei1/LRPFS accessed on 5 December 2023.

4.1. Datasets

The benchmark datasets include COIL20, Colon, Isolet, JAFFE, Yale64, PIE [29], nci9, PCMAC, Lung_dis, and TOX_171, downloaded at https://jundongl.github.io/scikit-feature/datasets.html, accessed on 3 August 2022, and https://www.face-rec.org/databases/, accessed on 3 August 2022, and Table 1 illustrates the details of these datasets.

4.2. Comparison Methods

Since LRPFS belongs to UFS, the comparison experiments are performed under unsupervised conditions, and the selected 10 state-of-the-art UFS methods are briefly described as follows.
Baseline: The method utilizes the original dataset as a feature subset for clustering.
LapScor [22]: A classical filter FS method to evaluate features by local preservation ability.
SPEC [19]: It selects feature subsets by utilizing spectral regression.
MCFS [20]: A two-step framework for selecting features is constructed by combining spectral regression and sparse regression.
UDFS [40]: Based on ℓ2,1-norm minimization and discriminant analysis ideas, a joint framework is imposed to guide the UFS.
SOGFS [25]: The method adaptively learns the local manifold structure and constructs a more accurate similarity matrix to select more discriminative features.
SRCFS [46]: The idea of collaboration and randomization of multiple subspaces under high-dimensional space is introduced to select more discriminable features by exploring the ability of various subspaces.
RNE [24]: A method to preserve local geometric structure is constructed through a novel robust objective function.
inf-FSU [39]: It assigns a score for each feature through graph theory to obtain a feature subset.
S2DFS [38]: This method constructs a parameter-free UFS method based on the trace ratio criterion with ℓ2,0-norm constraint to maintain more feature discrimination power.

4.3. Evaluation Metrics

In our experiments, the performance of the proposed LRPFS is evaluated by using two evaluation metrics: clustering accuracy (ACC) and normalized mutual information (NMI) [50]. The values of these evaluation metrics range from 0 to 1, and the higher the value is, the better the performance of the algorithm is.

4.4. Experimental Settings

Concerning the parameter settings, the number of k -neighbours is set to 0 or 5 for these methods that require the construction of a similarity matrix, and the parameter σ is fixed at 10 [23]. For the inf-FSU method, the parameter α is tuned in the range of {10−4, 10−3, 10−2, 10−1, 100}. For the RNE method, the parameter range is set according to ref. [24]. For our LRPFS method, both the scaling parameter λ and the sparse parameter α are searched from {10−4, 10−3, 10−2, 10−1, 100, 101, 102, 103, 104}, and the number of latent variables f is default to the number of classes of the dataset. For the remaining methods, the parameters are tuned in {10−4, 10−3, 10−2, 10−1, 100, 101, 102, 103, 104}. In the experimental process, the maximum number of iterations m a x I t e r is set to 30 and the iteration will be terminated early when the objective function value ( O b j ) satisfies O b j t O b j t 1 / O b j ( t 1 ) < 10 6 , where O b j ( t ) denotes the objective function value for the t -th iteration. The number of feature subsets l varies in {20, 30, 40, 50, 60, 70, 80, 90, 100}. Since the result of k-means depends on the initialization, we repeat the clustering 20 times independently and take the means and standard deviations as the final results.

4.5. Comparison Experiment

The performance of LRPFS is compared with 10 state-of-the-art FS methods on 9 datasets, i.e., COIL20, Colon, Isolet, JAFFE, Lung_dis, nci9, PCMAC, PIE and TOX_171. Firstly, feature selection is performed to achieve feature subsets from the datasets. Secondly, k-means is applied to the feature subsets to derive clustering results. Finally, the results of these algorithms are evaluated by two metrics (i.e., ACC and NMI).
According to the above experimental settings, the clustering results (i.e., ACC, NMI) of LRPFS and the comparison methods on nine datasets are shown in Table 2 and Table 3, where the best results for each dataset are bolded, the second-best results are underlined and labelled with the number of selected features. Table 4 illustrates the running time of all algorithms on various datasets. According to these Tables, it can be seen that LRPFS outperforms all other comparison methods in terms of ACC, while in most cases, the NMI values of LRPFS are higher than other algorithms, which fully demonstrates the effectiveness of selecting discriminative features. The specific summary is as follows.
(1)
Overall, most of the UFS methods outperform baseline across a majority of datasets. This performance differential highlights the substantial superiority of these UFS methods in effectively eliminating irrelevant and redundant features.
(2)
The results presented in Table 2 and Table 3 indicate that our proposed method, LRPFS, achieves significant performance compared to other state-of-the-art techniques. Specially, LRPFS showcases a substantial increase in Accuracy (ACC) of 32.26%, 30.65%, 30.57%, 33.87%, 24.84%, 25.81%, 29.04%, 24.2%, 14.76%, and 1.54%, respectively, as compared to baseline, LapScor, SPEC, MCFS, UDFS, SOGFS, SRCFS, RNE, inf-FSU, and S2DFS. The main reason for this phenomenon is that LRPFS excels in extracting the inherent information within the data structure and assigning unique attribute scores to individual samples. These distinct attributes significantly contribute to its outstanding performance, especially on the Colon dataset.
(3)
The insights revealed by the results in Table 4 serve to emphasize LRPFS’s remarkable competitiveness in terms of computation time against a significant proportion of the algorithms under comparison. While the performance of LRPFS might be slightly inferior to baseline, LapScor, and MCFS, it still exhibits excellent computational efficiency, coupled with the highest clustering accuracy. This is particularly evident when comparing LRPFS with UDFS, SOGFS, RNE, inf-FSU, and S2DFS. Compared with baseline, the running time of LRPFS on some datasets with few samples, such as Colon, nci9, and TOX_171, is slower than baseline. The main reason is that the process of selecting discriminant features takes a certain amount of time. However, when dealing with datasets with large samples such as PIE and Isolet, the running time of LRPFS is superior to baseline and the clustering accuracy is also significantly improved, which verifies the dimension reduction capability of LRPFS and provides a theoretical basis for the implementation of practical problems.
(4)
MCFS performs better than SPEC on Isolet, JAFFE, Lung_dis, PCMAC, and PIE since MCFS takes sparse regression into account in the FS model, which can improve the learning ability of the model. Specially, S2DFS is slightly better than UDFS even if there exists the same idea of discriminant analysis in S2DFS and UDFS. The reason may lie in that in S2DFS a trace ratio criterion framework with ℓ2,0-norm constraint plays a positive role.
(5)
RNE, SOGFS, and LRPFS exhibit commendable performance, affirming the importance of capturing the underlying manifold structure inherent in the data. Notably, SOGFS outperforms RNE across some datasets, especially on JAFFE, Lung_dis, nci9, and PIE. This distinction can be attributed to SOGFS’s incorporation of an adaptive graph mechanism, thereby engendering more precise similarity matrices. Unlike the aforementioned two techniques, LRPFS introduces the refinement of attribute scores to mitigate the harmful impact of noise while preserving the inherent data structure. This distinctive attribute underscores the superiority of LRPFS to a certain degree.
To further investigate the effect of the number of selected features on LRPFS, the clustering performance of various methods is illustrated upon different numbers of features as shown in Figure 3 and Figure 4, where the horizontal coordinate indicates the number of features selected according to the FS methods, the vertical coordinate denotes the clustering performance and the shaded section represents the error range of ACC and NMI. It can be explicitly observed that the curves of LRPFS are mostly uppermost, especially on Colon, PCMAC, and PIE, which achieves a satisfactory performance and demonstrates the superiority of LRPFS over other compared methods.
To verify the noise reduction ability of LRPFS, noise tests are conducted on the COIL20 dataset with random noise of 8 × 8, 12 × 12 and 16 × 16 sizes added to each sample (32 × 32) respectively to generate three synthetic datasets as shown in Figure 5b–d, and the clustering results are shown in Table 5. It can be seen that LRPFS is superior to other comparison methods under the influence of various noises and still achieves excellent performance, especially in Figure 5b, it is extremely difficult to select significant features since most features of the pictures are blocked according to the excessive size of the noise. However, LRPFS with latent relationship penalty term still achieves satisfactory results, for example, the ACC of LRPFS is 9.06% higher than that of RNE on the 16×16 noised COIL20 datasets. Consequently, LRPFS has the strong learning capability of identifying discriminative features and diminishing noise.
The feature subset obtained from the feature selection method on the COIL20 dataset is visualized using t-SNE. In our experiments, a comparative experiment is conducted with the Baseline, S2DFS, and LRPFS methods. For the baseline method, all features are selected as feature subsets to represent the original dataset. For both S2DFS and LRPFS, the top 100 features are selected as feature subsets. The experimental results correspond to Figure 6a–c, respectively. It is obvious that in Figure 6a,b, the inter-class distance in regions A and B is very small, which means that baseline and S2DFS fail to distinguish different classes clearly, whereas, in Figure 6c, our LRPFS succeeds in enlarging the distance of different classes. Especially, when the coordinate scales are the same as in Figure 6c, the overall spatial structure of LRPFS remains consistent compared to S2DFS, which further verifies that the potential relationship penalty term can explore the uniqueness of the samples to maximize the inter-class distance while preserving the spatial structure of the data and selecting more discriminative features.

4.6. LRPFS Experimental Performance

In this subsection, to assess the efficiency of LRPFS, convergence, and parameter sensitivity experiments are conducted on nine benchmark datasets (i.e., COIL20, Colon, Isolet, JAFFE, Lung_dis, nci9, PCMAC, PIE, and TOX_171). Additionally, the feature selection performance of LRPFS is evaluated on the Yale64 dataset.

4.6.1. Convergence Analysis

To empirically demonstrate the convergence of LRPFS, convergence curves for nine datasets are depicted in Figure 7, where the horizontal axis represents the number of iterations and the vertical axis denotes the objective function values. From these plots, it is observed that the curves of the objective function exhibit significant and rapid variations, particularly in the Colon, Isolet, Lung_dis, and nci9 datasets, and convergence can be achieved within 15 iterations on all datasets. This observation serves as evidence that LRPFS achieves effective and stable convergence across all datasets, which further validates the correctness of the theoretical convergence proof.

4.6.2. Parameter Sensitivity Experiment

LRPFS involves parameters k , σ , λ , and α . Among them, parameters k and σ are associated with constructing the sample attribute scores, which have an indirect and slight influence on the algorithm. Therefore, we mainly conduct parameter sensitivity experiments on parameters λ and α . We fix k = 0 , σ = 10 , and adjust the range of parameters λ and α in {10−4, 10−3, 10−2, …, 102, 103, 104}. The 3-D grids of ACC and NMI are displayed under different parameter values on test datasets as shown in Figure 8 and Figure 9. As can be observed on most of the datasets, the values of ACC and NMI are positively correlated, and the clustering results are comparatively stable with the varying parameters. In particular, in these datasets where the number of samples is larger than the number of features. For example, COIL20 and PIE, λ has a minor impact on the clustering performance of LRPFS than α . However, on Colon, nci9, and PCMAC where the number of features is larger than the number of samples, LRPFS is more sensitive to λ . The reason may lie in that latent relationship penalty mining plays a more significant role in guiding feature selection when the number of sample features is larger. In a word, the parameters λ and α both play an indispensable role in LRPFS in that the feature selection mechanism of LRPFS can perform efficiently under the combined influence of latent relationship penalty and sparse constraints. Meanwhile, the experiment verifies that the parameters λ and α are suitable for 12 benchmark datasets in the range of {10−4,10−3, 10−2, …, 102, 103, 104}, which provides a suitable parameter reference range for LRPFS in practical application.

4.6.3. The Effectiveness Evaluation of Feature Selection

On the Yale64 dataset, the selected features of LRPFS are visualized. In our experiments, two samples are randomly selected from the Yale64 dataset and select {0, 50, 100, 200, 500, 800, 1000, 2000} features from the selected samples under LRPFS, and the selected features are displayed in white pixels, which correspond to the images from left to right in the illustration in Figure 10, sequentially. It can be observed that when 50 features are selected, the selected features are mainly concentrated in hair, eyes, and nose, whereas the selected features appear in the mouth when the selected features are increased to 100 so that the features of hair, eyes, and nose are more discriminative than the mouth in Yale64 dataset. As the selected features gradually increase, the selected features are mainly divided into hair, eyes, glasses, nose, mouth, beard, etc., which is consistent with the perception of selected features for face recognition. Thus, it is demonstrated that LRPFS can effectively identify discriminative features and reasonably evaluate feature scores.

5. Conclusions

In this paper, a novel unsupervised feature selection with latent relationship penalty term, named LRPFS, is proposed, which takes into account the uniqueness of the samples and sufficiently exploits the attributes of preserving the data structure. LRPFS incorporates latent relationship penalty term into UFS, which provides a latent constraint on the feature transformation matrix and generates a pseudo-label matrix for feature selection. Additionally, the ℓ2,1-norm sparsity constraint is applied to the feature transformation matrix to enhance the computational efficiency of the algorithm significantly.
Comparative experiments are conducted between the LRPFS and 10 UFS methods. The comparison experiments covered various aspects, including clustering tasks, running speed, and noise experiments, utilizing datasets from different domains, such as images, text, Speech Signal, and biological data. The experimental results demonstrate that, compared to the comparison methods, LRPFS can effectively select discriminative features and reduce the interference of noise, especially on Colon dataset the ACC value of LRPFS is increased by 32.26% over the baseline, which further confirms the effectiveness of the LRPFS mechanism. The reason is that LRPFS can preserve pairwise relationships on the uniqueness score of the sample to explore the interconnection between individuals. At the same time, our proposed learning framework is beneficial for providing a theoretical basis for the realization of practical problems.
On the other hand, one limitation of LRPFS is that it requires the tuning of two parameters, which can be time-consuming. In Section 4, extensive experiments demonstrate the importance of a well-tuned set of parameters. Hence, our future work aims to develop a new mechanism that eliminates the need for parameter tuning or to design a novel optimization mechanism capable of simultaneously optimizing all variables. We also plan to apply this method to other fields such as remote sensing images and gene expression analysis in the future.

Author Contributions

Z.M., conceptualization, methodology, writing—review and editing, and validation; Y.H., methodology, software, data curation, and writing—original draft preparation; H.L., visualization and investigation; J.W., supervision and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Natural Science Foundation of Ningxia (Nos. 2020AAC03215 and 2022AAC03268), National Natural Science Foundation of China (No. 61462002), and Basic Scientific Research in Central Universities of North Minzu University (Nos. 2021KJCX09 and FWNX21).

Data Availability Statement

The data and code that support the findings of this study are available from the corresponding author (Z.M.) upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Jain, A.; Zongker, D. Feature selection: Evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 153–158. [Google Scholar] [CrossRef]
  2. Nie, F.; Wang, Z.; Wang, R.; Li, X. Submanifold-preserving discriminant analysis with an auto-optimized graph. IEEE Trans. Cybern. 2020, 50, 3682–3695. [Google Scholar] [CrossRef] [PubMed]
  3. Lee, D.; Seung, H. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef] [PubMed]
  4. Lipovetsky, S. PCA and SVD with nonnegative loadings. Pattern Recognit. 2009, 42, 68–76. [Google Scholar] [CrossRef]
  5. Roweis, S.; Saul, L. Nonlinear dimensionality reduction by locally linear embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef] [PubMed]
  6. Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. (CSUR) 2017, 50, 1–45. [Google Scholar] [CrossRef]
  7. Saberi-Movahed, F.; Rostami, M.; Berahmand, K.; Karami, S.; Tiwari, P.; Oussalah, M.; Band, S. Dual regularized unsupervised feature selection based on matrix factorization and minimum redundancy with application in gene selection. Knowl. Based Syst. 2022, 256, 109884. [Google Scholar] [CrossRef]
  8. Wang, Y.; Wang, J.; Tao, D. Neurodynamics-driven supervised feature selection. Pattern Recogn. 2023, 136, 109254. [Google Scholar] [CrossRef]
  9. Plaza-del-Arco, F.; Molina-González, M.; Ureña-López, L.; Martín-Valdivia, M. Integrating implicit and explicit linguistic phenomena via multi-task learning for offensive language detection. Knowl. Based Syst. 2022, 258, 109965. [Google Scholar] [CrossRef]
  10. Ang, J.; Mirzal, A.; Haron, H.; Hamed, H. Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection. IEEE/ACM Trans. Comput. Biol. Bioinf. 2016, 13, 971–989. [Google Scholar] [CrossRef]
  11. Bhadra, T.; Bandyopadhyay, S. Supervised feature selection using integration of densest subgraph finding with floating forward-backward search. Inf. Sci. 2021, 566, 1–18. [Google Scholar] [CrossRef]
  12. Wang, Y.; Wang, J.; Pal, N. Supervised feature selection via collaborative neurodynamic optimization. IEEE Trans. Neural Netw. Learn. Syst. 2022. [Google Scholar] [CrossRef] [PubMed]
  13. Han, Y.; Yang, Y.; Yan, Y.; Ma, Z.; Sebe, N.; Zhou, X. Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 252–264. [Google Scholar] [CrossRef] [PubMed]
  14. Chen, X.; Chen, R.; Wu, Q.; Nie, F.; Yang, M.; Mao, R. Semisupervised feature selection via structured manifold learning. IEEE Trans. Cybern. 2022, 52, 5756–5766. [Google Scholar] [CrossRef] [PubMed]
  15. Li, Z.; Tang, J. Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans. Image Process. 2015, 24, 5343–5355. [Google Scholar] [CrossRef] [PubMed]
  16. Zhu, P.; Hou, X.; Tang, K.; Liu, Y.; Zhao, Y.; Wang, Z. Unsupervised feature selection through combining graph learning and ℓ2,0-norm constraint. Inf. Sci. 2023, 622, 68–82. [Google Scholar] [CrossRef]
  17. Shang, R.; Kong, J.; Zhang, W.; Feng, J.; Jiao, L.; Stolkin, R. Uncorrelated feature selection via sparse latent representation and extended OLSDA. Pattern Recognit. 2022, 132, 108966. [Google Scholar] [CrossRef]
  18. Zhang, R.; Zhang, H.; Li, X.; Yang, S. Unsupervised feature selection with extended OLSDA via embedding nonnegative manifold structure. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 2274–2280. [Google Scholar] [CrossRef]
  19. Zhao, Z.; Liu, H. Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th Annual International Conference on Machine Learning, Corvalis, OR, USA, 20–24 June 2007; pp. 1151–1157. [Google Scholar] [CrossRef]
  20. Cai, D.; Zhang, C.; He, X. Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 333–342. [Google Scholar] [CrossRef]
  21. Hou, C.; Nie, F.; Li, X.; Yi, D.; Wu, Y. Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Trans. Cybern. 2014, 44, 2168–2267. [Google Scholar] [CrossRef]
  22. He, X.; Cai, D.; Niyogi, P. Laplacian score for feature selection. In Advances in Neural Information Processing Systems 18; The MIT Press: Cambridge, MA, USA, 2005; pp. 507–514. [Google Scholar]
  23. Shang, R.; Wang, W.; Stolkin, R.; Jiao, L. Subspace learning-based graph regularized feature selection. Knowl. Based Syst. 2016, 112, 152–165. [Google Scholar] [CrossRef]
  24. Liu, Y.; Ye, D.; Li, W.; Wang, H.; Gao, Y. Robust neighborhood embedding for unsupervised feature selection. Knowl. Based Syst. 2020, 193, 105462. [Google Scholar] [CrossRef]
  25. Nie, F.; Zhu, W.; Li, X. Unsupervised feature selection with structured graph optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1302–1308. [Google Scholar]
  26. Li, X.; Zhang, H.; Zhang, R.; Liu, Y.; Nie, F. Generalized uncorrelated regression with adaptive graph for unsupervised feature selection. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1587–1595. [Google Scholar] [CrossRef]
  27. Chen, H.; Nie, F.; Wang, R.; Li, X. Unsupervised feature selection with flexible optimal graph. IEEE Trans. Neural Netw. Learn. Syst. 2022. [Google Scholar] [CrossRef] [PubMed]
  28. Tang, C.; Bian, M.; Liu, X.; Li, M.; Zhou, H.; Wang, P.; Yin, H. Unsupervised feature selection via latent representation learning and manifold regularization. Neural Netw. 2019, 117, 163–178. [Google Scholar] [CrossRef] [PubMed]
  29. Shang, R.; Wang, L.; Shang, F.; Jiao, L.; Li, Y. Dual space latent representation learning for unsupervised feature selection. Pattern Recognit. 2021, 114, 107873. [Google Scholar] [CrossRef]
  30. Samaria, F.; Harter, A. Parameterisation of a stochastic model for human face identification. In Proceedings of the 2nd IEEE Workshop on Applications of Computer Vision, Princeton, NJ, USA, 19–21 October 1994; pp. 138–142. [Google Scholar]
  31. Yang, F.; Mao, K.; Lee, G.; Tang, W. Emphasizing minority class in LDA for feature subset selection on high-dimensional small-sized problems. IEEE Trans. Knowl. Data Eng. 2015, 27, 88–101. [Google Scholar] [CrossRef]
  32. Tao, H.; Hou, C.; Nie, F.; Jiao, Y.; Yi, D. Effective discriminative feature selection with nontrivial solution. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 796–808. [Google Scholar] [CrossRef]
  33. Pang, T.; Nie, F.; Han, J.; Li, X. Efficient feature selection via ℓ2,0-norm constrained sparse regression. IEEE Trans. Knowl. Data Eng. 2019, 31, 880–893. [Google Scholar] [CrossRef]
  34. Zhao, S.; Wang, M.; Ma, S.; Cui, Q. A feature selection method via relevant-redundant weight. Expert Syst. Appl. 2022, 207, 117923. [Google Scholar] [CrossRef]
  35. Nouri-Moghaddam, B.; Ghazanfari, M.; Fathian, M. A novel multi-objective forest optimization algorithm for wrapper feature selection. Expert Syst. Appl. 2021, 175, 114737. [Google Scholar] [CrossRef]
  36. Maldonado, S.; Weber, R. A wrapper method for feature selection using support vector machines. Inf. Sci. 2009, 179, 2208–2217. [Google Scholar] [CrossRef]
  37. Shi, D.; Zhu, L.; Li, J.; Zhang, Z.; Chang, X. Unsupervised adaptive feature selection with binary hashing. IEEE Trans. Image Process. 2023, 32, 838–853. [Google Scholar] [CrossRef] [PubMed]
  38. Nie, F.; Wang, Z.; Tian, L.; Wang, R.; Li, X. Subspace Sparse Discriminative Feature Selection. IEEE Trans. Cybern. 2022, 52, 4221–4233. [Google Scholar] [CrossRef] [PubMed]
  39. Roffo, G.; Melzi, S.; Castellani, U.; Vinciarelli, A.; Cristani, M. Infinite feature selection: A graph-based feature filtering approach. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4396–4410. [Google Scholar] [CrossRef] [PubMed]
  40. Yang, Y.; Shen, H.; Ma, Z.; Huang, Z.; Zhou, X. ℓ2,1-norm regularized discriminative feature selection for unsupervised learning. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011; pp. 1589–1594. Available online: https://dl.acm.org/doi/10.5555/2283516.2283660 (accessed on 8 August 2022).
  41. Xue, B.; Zhang, M.; Browne, W. Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach. IEEE Trans. Cybern. 2013, 43, 1656–1671. [Google Scholar] [CrossRef] [PubMed]
  42. Ding, D.; Yang, X.; Xia, F.; Ma, T.; Liu, H.; Tang, C. Unsupervised feature selection via adaptive hypergraph regularized latent representation learning. Neurocomputing 2020, 378, 79–97. [Google Scholar] [CrossRef]
  43. Shang, R.; Kong, J.; Feng, J.; Jiao, L. Feature selection via non-convex constraint and latent representation learning with Laplacian embedding. Expert Syst. Appl. 2022, 208, 118179. [Google Scholar] [CrossRef]
  44. He, Z.; Xie, S.; Zdunek, R.; Zhou, G.; Cichocki, A. Symmetric nonnegative matrix factorization: Algorithms and applications to probabilistic clustering. IEEE Trans. Neural Netw. 2011, 22, 2117–2131. [Google Scholar] [CrossRef]
  45. Shang, R.; Wang, W.; Stolkin, R.; Jiao, L. Non-negative spectral learning and sparse regression-based dual-graph regularized feature selection. IEEE Trans. Cybern. 2018, 48, 793–806. [Google Scholar] [CrossRef]
  46. Huang, D.; Cai, X.; Wang, C. Unsupervised feature selection with multi-subspace randomization and collaboration. Knowl. Based Syst. 2019, 182, 104856. [Google Scholar] [CrossRef]
  47. Xiao, J.; Zhu, X. Some properties and applications of Menger probabilistic inner product spaces. Fuzzy Sets Syst. 2022, 451, 398–416. [Google Scholar] [CrossRef]
  48. Cai, D.; He, X.; Han, J. Locally consistent concept factorization for document clustering. IEEE Trans. Knowl. Data Eng. 2011, 23, 902–913. [Google Scholar] [CrossRef]
  49. Pan, V.; Soleymani, F.; Zhao, L. An efficient computation of generalized inverse of a matrix. Appl. Math. Comput. 2018, 316, 89–101. [Google Scholar] [CrossRef]
  50. Luo, C.; Zheng, J.; Li, T.; Chen, H.; Huang, Y.; Peng, X. Orthogonally constrained matrix factorization for robust unsupervised feature selection with local preserving. Inf. Sci. 2022, 586, 662–675. [Google Scholar] [CrossRef]
Figure 1. Samples from the ORL dataset. (ac) are from different categories of the ORL dataset, respectively.
Figure 1. Samples from the ORL dataset. (ac) are from different categories of the ORL dataset, respectively.
Axioms 13 00006 g001
Figure 2. The framework of LRPFS method.
Figure 2. The framework of LRPFS method.
Axioms 13 00006 g002
Figure 3. The ACC of all of the algorithms for selecting different numbers of features on the nine datasets.
Figure 3. The ACC of all of the algorithms for selecting different numbers of features on the nine datasets.
Axioms 13 00006 g003aAxioms 13 00006 g003b
Figure 4. The NMI of all of the algorithms for selecting different numbers of features on the nine datasets.
Figure 4. The NMI of all of the algorithms for selecting different numbers of features on the nine datasets.
Axioms 13 00006 g004
Figure 5. Samples from COIL20 dataset with noise of different sizes.
Figure 5. Samples from COIL20 dataset with noise of different sizes.
Axioms 13 00006 g005
Figure 6. The 2-D demonstration of the COIL20 benchmark dataset.
Figure 6. The 2-D demonstration of the COIL20 benchmark dataset.
Axioms 13 00006 g006
Figure 7. Convergence curves of LRPFS on nine datasets.
Figure 7. Convergence curves of LRPFS on nine datasets.
Axioms 13 00006 g007aAxioms 13 00006 g007b
Figure 8. The ACC results of LRPFS on the nine datasets under different parameters varying in log10.
Figure 8. The ACC results of LRPFS on the nine datasets under different parameters varying in log10.
Axioms 13 00006 g008aAxioms 13 00006 g008b
Figure 9. The NMI results of LRPFS on the nine datasets under different parameters varying in log10.
Figure 9. The NMI results of LRPFS on the nine datasets under different parameters varying in log10.
Axioms 13 00006 g009
Figure 10. Results of two Yale64 samples with different numbers of selected features.
Figure 10. Results of two Yale64 samples with different numbers of selected features.
Axioms 13 00006 g010
Table 1. Details of nine datasets.
Table 1. Details of nine datasets.
No.DatasetsSamplesFeaturesClass Type
1COIL201440102420Object image
2Colon6220002Biological
3Isolet156061726Speech Signal
4JAFFE21367610Face image
5Lung_dis733257Biological
6nci96097129Biological
7PCMAC194332892Text
8PIE2856102468Face image
9TOX_171 17157484Biological
10Yale64165409611Face image
Table 2. ACC (MEAN ± STD % (The number of selected features)) of different algorithms on real-world datasets, where the best results for each dataset are bolded, the second-best results are underlined.
Table 2. ACC (MEAN ± STD % (The number of selected features)) of different algorithms on real-world datasets, where the best results for each dataset are bolded, the second-best results are underlined.
MethodsCOIL20ColonIsoletJAFFELung_disnci9PCMACPIETOX_171
Baseline 65.7554.8461.7382.0473.6340.7550.4924.6844.77
±4.16±0.00±2.77±5.59±5.26±5.26±0.00±1.09±3.93
(all)(all)(all)(all)(all)(all)(all)(all)(all)
LapScor 60.4156.4555.8376.7470.4137.5850.2339.0052.81
±2.11±0.00±2.14±4.58±7.34±3.08±0.00±1.05±0.27
(100)(40)(100)(50)(90)(80)(50)(70)(20)
SPEC64.7456.5346.8480.9471.0346.3350.0817.8850.32
±3.47±0.36±1.89±5.35±5.38±4.17±0.00±0.89±1.31
(90)(30)(100)(100)(100)(50)(20)(100)(90)
MCFS 64.2253.2356.8185.4081.9245.5850.1327.8743.63
±3.37±0.00±2.20±4.41±4.80±3.26±0.00±1.48±1.87
(50)(60)(90)(100)(80)(40)(20)(70)(20)
UDFS58.7162.2642.6584.5577.6035.2551.0220.7445.06
±2.14±1.22±1.72±4.10±6.66±2.25±0.39±0.69±4.18
(100)(30)(100)(100)(90)(60)(30)(100)(70)
SOGFS57.8361.2945.4486.0376.7142.7552.5036.6049.80
±2.78±0.00±1.88±4.78±5.50±4.43±0.41±1.01±2.21
(100)(30)(100)(70)(90)(50)(80)(30)(80)
SRCFS57.1458.0655.9176.1770.6839.3350.4939.7747.46
±2.68±0.00±2.04±4.65±5.36±2.98±0.00±1.05±0.29
(100)(100)(100)(100)(80)(90)(100)(90)(100)
RNE61.5262.9049.4473.6673.2936.0853.8627.8254.35
±1.91±0.00±1.51±4.52±5.32±3.56±4.82±0.83±3.64
(70)(70)(70)(90)(90)(30)(20)(40)(80)
inf-FSU58.3272.3456.7566.0178.9032.9250.7540.3239.94
±2.69±0.79±1.35±3.83±6.21±2.85±0.00±1.11±1.02
(100)(50)(100)(100)(80)(100)(40)(100)(30)
S2DFS67.1085.5663.6481.4678.7046.5050.0827.7946.55
±3.18±0.36±2.09±7.67±4.76±4.25±0.00±0.81±2.43
(60)(30)(100)(100)(80)(100)(30)(60)(50)
LRPFS69.3187.1066.4186.3182.2347.2557.8441.6254.44
±3.17±1.17±1.74±3.63±4.53±4.69±0.97±1.43±0.87
(90)(70)(100)(60)(60)(30)(60)(70)(50)
Table 3. NMI (MEAN ± STD % (The number of selected features)) of different algorithms on real-world datasets, where the best results for each dataset are bolded, the second-best results are underlined.
Table 3. NMI (MEAN ± STD % (The number of selected features)) of different algorithms on real-world datasets, where the best results for each dataset are bolded, the second-best results are underlined.
MethodsCOIL20ColonIsoletJAFFELung_disnci9PCMACPIETOX_171
Baseline 76.690.6076.0683.6169.2737.960.0148.8424.17
±1.99±0.00±1.26±3.37±4.21±5.92±0.00±0.62±3.73
(all)(all)(all)(all)(all)(all)(all)(all)(all)
LapScor 69.670.9769.4583.4564.8636.490.5864.5134.93
±1.18±0.00±0.91±2.28±5.71±1.66±0.00±0.65±0.36
(100)(40)(100)(50)(90)(40)(50)(70)(20)
SPEC73.521.7559.8983.9267.0945.410.4542.5725.64
±1.49±0.02±1.38±3.73±3.55±3.64±0.00±0.43±1.49
(100)(30)(100)(80)(100)(40)(20)(20)(90)
MCFS 74.140.1069.8085.8274.0145.760.4150.8919.00
±1.90±0.00±0.66±2.20±4.11±3.27±0.00±0.69±3.78
(60)(20)(100)(100)(80)(40)(20)(70)(60)
UDFS69.433.1257.4184.9569.8034.990.1144.1315.78
±0.99±0.63±0.87±2.69±4.65±2.95±0.05±0.38±5.16
(100)(30)(100)(90)(90)(20)(30)(100)(100)
SOGFS70.797.7961.3588.0870.6042.422.1658.5528.18
±1.72±0.00±0.36±2.97±2.57±3.77±0.52±0.53±4.16
(100)(30)(100)(70)(100)(70)(80)(30)(100)
SRCFS69.201.8368.1881.1865.4537.650.6364.9531.40
±0.98±0.00±0.98±3.89±4.25±3.32±0.00±0.86±0.30
(100)(100)(100)(100)(100)(80)(20)(90)(40)
RNE72.013.9562.8578.7667.3133.661.2654.4228.26
±1.56±0.00±1.01±3.70±4.40±3.73±1.24±0.55±4.07
(100)(70)(100)(90)(90)(30)(20)(40)(80)
inf-FSU68.6516.8071.5269.2674.6027.530.2165.6613.11
±1.16±0.77±0.58±2.28±5.81±3.30±0.00±0.73±1.17
(100)(50)(100)(100)(80)(100)(40)(100)(60)
S2DFS76.3340.7074.7583.8574.4346.550.2952.6430.18
±1.43±0.78±0.83±4.07±3.25±4.06±0.00±0.64±2.36
(80)(30)(100)(100)(80)(100)(20)(60)(100)
LRPFS76.5541.8075.9686.7176.6347.122.0064.9735.80
±1.03±3.10±0.61±2.12±4.38±4.56±0.30±0.48±1.00
(100)(70)(100)(90)(60)(70)(90)(80)(80)
Table 4. Computation time (seconds) of different methods on real-world datasets.
Table 4. Computation time (seconds) of different methods on real-world datasets.
MethodsCOIL20ColonIsoletJAFFELung_disnci9PCMACPIETOX_171
Baseline 24.910.6219.131.640.432.563.0499.577.84
LapScor 5.830.387.960.950.350.471.7533.240.94
SPEC10.580.3112.940.960.410.5416.5253.501.08
MCFS 5.971.018.011.240.652.142.8330.881.39
UDFS13.9412.4313.411.480.521198.8767.9755.85314.23
SOGFS96.533.0923.731.980.8612,137.22608.5758.83929.49
SRCFS10.530.5312.911.220.550.6212.5951.551.42
RNE12.7129.5312.675.421.36476.3350.1533.79179.91
inf-FSU10.752.889.311.900.6155.3150.9147.0728.03
S2DFS7.2112.807.861.770.651379.3660.7831.01291.46
LRPFS9.271.2412.831.190.4123.2429.5652.408.66
Table 5. ACC and NMI (MEAN ± STD %) on the noised COIL20 datasets, where the best results for each dataset are bolded, the second-best results are underlined.
Table 5. ACC and NMI (MEAN ± STD %) on the noised COIL20 datasets, where the best results for each dataset are bolded, the second-best results are underlined.
MethodsAccuracy (%)Normalized Mutual Information (%)
8 × 8 Noise12 × 12 Noise16 × 16 Noise8 × 8 Noise12 × 12 Noise16 × 16 Noise
LapScor 56.03 ± 3.4358.72 ± 2.9658.32 ± 3.5667.53 ± 1.0669.83 ± 0.8468.82 ± 1.28
SPEC63.90 ± 2.7863.78 ± 2.3456.46 ± 2.2972.18 ± 1.2373.35 ± 1.3467.75 ± 0.92
MCFS 64.01 ± 3.5863.94 ± 2.0261.96 ± 2.4573.69 ± 1.8473.37 ± 1.0870.77 ± 1.06
UDFS59.28 ± 2.6863.45 ± 2.4960.85 ± 2.3169.15 ± 1.3373.12 ± 1.1768.27 ± 1.48
SOGFS57.19 ± 2.3957.60 ± 2.5957.43 ± 2.8870.41 ± 1.1070.47 ± 0.7769.85 ± 1.60
SRCFS57.70 ± 3.1958.20 ± 2.0057.86 ± 2.4269.81 ± 1.5069.87 ± 1.2868.13 ± 1.00
RNE62.05 ± 4.1261.63 ± 2.9853.52 ± 2.3572.44 ± 1.6271.74 ± 1.4465.76 ± 0.88
inf-FSU57.33 ± 2.5258.01 ± 2.2759.23 ± 2.1969.06 ± 1.1268.07 ± 1.2068.21 ± 1.42
S2DFS65.39 ± 3.1965.75 ± 3.9462.39 ± 3.2773.27 ± 1.8173.40 ± 1.8870.76 ± 1.55
LRPFS65.95 ± 2.7966.85 ± 2.5062.58 ± 2.2173.71 ± 1.3374.14 ± 1.1570.94 ± 1.28
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, Z.; Huang, Y.; Li, H.; Wang, J. Unsupervised Feature Selection with Latent Relationship Penalty Term. Axioms 2024, 13, 6. https://doi.org/10.3390/axioms13010006

AMA Style

Ma Z, Huang Y, Li H, Wang J. Unsupervised Feature Selection with Latent Relationship Penalty Term. Axioms. 2024; 13(1):6. https://doi.org/10.3390/axioms13010006

Chicago/Turabian Style

Ma, Ziping, Yulei Huang, Huirong Li, and Jingyu Wang. 2024. "Unsupervised Feature Selection with Latent Relationship Penalty Term" Axioms 13, no. 1: 6. https://doi.org/10.3390/axioms13010006

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop