SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information

: As the feature dimension of data continues to expand, the task of selecting an optimal subset of features from a pool of limited labeled data and extensive unlabeled data becomes more and more challenging. In recent years, some semi-supervised feature selection methods (SSFS) have been proposed to select a subset of features, but they still have some drawbacks limiting their performance, for e.g., many SSFS methods underutilize the structural distribution information available within labeled and unlabeled data. To address this issue, we proposed a semi-supervised feature selection method based on an adaptive graph with global and local constraints (SFS-AGGL) in this paper. Specifically, we first designed an adaptive graph learning mechanism that can consider both the global and local information of samples to effectively learn and retain the geometric structural information of the original dataset. Secondly, we constructed a label propagation technique integrated with the adaptive graph learning in SFS-AGGL to fully utilize the structural distribution information of both labeled and unlabeled data. The proposed SFS-AGGL method is validated through classification and clustering tasks across various datasets. The experimental results demonstrate its superiority over existing benchmark methods, particularly in terms of clustering performance.


Introduction
High-dimensional data can describe real-world things more realistically and effectively.However, these data might include vast redundant and irrelevant information.If we process these data directly, it not only consumes a large amount of storage space and computational resources but also leads to the performance degradation of existing models [1].Therefore, it is necessary to mine the potential relationships between the data to select and learn useful feature information.
Feature representation learning (FRL) is one of the most effective methods of learning useful feature information.Among the existing FRL methods, feature extraction (FE) [2] and feature selection (FS) [3] are two representative methods.FE aims to map the original high-dimensional feature space to a low-dimensional subspace according to some predefined criteria [4].FS selects an optimal feature subset from the original feature set based on evaluation metrics [5].In comparison, FS is more interpretable than FE since it can remove irrelevant and redundant features from the original features and retain a small number of relevant features.Therefore, FS is widely used in image classification, bioinformatics, face recognition, medical image analysis, natural language processing, and other fields [6].
FS methods can be divided into unsupervised feature selection (UFS), supervised feature selection (SFS), and semi-supervised feature selection (SSFS).UFS methods can achieve feature selection by only using unlabeled data; they have received widespread attention since they do not require any labeled data.However, a lack of label-guided learning in UFS methods will lead to poor performance on practical application tasks [7].Thus, SFS methods have been devised to leverage the label information of the sample to guide the process of FS, enhancing the distinctiveness and consequently of selected features, improving the performance of the classification and clustering [8].However, obtaining ample labeled data is very challenging and time consuming in practical situations.For this reason, many SSFS methods have been proposed in the past decades.SSFS methods employ semi-supervised learning (SSL) to leverage the information of limited labeled data and a substantial volume of unlabeled data, enhancing the feature selection ability of the model [9].The existing SSFS methods can be classified as filtered, wrapped, and embedded methods [10].Filtered methods first evaluate each feature based on the principles of statistical or information theory and then perform the process of FS in terms of the calculated weights.A major benefit of filtering methods is that they are more applicable to large-scale datasets since they have high speed and computational efficiency.However, filtered methods may ignore the amount of redundant information generated by the combination of multiple features [11].Thus, some wrapped approaches have been proposed to exploit the interrelationship of features to mine the best combination of features.However, these approaches have high computational complexity.This makes them unsuitable for processing large-scale data [12].In contrast to the above-mentioned methods, embedded methods combine FS and model training together.That is, FS is automatically executed during the process of model training, which makes embedded methods improve the efficiency of FS by reducing runtime [13].Therefore, they have become mainstream and widely used in various scenarios.
In recent years, several semi-supervised embedded feature selection (SSEFS) methods have emerged.For example, Zhao et al. [14] introduced an SSFS method using both labeled and unlabeled data.Recognizing sparse regularization is an effective strategy for selecting useful features and reducing feature representation dimensions [15].Chen et al. [16] introduced an efficient semi-supervised feature selection (ESFS) method.ESFS first combines SSL and sparse regularization to obtain feature subsets.Then, it uses probability matrices of unlabeled data to measure feature relevance to the class, aiming to identify the globally optimal feature subset.Least squares regression (LSR) with complete statistical theory can handle noisy data effectively and thus improve computational efficiency [17].Therefore, Chang et al. [18] proposed a convex sparse feature selection (CSFS) method based on LSR, which employs the convex optimization theory to fit samples and predict labels to select the most critical features using constraint terms.Chen et al. [19] contended that LSR-based feature selection lacks interpretability and struggles to identify a global sparse solution.Hence, they proposed an embedded SSFS method based on rescaled linear regression, which exploits the L21 norm to obtain both global and sparse solutions.Moreover, they also introduced a sparse regularization with an implicit L2p norm to obtain sparser and more interpretable solutions [20].Therefore, this approach effectively constrains regression coefficients, achieving feature ranking.Besides, Liu et al. [21] combined sparse features and considered the correlation of samples in the original high-and low-dimensional spaces to improve the performance of feature learning.Despite the good results achieved by the sparse model-based methods, there are still some problems.
The first problem is that most of the methods do not consider constructing graphs to better preserve the geometric structural information of the data during the FS process.Initially, KNN was adopted by some FS methods to construct graphs based on Euclidean distances [22][23][24][25].To minimize the influence of the redundant features and noise in the original high-dimensional data on the constructing graph process, Chen et al. [26] employed local discriminant analysis (LDA) to map the data from high-dimensional space to low-dimensional space.Subsequently, numerous graph construction methods based on data correlation have been presented, including L1 graph [27], low-rank representation (LRR) [28], local structure learning [29], and sparse subspace clustering (SSC) [30], to construct high-quality graphs.The above-mentioned graph construction methods are integrated into FS models, proposing a large number of improvements for feature selection [31][32][33][34][35][36][37].However, the processes of adaptive graph construction and FS in the above-mentioned methods are independent of each other, so the influence of graph construction on the FS process is limited.To this end, some methods have been constructed to unify adaptive graph learning (AGL) and FS into a single framework [38][39][40][41].
The second problem is that the spatial distribution of the sample label information is not sufficiently considered, resulting in the weak discriminative ability of the selected features, which further leads to poor classification or clustering performance.To alleviate this issue, label propagation (LP) has been incorporated into the FS methods [42][43][44].However, since LP is also a graph-learning-based algorithm, the quality of the learned graph affects the performance to some extent.Therefore, numerous methods have emerged to merge AGL and LP [45][46][47][48].However, these methods still have the following limitations: (1) the process of AGL is based on the original data; (2) the process of adaptive graph construction only considers the local structure or the global structure.Therefore, these methods are inevitably affected by high-dimensional features or noisy data.
To address the above-mentioned issues, this study develops a novel SSFS framework, SFS-AGGL, which integrates FS, AGL, and LP to capture both global and local data structural information for selecting an optimal feature subset with maximum discrimination and minimum redundancy.In AGL, global and local constraints are imposed on the construction coefficient obtained by the self-representation of low-dimensionally selected features.Meanwhile, the similarity matrix obtained by AGL is integrated into LP, enhancing label prediction performance.To improve the discriminative ability of the selected features, the predicted label matrix is introduced into the sparse feature selection (SFS) process.SFS is performed through the mutual promotion of the three models.The framework of the proposed SFS-AGGL is shown in Figure 1.The primary contributions of this paper are as follows: (1) An efficient SSFS framework is proposed by combining the advantages of FS, adaptive learning, and LP.
(2) An adaptive learning strategy based on low-dimensional features is designed to counteract the influence of high-dimensional features or noise data.Moreover, global and local constraints are introduced.
(3) An LP based on an adaptive similarity matrix is introduced to enhance label prediction accuracy.
(4) Comprehensive experiments conducted on multiple real datasets demonstrate that the proposed SFS-AGGL method surpasses existing representative methods in classification and clustering tasks.
The rest of this paper is organized as follows: Section 2 describes some related work; Section 3 outlines the details of the proposed method and the iterative minimization strategy employed to optimize the objective function; Section 4 introduces the experimental setup and provides a comprehensive analysis of the obtained results, including comparisons with eight state-of-the-art methods on five real datasets; and Section 5 provides a summary of our work in this paper.

Related Work
In this section, we have first provided some commonly used notations.Then, sparse representation and graph construction methods are introduced.Finally, some semi-supervised feature selection methods are briefly reviewed.

Notations
denote the training samples, where is the label matrix, and l Y denotes the true label of the labeled sample.If the sample i x belongs to the class j, then its corre- . u Y denotes the true label of the unla- beled sample.Since u Y is unknown during the training process, it is set as a 0 matrix dur- ing training [49].The main symbols in this paper are presented in Table 1.

Traces of matrix
Common matrix norms include L1, L2, F, and L21 norms.Their detailed definitions are as follows:

Bb
(1) where i B is the i-th row vector of the matrix B .According to the matrix computation theory,

Sparse Representation
Sparse representation is a method that was first developed in signal processing.The core idea of sparse representation is to find a target dictionary to describe the signal.To be specific, the original signal can be decomposed into linear combinations of elements in the dictionary.Only a few non-zero elements are used to represent signal information, while the rest can be ignored.Given a sample  n XR and a target dictionary D , it is desired to find a coefficient vector a such that the signal X can be represented as a lin- ear combination of the basic elements of the target dictionary D .
where   d R is a one-dimensional vector [50] and  0 || || is the L0 norm of  .Due to the non-convexity and discontinuity of the L0 norm, the L1 norm is usually used to replace the L0 norm to obtain an approximate solution, as shown in the following formula: Compared with the L1 norm, the continuous derivability property of the L2 norm can make the optimization algorithm more intuitive.Hence, the L2 norm is commonly used to control overfitting, which can make the weight parameters of the model smoother and avoid overly complex models, as shown in the following formula: . XD (7) However, the disadvantage of the L2 norm is that the model parameters will be close to 0, but most of them cannot be 0. Therefore, the L21 norm, which is between the L1 and L2 norms, is proposed as an effective scheme, as shown in Equation ( 8): . XD (8) The advantage of L21 norm is that it can make the elements of the whole row 0, thus achieving a similar sparse effect as L1 and more robustness.

Constructing Graph Methods
KNN graph is a widely used method for constructing a similarity matrix.Sij is the similarity of the sample xi and xj, which is defined as: where () Nx is a set that contains k nearest neighbor samples of the sample j x and  is a parameter.From Equation ( 9), it can be seen that as the samples get closer, their similarity also increases.In addition, there are some similar methods, such as the ϵ-neighborhood method [51] and the fully connected method [52], which can also be utilized to construct graphs.
Unlike the KNN graph, the L1 graph is an adaptive graph learning mechanism method that aims to reconstruct each sample by find the best sparse linear combination of other samples.The objective function of the L1 graph can be described as follows: xX (10) Then, the weight matrix formed by the L1 graph is expressed as Compared with KNN graphs, L1 graphs can adaptively select the nearest samples for each sample.

Label Propagation Algorithm
The label propagation (LP) algorithm is a graph-based semi-supervised classification method that can effectively classify unknown samples using a small number of labeled samples.In the LP algorithm, similar samples should have similar labels.Therefore, the objective function of LP can be expressed as: where ij s can be computed by Equation (9) or Equation (10).
trix that effectively utilizes category information from all samples in SSL.The diagonal elements of this matrix are defined as follows: where the symbol  represents a relatively large constant.The first term in Equation ( 11) is based on the similarity of the data, which assigns similar labels to the neighboring samples to keep the graph as smooth as possible.The second term aims to minimize the difference between the matrix F and the label Y, i.e., the sample labels predicted by the trained model should be as consistent as possible with the true labels.

The Graph-Based Semi-Supervised Sparse Feature Selection
Sparse learning is widely used in machine learning due to its superior feature extraction capabilities.In this context, sparse regularization terms are used to penalize the projection matrix with the aim of selecting features with high sparsity and high discriminative properties.The following equation is commonly used for sparse feature selection: where the ( , , ) Loss X W Y is defined as a regression term and () RW is a sparse regularization term, and   0 is a regularization weight to constrain both terms.
As we know, selecting features only using the information of the labeled sample is inaccurate and unreliable since the labeled samples are insufficient in SSL.Therefore, it is also necessary to make full use of the information of unlabeled samples to improve the performance.The following model can achieve feature selection by introducing an LP algorithm into the semi-supervised sparse model.(11).
It can be seen that when constructing the model above, the merits of the similarity matrix construction directly determine the performance.To alleviate the issue, the following graph-based semi-supervised sparse feature selection model has been developed.

The Proposed Method
In this part, a detailed introduction of the SFS model is first presented.Second, a new AGL mechanism is introduced to make full use of the global and local information between the samples, which can acquire the geometric structural information of the original data well.Next, the similarity matrix learned by the AGL mechanism is integrated into the LP algorithm, which enhances label prediction performance and allows the model to classify and cluster unlabeled samples more accurately.Finally, the SFS, AGL, and LP models are fused in a unified framework to propose a novel SFS-AGGL method.Moreover, a new iterative updated algorithm is introduced to optimize the proposed model, and its convergence is confirmed through both theoretical and experimental testing.

SFS Model
The L21 sparsity constraint is applied to achieve the process of FS.In combination with LSR, a basic SFS model can be obtained as follows: where   dc WR denotes the feature projection matrix and θ is a regularization parameter.

Global and Local Adaptive Graph Learning (AGGL) Model
Although the sparse model-based approach has achieved good results in FS, there are still some problems, for e.g., the above-mentioned sparse model only focuses on the sample-label relationship and ignores the geometric structural information among the samples.To better preserve the original data's geometric structural information, the method of adaptively constructing the nearest neighbor graph is usually adopted.However, the nearest neighbor information in the original feature space may be disturbed by redundant and noisy features.Previous research has shown that feature projection can effectively mitigate the negative impact of redundant and noisy features [54].Therefore, when learning the nearest neighbor graph, the similarity matrix should be constructed through adaptive updates of sample similarities and their neighboring samples in the projected feature space.Hence, in this paper, the similarity of samples in the original highdimensional space and the low-dimensional space is utilized to describe the local distribution structure more accurately, thus enhancing the effectiveness of the graph learning task.Specifically, we have used a coefficient reconstruction method to construct the graph, leading to the subsequent model: where [ , , , ] denotes the reconstructed coefficient vector of the sample.
The maintenance of global and local sample information is crucial for sample reconstruction.That is, the similarity between the sample that needs to be reconstructed and its surrounding samples should be maintained in the process of sample reconstruction.To achieve this goal, we have incorporated global and local constraints into the sample reconstruction process.This ensures that the sample points are better reconstructed by the most adjacent sample points, thereby improving the quality of the construction graph.Specifically, we have combined global and local constraints with sparse learning to reconstruct samples, as shown in the following formula: where and each element ij e in E is defined as: By combining Equations ( 17) and ( 18), the following adaptive graph construction model with global and local constraints is obtained: where   0 is the balance coefficient, which aims to balance the effects of the coefficient reconstruction term and the global and local constraint terms.By constructing the above model, we can effectively maintain the global and local information of the sample, thereby enhancing the similarity matrix of the graph.

Objective Function
As can be seen in Equation ( 17), the SFS model only utilizes the labeling information of the data.It ignores the spatial distribution of the labels, making it difficult to select the ideal subset of features.It has been shown that the structural distribution information embedded in unlabeled data is very important for FS when there is less labeling information [55].For this reason, we have introduced the LP algorithm.Meanwhile, to make the LP process more efficient, we have introduced the adaptive graph coefficient matrix obtained by Equation (20) into LP.Therefore, a new SFS-AGGL algorithm is proposed by integrating SFS, AGGL, and LP into a unified learning framework.SFS-AGGL can account for both global and local sample information, and it is robust for FS.The objective function of SFS-AGGL is: where      , , 0 , are the equilibrium control parameters to be adjusted in the experiment, and denotes the product of matrix elements in their corresponding positions.As shown in Equation ( 21), we first efficiently obtained the constructive coefficients by imposing global and local constraints while self-representing the low-dimensional features.Therefore, it can avoid possible redundant information to affect the learning performance due to predefined matrices not being introduced.Second, we introduced the similarity matrix obtained by AGL into the LP process to improve the accuracy of label prediction.In addition, to enhance the discriminative performance of the selected features, we introduced a predictive labeling matrix into the SFS process and completed the FS by mutual reinforcement of the three models: SFS, AGGL, and LP.

Model Optimization
The objective function of the SFS-AGGL method involves three variables, i.e., the feature projection matrix W, the prediction label matrix F, and the similarity matrix S. Since the objective functions of all three variables are non-convex, they cannot be optimized directly.However, the objective function exhibits convexity with respect to a single variable.Therefore, we can solve it step-by-step by performing convex optimization on each variable separately.The specific process of solving the objective function is as follows: (1) Fixed variables F and S update variable W Simplifying Equation ( 21) by removing the terms unrelated to the variable W, the following optimization function is obtained: From the definition of matrix trace, Equation ( 23) can be derived from Equation ( 22) by using a simple algebraic transformation as follows: To solve Equation ( 23), a Lagrange multiplier and the corresponding Lagrange functions are introduced, which can be constructed as follows: Next, the partial derivative regarding the variable W is computed and then set to 0 as follows: Meanwhile, by combining the Karush-Kuhn-Tucker (KKT) condition ( 0 we can obtain Equation ( 26) as follows: Therefore, an updated rule for the variable W can be obtained: (2) Fixed variables W and S update variable F We first remove the terms unrelated to the variable F from Equation (21), and the optimization function on the variable F is acquired as: According to the definition of matrix trace, we can use a simple algebraic transformation to obtain Equation (29) as follows: = ( 22 ) Next, we have introduced a Lagrange multiplier to optimize Equation ( 29), and the corresponding Lagrange function can be defined as follows: Then, we have calculated the partial derivative with respect to the variable F and set it to 0 as follows: Following the KKT condition ( 0) , we can derive Equation (32) as shown: Finally, we have provided an iterative updated rule for the variable F as follows: (3) Fixed variables W and F update variable S Likewise, by removing the terms unrelated to the variable S, the optimization function becomes the following form: Equation ( 34) can be reduced to Equation (35) as follows: = ( 2) Here, a Lagrange multiplier is utilized to determine the optimal solution of Equation (35), and the related Lagrange function is formulated as: The partial derivative regarding the variable S is then set to 0 as follows: ) 0 Since the KKT condition ( 0)  exists, we can obtain Equation (38) as follows: Therefore, an expression of the following form for the variable S can be obtained:

Algorithm Description
Algorithm 1 describes the SFS-AGGL method in detail, while Figure 2 depicts its flowchart.Moreover, the SFS-AGGL algorithm stops iterating when the alteration of the objective function value between consecutive iterations is below a threshold or the maximum number of iterations is reached.

Computational Complexity Analysis
Based on Algorithm 1, the SFS-AGGL algorithm's computational complexity comprises two parts.The first part is the computation of the diagonal auxiliary matrix U in step 2, and the second part is the updating of three matrices (W, F, and S) during each iteration and the computation of the local matrix E in step 7.The computational or updating components of each matrix are defined in Table 2. Therefore, the total complexity of the SFS-AGGL algorithm is , where iter is the iteration count.Furthermore, the computational complexities of other related FS methods are also presented in Table 3.

Matrix
Formula Time Complexity [] [2 ] and () q  meet these two conditions, as shown in Equation ( 40), ( , ) is an auxiliary function of () q  .
iter iter tier iter iter iter q q q q q q (42) It is only necessary to show that the variables W, F, and S are non-decreasing under the update rule as shown in Equation (42).For this purpose, we have computed and presented the first-and second-order derivatives of each formula in Table 4. □ Table 4. First-and second-order derivatives of each formula.

S S S S S S S X WW XS E SS S (45)
Equations ( 43)-( 45) are both auxiliary functions of  ij .

□
Similarly, it is possible to prove Equations ( 44) and (45).Finally, based on Lemma 1, the update schemes for the variables W, F, S are derived in this paper as shown in Equations ( 51)-( 53).
Next, the upcoming focus will be on demonstrating the convergence of iterationbased Algorithm 1.
For any non-zero vectors , the following inequalities exist: The proof of Equation ( 54) can be found in the literature [55] □ According to Equation ( 55), we obtain: ( ( Again, based on the definition of matrix iter H , Equation ( 56) can be rewritten as: ) Thus, there is the following inequality: From Equation ( 58), we have: Considering Equations ( 55)-( 59) together, the following results can be obtained: ( ||( The inequality in Equation (60) shows the value of the objective function is decreased per iteration, indicating the optimization algorithm's progress toward a more optimal solution at each step.In addition, since there is a lower bound on the objective function, our proposed optimization algorithm will converge.We also adopted numerical experiments to further verify the effectiveness of the optimization algorithm, and the experimental result demonstrates that the objective function value consistently decreases as the number of iterations increases.

Experiment and Analysis
In this section, the effectiveness of the proposed method is validated on classification and clustering tasks, respectively.We first used five image classification datasets to test the classification performance of the proposed method and then employed two image datasets and two subsets of UCI data to verify the clustering performance of the proposed method.In the experiment, we compared our proposed method with some contemporary UFS and SSFS methods, including two UFS methods (SPNFSR [56] and NNSAFS [57]) and six SSFS methods (RLSR [19], FDEFS [50], GS 3 FS [43], S2LFS [44], AGLRM [47], and ASLCGLFS [48]).

Description of the Comparison Methods
In order to verify the effectiveness of our method and comprehensively evaluate the strengths and weakness of our proposed method, we compared it with some classical and novel benchmark methods for unsupervised and semi-supervised FS, which are similar to our method.Compared to these existing methods, our method is an improvement and innovation of them, which is with a general tendency toward continuous improvement.
(1) SPNFSR is a UFS algorithm that uses a low-rank representation graph for maintaining feature structures, and it achieves FS by using the L21 norm and non-negative constraints on the reconstruction coefficient matrix.The objective function of the SPNFSR method can be defined as follows: where the matrix M is obtained by solving the low-rank representation.In the SPNFSR method, the processes of graph construction and feature selection are performed independently, so the quality of the matrix M will directly affect the performance of feature selection.
(2) NNSAFS is a UFS algorithm that employs adaptive rank constraints and non-negative spectral feature learning.It employs sparse regression and feature mapping to mine the local structural information of the feature space to improve the adaptability of manifold learning.The objective function of NNSAFS can be defined as follows: . .0, , where  is an entropy regularization term to estimate the uniformity of matrix S. Compared with the SPNFSR method, NNSAFS integrates graph learning and feature selection into a framework to overcome the shortcomings of the SPNFSR method.Moreover, local structural information of learned feature is also considered.However, since NNSAFS and SPNFSR are unsupervised methods and do not consider the label information of the data, they cannot select the features with good discriminability.
(3) RLSR is an SSFS method, which identifies key features by learning the global and sparse solutions of the feature projection matrix.It also redefines regression coefficients with a deflation factor, as shown in Equation ( 63): Different from the SPNFSR and NNSAFS methods, RLSR is a semi-supervised selection method that can use both labeled and unlabeled samples to improve the discriminability of features.Moreover, it also uses the L21 norm instead of the L1 norm to reduce the redundancy of selected features.
(4) FDEFS is a supervised or semi-supervised FS method that combines margin discriminant embedding, manifold embedding, and sparse regression to achieve feature selection. . . .
where l M is a square matrix, and the detailed calculation procedure is provided in [50].
FDEFS can be regarded as an extension of RLSR by combining discriminant embedding terms and manifold embedding terms to enhance the discriminability of selected features.
(5) GS 3 FS is a robust graph-based SSFS method that selects relevant and sparse features through manifold learning and the L2p norm imposed on the regularization and loss functions.
Compared with the FDEFS method, GS 3 FS first integrates the LP into FDEFS.Moreover, GS 3 FS uses the L2p norm instead of the L21 norm to highlight the robustness of the selected features.
(6) S2LFS is a novel SSFS that can select different subsets for different categories rather than selecting one subset for all categories. . .0, , 0, 1 1.
where k z is an indicator vector representing whether a feature is chosen or not for the k-th class, and k w is the prediction function for the k-th class based on the selected features.( 7) AGLRM uses AGL techniques to enhance similarity matrix construction and mitigate the adverse impact of redundant features by minimizing redundant terms.
where A is a matrix of correlation coefficients for evaluating feature correlations.
Although the performance of the AGLRM method is superior to other methods, it still has shortcomings.First, the weight matrix of the graph is constrained by the L2 norm, which results in the graph lacking a sparse structure.Second, global constraints are not considered in the graph learning process, which leads to neglect of the distribution of the data and failure to explore more effective feature similarity metrics, thus affecting the performance of the method.
(8) ASLCGLFS improves similarity matrix quality by integrating label information into AGL.Additionally, it considers both local and global structures of the samples, thereby reducing redundancy in the selected features.
As an improvement to AGLRM, ASLCGLFS considers global information.However, the introduction of a predefined similarity matrix may bring in redundant information, which affects the learning performance.Therefore, instead of introducing predefined matrices, we will consider using the introduction of brand-new constraints to learn global and local information and reduce redundancy in order to improve the performance of feature selection.

Classification Datasets
Five publicly available image datasets were used in the classification experiment, which includes four face classification datasets (AR [58], CMU PIE [59], Extended YaleB [60], ORL [61]) and one object classification dataset (COIL20 [62]).Table 5 presents the detailed information of these datasets, in which P1 and P2 indicate training and test samples per category, respectively.The AR dataset is a widely used standard database consisting of more than 4000 color facial images.These images are from 126 faces, including 56 females and 70 males.The images in this dataset have variable expressions, lighting changes, and external occlusions.Figure 3a shows some images from this database.
The CMU PIE dataset consists of 41,368 grayscale facial images of 68 individuals.These images cover subjects of different ages, genders, and skin tones with different postural conditions, lighting environments, and expressions.Figure 3b shows some examples in this dataset.
The Extended YaleB dataset was taken from 38 subjects, and each subject was selected from 64 photos in different poses, different lighting environments, and 5 different shooting angles.This dataset has a total of 2414 face images.Figure 3c shows some images from the Extended YaleB dataset.
The ORL dataset contains 400 images of faces from 40 volunteers.Each volunteer provided 10 images with different facial postures, facial expressions, and facial ornaments obscured, such as serious or smiling, eyes up or squinting, and wearing or not wearing accessories.Some of the examples from this dataset can be observed in Figure 3d.
The COIL20 dataset comprises 1440 images featuring 20 different subjects.A total of 72 images were taken for each subject at 5-degree intervals.Some of the images from COIL20 are shown in Figure 3e.
It should be mentioned that in most existing work, these face databases (AR, CMU PIE, Extended YaleB, and ORL) are commonly used to evaluate the performance of the method because of the following aspects: (1) each database has different numbers of original data and categories; (2) each database contains different types of face variations; (3) each database has different conditions and environments for image acquisition.By using these classical facial datasets to evaluate our proposed method, we can ensure that our experimental results are adequately comparable to previous findings, thus better assessing the novelty and effectiveness of our proposed method in the field of face recognition.

Evaluation Metric
The accuracy rate [63] is employed to measure the performance of SFS-AGGL on the classification task, which is represented as: 100% TP TN ACC TP FP FN TN where TP and TN represent the numbers of correctly identified positive and negative samples.Additionally, false positive (FP) and false negative (FN) signify the misclassification of negative samples as positive and positive samples as negative, respectively.A higher accuracy rate value indicates improved classification performance.

Experimental Setup for Classification Task
In this experiment, P1 samples are randomly selected from each class for training, and the remaining P2 samples are used for testing.Then, an FS model is used to select a limited number of relevant features from the training data, and the model's effectiveness is assessed using KNN on the testing samples with only a subset of features.For the sake of experiment fairness and reliability, each experiment is conducted 10 times using diverse training data, and the final experimental results are represented as average classification accuracy and standard deviation.In addition, to select the optimal parameters, we used the grid search method to find the optimal values of parameters α, β, θ, and λ in the range {0.001, 0.01, 0.1, 1, 10, 100, 1000} and the optimal number of iterations a in {100, 200, 300, 400, 500, 600}.The dimensions of selected features vary from 50 to 500 in increments of 50.

Analysis of Classification Results
(1) Parameter sensitivity analysis of classification The effects of feature dimension (d), number of iterations (iter), and four balance parameters (α, β, θ, λ) on the performance of SFS-AGGL in the classification task are investigated.To assess SFS-AGGL's performance across varied experimental scenarios, the number of feature dimensions, iteration times, and the values of four balancing parameters were adjusted.
First, we have demonstrated the influence of different iteration times on the performance of SFS-AGGL, with the remaining parameters set to their optimal values.As shown in Figure 4, the classification accuracy varies with the iterations, showing an increasing trend.However, the classification accuracy will decrease or remain stable with an increasing iteration after reaching its peak.This demonstrates that SFS-AGGL can reduce the impact of noise and redundant features and effectively overcome overfitting problems.Second, the performance of different methods in different feature dimensions is shown in Figure 5. From Figure 5, we can find that the accuracy obtained by all methods is relatively lower when the feature dimensions are smaller.On the contrary, the performance of all methods gradually improves as the number of selected features increases.In most cases, the proposed SFS-AGGL outperforms the comparison methods, which indicates the stronger discriminative ability of the features selected by SFS-AGGL.However, the performance of some methods decreases as the number of selected features increases.This may be due to the presence of redundant or noisy information features in higher dimensions.Nevertheless, SFS-AGGL still surpasses the comparison methods in classification accuracy.The experimental results further validate the enhanced robustness of the features chosen by the SFS-AGGL method.Third, the performance of the proposed SFS-AGGL with different values of the four balancing parameters α, β, θ, and λ on different datasets is tested.The classification results for each balance parameter are depicted in Figure 6.From Figure 6, the following conclusions can be drawn: (1) The parameter α is used to control LP.The performance of SFS-AGGL is very sensitive to parameter α on different datasets.
(2) The parameter β affects the performance of AGL.SFS-AGGL achieves the best performance when β is set to 0.01 for the AR dataset and β is set to 0.1 for other datasets.In addition, the classification accuracy of SFS-AGGL on the ORL dataset is insensitive to different values of β.In contrast, the classification performance is very sensitive to the parameter β on other datasets.Therefore, β should be set to a smaller value to obtain better classification results.
(3) The parameter θ determines the significance of the sparse feature projection terms.The performance of SFS-AGGL is insensitive to parameter θ on the ORL, COIL20, and AR datasets, but it is very sensitive on the Extended YaleB and CMU PIE datasets.
(4) The parameter λ determines the importance of global and local constraint terms.SFS-AGGL achieves high accuracy on each dataset when the value of λ is small.However, the performance of SFS-AGGL decreases with increasing λ for the CMU PIE, Extended YaleB, and AR datasets.This indicates that there is significant variation among intraclass samples in these datasets.Therefore, λ should be set to a smaller value in the case of large differences between intraclass samples.
In summary, different values of the balancing parameters will have different effects on different datasets.The optimal parameter combinations for each dataset are listed in Table 6.
Comparative analysis of classification performance First, this section validates the classification performance of SFS-AGGL compared to other methods on the five image datasets.Table 7 presents the optimal average classification accuracy and their corresponding standard deviations for the different methods.The results in Table 7 show that: (1) SSFS methods outperform the UFS method, which indicates that the guidance of a small number of labels is crucial to improving the performance; (2) the joint FS algorithms achieve better performances than that of the RLSR method, which indicates that the correlation information among features is important for improving the FS performance; (3) the semi-supervised methods RLSR and FDEFS are inferior to other semi-supervised methods, which demonstrates that introducing the LP algorithm into semi-supervised methods is favorable for selecting discriminative features; (4) the proposed SFS-AGGL method outperforms the ASLCGLF method, notably since it integrates global and local constraints into AGL.Therefore, it is beneficial to fully consider LP and AGL in the SSFS approach to improve performance.Then, to demonstrate the superiority of SFS-AGGL, we employed one-tailed t-tests to determine if SFS-AGGL significantly outperformed the comparison methods.Both the null hypothesis and alternative hypotheses assumed that the results achieved by SFS-AGGL were equal to or greater than the results obtained by the comparison methods.For instance, in comparing SFS-AGGL with RLSR (SFS-AGGL vs. RLSR), the hypotheses are defined as H0: SFS-AGGL = RLSR and H1: SFS-AGGL > RLSR, where SFS-AGGL and RLSR represent average classification results obtained by SFS-AGGL and RLSR on different datasets, respectively.The experiment sets a statistical significance level of 0.05, and Table 8 presents the p values of pairwise one-tailed t-tests on different datasets.From Table 8, it can be seen that the performance of all methods is comparable on ORL and COIL datasets since these two datasets are relatively simple compared with other datasets, but the accuracy of our method is still slightly higher than that of other methods.Moreover, for AR, CMU PIE, and Extended YaleB databases, our method was able to significantly outperform the other comparative methods, indicating that our method is more advantageous in dealing with complex datasets.

Clustering Experiments
This section validates the effectiveness of the SFS-AGGL method for clustering tasks.For this purpose, we used the face dataset ORL and the object dataset COIL20, as well as two UCI datasets (Libras Movement and Landsat [64]) in the experiment.

Clustering Datasets
The Libras Movement dataset contains 15 gestures with a total of 360 samples and 89 attributes, while the Landsat dataset contains multispectral images of six different geographic regions with a total of 296 samples and 36 attributes.The details of all clustering datasets used are shown in Table 9. is a function that maps the learned clustering labels to align with the ground-truth labels.
NMI is the normalized mutual information for clustering, which is defined as: where MI denotes the mutual information, i.e., the entropy of the two sets, U and V. MI has been normalized to ensure fair comparisons between sets of different sizes.ARI is the adjusted Rand index, which is defined as: where RI (Rand index) denotes the number of sample pairs that are correctly clustered by the clustering algorithm out of all sample pairs; Expected_RI denotes the expected Rand index obtained through random clustering; and max(RI_max) indicates the maximum possible Rand index.The RI is adjusted to account for randomness, with values ranging between −1 and 1, where a value closer to 1 indicates better clustering performance.Purity measures the proportion of true categories that dominate each cluster.
where k C denotes the k-th cluster, j G denotes the j-th true category, and N denotes total number of samples.Precision reflects the ratio of correctly clustered positive samples to all samples identified as positive.

P Precision
Recall indicates the proportion of positive samples that were correctly clustered with all actual positive samples.

TP TP FN Recall
F-score is the harmonic mean of precision and recall, providing a comprehensive assessment of both performance metrics.

Analysis of Clustering Results
(1) Parameter sensitivity analysis of clustering Figure 7 illustrates the clustering results of SFS-AGGL on four datasets with varying parameters.When the selected feature dimension is unchanged, the parameter α first increases, then decreases, and finally rises again.The performance of SFS-AGGL is sensitive to different parameter values on different datasets, which underscores the importance of adjusting these values to achieve optimal clustering performance.Smaller values of regularization parameters β and λ can yield improved overall performance on diverse datasets.This demonstrates that our proposed SFS-AGGL can not only acquire neighboring information in the projected feature space but also capture the global and local sparse structures in the original feature space, ultimately leading to good performance.The performance of SFS-AGGL first improves and then decreases as the regularization parameter λ increases on the COIL20 and Landsat datasets.This indicates that SFS-AGGL is more sensitive to sparse learning in space.In summary, setting all balance parameters to smaller values enhances the clustering results of SFS-AGGL.Furthermore, it is advisable to adjust parameter values tailored to each dataset to achieve optimal outcomes.Figure 8 shows the clustering results obtained by sequentially setting each balancing parameter to different values while keeping all other conditions at optimal values.It can be found that the performance of SFS-AGGL is insensitive to all parameters in most cases.Notably, the clustering accuracy of SFS-AGGL on the ORL dataset is relatively sensitive to an increase in the parameter β.Therefore, it is recommended to set β to a larger value for optimizing clustering performance.(2) Comparative analysis of clustering performance In this experiment, the k-means method is adopted to cluster the low-dimensional features selected by each FS method.To minimize the impact of initialization on the kmeans method, we performed 10 clustering experiments with varied random initializations.Tables 10-13 display the average values and standard deviations of ACC, NMI, purity, ARI, F-score, precision, and recall for the RLSR, FDEFS, GS 3 FS, S2LFS, AGLRM, ASLCGLFS, and SFS-AGGL methods on the ORL, COIL20, Libras Movement, and Landsat datasets.These results further illustrate the superiority of the proposed SFS-AGGL compared to other comparative methods.The numbers in parentheses denote the feature dimensions that yield the optimal results.

Convergence and Runtime Analysis
In this section, experiments were performed on seven databases to assess the convergence and runtime of the proposed SFS-AGGL method.Figure 9 shows the convergence curve of SFS-AGGL.From Figure 9, we can see that the objective function values of the SFS-AGGL methods only require less than 100 iterations to reach convergence, which validates the efficiency of the proposed iterative optimization method.Table 14 displays the runtime of SFS-AGGL when iteration is set to 100 and feature dimensions are set to 500.The results in Table 14 clearly indicate that the runtime of our proposed method is slightly higher than that of AGLRM but lower than that of other methods.It is noteworthy that the runtime of SFS-AGGL is lower than that of all comparative methods after GPU optimization.

Conclusions and Discussion
This paper proposes the semi-supervised feature selection based on an adaptive graph with global and local constraints (SFS-AGGL) algorithm.This algorithm considers the sample neighborhood structure within the projected feature space, dynamically learns the optimal nearest neighbor graph among samples, and maintains global and local sparse structures within the selected feature subset.This ensures the preservation of the original data's geometric structural information.Moreover, it can effectively leverage structural distribution information from labeled data to derive label information from unlabeled samples.The incorporation of the L21 norm in the SFS model enhances its resilience to noisy features.The iterative optimization approach employed to solve parameter optimal solutions is validated, confirming the convergence of the SFS-AGGL algorithm.Extensive experiments on real datasets validate the classification and clustering performance of the proposed SFS-AGGL method.Although our method can achieve good performance, there are still several issues that need to be pointed out, which are as follows: 1. Since the proposed method has considered the correlation and geometric structure of the data, it is suitable for the features of data with significant correlation, meanwhile, the distribution of data has a certain local structure.
2. Since our proposed method only considers the local and global structural information of the data, its application will be limited in certain datasets.
3. The proposed method cannot effectively extract effective features from the data with complex nonlinear structures because it is a linear feature selection method.

Figure 3 .
Figure 3. Sample images of five datasets.

Figure 5 .
Figure 5. Classification accuracy of different methods under different feature dimensions.

Figure 6 .
Figure 6.Classification results of SFS-AGGL under different parameter values.

Figure 7 .
Figure 7. Clustering results of SFS-AGGL under different parameter values and different feature dimensions, where different colors represent different feature dimensions.

Figure 8 .
Figure 8. Clustering results of SFS-AGGL under different parameter values.

Table 1 .
Definition of the main symbols in this paper.  9: Until convergesFigure 2. Flow chart of SFS-AGGL algorithm.

Table 2 .
The time complexity of each matrix in our proposed algorithm.

Table 3 .
Computational complexity of each iteration for FS methods.

Table 5 .
Details of the five image datasets.

Table 6 .
Optimal parameter combination for SFS-AGGL on the five datasets.

Table 7 .
Best results of each method on five image datasets (ACC).
Numbers in parentheses denote the feature dimensions yielding the optimal results.

Table 9 .
Details of four clustering datasets.

Table 10 .
The best clustering results of different methods on ORL dataset.

Table 14 .
Runtime(s) of different methods on different datasets.