Next Article in Journal
An Investigation into the Nonlinear Dynamic Behavior of High-Speed Helical Gears for Electric Vehicle Reducers
Next Article in Special Issue
FDC-LGL: Fast Discrete Clustering with Local Graph Learning for Large-Scale Datasets
Previous Article in Journal
On the Topological Classification of Four-Dimensional Steady Gradient Ricci Solitons with Nonnegative Sectional Curvature
Previous Article in Special Issue
Uncertainty-Aware Multi-Branch Graph Attention Network for Transient Stability Assessment of Power Systems Under Disturbances
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Supervised Feature Selection Method Using Stackable Attention Networks

1
HUANENG Power International Inc., Beijing 100031, China
2
School of Computer Science and Engineering, Central South University, Changsha 410083, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2025, 13(22), 3703; https://doi.org/10.3390/math13223703
Submission received: 10 October 2025 / Revised: 1 November 2025 / Accepted: 5 November 2025 / Published: 18 November 2025

Abstract

Mainstream DNN-based feature selection methods share a similar design strategy: employing one specially designed feature selection module to learn the importance of features along with the model-training process. While these works achieve great success in feature selection, their shallow structures, which evaluate feature importance from one perspective, are easily disturbed by noisy samples, especially in datasets with high-dimensional features and complex structures. To alleviate this limitation, this paper innovatively introduces a Stackable Attention architecture for Feature Selection (SAFS), which can calculate stable and accurate feature weights through a set of Stackable Attention Blocks (SABlocks) rather than from a single module. To avoid information loss from stacking, a feature jump concatenation structure is designed. Furthermore, an inertia-based weight update method is proposed to generate a more robust feature weight distribution. Experiments on twelve real-world datasets, including multiple domains, demonstrate that SAFS produced the best results with significant performance edges compared to thirteen baselines.
MSC:
54C30; 54E35; 54H30

1. Introduction

With the rapid advancement of the Internet of Things (IoT) and industrial automation systems, both the number and dimensionality of data samples are increasing at an unprecedented rate [1]. Feature selection (FS) aims to identify a subset of relevant features that contribute most to the supervisory objective, thereby improving model interpretability and robustness [2]. Although numerous methods have been proposed to address this classical problem [3,4], many traditional approaches rely on the computation of global metrics or functions, making them increasingly challenging to apply in the era of high-dimensional and large-volume data [5].
In recent years, Deep Neural Networks (DNNs) have emerged as a prominent research direction in feature selection, owing to their ability to capture complex feature interactions and alleviate the “curse of dimensionality and volume.” Significant attention has been directed toward designing effective feature weighting mechanisms. For instance, ref. [6] introduced Deep Feature Selection (DFS), which employs a sparse one-to-one linear layer whose weights are directly used as feature weights—though this design remains susceptible to noise. To mitigate this issue, ref. [7] utilized activation potentials from individual input dimensions as a selection criterion. Ref. [8] pioneered the integration of attention mechanisms into feature selection (referred to as AFS), where feature weights are generated via a soft attention network. Subsequently, ref. [9] enhanced the attention network by incorporating multi-head self-attention for more stable performance. In a different vein, ref. [10] adopted stochastic gates—continuous relaxations of the Bernoulli distribution optimized via gradient descent—to identify relevant features. These methods generally rely on specific modules, such as neural weights [6,11], activation potentials [7], attention weights [8], or stochastic gates [10], to determine salient features. However, as our experimental results indicate, such voting-based architectures often struggle to capture the intricate feature–label relationships necessary for effective feature selection.
Recent FS research has primarily focused on two directions: (1) developing robust feature weighting mechanisms—e.g., ref. [12] applied batch-normalized attention to improve weight stability; (2) leveraging self-supervised learning to model data distribution—e.g., refs. [13,14] employed multi-task autoencoders to denoise input data and facilitate feature selection, while ref. [15] adopted similarity-based graph neural networks to extract structural information from unlabeled data. Notably, however, these approaches still fundamentally depend on conventional feature-wise weight generation layers. We argue that such a constrained architectural design significantly limits feature selection potential, particularly when dealing with high-dimensional data and diverse label spaces.
  • Motivation. The design of SAFS is grounded in the principles of ensemble learning and multi-perspective learning. Ensemble methods demonstrate that combining multiple weak learners yields a stronger, more accurate model. In feature selection, however, a single feature selection module (a “weak selector”) can be misled by noisy samples or complex feature interactions. This raises a key question: instead of relying on a single module, can multiple selectors work collaboratively to produce more robust and discriminative feature weights?
To address this, SAFS employs a stacking architecture where multiple selection modules are cascaded, forming a strong selector. Each successive layer refines the feature weights from its predecessor, effectively performing multi-round, collaborative voting on feature importance. The approach is theoretically supported by the bias-variance trade-off; stacking reduces variance in feature weight estimation, leading to more stable and reliable feature subsets. Furthermore, inspired by deep residual networks [16], which show that stacking non-linear transformations with skip connections helps model complex relationships while easing optimization, we stack SABlocks with feature jump connections. Additional mechanisms are incorporated to mitigate information loss during stacking and to counteract the impact of noisy data, thereby supporting more stable weight updates.
The main contributions of this paper are summarized as follows:
  • Beyond one-layer structures for feature selection: Rather than using one layer structure for feature weight generation, SAFS allows for the feature weights are generated collaboratively from a set of stacked SABlocks. This structure can enlarge the differentiation between essential and less important features and further reduce the weight of features that are accidentally relevant.
  • Key Designs of SAFS: This research proposes a set of crucial designs to support the stackable feature weight generation: (D1) SABlock is a new type of feature weight generation module which can be easily stacked for collaborative feature weight generation. This design can capture complex and intricate feature-label interactions while generating robust feature weights. (D2) “Feature Jump Concatenation” is designed to allow stacked SABlock to access the original feature values rather than processed data to avoid error accumulation. (D3) “Inertia-based weight updation” is proposed. Similarly to the update strategy in reinforcement learning, feature weights are updated by leveraging both current and historical weights to achieve stable weight generations.
  • Excellent experimental results: Our extensive experiments have been performed on twelve real-world datasets to validate our design. The results show that SAFS identifies the most relevant features that provide the best performance compared to thirteen state-of-the-art baselines.

2. Related Work

Feature selection, as a fundamental problem in machine learning, aims to identify the most relevant feature subset for a given learning task. Existing methods can be broadly categorized into three paradigms: filter methods, wrapper methods, and embedded methods [17]. Among these, embedded methods have gained increasing attention due to their ability to integrate feature selection within model training, particularly with the advancement of deep learning techniques. In this section, we systematically review the evolution of feature selection methods and highlight how our proposed SAFS framework addresses key limitations in existing approaches.

2.1. Traditional Feature Selection Methods

Traditional feature selection methods typically rely on handcrafted criteria to evaluate feature relevance. These include
  • Similarity-based methods, such as Fisher Score [18], which select features that maximize inter-class separation and minimize intra-class variance.
  • Information-theoretic approaches, like mRmR [19], which consider both feature relevance and redundancy using mutual information.
  • Regularization-based techniques, exemplified by LASSO [20], that induce sparsity through L1-norm regularization.
While effective in low-dimensional settings, these methods face significant challenges with high-dimensional and large-volume datasets due to their reliance on global metric computations [5]. Recent traditional methods have attempted to address these limitations by incorporating data structural information, such as neighborhood rough sets [21], multi-center and local structure learning [22], and feature dependency analysis [23]. However, their ability to capture complex, non-linear feature interactions remains limited, which motivates the need for more expressive deep learning-based approaches.

2.2. DNN-Based Feature Selection Methods

The emergence of deep neural networks has revolutionized feature selection by enabling the learning of complex feature interactions directly from data. DNN-based methods typically incorporate a custom-designed feature selection module within the network architecture:
  • Weight-based methods: DFS [6] introduced a sparse one-to-one linear layer where network weights directly represent feature importance. This approach, however, is susceptible to noise interference.
  • Activation-based methods: Roy et al. [7] used activation potentials contributed by individual input dimensions as selection metrics.
  • Attention mechanisms: AFS [8] pioneered the use of soft attention networks for feature weighting, while SANs [9] extended this with multi-head self-attention for improved stability.
  • Stochastic approaches: STG [10] employed continuous relaxations of Bernoulli distributions (stochastic gates) trained via gradient descent.
  • Architectural innovations: FIR [4] proposed a dual-network architecture with separate selector and operator networks, while FM [12] incorporated batch-wise attention to attenuate noise.
More recent developments include external attention-based feature rankers [24], sequential attention networks [25], and neurodynamics-based approaches [26]. FsNet [27] introduced a specialized selection layer for high-dimensional biological data.
Despite these advancements, a fundamental limitation persists across most DNN-based methods: their reliance on a single feature selection module. This monolithic architecture restricts their capacity to capture the complex, multi-faceted nature of feature relevance in high-dimensional spaces. When faced with noisy samples or accidental feature correlations, these single-perspective approaches are prone to generating suboptimal feature weights. It is precisely this limitation that motivates our stackable architecture in SAFS, which enables collaborative feature evaluation from multiple perspectives.

2.3. Self-Supervised and Semi-Supervised Methods

Recent research has recognized the importance of leveraging unlabeled data to improve feature selection performance. These methods employ various techniques to extract structural information from tabular data:
  • Autoencoder-based approaches: SEFS [13] and A-SFS [14] utilize multi-task autoencoders to learn latent relationships in unlabeled data, reducing noise in the original features.
  • Graph-based methods: Tan et al. [15] employed similarity-based graph neural networks to capture structural information in unlabeled data.
  • Other techniques: These include clustering assumptions [28], fuzzy relevance and redundancy analysis [29,30], heuristic functions [31], and adaptive structure learning [32,33].
While these methods demonstrate the value of leveraging unlabeled data and have shown promise in capturing data structures, they largely inherit the fundamental architectural limitations of their supervised counterparts. Most self-supervised approaches still rely on single-module feature weighting mechanisms and focus primarily on data preprocessing or auxiliary tasks, without addressing the core challenge of robust feature weight generation in the presence of noise and complex feature interactions.
This gap in the literature highlights the need for a more fundamental rethinking of feature selection architectures. Our SAFS framework addresses this by proposing a novel stackable architecture that can work in conjunction with both supervised and self-supervised paradigms. The feature jump concatenation and inertia-based weight updating mechanisms in SAFS provide a more robust foundation for feature weighting that can potentially enhance existing self-supervised methods when combined with our architecture.
In summary, while existing methods have made significant progress in feature selection, they largely overlook the potential of multi-perspective, collaborative feature evaluation. SAFS represents a paradigm shift from single-module architectures to a stackable framework that can generate more robust and discriminative feature weights through layered processing and historical weight integration.

3. Our Proposed Model

3.1. Notation and Problem Formulation

Generally, this paper presents the matrix as an uppercase bold character (e.g., A ), vector as lowercase bold (e.g., a ) and value as lowercase (e.g., a). Some important notations are described in Table 1. The tabular dataset is defined as X R n × m with n samples and m dimensions (features), the i-th sample is denoted as x i , the i-th feature is X i and the j-th feature of the i-th sample is denoted as x i j . All samples in X are labeled, corresponding to y = y 1 , y 2 , , y n .
Given the realizations of unknown data distribution P ( X , y ) , the goal of feature selection is to select a subset with k indices v [ m ] , | v | = k , and X v that can best predict the target y . Thus, the objective function of embedded feature selection can be achieved by solving the following optimization problem:
min v [ m ] E P ( X , y ) [ L ( f ( X v ) , y ) ]
where L ( · ) is the loss function (e.g., cross-entropy for classification), f ( · ) is a predictive model that maps the selected features X v to the target y . The expectation E P ( X , y ) is taken over the underlying data distribution, emphasizing that the goal is to find a feature subset that generalizes well. Solving this optimization directly is intractable due to the combinatorial nature of the subset selection problem v [ m ] . Embedded methods like SAFS circumvent this by jointly learning the feature weights (which implicitly define v) and the model parameters through continuous optimization, making the process computationally feasible.

3.2. Design Principle

In order to accurately calculate feature weights from the collaboration of multiple FS modules, it is important to design an appropriate architecture to allow for collaborative weight generation. One natural solution involves sequentially stacking several feature selection modules, using their multiplied weights as the final feature results. Another approach is to ensemble several modules in parallel and use their averaged weights. However, both designs have their respective trade-offs. For the sequential structure, each feature selection module can work as a filter with weights for each feature. Due to its multiplicative properties, this structure can filter out irrelevant features hierarchically and expand the difference in feature weights. Because feature selection usually only focuses on a small number of the most relevant features, often only 1 to 3% of the total features are selected, the difference in feature weights can obviously lead to better feature selection. However, this structure is vulnerable to small amounts of errors. If any of the modules accidentally generate wrong feature weights, those errors would propagate to its downstream modules. However, in the parallel structure, different modules can directly access the inputs and individually would not be influenced by others, thus its feature selection capabilities could be impaired.

3.3. SAFS Architecture

Thus, rather than using one of the two, SAFS adopts a hybrid structure with both. Figure 1 illustrates the global structure of SAFS. The basic feature selection block is SABlock, which evaluates the importance of the feature. From a feature weight generation perspective, SABlocks are placed sequentially. However, except for the SABlock outputs from the previous layer, each block can directly access the original inputs to avoid possible information loss. For collaborative weight generation, an ‘inertia-based collaborative weight updating’ mechanism is also designed, adopted from deep reinforcement learning.
  • The Stackable Attention Block. This research introduces the basic building block of SAFS, named SABlock, which can assign individual weights to each feature. Its structure is shown in the left part of Figure 2. The SABlock consists of two fully connected (FC) layers and one batch normalization layer. For each iteration, SABlocks receives batch-wise inputs X B = [ x 1 , x 2 , , x B ] T , sampled from the training dataset X , B is the batch size. The FC layer 1 is defined as Equation (2).
H = Θ 1 · X B + c 1
where Θ 1 is a trainable weight matrix, c 1 is bias vector. This layer condenses original data with m dimensions into embeddings H with a smaller space | h | < m ( | h | means the number of hidden units) to reduce possible noise and duplication. The activation function is removed to avoid overly complex non-linear transformations. To mitigate the bias caused by the internal covariate shift and prevent overfitting, a batch normalization layer is incorporated to re-center and re-scale the input H with
H ^ = γ · N ( H ) + ϵ
where N ( · ) is the normalization function, bias ϵ and gain γ are learnable parameters. Then, the FC layer 2 maps H ^ back to a batch-wise weight matrix of the same size as the input. The tanh ( · ) activation function is used to capture non-linear relationships among features. Then, the feature weight w is calculated by column-wise averaging of the weight matrix to mitigate the impact of noise introduced by certain noisy samples and/or accident correlations. This process is depicted by Equation (4), as follows:
w = 1 B k = 1 B tanh Θ 2 · H ^ + c 2
where Θ 2 is a trainable weight matrix, c 2 is bias vector.
  • Design for Collaborative Weight Generation. This section elaborates on the key designs that enable multiple SABlocks to work in concert, thereby generating more robust and discriminative feature weights. To precisely delineate the data flow and weight generation across different layers, we employ superscripts to denote the layer of origin. For example, X B ( 0 ) signifies the original input batch, whereas w ( s ) represents the feature weight vector generated by the s -th SABlock.
  • Feature Jump Concatenation: Mitigating Information Attenuation. A critical challenge in deep sequential feature weighting is the potential loss of information through layers, where features attenuated early might be irrecoverable later even if they are relevant. To address this, we design the feature jump concatenation mechanism, allowing each SABlock to receive inputs from two parallel paths, as illustrated in the right panel of Figure 2.
1.
Path A: Refined Input from Predecessor: This path carries the output of the preceding SABlock, given by the element-wise product w ( s ) X B ( s ) . This represents a refined view of the features, where the influence of each feature has been selectively amplified or dampened based on the consensus of the previous s layers.
2.
Path B: Persistent Original Input: This path provides direct, unaltered access to the original input batch X B ( 0 ) . It serves as an information-rich anchor or reference, ensuring that no feature value is entirely lost due to potentially premature weighting in earlier blocks.
The input to the next SABlock is formed by concatenating these two paths:
X B ( s + 1 ) = Concat w ( s ) X B ( s ) , X B ( 0 )
This design ensures that every SABlock can perform its importance assessment with full context—it can either reinforce the weighting decisions of its predecessors by further amplifying features that are consistently important (via Path A), or it can re-evaluate and potentially rescue features that were previously under-weighted by accessing their original values (via Path B). This effectively combats progressive information loss and error accumulation in the stack.
  • Inertia-based Collaborative Weight Updating: Ensuring Stability. To foster coherent collaboration across the stack and prevent erratic weight fluctuations, we introduce an inertia-based weight update mechanism. This strategy incorporates a form of memory into the weight evolution process, ensuring that the feature weights change smoothly across layers.
The update rule, inspired by momentum optimization in gradient descent but applied directly to the feature weights, is defined as follows:
w ( s + 1 ) ( 1 g ) · w ( s + 1 ) + g · w ( s )
Here, w ( s + 1 ) on the right-hand side is the new weight vector freshly computed by the ( s + 1 ) -th SABlock based on its input. w ( s ) is the weight vector from the previous layer. The inertia coefficient g (where 0 g 1 ) is a crucial hyperparameter that controls the blend.
1.
A high value of g (e.g., close to 1) implies strong inertia, meaning the system heavily relies on the historical trajectory of weights, leading to very stable but potentially slower-to-adapt updates.
2.
A low value of g (e.g., close to 0) gives more weight to the current layer’s instant assessment, making the system more responsive but also more susceptible to noise.
This mechanism provides two key benefits: (1) Stability: It smooths the weight updates, making the training process less sensitive to noisy batches or outliers that might cause large swings in a single layer’s estimation. (2) Consensus-driven weighting: It forces a feature to be consistently deemed important across successive layers to achieve a high final weight, as a single layer’s strong opinion can be tempered by the historical consensus. The process is initialized with w ( 0 ) = 0 .
Finally, the output weight vector from the last SABlock (layer S ) is normalized via the softmax function to produce the final feature importance scores α = softmax ( w ( S ) ) , which have a sum of 1 and are used for feature selection.
  • Learning module. Learning module is used to model the model between the twisted inputs, generated from the last SABlock and outputs. It can be any form of DNN. Here, a multilayer perceptron with two FC layers is generally adopted. All SABlocks and weights are adjusted during the backpropagation until convergence. The loss function L is as follows:
L = CrossEntropy ( y ^ , y )
For multi-category classification, the softmax cross-entropy is used; for binary-category classification, the binary cross-entropy is used.

3.4. Theoretical Analysis of Stacking for Stability

The sequential stacking of SABlocks can be viewed as an ensemble method applied to feature weighting. We can provide a simplified theoretical intuition for why stacking enhances stability.
Consider the final feature weight vector α as a function of the input batch X B . A single-layer feature selector (e.g., a lone SABlock or attention layer) computes α = f ( X B ) . The variance of this estimator can be high if f is susceptible to noise in X B .
In SAFS, with S stacked blocks and inertia g , the weight update can be unfolded as follows:
w ( S ) ( 1 g ) S f S ( X B ) + g ( 1 g ) S 1 f S 1 ( X B ) + + g S f 0 ( X B )
where f s represents the transformation of the s -th SABlock. This is a form of exponential smoothing across layers. This smoothing has two effects:
1.
Variance Reduction: The final weight w ( S ) is a weighted average of the opinions of all S layers. Assuming the noise in each layer’s estimation is somewhat independent, the variance of the average is reduced compared to the variance of a single layer. This makes the feature ranking less volatile across different training batches.
2.
Error Robustness: An erroneous weight assignment in one specific layer ( f s ) is dampened by the contributions of other layers. The inertia parameter g controls the strength of this smoothing; a higher g places more trust in the historical consensus, further increasing stability at the cost of potentially slower adaptation.
This analysis aligns with the empirical observations in our sensitivity analysis (Section 4.6), where increasing the inertia g generally led to more stable performance. Furthermore, the ablation study (Section 4.5) confirms that removing the stacking structure (SAFS-s) or the inertia mechanism (SAFS-i) consistently leads to a performance drop or increased standard deviation, validating the importance of this design for robust feature selection.

3.5. Pseudocode of Proposed SAFS

The pseudocode of SAFS is shown in Algorithm 1. In each iteration, a batch of B samples is randomly selected from the dataset, which is then processed by the SABlock to obtain a weight vector w . After passing the last SABlock, a softmax function is used to transform the weights into feature important scores α ; finally, α is evaluated by a multi-layer perceptron (MLP).
Algorithm 1 Stackable Attention Network for Feature Selection
Input: Dataset X , Batchsize B, Bias ϵ , Gain γ , inertia parameter g , Number of stacked layers S, Iteration K
Output: feature important scores α
     t , s , w ( 0 ) 0
    while  k < K  do
         X B ( 0 ) Sampling ( X ) # Sampling a batch as input
        while  s < S  do
              H FC Layer 1 ( X B ( 0 ) )
              H ^ γ · N ( H ) + ϵ
              w ( s ) 1 B k = 1 B FC Layer 2 ( H ^ )
              X B ( s + 1 ) = Concat ( ( w ( s ) X B ( s ) ) , X B ( 0 ) ) # Feature jump concatenation
             if  s > 2  then
                  w ( s ) ( 1 g ) · w ( s ) + g · w ( 0 ) # Weight updating
             end if
              s s + 1
        end while
         α softmax ( w ( s ) )
         y ^ Learning _ Module ( X B ( 0 ) α ) # MLP
         L = CrossEntropy ( y ^ , y )
         k k + 1
    end while
The trainable parameters of SAFS, including the weight matrices Θ 1 , Θ 2 and bias vectors c 1 , c 2 in each SABlock, are initialized using the Xavier uniform initializer. This initialization method is chosen to maintain stable gradient flow through the network at the start of training. The initial feature weight vector w ( 0 ) is explicitly set to 0 , as indicated in Algorithm 1. This zero-initialization ensures that the first SABlock initially passes the input features forward uniformly, allowing the learning process to gradually and stably differentiate feature importance from an unbiased starting point. All parameters, including the feature weights which are generated dynamically per batch, are updated via backpropagation to minimize the loss function L (Equation (7)).

3.6. Parallel Stacking Model (SAFS-Pa)

In theory, such a stack structure can have infinite combinations. This research proposes a parallel stacking model for analysis, named SAFS-Pa. It is noteworthy to point out that the parallel model still belongs to the one-layer feature selection architecture albeit it can enhance the stability of the feature selection process. As shown in the left part of Figure 3, several SABlocks are used, and each SABlock generates an attention weight vector w . Then, weight vectors from multiple SABlocks are averaged as overall weights:
w = Mean ( w 1 , w 2 , , w n )
Finally, a following softmax function is used to regulate the weights into (0, 1) range with most of them close to 0:
α = e w j = 1 m e w j

4. Experiment

This section compares SAFS with several state-of-the-art baselines on twelve real-world datasets and six synthetic datasets. The source code is public available at https://github.com/Icannotnamemyselff/SAFS (accessed on 4 November 2025).

4.1. Experiment Settings

  • Real-world datasets. Twelve real-world tabular datasets from public repositories (OpenML and Datamicroarray) are selected for evaluation to ensure diversity and real-world applicability. Our selection criteria are designed to cover a wide spectrum of challenges in feature selection:
  • Dimensionality: We include low-dimensional (e.g., SEGMENT, 19 features), medium-dimensional (e.g., HAR, 561 features), and high-dimensional datasets (e.g., Chiaretti, 12,625 features) to test scalability.
  • Sample Size. The number of samples ranges from a few (e.g., Alon, 62 samples) to many (e.g., SVHN/CIFAR-10, 10,000 samples), assessing performance in both data-scarce and data-rich regimes.
  • Number of Classes: We include binary classification (e.g., Gravier), multi-class classification with a moderate number of classes (e.g., DNA, 3 classes), and many-class problems (e.g., ISOLET, 26 classes).
  • Domain Diversity: Datasets are drawn from various domains, including medicine, image processing (represented as tabular data), speech recognition, physics, and biology, to ensure generalizability.
This comprehensive selection mitigates the risk of overfitting to a specific data characteristic and provides a robust evaluation of the proposed method. Detailed characteristics of these datasets are summarized in Table 2.
  • Case Study Dataset. To further validate SAFS in a real-world industrial scenario, we conducted a case study on short-term wind power forecasting using a publicly available dataset from Kaggle. This time-series dataset originates from a real wind farm in Germany, containing sensor readings recorded every 15 min. It presents a challenging regression task with high feature dimensionality (76 sensors over time) and strong temporal dependencies, moving beyond curated academic benchmarks to a practical application.
  • Synthetic datasets. First, following the widely used data models E 1 E 6 which were also used for evaluation in [34,35,36,37,38]. The input features are independently generated from a 20-dimensional Gaussian distribution with no correlations across the features ( X N ( 0 , I ) ), where I is the 20 × 20 identity matrix. The target is y = 1 A ( 1 1 + Logit ( X ) ) > 0 . 5 , where 1 A ( · ) is an indicator function. The Logit ( X ) for each sample is calculated based on different features, depending on the sign of the 11-th feature X 11 , and is defined as follows:
E 1 : Logit = exp ( X 1 × X 2 ) ;
E 2 : Logit = exp ( i = 3 6 ( ( X i ) 2 4 ) ) ;
E 3 : Logit = exp ( 10 × ( sin ( 2 X 7 ) ) ) + 2 | X 8 | + X 9 + exp ( X 10 ) ;
E 4 : Logit follows E 1 if X 11 < 0 else E 2 ;
E 5 : Logit follows E 1 if X 11 < 0 else E 3 ;
E 6 : Logit follows E 2 if X 11 < 0 else E 3 ;
  • Baselines. The experiments compare SAFS with 13 classical and/or novel feature selection methods from three main streams.
  • ML-Based:
LASSO [20]: a classical regularized linear/logistic model; RF [39]: a classical tree-based model; XGB [40]: Gradient boosted decision trees; CCM [41]: a kernel-based model which measures of independence to find a feature subset.
  • DNN-Based:
AFS [8] and SANs [9]: attention mechanism-based model; FIR [4]: a dual-net architecture, the selector generates a feature subset, while the operator makes predictions; FM [12]: batch-wise attenuation model with almost no hyperparameters; STG [10]: continuous-relaxation-based model; NeuroFS [11]: By gradually pruning the uninformative features from the input layer of a sparse neural network.
  • Self-supervised:
A-SFS [14] and SEFS [13]: Using multi-task autoencoder to learn latent relationship in unlabeled data.
Evaluation Protocols. In practice, the ground-truth about actual feature relevance is generally unavailable. Hence, following common practice in feature selection, the performance of each baseline is evaluated based on the prediction accuracy achieved using its TopK selected features. This is an indirect way to assess the relevance of the discovered features on the target tasks.
Considering the need for both effectiveness and overfitting prevention in deep neural networks, two powerful and popular classifiers are employed: LightGBM [3] and CatBoost [42]. For the classification task, the Micro-F1 score (in %) is used for evaluation. All datasets are split into training and test sets at an 8:2 ratio.
In the evaluation, the number of selected TopK features is fixed at 3% of the total feature dimension, with a minimum value of K set to 5. All experiments are repeated 10 times using different random seeds, and the averaged Micro-F1 score on the test sets is reported.
Parameter Details. All baselines selected for comparison use the default/recommended settings specified in their paper. For SAFS, the number of stacked layers is set to 3, and the number of hidden units is set to 64. This configuration was determined based on a balance between model capacity and computational cost through preliminary experiments on a held-out validation set. Deeper stacks (e.g., 5, 10) showed diminishing returns or potential overfitting on smaller datasets, while shallower stacks (1, 2) underperformed on complex datasets. The hidden unit size of 64 provided a good compromise between representation power and efficiency for the datasets in our study.
The gain γ and bias ϵ of the BN-layer are set to 0.9 and 10−5, respectively. The value for γ is slightly lower than the default of 1.0 to initially allow for milder re-scaling of the normalized activations, which we found to contribute to training stability. The value for ϵ is a common default used to ensure numerical stability during normalization.
The inertia parameter g is set to 0.8, favoring a strong reliance on historical weights for stable updates. The batch size constitutes 10% of the total samples, a common practice that provides a reasonable estimate of the gradient. The Adam optimization method [43] is utilized, with a learning rate of 0.002.

4.2. Main Results: Real-World Data

Table 3 and Table 4 present the experimental results with 13 baselines across two robust classifiers. Notably, SAFS achieves good results on all datasets and shows significant improvements over the compared baselines, including several state-of-the-art methods. When utilizing with the CatBoost classifier, most baseline results demonstrate enhancement, particularly for RF and XGB, but SAFS maintains leading compared to baselines, particularly on large-scale datasets such as ISOLET, SVHN, and CIFAR-10.
For the main results on Table 3, compared to the average performance of other baselines, SAFS can improve ranging from 3.08% to 26.01%. The improvements on datasets with high dimensions, e.g., HAR, ISOLET, SVHN and Chiar. are especially significant. This fact highlights the effectiveness of SAFS on high-dimensional data.
Focus on other baselines, LASSO, due to its linear nature, achieves the worst performance in almost all datasets. RFE, as a wrapper algorithm, demands a vast amount of calculations, often leading to overtime error (over 24 h of computation on EPYC 7552 with 48 × 2 core) on high-dimensional datasets. XGB and RF, known for their strong performance and robustness in feature selection, generally rank second or third, outperform most DNN-based solutions. However, when the dimension increases, their performance generally lags behind recent baselines in those high-dimensional data. The DNN-based methods, AFS, SANs, and FIR, are susceptible to inter-sample noise, worse than tree-based methods. Two recent baselines, STG and NeuroFS, also achieve unstable performance in these datasets because they are susceptible to accidental correlation. Self-supervised solutions, A-SFS and SEFS, display no significant performance improvement since they typically depend on additional unlabeled data for self-supervision and offer limited enhancements to the feature weight generation module designed for the high-dimensional nature of the dataset.
These results suggest that the stacked architecture can significantly improve the ability to capture complex feature interactions compared to existing one-layer-based approaches. SAFS also has a relatively low standard deviation, please refer to detailed experiment results which clearly shows the robustness of the SAFS weight generation towards different datasets.
Subsequently, the feature selection performance is compared across varying numbers of TopK features. Figure 4a–d illustrates the classification accuracy on four different datasets. Notably, SAFS demonstrates significant performance advantages over other baselines across nearly all evaluated TopK ranges, particularly with respect to the initial features. This is because as the stacking deepens, features that are more relevant to the label will be given greater weights, while similar features are given more discriminative weights. This merit is rather important in real-world FS applications, as it shows that SAFS generates high-quality feature weights for different features: both for highly related features and those less related but still valuable features.
While SAFS achieves the best or second-best performance on most datasets, we observe on the SEGMENT dataset (Table 3) that STG achieves a highly competitive accuracy of 96.33%, closely followed by SAFS at 96.75%. This minor difference does not undermine the value of SAFS but rather highlights a key characteristic: for some well-structured datasets with less complex feature interactions, a single strong baseline like STG can perform exceptionally well. The advantage of SAFS’s stacked architecture becomes more pronounced in more challenging scenarios, as evidenced by its substantial performance gains on high-dimensional (e.g., SVHN and ISOLET) and noisy datasets (as shown in Section 4.5). The stackable design aims for robust superiority across a wider range of data conditions rather than necessarily outperforming every baseline on every single dataset by a large margin.
  • Why Stacking works? This part aims to answer an important question: Why does stacking help the FS process? As previously analyzed, later SABlocks can provide a more precise view of the pertinent features as accidental associated features can be gradually filtered out by this layered structure.
To experimentally verify this, the MNIST (Tabular) dataset is utilized to visualize the weights generated by SABlocks with different depths. The weights collected from different layers undergo a min–max normalization operation for better presentation. The kernel density estimation (KDE) of feature weights is illustrated in Figure 5. Please note that the Gaussian kernel used in the visualization also applies smoothing at the boundaries, which can lead to some density estimates to overflow outside the range [0, 1].
In fact, the FS task usually focuses on a few features that are most relevant to the target; therefore, enhancing the identifiability of the TopK features is logically beneficial. Compared to Layer-1 and Layer-2, the weights generated by Layer-3 exhibit superior discriminative capabilities. When the weight exceeds 0.8, Layer-3 shows a lower feature density. At the same time, the peak of weight density in Layer-3 shifts to the left, which indicates that more features are considered unimportant.

4.3. Main Results: Synthetic Data

This section uses synthetic datasets where the target value only depends on a subset of features, which varies across samples. All datasets in this study consist of 2000 generated samples, split into training and testing sets at an 8:2 ratio. This experimental regime presents a more challenging scenario than that explored in [34]. The experiments evaluate all baselines by measuring the TPR of informative features and accuracy score of the LightGBM classifier ( TPR = TP TP + FN , where TP is the number of informative features that are selected by FS method, and FN is the number of informative features unrecovered by the model).
The results are shown in Table 5, where SAFS performs the best. For the relatively easy E 1 E 3 datasets, all methods perform well; on the complex E 4 E 6 datasets, DNN-based methods are weaker than tree-based methods, while SAFS still performs well on these datasets, which shows that SAFS has still good feature discrimination capabilities for capturing complex non-linear feature-label relations.

4.4. Feature Selection Under Different Regimes

This section evaluates the performance of SAFS under different regimes, including scenarios where (1) the number of features exceeds the number of samples; (2) the complexity of feature selection tasks evolves; and (3) feature selection is conducted under various types of noise.
  • Feature selection in the m > n regime. The performance of SAFS is next evaluated under a more challenging regime where the number of features exceeds the number of samples ( m > n ), using only 10% of the randomly sampled data. The results are reported across varying numbers of TopK features to examine whether the proposed method can consistently identify the optimal feature subset under different TopK levels.
Figure 6a–c show the performance of different baselines with different TopK. Across almost all TopK ranges, SAFS achieves superior performance with few samples per class and keeps a steady increment of performance with increasing K. This experiment shows that SAFS can effectively weight high-dimensional features with limited labels.
  • Feature selection under noise disturbance. Building robust models in noisy environments (e.g., industrial data, genetic engineering) is essential. To analyze model robustness under noise conditions, this study designs two noisy scenarios for evaluating feature selection methods.
  • Feature Perturbation: Three types of noise are injected into all datasets to assess baseline robustness on contaminated data:
    (a) 
    Gaussian noise with mean 0 and variance 0.3;
    (b) 
    Salt-and-pepper (S&P) noise with a noise ratio of 0.3;
    (c) 
    Mask noise where 30% of features are randomly set to zero.
The results in Figure 7a show that SAFS provides clear advantages over all baselines. The linear embedding method LASSO performs poorly in high-dimensional noisy settings. DNN-based methods AFS and STG exhibit sensitivity to noise, while XGB, as a strong tree-ensemble baseline, demonstrates robustness comparable to SAFS. The hybrid architecture of SAFS functions similarly to a multi-round feature voting mechanism, effectively mitigating noise impact.
Further experiments evaluate noise resistance with 30% feature masking on three real-world datasets of different scales. As shown in Figure 6d–f, stacking provides significant performance gains on the large-scale SVHN dataset. For medium- and small-scale datasets (ISOLET and DNA), the stacking depth should be limited to 5–10 layers to prevent overfitting.
  • Label Perturbation: Random label noise is introduced by replacing original labels with random ones at varying ratios. To examine the impact on critical features, the Top-1% features selected by each method are used for comparison.
Figure 7b demonstrates that SAFS consistently identifies robust feature subsets under label perturbation, while all methods degrade as noise increases, SAFS maintains stable performance across noise levels, confirming its hybrid architecture’s effectiveness.
An interesting observation is that most FS methods remain relatively stable under 5% label perturbation. As indicated in recent work [44], this may be because limited out-of-distribution (OOD) data prevents over-reliance on category-specific features.
  • Feature selection with class variation. Here, two special tasks are prepared to evaluate SAFS’s generalization ability, the ISOLET dataset is used, which has 26 categories. Classes are selected according to Label orders.
  • Changing feature selection complexity: In this setting, the number of classes gradually increases based on the index order from 0 to 25. As the number of classes grows, the dataset and the feature selection task become progressively more complex and challenging.
Results in Figure 8a indicate that as the number of categories increases, the task becomes more challenging, leading to a gradual decline in accuracy on the feature subsets. Compared to other baselines, SAFS achieves the best overall performance and demonstrates greater effectiveness with increasing category counts.
  • Feature Selection for Transfer Learning: This setting evaluates feature selection using only a subset of classes (e.g., only classes 0–3), while testing classification performance across all 26 categories. This setup assesses whether the selected features generalize to unseen classes, with fewer available classes making the task more challenging.
Figure 8b shows that with very few classes (e.g., 3 classes), most feature selection methods tend to learn features specific to the minority classes rather than generalizable representations. As the number of classes increases, the performance advantage of stronger methods becomes more pronounced. The proposed method achieves the best performance when the number of classes exceeds 3. It should be noted that while more classes provide more information, they also increase the difficulty of feature selection. Some methods (e.g., XGB and CCM) exhibit performance degradation as the number of classes grows. In contrast, SAFS maintains consistently improving performance across all evaluated ranges, clearly validating its effectiveness in this transfer learning scenario.

4.5. Ablation and Component Analysis

To better understand the impact of individual components in SAFS, ablation studies were conducted by sequentially removing each key component to observe corresponding performance changes.
  • Overall results. Table 6 clearly shows that all designs are important for improving performance. Our key design, the stack architecture, plays a crucial role, as removing it results in a significant performance drop on high-dimensional datasets. Additionally, the BN-layer in each SABlock is essential for reducing the standard deviation. From the results, the feature skip connection design and the inertia-based update strategy also increase the stability of the SAFS.
  • SABlock v.s. MLP. The multi-layer perceptron (MLP), as a basic feature selection DNN baseline, can be easily stacked into multiple layers. If all the designed components are removed, SABLock will degenerate into MLP. In this comparison, the SABlock architecture is evaluated against a standard MLP baseline without additional design modifications. All parameters for SABlock and MLP are set identically: The hidden layer units are set to 64, the learning rate is 0.002, the batch size is 10% of the total sample size, and the number of training epochs is 10,000.
The results in Figure 9 show that the SABlock architecture achieves more stable performance on both datasets, particularly with 2∼10 layers. Although MLP exhibits some performance improvement with increasing layers, it shows unstable performance on the two datasets (low accuracy with high standard deviation), while SABlock shows stable changes.

4.6. Hyperparameter Analysis

This section presents a sensitivity analysis of key hyper-parameters: stacking depth, batch size, and inertia rate.
  • Varying stacking layers. Here, performance variations are evaluated across different numbers of stacking blocks in the proposed architecture. Table 7 presents results for six architectural variants using 1, 2, 3, 5, 10, and 20 blocks, respectively. It can be observed that SAFS generally achieves better and more stable performance by stacking more SABlocks, particularly in the range 2∼10. When there are too many stacks (e.g., layer = 20), the generated weights may focus only on the most important features and ignore the less important features, resulting in certain performance degradation.
  • Batch size and inertia parameter. Figure 10a demonstrates that when batch size B is too small, it negatively impacts accuracy because the model more likely to be influenced by some noisy samples. An appropriate batch size contributes to the accuracy of gradient estimation and the stability of the optimization process. A comparably large batch size should be used to update feature weights. This is especially evident for the SVHN and Gravier datasets. Generally, a batch size greater than 10% of the training set is preferable.
Figure 10b shows the impacts of different inertia. The results of these datasets indicate a similar upward trend in accuracy as g increases. When the inertia g = 0, SABlocks would not collaborate and the performance is worst. The increase of g would enhance the overall stability of the training process, but being too high might impair the learning of an SABlock. g = 0.8 generatlly achive good performance.

4.7. Discussion of Computational Complexity

During the feature selection process, the main computational complexity lies in the stacking phase (in particular, SABlocks). The computational complexity per batch for SABlock is O ( LD 2 ) , where L is the number of SABlock layer, and D is the dimension of input data. Although this architecture does introduce some additional complexity than one-layer solutions, e.g., AFS, STG, and SANs, but this architecture can be efficiently executed via GPU calculation.
In Table 8, the computational overheads of different DNN-based feature selection methods are illustrated. It can be observed that when the dataset is large (e.g., SVHN), the computational overhead of DNN-based methods increases significantly. The results show that although SAFS is deeper, its computational time is less than those of STG, SANs, and FIR and is a bit slower than AFS.

4.8. More Detailed Experiment Results

4.8.1. Overall Performance with Standard Deviation

Table 9 and Table 10 present the experimental results with 13 baselines across two robust classifiers. Notably, SAFS achieves good results on all datasets and shows significant improvements over the compared baselines, including several state-of-the-art methods. When utilizing with the CatBoost classifier, most baseline results demonstrate enhancement, particularly for RF and XGB, but SAFS maintains leading compared to baselines, particularly on large-scale datasets such as ISOLET, SVHN, and CIFAR-10.
For the main results on Table 9, compared to the average performance of other baselines, SAFS can improve ranging from 3.08% to 26.01%. The improvements on datasets with high dimensions, e.g., HAR, ISOLET, SVHN, and Chiar. are especially significant. This fact highlights the effectiveness of SAFS on high-dimensional data.
Self-supervised solutions, A-SFS and SEFS, display no significant performance improvement since they typically depend on additional unlabeled data for self-supervision and offer limited enhancements to the feature weight generation module designed for the high-dimensional nature of the dataset.
These results suggest that the stacked architecture can significantly improve the ability to capture complex feature interactions compared to existing one-layer-based approaches.

4.8.2. Performances Under Different TopK Features

This section compares the performance of feature selection across varying numbers of features. Figure 11a–i shows the classification accuracy for nine different datasets.
It is evident that SAFS achieves consistent performance advantages over other baselines across nearly all evaluated TopK ranges. This merit is rather important in real-world FS applications, as it shows that SAFS generates high-quality feature weights for different features: both for highly related features and those less related but still valuable features.

4.8.3. Feature Selection in m > n Regime

The performance of SAFS is next evaluated under a more challenging regime where the number of features exceeds the number of samples ( m > n ). In some real-world scenarios, only limited samples are labeled. In this feature selection regime, only 10% of the samples are randomly selected from the raw datasets, with 70% allocated for training and the remaining 30% for testing. To better demonstrate overall performance, Figure 12 shows the performance of different baselines on different TopK. Across almost all ranges, SAFS achieves the best performance in this challenging regime with a few samples per class, and keeps a constant increment of performance with increasing K. This experiment shows that SAFS can effectively weight high-dimensional features with limited labels.

4.8.4. Detailed Ablation Results

To better understand the impact of components in SAFS, ablation studies are conducted on key components by systematically removing each one to observe performance changes across multiple datasets. Table 11 clearly shows that all designs are important for improving performance. When the stack architecture is removed, the performance drops significantly on high-dimensional datasets, and the BN-layer reduces the standard deviation. From the results, the feature skip connection design and the inertia-based update strategy also increase the stability of the SAFS.

5. Case Study: SAFS in Short-Term Wind Power Forecasting

  • Background. With increasing energy demands and environmental concerns from fossil fuels, renewable energy sources are essential [45]. Wind energy, being pollution-free and abundant, has gained significant global attention. Accurate wind power prediction is crucial for integrating wind energy into power grids. The high number of input features of the forecasting system negatively affects the accuracy of performance and computational time. To address this issue, some studies apply feature selection methods in the wind power systems [46,47].
  • Dataset. These dataset is publicly available on the Kaggle platform (https://www.kaggle.com/datasets/aymanlafaz/wind-energy-germany) (accessed on 4 November 2025), and it was collected continuously from 1 January 2011 00:00:00 to 30 December 2021 07:45:00, with data recorded every 15 min. The unit includes 76 sensor monitoring points, such as the temperature of the heat exchanger converter, blade torque, and line-to-line voltage (phase voltage), resulting in 7392 feature value entries per day, which is approximately 192 MB.
To evaluate the proposed feature selection method on time-series data, experiments were conducted using data generated between 1 January 2020 and 30 June 2020. The dataset was chronologically split into training and testing sets with an 8:2 ratio, comprising 20,889 samples for training and 5223 for testing. The statistical information of the collected wind power data are presented in Table 12.
  • Data cleaning. Features with a missing value ratio exceeding 50% and outliers were processed using previous value imputation to maximally preserve working condition invariance. For features with missing values below 50%, the local sample relationships were considered by applying K-NN imputation (K = 5) to estimate missing values. Finally, min–max normalization was applied to each feature to facilitate faster convergence of SAFS.
  • Experiment. The experiment aims to predict wind power in different short-term periods, LightGBM regressor is used to evaluate the feature subsets found by different FS methods.
Figure 13 shows the wind power prediction results using different feature selection methods. The results on the left show that compared with using all features (Baseline), the RMSE and MAE are improved when only TOP-10 features are selected for prediction, and STG and SAFS have better results. This indicates that through feature selection, the impact of redundant/irrelevant features on the prediction model can be reduced. From the curve in Figure 13 (right), compared with STG and Baseline, the gap between the TOP-10 prediction curve of SAFS and the true curve is smaller, suggesting that SAFS is beneficial for improving the accuracy of wind power prediction.

6. Limitation and Future Work

While the SAFS method proposed in this paper demonstrates promising results in feature selection, it is imperative to acknowledge certain limitations and outline potential directions for future research.
Firstly, the application of SAFS has not yet been extensively tested in industrial settings where domain-specific knowledge plays a critical role. This highlights the need for more empirical studies to validate and refine SAFS in various real-world scenarios, ensuring its adaptability and effectiveness across different industrial domains.
Secondly, the theoretical underpinnings of SAFS are still in nascent stages. Our future work will develop a comprehensive theoretical analysis that can provide insight into the convergence properties, optimality conditions, and scalability aspects of SAFS.
Lastly, the architectural choices within SAFS present another avenue for exploration. It remains an open question whether alternative structural configurations or enhancements could further improve its performance and efficiency. Investigating different construction strategies and integrating advanced techniques could potentially lead to more robust versions of SAFS.
Addressing these limitations will not only strengthen the credibility and applicability of SAFS but also contribute to the broader field of feature selection in machine learning.

7. Conclusions

This paper proposes SAFS, a novel feature selection framework that can evaluate the importance of features. Specifically, a new FS module, SABlock, is designed to capture complex feature interactions. This architecture is designed to be stackable, and SAFS can be built based on this fundamental block. Then, feature skip connection and inertia-based weight update method are designed to enhance the performance of SAFS. Experiments on nine real-world datasets from diverse domains validate that the proposed model discovers features that deliver superior prediction performance for classification tasks. Analysis of the weight distribution across layers provides explanatory insights into the effectiveness of the stacked architecture. Further experiments prove that SAFS can effectively select features in multiple regimes, such as few samples and noise disturbance, which is important for practical applications. The ablation and sensitivity analyses illustrate the effectiveness of our design. One of our future works will be to verify its applicability in real-world applications with domain experts.

Author Contributions

Conceptualization, N.G.; methodology, Z.L. and J.T.; software, Z.L. and J.T.; validation, Z.C., W.J., Z.L. and J.T.; formal analysis, Z.C. and W.J.; investigation, Z.L. and J.T.; resources, N.G.; data curation, Z.C. and W.J.; writing—original draft preparation, Z.C., Z.L. and J.T.; writing—review and editing, Z.C., Z.L. and J.T.; visualization, Z.L. and J.T.; supervision, N.G.; project administration, N.G.; funding acquisition, N.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partially funded by the Huaneng Group Headquarters Technology Projects with No. HNKJ23-HF97.

Data Availability Statement

The data presented in this study are openly available in [OpenML] at [https://www.openml.org/] (accessed on 4 November 2025).

Conflicts of Interest

Author Zhu Chen was employed by the company HUANENG Power International Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Yin, S.; Ding, S.; Xie, X.; Luo, H. A review on basic data-driven approaches for industrial process monitoring. IEEE Trans. Ind. Electron. 2014, 61, 6418–6428. [Google Scholar] [CrossRef]
  2. Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. 2017, 50, 1–45. [Google Scholar] [CrossRef]
  3. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 983–991. [Google Scholar]
  4. Wojtas, M.; Chen, K. Feature importance ranking for deep learning. Adv. Neural Inf. Process. Syst. 2020, 33, 5105–5114. [Google Scholar]
  5. Škrlj, B.; Džeroski, S.; Lavrač, N.; Petković, M. Feature Importance Estimation with Self-Attention Networks. In Proceedings of the ECAI 2020, Santiago de Compostela, Spain, 29 August–8 September 2020; pp. 1491–1498. [Google Scholar]
  6. Li, Y.; Chen, C.Y.; Wasserman, W. Deep feature selection: Theory and application to identify enhancers and promoters. J. Comput. Biol. 2016, 23, 322–336. [Google Scholar] [CrossRef] [PubMed]
  7. Roy, D.; Murty, K.; Mohan, C. Feature selection using deep neural networks. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–6. [Google Scholar]
  8. Gui, N.; Ge, D.; Hu, Z. AFS: An attention-based mechanism for supervised feature selection. Proc. AAAI Conf. Artif. Intell. 2019, 33, 3705–3713. [Google Scholar] [CrossRef]
  9. Yang, B.; Wang, L.; Wong, D.; Chao, L.; Tu, Z. Convolutional self-attention networks. arXiv 2019, arXiv:1904.03107. [Google Scholar] [PubMed]
  10. Yamada, Y.; Lindenbaum, O.; Negahban, S.; Kluger, Y. Feature selection using stochastic gates. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 10648–10659. [Google Scholar]
  11. Atashgahi, Z.; Zhang, X.; Kichler, N.; Liu, S.; Yin, L.; Pechenizkiy, M.; Veldhuis, R.; Mocanu, D. Supervised feature selection with neuron evolution in sparse neural networks. arXiv 2023, arXiv:2303.07200. [Google Scholar] [CrossRef]
  12. Liao, Y.; Latty, R.; Yang, B. Feature selection using batch-wise attenuation and feature mask normalization. In Proceedings of the International Joint Conference on Neural Networks, Piscataway, NJ, USA, 18–22 July 2021; pp. 1–9. [Google Scholar]
  13. Lee, C.; Imrie, F.; van der Schaar, M. Self-Supervision Enhanced Feature Selection with Correlated Gates. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
  14. Qiu, Z.; Zeng, W.; Liao, D.; Gui, N. A-SFS: Semi-supervised feature selection based on multi-task self-supervision. Knowl.-Based Syst. 2022, 252, 109449. [Google Scholar] [CrossRef]
  15. Tan, J.; Gui, N.; Qiu, Z. GAEFS: Self-supervised Graph Auto-encoder enhanced Feature Selection. Knowl.-Based Syst. 2024, 290, 111523. [Google Scholar] [CrossRef]
  16. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  17. Li, G.; Yu, Z.; Yang, K.; Lin, M.; Chen, C. Exploring Feature Selection With Limited Labels: A Comprehensive Survey of Semi-Supervised and Unsupervised Approaches. IEEE Trans. Knowl. Data Eng. 2024, 36, 6124–6144. [Google Scholar] [CrossRef]
  18. Duda, R.; Hart, P.; Stork, D. Pattern Classification. In Pattern Classification, 2nd ed.; Wiley Interscience: New York, NY, USA, 2001. [Google Scholar]
  19. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
  20. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
  21. Yuan, K.; Miao, D.; Pedrycz, W.; Ding, W.; Zhang, H. Ze-HFS: Zentropy-based uncertainty measure for heterogeneous feature selection and knowledge discovery. IEEE Trans. Knowl. Data Eng. 2024, 36, 7326–7339. [Google Scholar] [CrossRef]
  22. Zhang, C.; Nie, F.; Wang, R.; Li, X. Supervised Feature Selection via Multi-Center and Local Structure Learning. IEEE Trans. Knowl. Data Eng. 2024, 36, 4930–4942. [Google Scholar] [CrossRef]
  23. Qian, W.; Li, Y.; Ye, Q.; Xia, S.; Huang, J.; Ding, W. Confidence-Induced Granular Partial Label Feature Selection via Dependency and Similarity. IEEE Trans. Knowl. Data Eng. 2024, 36, 5797–5810. [Google Scholar] [CrossRef]
  24. Xue, Y.; Zhang, C.; Neri, F.; Gabbouj, M.; Zhang, Y. An external attention-based feature ranker for large-scale feature selection. Knowl.-Based Syst. 2023, 281, 111084. [Google Scholar] [CrossRef]
  25. Bateni, M.; Chen, L.; Fahrbach, M.; Fu, G.; Mirrokni, V.; Yasuda, T. Sequential Attention for Feature Selection. arXiv 2022, arXiv:2209.14881. [Google Scholar]
  26. Wang, Y.; Li, X.; Wang, J. A neurodynamic optimization approach to supervised feature selection via fractional programming. Neural Netw. 2021, 136, 194–206. [Google Scholar] [CrossRef]
  27. Singh, D.; Climente-González, H.; Petrovich, M.; Kawakami, E.; Yamada, M. Fsnet: Feature selection network on high-dimensional biological data. In Proceedings of the International Joint Conference on Neural Networks, Gold Coast, Australia, 18–23 June 2023; pp. 1–9. [Google Scholar]
  28. Zhao, Z.; Liu, H. Semi-supervised feature selection via spectral analysis. In Proceedings of the SIAM International Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007; pp. 641–646. [Google Scholar]
  29. Liu, K.; Li, T.; Yang, X.; Chen, H.; Wang, J.; Deng, Z. SemiFREE: Semi-supervised feature selection with fuzzy relevance and redundancy. IEEE Trans. Fuzzy Syst. 2023, 31, 3384–3396. [Google Scholar] [CrossRef]
  30. Guo, Z.; Shen, Y.; Yang, T.; Li, Y.J.; Deng, Y.; Qian, Y. Semi-supervised feature selection based on fuzzy related family. Inf. Sci. 2024, 652, 119660. [Google Scholar] [CrossRef]
  31. Karimi, F.; Dowlatshahi, M.; Hashemi, A. SemiACO: A semi-supervised feature selection based on ant colony optimization. Expert Syst. Appl. 2023, 214, 119130. [Google Scholar] [CrossRef]
  32. Roffo, G.; Melzi, S.; Castellani, U.; Vinciarelli, A.; Cristani, M. Infinite feature selection: A graph-based feature filtering approach. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4396–4410. [Google Scholar] [CrossRef]
  33. Zhang, Z.; Yao, J.; Liu, L.; Li, J.; Li, L.; Wu, X. Partial Label Feature Selection: An Adaptive Approach. IEEE Trans. Knowl. Data Eng. 2024, 36, 4178–4191. [Google Scholar] [CrossRef]
  34. Yoon, J.; Jordon, J.; Van der Schaar, M. INVASE: Instance-wise variable selection using neural networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  35. Jethani, N.; Sudarshan, M.; Aphinyanaphongs, Y.; Ranganath, R. Have We Learned to Explain?: How Interpretability Methods Can Learn to Encode Predictions in their Interpretations. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021; pp. 1459–1467. [Google Scholar]
  36. Arik, S.; Pfister, T. Tabnet: Attentive interpretable tabular learning. Proc. AAAI Conf. Artif. Intell. 2021, 35, 6679–6687. [Google Scholar] [CrossRef]
  37. Yang, J.; Lindenbaum, O.; Kluger, Y. Locally sparse neural networks for tabular biomedical data. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 25123–25153. [Google Scholar]
  38. Cohen, D.; Shnitzer, T.; Kluger, Y.; Talmon, R. Few-sample feature selection via feature manifold learning. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 6296–6319. [Google Scholar]
  39. Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
  40. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  41. Chen, J.; Stern, M.; Wainwright, M.; Jordan, M. Kernel feature selection via conditional covariance minimization. Adv. Neural Inf. Process. Syst. 2017, 30, 2591–2598. [Google Scholar]
  42. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
  43. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  44. Lee, S.; Park, C.; Lee, H.; Yi, J.; Lee, J.; Yoon, S. Removing undesirable feature contributions using out-of-distribution data. arXiv 2021, arXiv:2101.06639. [Google Scholar] [CrossRef]
  45. Wang, Y.; Zou, R.; Liu, F.; Zhang, L.; Liu, Q. A review of wind speed and wind power forecasting with deep neural networks. Appl. Energy 2021, 304, 117766. [Google Scholar] [CrossRef]
  46. Khazaei, S.; Ehsan, M.; Soleymani, S.; Mohammadnezhad-Shourkaei, H. A high-accuracy hybrid method for short-term wind power forecasting. Energy 2022, 238, 122020. [Google Scholar] [CrossRef]
  47. El-Kenawy, E.S.; Mirjalili, S.; Khodadadi, N.; Abdelhamid, A.; Eid, M.; El-Said, M.; Ibrahim, A. Feature selection in wind speed forecasting systems based on meta-heuristic optimization. PLoS ONE 2023, 18, e0278491. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The overall architecture of SAFS. Different SABlocks are stacked with a sequential architecture while ‘feature jump concatenation’ is employed to obtain more feature interaction and avoid possible information loss.
Figure 1. The overall architecture of SAFS. Different SABlocks are stacked with a sequential architecture while ‘feature jump concatenation’ is employed to obtain more feature interaction and avoid possible information loss.
Mathematics 13 03703 g001
Figure 2. The two key designs in SAFS. The left part is the basic building SABlock, and the right part shows the basic structure on how to stack two SAblocks.
Figure 2. The two key designs in SAFS. The left part is the basic building SABlock, and the right part shows the basic structure on how to stack two SAblocks.
Mathematics 13 03703 g002
Figure 3. The overall architecture of SAFS-Pa. Different SABlocks are stacked with a parallel architecture, different SABlocks vote to obtain weights, and then use the average as the final results.
Figure 3. The overall architecture of SAFS-Pa. Different SABlocks are stacked with a parallel architecture, different SABlocks vote to obtain weights, and then use the average as the final results.
Mathematics 13 03703 g003
Figure 4. Performance with different TopK levels. (a) SVHN: Description of the first panel. (b) CIFAR10: Description of the second panel. (c) GAS: Description of the third panel. (d) ISOLET: Description of the fourth panel. TopK represents important features selected by the respective baselines.
Figure 4. Performance with different TopK levels. (a) SVHN: Description of the first panel. (b) CIFAR10: Description of the second panel. (c) GAS: Description of the third panel. (d) ISOLET: Description of the fourth panel. TopK represents important features selected by the respective baselines.
Mathematics 13 03703 g004
Figure 5. Weight distributions on MNIST (Tabular) from different layers, the larger the feature weight, the more important it is.
Figure 5. Weight distributions on MNIST (Tabular) from different layers, the larger the feature weight, the more important it is.
Mathematics 13 03703 g005
Figure 6. Performance under different scenarios: (ac) Few-shot learning with high-dimensional features ( m > n ), showing performance at different TopK levels; (df) Robustness evaluation with 30% randomly masked data on datasets of varying scales: large (SVHN), medium (ISOLET), and small (DNA).
Figure 6. Performance under different scenarios: (ac) Few-shot learning with high-dimensional features ( m > n ), showing performance at different TopK levels; (df) Robustness evaluation with 30% randomly masked data on datasets of varying scales: large (SVHN), medium (ISOLET), and small (DNA).
Mathematics 13 03703 g006
Figure 7. Robustness analysis under different perturbation scenarios: (a) Performance with varying feature perturbations using Top-3% features; (b) Performance with 0%, 5%, and 10% label noise using Top-1% features. Average performance across all datasets is reported.
Figure 7. Robustness analysis under different perturbation scenarios: (a) Performance with varying feature perturbations using Top-3% features; (b) Performance with 0%, 5%, and 10% label noise using Top-1% features. Average performance across all datasets is reported.
Mathematics 13 03703 g007
Figure 8. Feature selection performance in two specialized tasks: (a) impact of changing task complexity on feature selection effectiveness; (b) feature selection applied to transfer learning scenarios. Both plots use the ISOLET dataset, with number of classes on the x-axis and Micro-F1 score on the y-axis.
Figure 8. Feature selection performance in two specialized tasks: (a) impact of changing task complexity on feature selection effectiveness; (b) feature selection applied to transfer learning scenarios. Both plots use the ISOLET dataset, with number of classes on the x-axis and Micro-F1 score on the y-axis.
Mathematics 13 03703 g008
Figure 9. Performance comparison of different stacked layer architectures on datasets of varying scales: (a) smaller-scale ISOLET dataset (617 dimensions); (b) larger-scale SVHN dataset (3072 dimensions). For clearer visualization, a slight horizontal offset was applied to the curves.
Figure 9. Performance comparison of different stacked layer architectures on datasets of varying scales: (a) smaller-scale ISOLET dataset (617 dimensions); (b) larger-scale SVHN dataset (3072 dimensions). For clearer visualization, a slight horizontal offset was applied to the curves.
Mathematics 13 03703 g009
Figure 10. Parameter sensitivity analysis: (a) model performance with varying batch sizes ( B ); (b) model performance with varying inertia parameters ( g ). Results are shown for different parameter ratios to demonstrate sensitivity.
Figure 10. Parameter sensitivity analysis: (a) model performance with varying batch sizes ( B ); (b) model performance with varying inertia parameters ( g ). Results are shown for different parameter ratios to demonstrate sensitivity.
Mathematics 13 03703 g010
Figure 11. Performance comparison across multiple datasets at different TopK levels: (1st row) SVHN, CIFAR10, MNIST; (2nd row) ISOLET, HAR, GAS; (3rd row) DNA, SATIMAGE, SEGMENT. The plots demonstrate the feature selection effectiveness across different data scales and domains.
Figure 11. Performance comparison across multiple datasets at different TopK levels: (1st row) SVHN, CIFAR10, MNIST; (2nd row) ISOLET, HAR, GAS; (3rd row) DNA, SATIMAGE, SEGMENT. The plots demonstrate the feature selection effectiveness across different data scales and domains.
Mathematics 13 03703 g011
Figure 12. Feature selection performance in the high-dimensional regime ( m > n ), where the number of features exceeds the number of samples: (1st row) ISOLET, CIFAR10, SVHN; (2nd row) HAR, MNIST, DNA. Results show comparative performance at different TopK levels (D = dimensions).
Figure 12. Feature selection performance in the high-dimensional regime ( m > n ), where the number of features exceeds the number of samples: (1st row) ISOLET, CIFAR10, SVHN; (2nd row) HAR, MNIST, DNA. Results show comparative performance at different TopK levels (D = dimensions).
Mathematics 13 03703 g012
Figure 13. Wind power forecasting using regression models with different FS methods. The x-axis is the time step and y-axis is the wind power (kW). For SAFS, the TOP-1 feature is Torque, while TOP-2 and -3 are Power Factor and Pitch Demand Baseline Degree, respectively.
Figure 13. Wind power forecasting using regression models with different FS methods. The x-axis is the time step and y-axis is the wind power (kW). For SAFS, the TOP-1 feature is Torque, while TOP-2 and -3 are Power Factor and Pitch Demand Baseline Degree, respectively.
Mathematics 13 03703 g013
Table 1. Notation description.
Table 1. Notation description.
NotationDescription
nThe number of samples
mThe number of features
gInertia parameter
sStack parameter
x i The i-th sample
y i The i-th label
c i The bias vector
x i j The j-th feature of the i-th sample
KThe number of selected features
P ( · ) Distribution of dataset
X Original data matrix
y The label set corresponding to X
h Hidden units in a neural network
w The weight vector of features
X v The selected features
X i The i-th feature
X B The batch-wise inputs
H Low dimension embeddings
Θ Trainable weight matrix
L ( · ) Loss function
Table 2. Datasets description.
Table 2. Datasets description.
DatasetsFeaturesTopKClassesSamplesDomains
Chiar.12,625%3(390)4127Medical
SVHN(Tabular)3072%3(92)1010,000Picture
CIFAR10(Tabular)3072%3(92)1010,000Picture
Gravier2905%3(90)2168Medical
Alon2000%3(60)262Medical
MNIST(Tabular)784%3(24)108000Picture
ISOLET618%3(18)262600Speech
HAR561%3(16)63000Physics
DNA180%3(5)3450Biology
GAS128%3(5)66000Chemistry
SATIMAGE(SAT.)3756600Physics
SEGMENT(SEG.)19571400Picture
Table 3. Average performance (Micro-F1↑) with LightGBM classifier with ten runs, ‘OT’ means overtime (more than 24 h of calculation in EPYC 7552*2 192 cores), ‘−’ means no result because of an internal error occurred. The best- and second-best results are highlighted in bold and with underline, respectively.
Table 3. Average performance (Micro-F1↑) with LightGBM classifier with ten runs, ‘OT’ means overtime (more than 24 h of calculation in EPYC 7552*2 192 cores), ‘−’ means no result because of an internal error occurred. The best- and second-best results are highlighted in bold and with underline, respectively.
AlgorithmSEG.SAT.GASDNAHARISOLETMNISTAlonGravierSVHNCIFAR10Chiar.
LASSO86.9574.7790.2168.8980.5766.8254.3172.4573.7222.5527.2382.56
RFE93.0972.8894.8531.2583.2455.8640.7566.8468.43OTOTOT
RF96.5083.4489.3787.4893.7876.7361.3481.0582.1551.9636.5082.82
XGB90.5283.8895.3782.7493.3568.3861.7884.7387.2557.2441.4385.64
CCM95.1282.5693.6364.5982.7759.1242.7180.0575.6845.5241.34
FIR91.7373.8893.6143.3376.7263.2841.1880.0068.2347.4041.2473.84
AFS95.9882.0996.2172.6791.5675.2062.8080.0080.8855.8938.1686.92
SANs94.0479.1194.3761.0388.7768.7936.8684.3171.3745.2239.8471.28
FM95.2480.4597.0183.9189.9475.8863.6778.0277.5757.9242.8778.85
STG96.3381.7795.5779.5594.0677.9462.4483.1580.7856.3040.6183.68
NeuroFS94.3482.5486.1170.3794.9560.5955.6178.5980.2645.9042.4287.91
A-SFS96.1780.0494.3580.5992.4469.2564.1675.7872.5542.1438.9474.36
SEFS95.5983.4296.9476.1368.0761.7778.6475.6153.7639.99
SAFS-Pa96.4284.9197.1487.6493.8674.0064.7185.3885.5854.4041.6688.23
SAFS96.7584.3397.9989.3396.1780.8462.6486.9287.5560.6143.8688.46
Table 4. Average performance (Micro-F1↑) with Catboost classifier with ten runs, ‘OT’ means overtime (more than 24 h of calculation in EPYC 7552*2 192 cores), ‘−’ means no result because of an internal error occurred. The best- and second-best results are highlighted in bold and with underline, respectively.
Table 4. Average performance (Micro-F1↑) with Catboost classifier with ten runs, ‘OT’ means overtime (more than 24 h of calculation in EPYC 7552*2 192 cores), ‘−’ means no result because of an internal error occurred. The best- and second-best results are highlighted in bold and with underline, respectively.
AlgorithmSEG.SAT.GASDNAHARISOLETMNISTAlonGravierSVHNCIFAR10Chiar.
LASSO87.8875.5090.6869.8580.5168.2056.8578.4273.9225.6829.2978.68
RFE93.8377.8995.8746.6785.5863.7243.68OTOTOTOTOT
RF97.4284.3889.5087.1193.8177.2963.2283.1582.5458.0139.2784.47
XGB90.6983.7295.4082.4492.6769.2364.2585.7989.2163.3843.8889.73
CCM96.1282.6794.0568.5189.8367.0156.1084.2176.0752.0443.97
FIR94.0482.2292.2440.7476.8365.3338.2383.1574.5155.8243.7669.74
AFS95.0278.6796.3877.8991.8075.9665.1083.0780.5158.3942.5685.38
SANs93.6679.5594.3059.4089.1170.7130.3285.3676.4754.2641.2772.30
FM95.3380.7697.1283.4790.4576.3364.0281.1477.7359.2644.5676.30
STG97.0983.7796.4486.2291.2075.3565.1186.8880.7859.7543.2281.57
NeuroFS95.2883.0388.4982.0595.2258.4757.6482.8784.2646.6243.5886.67
A-SFS95.5783.5594.3580.5187.7376.6465.9976.3874.0756.2942.2376.92
SEFS96.0882.7996.6783.3377.0265.4280.2977.8954.7843.01
SAFS-Pa96.9684.5097.1286.2294.4375.4466.6386.1584.6160.7445.2985.76
SAFS97.1782.6298.1889.1195.9581.8265.3387.3888.1166.8646.8989.69
Table 5. Relevant feature discovery results for synthetic datasets with 20 features.The best and second-best results are highlighted in bold and underline, respectively.
Table 5. Relevant feature discovery results for synthetic datasets with 20 features.The best and second-best results are highlighted in bold and underline, respectively.
DatasetE1 E2 E3 E4 E5 E6
Metrics (%) TPR/F1 TPR/F1 TPR/F1 TPR/F1 TPR/F1 TPR/F1
XGB100/66.7100/66.796.7/64.485.7/63.285.7/63.288.9/64.0
RF100/66.7100/66.7100/66.771.4/58.874.3/59.676.5/60.4
AFS100/66.7100/66.796.7/64.446.7/48.057.2/53.346.7/48.0
STG100/66.7100/66.7100/66.771.4/58.861.9/55.285.2/62.9
NeuroFS100/66.7100/66.796.7/64.445.6/47.665.6/57.252.5/51.6
SAFS100/66.7100/66.7100/66.798.6/65.686.7/63.585.2/62.9
Table 6. Ablation studies. SAFS-s: Remove the stack architecture, do FS only use one SABlock; SAFS-bn: Remove the BN-layer of SABlock; SAFS-i: Remove the inertia-based weight update strategy; SAFS-c: Remove the feature skip connection. The best and second-best results are highlighted in bold and underline, respectively.
Table 6. Ablation studies. SAFS-s: Remove the stack architecture, do FS only use one SABlock; SAFS-bn: Remove the BN-layer of SABlock; SAFS-i: Remove the inertia-based weight update strategy; SAFS-c: Remove the feature skip connection. The best and second-best results are highlighted in bold and underline, respectively.
DatasetsSAFSSAFS-sSAFS-bnSAFS-iSAFS-c
Chiar.88.46 ± 6.7086.92 ± 7.3370.38 ± 14.4987.27 ± 7.5286.00 ± 7.13
SVHN60.61 ± 0.7457.03 ± 3.1857.82 ± 3.2359.40 ± 1.0659.47 ± 1.43
ISOLET80.84 ± 2.1673.01 ± 3.6865.73 ± 0.6678.20 ± 2.6879.09 ± 2.29
GAS97.99 ± 0.3996.83 ± 0.7293.95 ± 2.0797.08 ± 0.8197.65 ± 0.59
Table 7. Performance comparison with different variants. The best and second-best results are highlighted in bold and underline, respectively.
Table 7. Performance comparison with different variants. The best and second-best results are highlighted in bold and underline, respectively.
# of LayerHARISOLETSVHNChiar.
194.85 ± 0.8173.01 ± 3.6854.68 ± 1.6186.92 ± 7.33
295.92 ± 0.9981.94 ± 1.7155.62 ± 1.7888.23 ± 8.03
3(Ours)96.17 ± 0.8380.84 ± 2.1660.61 ± 0.7488.46 ± 6.70
596.08 ± 0.8380.34 ± 1.1359.64 ± 0.6390.00 ± 7.73
1095.71 ± 0.9580.63 ± 2.0559.54 ± 0.7988.84 ± 7.38
2095.63 ± 0.6477.36 ± 3.0359.69 ± 0.8687.30 ± 8.25
Table 8. Computational complexity per-iteration (in seconds).
Table 8. Computational complexity per-iteration (in seconds).
Dataset(Sam./Dim.)FIRAFSSANsSTGSAFS
SVHN(10,000, 3072)0.21280.04743.26870.15340.0748
MNIST(8000, 784)0.07100.02330.81260.04740.0316
GAS(6000, 128)0.01210.00730.17780.01110.0118
Table 9. Average performance (Micro-F1↑) with LightGBM classifier with ten runs, ‘OT’ means overtime (more than 24 h of calculation in EPYC 7552*2 192 cores), ‘−’ means no result because of an internal error occurred. The best- and second-best results are highlighted in bold and with underline, respectively.
Table 9. Average performance (Micro-F1↑) with LightGBM classifier with ten runs, ‘OT’ means overtime (more than 24 h of calculation in EPYC 7552*2 192 cores), ‘−’ means no result because of an internal error occurred. The best- and second-best results are highlighted in bold and with underline, respectively.
AlgorithmSEGMENTSATIMAGEGASDNAHARISOLET
LASSO86.95  ±   0.78 74.77  ±   1.22 90.21  ±   0.62 68.89  ±   3.11 80.57  ±   1.02 66.82  ±   1.60
RFE93.09  ±   1.17 72.88  ±   1.97 94.85  ±   0.42 31.25  ±   2.99 83.24  ±   1.36 55.86  ±   0.51
RF96.50  ±   0.84 83.44  ±   1.83 89.37  ±   3.13 87.48  ±   2.48 93.78  ±   0.67 76.73  ±   2.25
XGB90.52  ±   0.88 83.88  ±   2.23 95.37  ±   0.26 82.74  ±   2.77 93.35  ±   0.53 68.38  ±   1.45
CCM95.12  ±   0.58 82.56  ±   2.47 93.63  ±   0.82 64.59  ±   5.76 82.77  ±   7.83 59.12  ±   2.66
FIR91.73  ±   4.45 73.88  ±   5.49 93.61  ±   2.04 43.33  ±   7.02 76.72  ±   6.23 63.28  ±   5.77
AFS95.98  ±   1.05 82.09  ±   2.56 96.21  ±   0.94 72.67  ±   8.47 91.56  ±   2.96 75.20  ±   2.21
SANs94.04  ±   1.84 79.11  ±   2.69 94.37  ±   2.29 61.03  ±   5.74 88.77  ±   1.96 68.79  ±   5.11
FM95.24  ±   0.59 80.45  ±   1.87 97.01  ±   0.98 83.91  ±   3.36 89.94  ±   3.51 75.88  ±   3.44
STG96.33  ±   0.43 81.77  ±   3.46 95.57  ±   1.79 79.55  ±   5.88 94.06  ±   0.72 77.94  ±   1.72
NeuroFS94.34  ±   1.55 82.54  ±   1.91 86.11  ±   5.98 70.37  ±   6.08 94.95  ±   1.51 60.59  ±   4.45
A-SFS96.17  ±   0.48 80.04  ±   1.63 94.35  ±   1.33 80.59  ±   2.56 92.44  ±   1.76 69.25  ±   3.33
SEFS95.59  ±   0.47 83.42  ±   0.97 96.94  ±   0.83 76.13  ±   4.29 68.07  ±   4.24
SAFS-Pa96.42  ±   0.33 84.91  ±   3.03 97.14  ±   0.42 87.64  ±   3.14 93.86  ±   0.80 74.00  ±   4.04
SAFS96.75  ±   1.40 84.33  ±   3.06 97.99  ±   0.39 89.33  ±   3.41 96.17  ±   0.83 80.84  ±   2.16
MNISTAlonGravierSVHNCIFAR10Chiaretti
LASSO54.31  ±   0.88 72.45  ±   10.05 73.72  ±   6.45 22.55  ±   0.66 27.23  ±   0.91 82.56  ±   5.10
RFE40.75  ±   0.69 66.84  ±   10.53 68.43  ±   7.86 OTOTOT
RF61.34  ±   1.18 81.05  ±   8.91 82.15  ±   3.53 51.96  ±   1.39 36.50  ±   1.96 82.82  ±   6.98
XGB61.78  ±   0.81 84.73  ±   8.28 87.25  ±   3.24 57.24  ±   0.80 41.43  ±   0.87 85.64  ±   7.21
CCM42.71  ±   4.79 80.05  ±   7.14 75.68  ±   5.34 45.52  ±   2.54 41.34  ±   1.11
FIR41.18  ±   3.85 80.00  ±   3.93 68.23  ±   4.36 47.40  ±   1.09 41.24  ±   0.76 73.84  ±   6.76
AFS62.80  ±   2.01 80.00  ±   9.85 80.88  ±   6.20 55.89  ±   1.27 38.16  ±   3.28 86.92  ±   7.33
SANs36.86  ±   2.39 84.31  ±   4.21 71.37  ±   6.63 45.22  ±   4.84 39.84  ±   1.16 71.28  ±   7.50
FM63.67  ±   2.35 78.02  ±   8.41 77.57  ±   7.19 57.92  ±   0.89 42.87  ±   2.25 78.85  ±   9.42
STG62.44  ±   2.67 83.15  ±   6.13 80.78  ±   4.80 56.30  ±   1.04 40.61  ±   0.82 83.68  ±   8.71
NeuroFS55.61  ±   3.97 78.59  ±   9.41 80.26  ±   4.55 45.90  ±   1.46 42.42  ±   0.73 87.91  ±   6.67
A-SFS64.16  ±   2.02 75.78  ±   7.13 72.55  ±   8.36 42.14  ±   1.70 38.94  ±   1.95 74.36  ±   7.72
SEFS61.77  ±   2.87 78.64  ±   5.71 75.61  ±   6.13 53.76  ±   5.68 39.99  ±   1.50
SAFS-Pa64.71  ±   2.08 85.38  ±   7.25 85.58  ±   3.83 54.40  ±   2.08 41.66  ±   1.22 88.23  ±   7.84
SAFS62.64  ±   2.37 86.92  ±   6.92 87.55  ±   4.23 60.61  ±   0.74 43.86  ±   1.31 88.46  ±   6.70
Table 10. Average performance (Micro-F1↑) with Catboost classifier with ten runs, ‘OT’ means overtime (more than 24 h of calculation in EPYC 7552*2 192 cores), ‘−’ means no result because of an internal error occurred. The best- and second-best results are highlighted in bold and with underline, respectively.
Table 10. Average performance (Micro-F1↑) with Catboost classifier with ten runs, ‘OT’ means overtime (more than 24 h of calculation in EPYC 7552*2 192 cores), ‘−’ means no result because of an internal error occurred. The best- and second-best results are highlighted in bold and with underline, respectively.
AlgorithmSEGMENTSATIMAGEGASDNAHARISOLET
LASSO87.88  ±   0.99 75.50  ±   2.02 90.68  ±   0.57 69.85  ±   3.44 80.51  ±   1.39 68.20  ±   1.49
RFE93.83  ±   0.90 77.89  ±   5.94 95.87  ±   1.52 46.67  ±   10.72 85.58  ±   1.48 63.72  ±   2.69
RF97.42  ±   0.62 84.38  ±   1.81 89.50  ±   3.14 87.11  ±   2.63 93.81  ±   0.86 77.29  ±   1.97
XGB90.69  ±   0.84 83.72  ±   1.53 95.40  ±   0.35 82.44  ±   2.86 92.67  ±   0.41 69.23  ±   0.86
CCM96.12  ±   0.76 82.67  ±   2.31 94.05  ±   1.92 68.51  ±   3.07 89.83  ±   3.11 67.01  ±   6.09
FIR94.04  ±   5.73 82.22  ±   3.89 92.24  ±   2.81 40.74  ±   3.12 76.83  ±   5.20 65.33  ±   3.34
AFS95.02  ±   2.16 78.67  ±   3.01 96.38  ±   1.01 77.89  ±   4.11 91.80  ±   2.21 75.96  ±   2.15
SANs93.66  ±   2.87 79.55  ±   3.34 94.30  ±   2.60 59.40  ±   7.53 89.11  ±   2.05 70.71  ±   5.03
STG97.09  ±   0.45 83.77  ±   1.70 96.44  ±   1.10 86.22  ±   5.49 91.20  ±   3.92 75.35  ±   4.50
FM95.33  ±   0.57 80.76  ±   1.79 97.12  ±   0.96 83.47  ±   3.56 90.45  ±   3.17 76.33  ±   3.03
NeuroFS95.28  ±   1.56 83.03  ±   1.24 88.49  ±   4.82 82.05  ±   5.87 95.22  ±   2.38 58.47  ±   5.27
A-SFS95.57  ±   3.55 83.55  ±   1.08 94.35  ±   1.96 80.51  ±   2.77 87.73  ±   4.20 76.64  ±   4.79
SEFS96.08  ±   0.57 82.79  ±   2.46 96.67  ±   0.74 83.33  ±   2.49 77.02  ±   2.91
SAFS-Pa96.96  ±   0.97 84.50  ±   2.53 97.12  ±   0.57 86.22  ±   6.72 94.43  ±   0.80 75.44  ±   3.40
SAFS97.17  ±   1.15 82.62  ±   1.95 98.18  ±   0.25 89.11  ±   3.36 95.95  ±   0.72 81.82  ±   4.20
LASSO56.85  ±   0.74 78.42  ±   7.96 73.92  ±   6.38 25.68  ±   0.55 29.29  ±   0.71 78.68  ±   7.20
RFE43.68  ±   0.52 OTOTOTOTOT
RF63.22  ±   1.28 83.15  ±   7.36 82.54  ±   5.21 58.01  ±   1.82 39.27  ±   1.39 84.47  ±   9.07
XGB64.25  ±   0.61 85.79  ±   8.17 89.21  ±   5.13 63.38  ±   0.60 43.88  ±   1.02 89.73  ±   7.57
CCM56.10  ±   4.25 84.21  ±   8.15 76.07  ±   7.27 52.04  ±   1.54 43.97  ±   0.62
FIR38.23  ±   2.77 83.15  ±   5.15 74.51  ±   5.26 55.82  ±   1.21 43.76  ±   0.40 69.74  ±   6.95
AFS65.10  ±   1.82 83.07  ±   5.75 80.51  ±   6.19 58.39  ±   2.10 42.56  ±   0.98 85.38  ±   11.76
SANs30.32  ±   6.67 85.36  ±   5.36 76.47  ±   7.12 54.26  ±   2.47 41.27  ±   0.78 72.30  ±   7.84
STG65.11  ±   2.65 86.88  ±   5.36 80.78  ±   5.13 59.75  ±   1.46 43.22  ±   0.99 81.57  ±   8.88
FM64.02  ±   2.26 81.14  ±   7.56 77.73  ±   6.97 59.26  ±   0.47 44.56  ±   0.63 76.30  ±   8.24
NeuroFS57.64  ±   2.31 82.87  ±   8.52 84.26  ±   4.74 46.62  ±   1.57 43.58  ±   0.69 86.67  ±   7.10
A-SFS65.99  ±   2.40 76.38  ±   4.70 74.07  ±   8.54 56.29  ±   4.14 42.23  ±   2.67 76.92  ±   6.45
SEFS65.42  ±   2.14 80.29  ±   6.67 77.89  ±   7.52 54.78  ±   3.62 43.01  ±   2.78
SAFS-Pa66.63  ±   2.77 86.15  ±   4.61 84.61  ±   4.56 60.74  ±   2.21 45.29  ±   1.38 85.76  ±   10.88
SAFS65.33  ±   1.79 87.38  ±   3.52 88.11  ±   5.12 66.86  ±   0.88 46.89  ±   1.27 89.69  ±   8.13
Table 11. Ablation studies. SAFS-s: Remove the stack architecture, do FS only use the SABlock; SAFS-bn: Remove the BN-layer of SABlock; SAFS-i: Remove the inertia-based weight update strategy; SAFS-c: Remove the feature skip connection. The best and second-best results are highlighted in bold and underline, respectively.
Table 11. Ablation studies. SAFS-s: Remove the stack architecture, do FS only use the SABlock; SAFS-bn: Remove the BN-layer of SABlock; SAFS-i: Remove the inertia-based weight update strategy; SAFS-c: Remove the feature skip connection. The best and second-best results are highlighted in bold and underline, respectively.
DatasetsSAFSSAFS-sSAFS-bnSAFS-iSAFS-c
SVHN60.61 ± 0.7457.03 ± 3.1857.82 ± 3.2359.40 ± 1.0659.47 ± 1.43
CIFAR1043.86 ± 1.3141.48 ± 1.2941.39 ± 1.7643.61 ± 0.9043.57 ± 1.33
MNIST62.64 ± 2.3764.04 ± 2.7759.98 ± 4.1263.65 ± 2.0562.38 ± 1.74
ISOLET80.84 ± 2.1673.01 ± 3.6865.73 ± 0.6678.20 ± 2.6879.09 ± 2.29
HAR96.17 ± 0.8394.85 ± 0.8182.25 ± 5.8494.25 ± 1.8095.95 ± 0.92
DNA89.33 ± 3.4185.67 ± 6.6787.22 ± 3.1088.88 ± 3.4488.11 ± 4.95
GAS97.99 ± 0.3996.83 ± 0.7293.95 ± 2.0797.08 ± 0.8197.65 ± 0.59
SATIMAGE84.33 ± 3.0684.67 ± 2.4281.33 ± 7.0983.08 ± 3.0782.37 ± 2.93
SEGMENT96.75 ± 1.3796.43 ± 0.7196.64 ± 1.1096.04 ± 1.8696.17 ± 1.31
Table 12. Description of the wind power dataset.
Table 12. Description of the wind power dataset.
StatisticsMeanStdMin25%50%75%Max
Power (kW)1052.861083.11−42.1933.75588.812218.212777.19
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Z.; Jiang, W.; Tan, J.; Li, Z.; Gui, N. Supervised Feature Selection Method Using Stackable Attention Networks. Mathematics 2025, 13, 3703. https://doi.org/10.3390/math13223703

AMA Style

Chen Z, Jiang W, Tan J, Li Z, Gui N. Supervised Feature Selection Method Using Stackable Attention Networks. Mathematics. 2025; 13(22):3703. https://doi.org/10.3390/math13223703

Chicago/Turabian Style

Chen, Zhu, Wei Jiang, Jun Tan, Zhiqiang Li, and Ning Gui. 2025. "Supervised Feature Selection Method Using Stackable Attention Networks" Mathematics 13, no. 22: 3703. https://doi.org/10.3390/math13223703

APA Style

Chen, Z., Jiang, W., Tan, J., Li, Z., & Gui, N. (2025). Supervised Feature Selection Method Using Stackable Attention Networks. Mathematics, 13(22), 3703. https://doi.org/10.3390/math13223703

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop