1. Introduction
The information technology industry uses the buzzword Big Data for high-dimensional data with a large number of features. Big Data has three characteristics, which are simply called the 3Vs—volume, velocity, and variety. This means that Big Data has a large volume, a huge variety, and changes rapidly [
1]. Data are divided into numerous categories based on the data size, and certain datasets up to sizes of 10 terabytes or more can be considered as Big Data. Most industries, such as the internet, biomedicine, and astronomy, have massive data with a great number of features [
2]. Scaling large databases is a huge issue in Big Data systems, as they usually contain lots of redundant and irrelevant data, which consumes computing resources and also contributes to performance reduction. Thus, it is important to reduce the unnecessary features and extract the necessary and valuable features in order to build good models based on this Big Data. The dimensionality reduction will lower the consumption of computer resources and also improve the model’s performance [
3].
There are algorithms that reduce the dimensionality of the data that can make the learning model more generalized and denser [
4]. Dimensionality reduction is divided into two types—feature extraction and feature selection [
5]. The feature extraction technique aims to convert high-dimensional data into a low-dimensional space. Features of low-dimensional data are a linear or nonlinear combination of the original features. The feature selection technique selects the best feature subsets from the original features using a certain process. Usually, feature extraction is said to improve the model performance, but they tend to compress and transform the original features, leading to data distortion and affecting the efficiency of data processing [
6]. On the other hand, feature selection retains the semantic meaning of the original features and thus has better interpretability. In the feature selection technique, the most relevant features are chosen from the original dataset, whereas feature extraction creates new features by transforming the existing ones. This will reduce the cost of the feature collection [
7].
Feature selection is divided into three categories—filter, wrapper, and embedded. The filter feature selection technique assumes that data is completely independent of the classifier algorithm and forms the subset of features according to their measurement of contribution to class attributes [
8]. For the wrapper feature technique, the domain knowledge is needed, and a performance metric is employed on the classification algorithm for feature subset evaluation, and based on the results, it searches for an optimal feature subset [
9]. The embedded feature selection technique incorporates feature selection into the learning process of the classifier, and then searches for a feature subset by a functional optimization that is designed in advance. The embedded technique thus deletes the features that have a minor influence on the outcome of the model result and only retains good features that are essential for the model result [
10].
The rest of the paper is organized as follows:
Section 2 discusses related work and highlights the shortcomings.
Section 3 introduces and describes our proposed feature selection model as well as the comparison models, and also describes different classification models used in this research.
Section 4 explains the experiment setup, statistical analysis, and results. Finally,
Section 5 concludes the paper.
2. Related Work
Many researchers in the past have carried out experiments to use feature selection techniques and applied classification methods on reduced-dimensional data to improve the performance of the models. The authors in [
11] proposed the integration of Gradient Boosting (GB), Random Forest (RF), Logistic Regression with Lasso Regularization, Logistic Regression with Ridge Regression, and SVM with the K-Means-Clustering-based feature selection method. They applied the proposed model to the Coimbra breast cancer dataset. The authors developed a gene expression-based cancer classification network in [
12]. In this network, they used AlexNet-based transfer learning to extract the features and then used a hybrid fuzzy ranking network to rank and select the features and finally used a multi kernel Support Vector Machine for multiclass classification on colon, ovarian, and lymphography cancer data.
Traditional gradient-based methods such as Forward Selection, Backward Elimination, and Stepwise Regression represent classical approaches to feature selection [
13,
14]. However, these methods often struggle with high-dimensional datasets due to computational complexity and local optima issues [
15]. Derivative-free optimization methods, including Random Search [
16] and Bayesian Optimization [
17], have gained attention for their ability to handle discrete and non-convex optimization landscapes typical in feature selection problems.
The Boruta method [
18] represents a notable wrapper-based approach that utilizes Random Forest as a base classifier to identify relevant features. Boruta addresses the feature selection problem by comparing the importance of original features with randomly permuted shadow features, providing statistical significance testing for feature relevance. While Boruta has demonstrated effectiveness in identifying truly relevant features and handling feature interactions [
19,
20], the method is computationally heavy, particularly when applied to high-dimensional datasets with tens of thousands of features. The computational burden stems from the need to repeatedly train Random Forest models with augmented feature sets, including shadow features, making it less practical for large-scale genomic datasets [
21].
Other ensemble-based methods include Recursive Feature Elimination (RFE) with various base classifiers [
22], stability selection [
23], and bootstrap-based feature ranking [
24]. These methods often provide robust feature selection but at increased computational cost.
In [
25], the authors integrated Binary Particle Swarm Optimization (BPSO) and Grey Wolf Optimizer (GWO) algorithm for feature selection on the Breast Cancer Wisconsin dataset. Another approach, introducing a guided PSO approach, was presented in [
26]. In [
27], the authors used the Krill Herd (KH) optimization algorithm to address problems in feature selection methods. They incorporated adaptive genetic operators to enhance the KH algorithm.
A Genetic Algorithm-based feature selection model (GA-FS) is proposed in [
28] and was applied on a breast cancer dataset. The authors combined GA-FS with different classification models and compared the Accuracy before and after GA-FS. Authors in [
29] proposed a two-stage feature selection method to classify colon cancer. In the filtering phase, they used ReliefF for feature ranking and then selected the best gene expression subset from 2000 features. Then finally applied the Support Vector Machine classifier for classifying colon cancer.
The authors applied Recursive Feature Selection (RFE) on different classification models to compare the performance of the models with regard to Accuracy, Precision, and F-measures [
30]. The authors used five types of feature selection methods to classify gene expression datasets for ovarian, leukemia, and central nervous system (CNS) cancer in [
31], and after discovering the minimal feature sets, applied five classifiers for classifying the data. In [
32], authors proposed the Gradient Boosting Deep Feature Selection (GBDFS) algorithm to reduce the feature dimension of omics data, and thus, improved the classifier Accuracy of gastric cancer subtype classification.
James Spall introduced the Simultaneous Perturbation Stochastic Approximation (SPSA), which is a pseudo-gradient descent stochastic optimization algorithm in [
33]. Initially, Spall introduced the SPSA method into the control area to tune a large number of neurons of a neural network controller with applications in a water treatment plant. In the beginning, SPSA was used in many successful applications in control problems, such as traffic signal control [
34], robot arm control [
35], etc.
In [
36], the authors adopted Spall’s SPSA approach for the first time to perform feature selection for a Nearest Neighbor classifier with the Minkowski distance metric for Artificial Nose and Golub Gene datasets. Later in [
37], the authors introduced the concept of Binary SPSA (BSPSA). The feature selection problem is treated as a stochastic optimization problem where the features are represented as binary variables. BSPSA is used for feature selection on both small and large datasets. In [
38], the authors proposed the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm that mitigates the slow convergence issue of BSPSA in feature selection and feature ranking. The authors compared SPSA with the four wrapper methods on eight datasets (the largest dataset contains 2400 features) and further applied classification on the datasets using four classifiers’ mean classification rates.
In [
39], the authors also used SPSA with Barzilai and Borwein (BB) non-monotone gains on various public datasets with Nearest Neighbors Naïve Bayes classifiers as wrappers. They compared the proposed method with full features against seven popular meta-heuristics-based FS algorithms. SPSA-BB converges to a good feature set in about 50 iterations on average, regardless of the number of features (the largest dataset contains 1000 features). The authors in [
40] generated subsets using Simultaneous Perturbation Stochastic Approximation (SPSA), migrating birds optimization, and Simulated Annealing algorithms. The subsets generated by the algorithms are evaluated by using correlation-based FS, and the performance of the algorithms is measured using a Decision Tree (C4.5) as the classifier. The computational experiments are conducted on the 15 datasets taken from the UCI machine learning repository. The authors concluded that the SPSA algorithm outperforms other algorithms in terms of Accuracy values, and all algorithms reduce the number of features by more than 50%.
The authors in [
41] present SPFSR, a novel stochastic approximation approach for performing simultaneous k-best feature ranking (FR) and feature selection (FS) based on Simultaneous Perturbation Stochastic Approximation (SPSA) with Barzilai and Borwein (BB) non-monotone gains. The proposed method is performed on 47 public datasets, which contain both classification and regression problems, with the mean Accuracy reported from four different classifiers and four different regressors, respectively. The authors concluded that for over 80% of classification experiments and over 85% of regression experiments, SPFSR provided a statistically significant improvement or equivalent performance compared to existing, well-known FR techniques.
As seen by the related work in the paragraphs above, the SPSA method for feature selection has traditionally been applied to smaller datasets. In this research, we investigate its effectiveness on large-scale datasets used for cancer classification. Our approach builds on prior work, particularly [
41], which employed feature ranking; however, we extended this by evaluating the impact of using varying proportions of the top-ranked features (5%, 10%, and 15%). Specifically, we apply feature selection and ranking via the SPSA method to datasets containing over 35,000 features (ranging from 35,924 to 44,894), with the goal of identifying features most relevant to cancer detection. To the best of our knowledge, this is the first study to apply the SPSA-based feature selection technique to such large cancer datasets. We conducted a comprehensive experimental evaluation and analysis, including comparisons with state-of-the-art feature selection and classification methods. Additionally, we assessed whether SPSA yields statistically significant improvements over ten benchmark methods.
3. Proposed Approach and Comparison Methods
In this section, we discuss our proposed methodology of feature selection based on the SPSA algorithm. Then, we discuss other popular feature selection models—RelChaNet, ReliefF, Genetic Algorithm, Mutual Information, Simulated Annealing, and Minimum Redundancy Maximum Relevance feature selection types that we are going to use to compare our SPSA feature selection method with. Further, we explain all the classification models we used in this research—Decision Tree, K-Nearest Neighbors, Light Gradient Boosting Machine, Logistic Regression, Support Vector Machine, and Extreme Gradient Boosting.
As illustrated in
Figure 1, all ten cancer datasets were first divided into training (80%) and testing (20%) subsets. Feature selection was performed only on the training data using all seven feature selection methods. From each training set, the top 5%, 10%, and 15% of features were selected, resulting in 30 reduced feature subsets across all datasets. These selected feature subsets were then applied to the corresponding test sets. Next, classification models were trained on the reduced training sets and evaluated on the held-out test sets. Performance was assessed using Accuracy, Precision, Recall, F1 Score, and Balanced Accuracy.
3.1. Proposed Methodology
3.1.1. Simultaneous Perturbation Stochastic Approximation (SPSA) Algorithm as Feature Selection (FS) Method
Spall introduced the Simultaneous Perturbation Stochastic Approximation (SPSA) [
33], which is a pseudo-gradient descent stochastic optimization algorithm. The algorithm first starts with a random solution of a vector, and it gradually moves towards the optimal solution during iterations, where the current solution is perturbed simultaneously by offsets that are random and generated from a specific probability distribution.
Let us say is a real-valued objective function. Gradient descent search starts from an arbitrary initial solution and iteratively moves toward a local minimum of the objective function . At each step, the gradient of the objective function is evaluated, and the algorithm updates the solution in the direction of the negative gradient . The process continues until converging to a local minimum, where the gradient is zero. In the language of machine learning, can be called as loss function for the minimization problem. This gradient descent method cannot be applied where the loss function and loss function’s gradient are unknown. Therefore, stochastic pseudo-gradient descent algorithms such as SPSA are used, so that the gradient from noisy loss function measures is approximated and does not need the loss function information.
At each iteration , SPSA evaluates three noisy measurements of loss functions and . and are used for gradient approximations, and is used to measure the performance of the next iteration .
As per [
33], the functions for tuning the parameters are shown in Equations (
1) and (
2):
where
a,
,
,
c, and
are algorithmic hyperparameters of SPSA. Here,
a is the initial scaling constant for the step size,
is the stability constant that shifts the denominator to reduce large updates in early iterations,
is the decay rate of the step size, typically chosen in
for convergence guarantees,
c is the initial perturbation constant controlling the magnitude of random offsets, and
is the decay factor controlling how quickly perturbations decrease across iterations.
These parameters are dimensionless and require tuning for the problem at hand. Following the SPSA literature [
33,
42], we set the initial values via preliminary experiments, and then performed a sensitivity analysis by varying one parameter at a time within a reasonable range while keeping the others fixed. For each configuration, we ran 10 independent trials and reported the mean results.
SPSA does not have an automatic stopping rule; thus, we specify the maximum number of iterations as the stopping rule. The iteration sequence specified here as the stopping criterion must be monotone and satisfy the condition
Let us illustrate how the SPSA algorithm is used as a feature selection technique. Assume is a data matrix with dimensions , where n represents observations and p represents features. Assume is a response vector with dimensions . The vector {} constitutes a dataset. Let { = } denote a feature set where the feature in is represented by . For a subset that is non-empty, represented as , we define as the true value of the performance criterion of a wrapper classifier denoted by on the dataset. We train the classifier albeit with an unknown and compute the error rate denoted by .
In this study, the wrapper classifier C is implemented as a linear Support Vector Machine (SVM) with class-weighted loss to handle imbalance across datasets. This choice is motivated by the small-n, large-p nature of our datasets ( 36,000–45,000 features vs. samples), where linear models with regularization provide stable estimates and reduce the risk of overfitting. For robustness, we also verified results using logistic regression with an elastic net penalty, which promotes sparsity while maintaining stability under correlated features.
Since the true error
is unknown, we instead compute the empirical error rate denoted by
, which can be expressed as
, where
represents the noise arising from finite-sample estimation, variability in cross-validation splits, and stochastic elements of training. Thus,
serves as a noisy but unbiased estimate of
, and SPSA leverages this noisy feedback in approximating the gradient. The feature selection problem can therefore be defined by the non-empty feature set
, and it can be determined by Equation (
3).
3.1.2. Barzilai–Borwein (BB) Method
According to the non-monotone methods concept, the non-monotone feature remembers the data provided by previous iterations. One of the first non-monotone search methods, the Barzilai–Borwein (BB) method, is described as a gradient method with a two-point step size [
43]. With the motivation from Newton’s method, the BB method targets to approximate the Hessian matrix instead of doing the direct computation. Thus, it computes the series of objective values that are decreasing monotonically, and hence, the BB method performs better than the classical steep-descent methods in terms of both performance and cost of computation.
A lot of research has happened in steepest descent methods like the BB method and the Cauchy method [
44], and the research concluded that the convergence analysis of the BB method found to linearly converge in a convex quadratic form [
45,
46]. The famous BB methods, well studied in different research areas, are Cauchy BB and cyclic BB. Cauchy BB is the combination of BB and the Cauchy methods, which performs better than the original BB and reduces the computation complexity by half, but it includes the steepest descent method, whereas cyclic BB has an extra process that determines the appropriate cycle length. Due to this shortcoming, we use the original BB method with a smoothing effect for our SPSA feature selection (SPSA-FS) algorithm.
3.1.3. Using the BB Method in SPSA-FS
In the SPSA feature selection algorithm, we improved the speed of the convergence by adopting a non-monotone BB step size strategy. Let denote the estimated parameter vector that is the feature weights at iterations k. represents the estimated gradient of objective function with respect to . The gradient estimates here are noisy because of the stochastic nature of the optimization; thus, we apply the smoothing to stabilize the updates.
The BB step size at iteration
k, denoted
, can be computed as shown in Equation (
4). This approximates the inverse Hessian using differences between the gradients and consecutive parameter vectors without computing the second derivatives.
For the reduction in the step size fluctuations, we smooth the step size by taking the average of window
iterations. It is shown in Equation (
5):
where
in the above equation is the smoothed step size at iteration
k.
Likewise, to stabilize the gradient estimates, we average the current gradient with the previous
m gradients as shown in Equation (
6), where
is the smoothed gradient that is used to update
.
By using the smoothed step size and gradient estimates, the SPSA algorithm achieves more stable estimates and converges faster, especially in the cases of optimizing complex or noisy functions.
As explained above, the SPSA algorithm is an iterative stochastic optimization algorithm, regardless of the number of features, that approximates the gradients of the objective function with only a few functional evolutions per iteration, which gives SPSA scalability, noise tolerance, global search tendency, and computational efficiency. SPSA works well for high-dimensional features without exponential cost. It handles noisy evaluation metrics better than deterministic methods. Also, SPSA perturbs multiple features simultaneously, which helps to avoid getting easily trapped in local optima. Finally, SPSA required fewer evaluations than the methods that compute gradients or the methods that evaluate fitness scores feature by feature. The hyperparameters used for SPSA are described in
Table 1.
3.2. Feature Selection Algorithms for Comparison
3.2.1. Neural Network Feature Selection Using Relative Change Scores (RelChaNet)
Network Pruning is a technique that identifies the less relevant features or neurons and removes them. This technique has been extensively studied since 1988 [
47]. Among the recent advances in the pruning technique, the Neural Network Feature Selection Using Relative Change Scores (RelChaNet) method builds upon these foundational concepts by measuring the induced change in network parameters to guide the feature selection [
48]. The authors introduced a lightweight feature selection algorithm that uses the pruning of neurons and input layer regrowth of a dense neural network. The neuron pruning happens when a gradient sum metric measures the relative change that occurs in a network once the feature enters, and in the meantime, the neurons grow again randomly.
Figure 2 illustrates the relative change score calculation that is embedded in RelChaNet.
Consider a neural network that consists of an input layer whose size is equal to the total number of features that need to be selected (K) and some candidate features. Multiple mini-batches are determined by the hyperparameter. The first layer gradients are combined in the matrix S. In the next step, this sum of gradients is normalized by the norm with regard to each input neuron. This is followed by applying Z-standards to the resulting vector, which produces the score vector s. The candidate scores are then used to update the high scores h.
Ultimately, all the features with K high scores remain in the network while the other features are redrawn randomly. Before training, the first layer weights are reinitialized, and the two hyperparameters used in RelChaNet adapt to the dataset characteristics fed to the network. RelChaNet overcomes the general drawbacks by allowing candidate features multiple mini-batches to demonstrate their relevance potential in the network and compares that relevance as an induced change rather than their absolute weights. The algorithm considers the network of the multi-layer perceptron with feed-forward architecture. This architecture is integrated into back-propagation training using the Adam optimizer.
Let us consider a dataset with features
N, selected features
K, and the number of hidden layer neurons
. The hyperparameters for the algorithm are the ratio of candidate features
considered at each iteration, and total mini-batches
. Let us initialize the candidate features
with Equation (
7).
The input layer size is calculated as the number of selected features
K plus
. First, we choose the features randomly to populate the input layer; then, we start training the neural network that runs for
mini-batches. The first-layer gradients are aggregated by addition. These gradient sums are normalized later, which results in a relative change score
, which is calculated by Equation (
8).
These scores are used for candidate features to update their high scores
h. These features with high scores remain, and then new feature candidates are drawn randomly. This cycle will be repeated to gather as many features with a high score
h, and also compare this score with incoming new features that are added in the next iterations. The hyperparameters used for RelChaNet are described in
Table 2.
3.2.2. ReliefF
ReliefF is a popular feature selection algorithm that is widely used in many industry data applications. It is a filter-based feature selection method that selects the best features by feature weight calculations [
49]. Relief was proposed by Kira in their 1992 paper [
50], and the proposed algorithm is limited to two-class classification problems. The Relief algorithm assigns different weights to the data features based on the correlation between classes and the features. The feature whose weight is greater than the set threshold will be selected as an important feature.
The main limitation of the Relief algorithm is that it only handles two-class classification problems. In 1994, Kononenko proposed the ReliefF algorithm, which is an extension of the Relief algorithm [
51]. ReliefF can handle multiclass classification problems.
Let us assume that we have class labels of a certain training dataset
C = {
,
, …,
}. A sample
is selected randomly from this training dataset; then, ReliefF searches for
K-Nearest Neighbors, which are also called hits of
from the same class, that is
(
j = 1, 2, …,
k), and hits of
from the different classes that is
(c) (
j = 1, 2, …,
k). This procedure repeats
m times. Therefore, the feature
A weight is updated using Equation (
9):
where
m is the total number of iterations and
(
A,
,
) denotes the difference between samples
and
in feature
A. This difference is calculated using Equation (
10).
The hyperparameters used for ReliefF are described in
Table 3.
3.2.3. Genetic Algorithm Based Feature Selection Method (GA)
The Genetic Algorithm is an evolutionary algorithm that is inspired by the process of natural selection and genetics for finding optimal solutions in a vast solution space. According to the natural selection theory, the fittest individuals are selected, and then they are used to produce the offspring. The fittest parents’ characteristics are passed on to the offspring using crossover and mutation for a better chance of survival. The GA contains two types of components. The first one defines the meta-parameters fitness function, selection strategy, crossover, mutation rates, and population size. The second component is an iterative evolutionary loop that applies the first component repeatedly so that it improves the population [
52]. In this loop, the algorithm performs the following steps: (1) Fitness evaluation of each individual in the current population. (2) Select parents based on fitness values. (3) Offspring generation through crossover and mutation. (4) Producing the next generation. This evolutionary loop continues until a stopping criterion, such as the value of the maximum number of generations, is met.
Initial Population Generation
This is the first step in the GA implementation. The initial population consists of 50 chromosomes, each representing a randomly generated feature subset. For each chromosome, the genes are assigned randomly as 0 or 1, which indicates either exclusion or inclusion of corresponding features. In order to avoid redundancy, duplicates in the initial population are minimized. This results in diverse candidate feature subsets for the evolutionary algorithm to explore. The length of the chromosome is the total number of features in the subset.
Fitness Function
The fitness function evaluates each of the chromosome’s feature subsets’ quality based on the classification performance. In this step, we use the KNN (K-Nearest Neighbor) accuracy, which is trained and tested on selected features. We set the KNN parameters as and Euclidean distance as the metric. Before we proceed to classification, the features are normalized using z-score scaling for comparability across the dimensions. The fitness score of each chromosome is calculated as the average classification accuracy from 5-fold cross-validation on training data.
Selection
This is an important step in the process in which parent chromosomes are selected from the current population based on their fitness scores using tournament selection. In the selection method, a subset of individuals is selected randomly, and the fittest individual among them is selected as a parent. We selected tournament selection as a selection criterion, which ensures that the fittest individuals have more chances to get selected while also maintaining diversity. This selected parent then proceeds to the next step.
Crossover and Mutation
In this step, offspring chromosomes that form the next generation are generated by the crossover and mutation process. Here, we implement two-point crossover, where two crossover points are randomly selected along the parent chromosomes. The segments between points are swapped to produce offspring. Here, we set the crossover rate as 0.5, which means fifty percent of parents that are selected go through the crossover process, and the rest of the parents are unchanged. During the mutation process, random alterations are introduced to offspring for genetic diversity and to explore the search space. A mutation flips the individual genes with a mutation rate of 0.01 per gene. For example, if there are 40,000 features, then due to this mutation rate, approximately 400 mutations per chromosome per generation happen, which results in balancing the exploration.
Creating Next Generation
The next generation is created by the replacement of the whole current population with newly created offspring. This replacement will make sure that only new chromosomes will survive for the next iteration. Among all the individuals in the final generation, the most fitted chromosomes (the feature subset that yields the best classification accuracy) are selected as the optimal feature set.
Stopping Criterion
There should be a general stopping criterion for terminating the process of GA. Here, we used a fixed number of iterations as the stopping criterion, which we set to 20. Once the limit is reached, the GA execution is terminated, and the best-performing chromosome from the last generation returns as the selected feature subset.
The hyperparameters used for GA are described in
Table 4.
3.2.4. Mutual Information-Based Feature Selection (MI)
The filter-based feature selection methods rank features based on their association with the target class. In simple filter approaches, the features are individually scored, and the features with high scores are selected. But in greedy methods, dependencies between features are considered by selecting features iteratively, which provides the highest incremental contribution, given that the features are already selected. At each step, the feature, together with the previously selected set, maximizes the relevance and is added. This process continues until the desired number of features is selected. In this research, the MI feature selection uses a simple ranking approach where features are ranked based on MI scores with class label, without considering dependencies among features.
MI of two random variables is a quantitative measurement of dependency between the variables [
53]. This will be defined by the probability density function (PDF) of variables, say
X,
Y and joint (
) as
,
and
, respectively, [
54].
If the variables
X and
Y are completely independent, then the joint PDF is equal to the product of the PDF of
X and the PDF of
Y, which will be equated as
=
, and MI becomes zero.
Entropy is a measure of the uncertainty or randomness in a random variable. For a variable
X, entropy is defined as:
MI is expressed in terms of entropy as:
where
is the uncertainty of
Y when
X is known. If
Y and
X are independent, then
and
.
In feature selection problems, usually the features
X are continuous and the class label
Y is discrete including ours. So, here we estimate the MI between the continuous features and discrete labels, which requires computing the conditional PDF of the feature given each class. This can be carried out using techniques like kernel density estimation or histogram binning [
55]. The MI between
X and
Y with possible
is calculated by Equation (
15):
where
is the prior probability of class
y,
is the conditional PDF of
X given
, and
is the marginal PDF of
X. By using this estimation, features are ranked based on MI scores with a class label.The hyperparameters used for MI are described in
Table 5.
3.2.5. Simulated Annealing (SA)
Simulated Annealing (SA) is a stochastic technique that is inspired by statistical mechanics. SA is used for finding globally optimal solutions to large optimization problems. The SA algorithm works with the assumption that some parts of the current solution belong to a potentially better one, and thus, these parts should be retained by exploring the current solution’s neighbors. With the assumption of minimizing the objective function, SA jumps from hill to hill, and thus, escapes or avoids sub-optimal solutions. When a system, say S, contains a set of possible states in thermal equilibrium at a temperature T, the probability that it is in a certain state s is called . depends on T and of state s. That probability follows the Boltzmann distribution.
MI is expressed in terms of entropy as:
where
k is Boltzmann constant and
Z acts as a normalization factor and it is defined as:
Consider
s a current state as described above, and
as neighboring state. The probability of the transition from
s to
can be formulated as:
where
.
If , then the move is accepted and, if , then the move is accepted with probability of . The probability depends on the current temperature T, and it decreases when T does. There is a probability of T being lower at the end, in which the state is called the freezing point, and at this state, the transition is unlikely and the system is considered frozen. At this state, to increase the chance of maximizing the probability of finding the minimal energy state, thermal equilibrium should be reached. In order to reach equilibrium, annealing is scheduled to escape becoming stuck at a local minimum. Hence, the SA algorithm is introduced, and in SA, T is initially set at a high value to approximate thermal equilibrium. Then, small decrements of T are performed, and the process is iterated until the system is considered frozen. Reaching a near-optimal solution depends on how well the cooling schedule is designed, but it results in an inherently slow process because of the thermal equilibrium requirements at every point of temperature T.
To perform SA, we need four components to be performed. The components are configuration, move set, objective function, and cooling schedule. In the configuration step, the model represents all the possible solutions that the system can take, and then, it is used to find a near-optimal solution. Move sets are the computations that we need to perform to move from one state to another as part of the annealing process. The objective function is used to measure how good and optimal a given current state is. The cooling schedule anneals the problem from a random solution to a good solution. This component schedules when to reduce the current temperature and when to stop the annealing process.
At the beginning of the SA process, an initial solution is selected randomly and is assumed to be an optimal solution. If
T does not satisfy the termination condition, then the neighboring solution is selected and the cost is calculated for that solution. If the cost of the newly selected neighbor solution is less than or equal to the current optimal solution, then the current optimal solution is replaced by the newly selected neighbor solution [
56].
The hyperparameters used for SA are described in
Table 6.
3.2.6. Minimum Redundancy Maximum Relevance (MRMR)
The Minimum Redundancy Maximum Relevance (MRMR) method was first introduced by Ding and Peng in [
57] to address redundancy problems with high dimensional and high-throughput datasets related to cancer. The MRMR method helps in identifying the features that are most relevant to the class labels and less redundant with respect to each other, and thus, results in improving the classification performance.
The MRMR algorithm works in a filter-based framework using Mutual Information (MI) to evaluate two criteria. The first criterion is relevance, which is quantified as Mutual Information
between a candidate feature
f and the target class label
c. The features are selected based on high Mutual Information. The second criterion is redundancy, which is defined as the average MI
between the candidate feature and each feature that has been selected already. The features are selected based on the minimum average MI to avoid the highly correlated (redundant) features [
55].
At each step, a new feature is selected from the unselected features by maximizing the following condition:
where
S is the current feature set that is selected within the subset of size
. The MRMR optimization is formulated as:
The greedy incremental process continues until a predefined number of features are selected. The greedy incremental algorithm follows the steps below:
Initialize S as empty;
At each step, evaluate the remaining candidate features using the above score condition;
Add the feature that maximizes the score;
Repeat the process until the desired number of features is selected.
The hyperparameters used for MRMR are described in
Table 7.
3.3. Classification Models
After generating the feature subsets from seven feature selection methods defined in the above section, we pass those feature subsets to different classification models in order to see how the features generated from different feature selection algorithms affect the classification model performance. The classifier’s objective is to learn how to classify the objects by analyzing the dataset, where we know which classes the instance belongs to. We used six different classification models in this research, and we briefly describe each of the models in the subsections below.
3.3.1. Decision Tree (DT)
The instances are representations of attribute value vectors, and the input data that is fed to the classifiers has such types of vectors where each vector belongs to a class. The output typically consists of mapping from attribute values to classes, and with learning, the model is able to classify known and unseen instances [
58].
The Decision Tree method is an example model of representation of mappings that contains attribute nodes linked to multiple sub-trees or leaves or decision nodes that are labeled with classes. The decision node computes the outcome or decision based on an attribute value, and each decision is associated with one sub-tree. In DT, an instance at the root node is classified, and the outcome of that node will be a sub-tree. This process will continue until the outcome of that instance is determined. The depth of the Decision Tree is based on how many sub-trees it was divided into, and this determines the total conditions used in the decision rules, which is not a fixed number [
59]. The hyperparameters used for DT are described in
Table 8.
3.3.2. K-Nearest Neighbors (KNN)
KNN is one of the popular machine learning classification methods, which works on the principle of classifying unlabeled data based on its Nearest Neighbors. The concept of the KNN method was first proposed by Fix and Hodge in 1951 [
60], and later developed by Cover and Heart in 1967 [
61]. KNN is also used for prediction problems in which the label of the sample is predicted as the one with the majority label among its Nearest Neighbors.
KNN classifies the objects according to the distance between two samples. In general, the Euclidean distance formula is used to calculate the distance between two training or testing objects [
62]. The formula is given in Equation (
21).
The hyperparameters used for KNN are described in
Table 9.
3.3.3. Extreme Gradient Boosting (XGB)
The XGBoost algorithm, an ensemble learning method introduced by Chen and Guestrin [
63] in 2016, improved Gradient Boosting by optimizing computational efficiency and scalability. XGB has been implemented in several programming languages and software libraries since its introduction and makes it accessible for performing both regression and classification tasks.
This method is a hybrid model of multiple base learners. It explores different base learners and picks a learning function that reduces the loss. The idea of ’ensembling’ the additive models is to train the predictors sequentially, and correct the predecessor by fitting the new predictor to the residual errors made by the previous predictor. In each step, the model optimizes the parameters. The inference and training of this learning method can be expressed by Equations (
22) and (
23):
where
and
f represent both the model set and its parameter set.
is the model train loss function.
is the predicted conditional probability of the output
y given input
x and optimized parameters
. The hyperparameters used for XGB are described in
Table 10.
3.3.4. Logistic Regression (LR)
LR is used frequently for binary and linear classification tasks [
64]. LR is used for estimating the probabilities of classes because it models the associations with the logistic data distribution. LR performs well with linearly separable classes, and this method is best used to identify the class decision boundaries [
65]. This method focuses on the relationship between independent variables (
,
, …,
) and a dependent variable
Y. The logistic function, which is a sigmoid function, is used for the logistic model calculation. In this calculation, each value between negative infinity and positive infinity is generated as the range of input and output values between 0 and 1 [
66].
Logistic regression is able to interpret the vector variables in the data and to conduct the coefficient or weight evaluation for each of those input variables, and in turn is able to predict the class that the vector belongs to. LR is the method used for the datasets in which the independent variables are known the results. The hyperparameters used for LR are described in
Table 11.
3.3.5. Support Vector Machine (SVM)
SVM determines an optimal hyperplane that maximizes the margin between the hyperplane and the closest data point to that hyperplane. The patterns observed in the optimal hyperplane are called support vectors. Identifying a hyperplane that is optimal using SVM can be seen in
Figure 3.
The SVM hyperplane is the set of points
x satisfying Equation (
24):
where
w is the weight vector orthogonal to the hyperplane and
b is an offset from the origin. In the case of linear SVMs,
is considered the closest point of a class, and
is considered as the closest point of another class [
68].
The distance between two support vectors is defined by Equation (
25):
where
w is the optimal weight vector orthogonal to the seperating hyperplane, which is obtained through SVM optimization.
d should be increased for better separation, and proportionally
w should be reduced using the Lagrange function in Equation (
26):
where
i = 1, 2, …,
n and
= {+1, −1} represent the class labels and
is the Lagrange multiplier.
should be reduced for optimal
w and
b computation. The hyperparameters used for SVM are described in
Table 12.
3.3.6. Light Gradient Boosting Machine (LGBM)
LGBM is a variant method based on the Gradient Boosting Decision Tree (GBDT) and has better optimization than the XGB method. GBDT is an ensemble algorithm that combines multiple Decision Trees as base learners [
69]. Each newly added tree pays increased attention to the samples that are misclassified by the previous trees. Through repetitive training of new Decision Trees and increasing their weights, GBDT gradually reduces model error and improves classification accuracy [
70].
The LGBM uses the GBDT concept in its method, and the core concept involves sorting and bucketing the attributes to ensure all split points are checked. In training, LGBM selects leaf nodes for splitting and growing, and thus reduces the loss function. LGBM also introduces Gradient-Based One-Side Sampling (GOSS) to improve the effectiveness of model training. GOSS basically concentrates on the samples that have larger gradients, ignores the samples with low gradients, and amplifies the small gradient data weight. This process allows for effective utilization of large gradient samples and also retains some information from small gradient data samples that are disregarded [
71].
The hyperparameters used for LGBM are described in
Table 13.
3.4. Description of Datasets
The datasets used in this study were obtained from the Cancer Genome Atlas (TCGA) repository. TCGA is a cancer genomics program that characterizes 20,000 primary cancer and matched normal samples of 33 cancer types in total. This genomics program started in 2006 as a joint effort of both the National Cancer Institute and the National Human Genome Research Institute in the United States of America, and brought researchers from several institutions together. Due to the efforts of this program, TCGA was able to create 2.5 petabytes of data of transcriptomics, epigenomics, genomics, and proteomics data [
72].
From this website, we collected a total of 10 types of cancer genomic datasets to use in this paper: the Colon and Rectal Adenocarcinomas (COAD), Head and Neck Squamous Cell Carcinoma (HNSC), Kidney Chromophobe (KICH), Kidney Renal Papillary (KIRP), Liver Hepatocellular Carcinoma (LIHC), Lung Squamous Cell (LUSC), Prostate Adenocarcinoma (PRAD), Stomach Adenocarcinoma (STAD), Thyroid Cancer (THCA), and Uterine Corpus Endometrioid Carcinoma (UCEC). All datasets are high-dimensional data that have features between 35,924 to 44,878. We used the datasets without applying additional preprocessing or normalization steps. This decision was made to ensure that all feature selection methods were evaluated on identical input data, thereby isolating the effect of feature selection from any influence of preprocessing techniques. While data normalization and transformation are often applied in research studies, our focus was on the comparative evaluation of feature selection algorithms under a consistent setting.
The summary of each dataset is shown in
Table 14.
3.5. Experiment Setup
We passed all ten cancer datasets and fed them as input to all seven feature selection methods that we considered in this research—SPSA (our proposed model), RelChaNet, ReliefF, GA, MI, SA, and MRMR. We selected the top 5%, 10%, and 15% of features from each of the ten datasets. We used all the 30 dimensionally reduced datasets (top 5%, 10%, and 15% feature subsets in each of the 10 cancer datasets) and passed them as input to six classification models and calculated the performance metrics. We divided each dataset for training and testing with a split of 80% and 20%, respectively.
3.6. Evaluation Metrics
To compare how the subsets of data produced by different feature selection algorithms as well as how they perform with the classification models, we considered performance metrics such as Accuracy, Precision, Balanced Accuracy, Recall, and F1 Scores [
73].
Accuracy is the ratio of all classifications that are correct, whether they are negative or positive. It is calculated using Equation (
27).
Recall is also known as the true positive rate, which calculates all positives that are classified as positives. It is calculated using Equation (
28).
Precision is the proportion of all positive classifications that are actually positive. It is calculated using Equation (
29).
F1 Score is a harmonic mean of both Precision and Recall, which balances both. This metric is preferred over Accuracy for class-imbalanced datasets. It is calculated using Equation (
30).
Sensitivity is the same as Recall explained above, and Specificity is used to measure the proportion of True Negatives over the Total Negatives. It is calculated using Equation (
31).
Balanced Accuracy is the arithmetic mean of sensitivity and specificity. It is used in cases of imbalanced data. It is calculated using Equation (
32).
In all the above evaluation metric equations, is True Positive, is True Negative, is False Positive, and is False Negative.
3.7. Computational Resource Consumption Measurement
For the experiments, we used the Python (version 3.12.11) programming language to re-implement all the feature selection methods and classification models. For the experiments, we used the Center for Computationally Assisted Science and Technology (CCAST), an advanced computing infrastructure at North Dakota State University. We ran all the experiments on a JupyterLab setup of a 16-Core CPU with 128 GB memory with a GPU allocation.
Table 15 provides the average execution runtime (in seconds) of the feature selection algorithms across all ten datasets, as well as the average execution runtime of the classifiers combined with the feature selection algorithms across the same datasets.
4. Results
We applied seven feature selection methods on the cancer datasets and extracted the top attributes that will help to improve the performance of the classification models. According to all the tables in the
Appendix A, we can see that our model SPSA achieved mostly higher Balanced Accuracy compared to ReliefF, RelChaNet, GA, and MI in case of all the top 5%, 10%, and 15% of feature subsets.
For the DT classification, SPSA achieved a 100% Accuracy for COAD’s and KICH’s top 5, 10, 15 percent feature sets as shown in
Table A1,
Table A2, and
Table A3, respectively. SPSA often achieves near-perfect or perfect results across the datasets and different feature selection percentages, suggesting that SPSA is effective and robust in selecting features across different datasets. As for RelChaNet, its performance is generally good but varies more significantly between datasets. For example, Accuracy and F1 Score drop notably for certain datasets (e.g., THCA and UCEC). Regarding ReliefF, it shows competitive results, generally close to SPSA’s performance, although there is some variation, such as slightly lower F1 Scores and Recall for certain datasets. GA scored average to very well in performance metrics with the small feature set (5%) with all the datasets, but it did not do better with the 10% and 15% feature sets. With the MI, the feature selection performance is good with the 10% feature set among most of the datasets, but with smaller and larger feature sets, it did not show good performance compared to other feature selection methods, and in some cases, it performed worse on most of the datasets. SA shows gradual improvement with higher feature subsets (10% and 15%). This feature selection method is slightly less stable in performance compared to SPSA or MI feature selection methods. MRMR is more consistent and has better values for Accuracy, Precision, F1 Score, and Balanced Accuracy, but lower and less consistent Recall scores for the 10% feature subset, and average values of Recall for the 15% feature subset. In particular, SPSA consistently ranks among the top methods for most datasets and subsets, with its strongest performance on COAD, KICH, LIHC, STAD, and THCA, while in a few cases (e.g., PRAD and UCEC), MRMR or SA slightly surpass its results. This shows that SPSA provides stable and reliable feature selection across subsets, often matching or outperforming traditional methods.
For the KNN classification, SPSA achieved 100% Accuracy for COAD’s and KICH’s top 5 and 10 percent feature sets as shown in
Table A4,
Table A5, and
Table A6, respectively. ReliefF also achieved 100% Accuracy for the same feature sets on COAD. SPSA shows consistently good values across different datasets, maintaining high classification performance even with a reduced number of features, whereas ReliefF shows slightly less consistent results, but still offers competitive performance and high Accuracy for many datasets. RelChaNet, on the other hand, tends to underperform, particularly with fewer features, and appears to be less robust across datasets compared to the other two methods. The GA feature selection method did not work well with KNN and scored low on all the datasets except for COAD, where it performed best. MI worked well with most of the small feature sets on all the datasets, but it performed occasionally average and, most of the time, worse than the 15% feature set. The results for SA indicate that the feature subsets that are generated have lower Accuracy and Precision compared to other methods in most cases, except for very few datasets where the Precision is high. The Balanced Accuracy is lowest across all feature subset sizes and the datasets. The MRMR feature subset performance is more variable. It shows high Accuracy and Recall for the KIRP and PRAD datasets, and for other datasets, it performed poorly for Balanced Accuracy and Recall. The Balanced Accuracy is frequently lower than SPSA and ReliefF. Overall, SPSA shows strong and consistent performance across datasets, especially at 5% features, where it often achieves near-perfect Accuracy and Recall compared to other methods. At 10% and 15%, it remains competitive, though in some datasets, MRMR or GA slightly surpass it, indicating SPSA’s advantage is most pronounced with smaller feature subsets.
For the LGBM classification, SPSA, RelChaNet, and ReliefF feature selection methods achieved 100% Accuracy for all COAD’s feature sets as shown in
Table A7,
Table A8, and
Table A9, respectively. SPSA tends to be the most stable and highest-performing method across the different datasets and feature levels. RelChaNet occasionally showed variability and lower performance, suggesting potential dataset-specific challenges. ReliefF was comparable to SPSA in many cases, but occasionally showed variability, particularly for smaller feature sets. GA did not perform well on all datasets except for the COAD and KICH datasets, where it performed on par with SPSA on the 10% feature set. MI also worked well with COAD, but it performed poorly with the other datasets compared to the other feature selection methods. The SA feature subsets performed poorly for some datasets, like COAD and LUSC, at 15%, but were excellent in a few cases. The Balanced Accuracy is lowest among all feature selection methods, especially for the 15% feature subsets across the ten datasets. The MRMR feature subsets generated mixed results, where the Accuracy and Balanced Accuracy were lowest, especially for the 15% subset. Some datasets show decent Recall but poor F1 and Balanced Accuracy. Overall, SPSA demonstrates consistently strong performance across 5%, 10%, and 15% feature subsets, often matching or surpassing other methods in Accuracy and F1 while maintaining higher Balanced Accuracy. Unlike methods such as SA or MRMR that show variability, SPSA remains stable and reliable across all datasets.
For the LR classification, SPSA achieved 100% Accuracy for all KICH’s top 5 and 10 percent feature sets, as shown in
Table A10,
Table A11, and
Table A12, respectively. SPSA emerges as the most robust method, maintaining high performance with fewer features. RelChaNet offers a good balance, with strengths at mid-level feature percentages, but some sensitivity to feature set size. ReliefF shows potential in specific datasets but lacks the broad consistency demonstrated by the other two methods. GA is effective on the LIHC dataset and showed good results with 10% and 15% feature sets for the majority of the datasets. MI also performed and achieved good scores on the KICH dataset; however, the performance was worse with most of the 10% and 15% feature sets. The SA feature subsets produced high Accuracy, Precision, Recall, and F1 Score results across most datasets. It frequently achieved perfect scores for the UCEC, KIRP, and HNSC datasets and has a strong Balanced Accuracy across all datasets. The MRMR feature subsets have very strong performance with perfect or near-perfect Precision, Recall, and F1 Scores. This feature selection method generated excellent results for the THCA, UCEC, HNSC, and PRAD datasets for all feature subsets. Overall, SPSA shows stable and competitive performance across all subsets. At 5%, it secures strong results—often surpassing RelChaNet, ReliefF, and MI—while remaining close to GA. At 10%, it maintains reliable Accuracy and Recall across datasets, outperforming weaker methods though occasionally behind SA and MRMR. By 15%, SPSA continues to deliver high scores, particularly in THCA and UCEC, confirming its robustness and consistent competitiveness with other leading methods.
For SVM classification, SPSA achieved 100% Accuracy for only COAD’s top 5 percent feature set but outperformed ReliefF and RelChaNet for all the other datasets’ feature sets as shown in
Table A13,
Table A14, and
Table A15, respectively. SPSA often leads in Precision and Recall metrics across most datasets, showcasing its ability to identify high-importance features that strongly influence classification. RelChaNet provides more stable but generally moderate performance, sometimes closing in on SPSA’s results but also demonstrating variable results with feature changes. ReliefF’s performance suggests it is less effective in filtering out critical features, leading to consistently lower performance results. GA did not perform well overall among most of the datasets, but it scored well on the PRAD dataset, and with the 15% feature set of the STAD dataset. MI performed well with the 15% feature set on most datasets, but produced low Accuracy among the 5% and 10% feature sets for all datasets. The SA feature subsets are among the best performers, especially for the 15% subset, with an Accuracy of above 0.99 among the KIRP, PRAD, STAD, and UCEC datasets. The Precision, F1 Score, and Recall are high in many datasets, but the Balanced Accuracy is slightly behind SPSA and MRMR feature selection methods. The MRMR feature subsets have variable Accuracy and F1 Scores and are less consistent across the datasets. The Recall drops significantly, and the Balanced Accuracy achieved mixed results—with some datasets, it is good, but it is poor in others. Overall, SPSA shows strong and consistent performance across 5%, 10%, and 15% feature subsets. At 5%, it achieves near-perfect results on several datasets, clearly outperforming ReliefF, RelChaNet, and GA. With 10% features, SPSA maintains robust Accuracy and balanced results, especially in LIHC, THCA, and UCEC. At 15%, it remains competitive—particularly in THCA and UCEC—though SA and MRMR occasionally surpass it.
For the XGB classification, SPSA achieved 100% Accuracy for COAD’s and KICH’s top 5, 10, 15 percent feature sets, and THCA’s top 5 percent feature set as shown in
Table A16,
Table A17 and
Table A18, respectively. SPSA generally achieves very high Accuracy and consistent metrics across the different datasets. RelChaNet displays lower Accuracy and performance metrics compared to SPSA and ReliefF in most cases. ReliefF generally performs on par with SPSA, often showing similarly high Accuracy and metrics across most datasets and feature percentages. Most of the time, GA did not perform well across most of the datasets, but performed well in Accuracy and Precision for the UCEC and COAD datasets. MI performed subpar across all datasets except KIRP, where it achieved a score as good as SPSA’s. The SA feature subsets performed occasionally decently and achieved perfect values for the 15% feature subset for the LUSC dataset, but they have high inconsistency, often with poor Balanced Accuracy and Recall, and struggle with most of the datasets, which we can interpret due to the lack of robustness across the datasets. The MRMR feature subsets performed well for the PRAD dataset at 5%, and for STAD at 10%, but they have poor performance for Balanced Accuracy and Precision, especially for the KICH and KIRP datasets. Overall, SPSA consistently delivered strong and stable results across 5%, 10%, and 15% subsets, often matching or outperforming ReliefF, GA, and MI, with perfect scores in datasets like COAD, KICH, and THCA. While MRMR occasionally surpassed SPSA at higher subsets (e.g., PRAD, STAD), SPSA generally proved more reliable and robust than SA and RelChaNet, maintaining high Accuracy and balanced performance across datasets.
A colored heat map table representation of the Balanced Accuracy scores for each classification model used in this research across all ten cancer datasets is shown in
Figure 4 for the 5% feature set. Please note that we opted to only display the best values, which are the 5% feature set results.
4.1. Statistical Analysis
Next, we study the effects of the features selected from the seven feature selection methods on Accuracy among the ten cancer datasets. We used the Friedman test, which is a non-parametric statistical test, to determine whether or not there is a statistically significant difference between the paired treatments, where treatments are arranged in a randomized repeated-measure design.
We are doing this statistical test only for the top 5% feature selection data. The Friedman test for this research uses the following null and alternative hypotheses:
The null hypothesis (): The seven feature selection methods used in this research have an equal effect on Accuracy among the ten cancer datasets.
The alternative hypothesis (): At least one feature selection method used in this research has a different effect from the others based on Accuracy among the ten cancer datasets.
4.1.1. DT
First, we calculated the summary statistics of the DT balanced classifier Accuracy scores of all ten datasets after applying the feature selection algorithms. We visualized the above summary with a violin plot in
Figure 5.
Later, we applied the Friedman test and we obtained the test statistic of 32.61, and the p-value of 0.00001. Since the p-value is less than 0.05, which is statistically significant, the null hypothesis should be rejected. Therefore, we have sufficient evidence to conclude that the type of feature selection method used leads to statistically significant differences in Accuracy scores between the ten cancer datasets.
Next, we performed the Nemenyi post hoc test to identify which feature selection methods have different effects on Accuracy. The Nemenyi post hoc test returns the following
p-values for each pairwise comparison of means as shown in
Table 16.
At = 0.05, the pairs below have statistically significant differences in the Accuracy scores among the ten cancer datasets.
SPSA vs. GA;
SPSA vs. MI;
SPSA vs. SA;
ReliefF vs. SA.
4.1.2. KNN
The summary statistics of the KNN classifier balanced classifier scores of all ten datasets after the applied feature selection algorithms are visualized as a violin plot in
Figure 6.
For the Friedman test, we obtained the test statistic as 34.37 and the p-value as 0.000006. Since the p-value is less than 0.05, which is statistically significant, the null hypothesis should be rejected. Therefore, we have sufficient evidence to conclude that the type of feature selection method used leads to statistically significant differences in Accuracy scores among the ten cancer datasets.
Next, the Nemenyi post hoc test returns the following
p-values for each pairwise comparison of means as shown in
Table 17.
At = 0.05, the pairs below have statistically significant differences in the Accuracy scores among the ten cancer datasets.
SPSA vs. RelChaNet;
SPSA vs. GA;
SPSA vs. MI;
SPSA vs. SA;
SPSA vs. MRMR;
ReliefF vs. MI.
4.1.3. LGBM
The summary statistics of the LGBM classifier balanced classifier scores of all ten datasets after feature selection algorithms applied are visualized as a violin plot in
Figure 7.
For the Friedman test, we obtained the test statistic as 47.06 and the p-value as 0.00000001. Since the p-value is less than 0.05, which is statistically significant, the null hypothesis should be rejected. Therefore, we have sufficient evidence to conclude that the type of feature selection method used leads to statistically significant differences in Accuracy scores among the ten cancer datasets.
Next, the Nemenyi post hoc test returns the following
p-values for each pairwise comparison of means as shown in
Table 18.
At = 0.05, the pairs below have statistically significant differences in Accuracy scores among the ten cancer datasets.
SPSA vs. GA;
SPSA vs. MI;
SPSA vs. SA;
SPSA vs. MRMR;
RelChaNet vs. SA;
RelChaNet vs. MRMR;
ReliefF vs. GA;
ReliefF vs. SA;
ReliefF vs. MRMR.
4.1.4. LR
The summary statistics of the LR classifier balanced classifier scores of all ten datasets after the applied feature selection algorithms are visualized with a violin plot in
Figure 8.
For the Friedman test, we obtained the test statistic as 30.59 and the p-value as 0.00003. Since the p-value is less than 0.05, which is statistically significant, the null hypothesis should be rejected. Therefore, we have sufficient evidence to conclude that the type of feature selection method used leads to statistically significant differences in Accuracy scores among the ten cancer datasets.
Next, the Nemenyi post hoc test returns the following
p-values for each pairwise comparison of means as shown in
Table 19.
At = 0.05, the pairs below have statistically significant differences in Accuracy scores among the ten cancer datasets.
SPSA vs. GA;
SPSA vs. MI;
SPSA vs. SA;
SPSA vs. MRMR.
4.1.5. SVM
The summary statistics of the SVM classifier balanced classifier scores of all ten datasets after feature selection algorithms applied are visualized with a violin plot in
Figure 9.
For the Friedman test, we obtained the test statistic as 28.51 and the p-value as 0.000075. Since the p-value is less than 0.05, which is statistically significant, the null hypothesis should be rejected. Therefore, we have sufficient evidence to conclude that the type of feature selection method used leads to statistically significant differences in Accuracy scores among the ten cancer datasets.
Next, the Nemenyi post hoc test returns the following
p-values for each pairwise comparison of means as shown in
Table 20.
At = 0.05, the pairs below have statistically significant differences in Accuracy scores among the ten cancer datasets.
SPSA vs. GA;
SPSA vs. MI;
SPSA vs. SA;
SPSA vs. MRMR.
4.1.6. XGB
The summary statistics of the XGB classifier balanced classifier scores of all ten datasets after feature selection algorithms applied are visualized as a violin plot in
Figure 10.
For the Friedman test, we obtained the test statistic as 36.31 and the p-value as 0.000002. Since the p-value is less than 0.05, which is statistically significant, the null hypothesis should be rejected. Therefore, we have sufficient evidence to conclude that the type of feature selection method used leads to statistically significant differences in Accuracy scores among the ten cancer datasets.
Next, the Nemenyi post hoc test returns the following
p-values for each pairwise comparison of means as shown in
Table 21.
At = 0.05, the pairs below have statistically significant differences in Accuracy scores among the ten cancer datasets.
SPSA vs. GA;
SPSA vs. MI;
SPSA vs. SA;
SPSA vs. MRMR;
ReliefF vs. GA;
ReliefF vs. SA.
5. Conclusions
This research successfully demonstrated the effectiveness of the Simultaneous Perturbation Stochastic Approximation (SPSA) method for feature selection in large-scale cancer classification tasks, making an advancement in the application of the SPSA technique to high-dimensional genomic datasets. Our comprehensive experimental evaluation across datasets containing over 35,000 features establishes SPSA as a viable and superior alternative to existing feature selection methodologies for cancer detection applications. The experimental results provide compelling evidence for the efficacy of the SPSA-based approach. Through systematic evaluation using six diverse classification algorithms (Decision Trees, K-Nearest Neighbors, LightGBM, Logistic Regression, XGBoost, and Support Vector Machines), we demonstrated that SPSA-generated feature subsets consistently achieve superior classification performance compared to four state-of-the-art feature selection methods. Our approach yielded mostly higher and often perfect classification Accuracy across nearly all ten reduced-dimensional datasets, while maintaining competitive computational efficiency with average and frequently lower computation times.
The robustness of our findings is underscored by the comprehensive evaluation framework employing multiple performance metrics, including Accuracy, Balanced Accuracy, Precision, Recall, and F1 Score. The consistent advantage of SPSA-based feature selection across these diverse metrics and multiple classifier architectures validates the method’s reliability and generalizability for high-dimensional cancer classification tasks.
Our investigation revealed that while SPSA maintains consistently high performance across most classifier combinations, there are isolated instances of reduced performance when integrated with certain classifiers. However, these cases represent minimal exceptions rather than systematic limitations, and the overall performance profile strongly favors the SPSA approach.
The successful application of SPSA to datasets exceeding 35,000 features establishes a new benchmark for feature selection in high-dimensional biomedical data analysis. We believe that researchers working with high-dimensional genomic, proteomic, or other biomedical datasets can leverage the SPSA-based feature selection method to significantly improve the Accuracy and reliability of their machine learning models.
This work opens several avenues for future research, including the exploration of hybrid approaches combining SPSA with other optimization techniques, the investigation of adaptive parameter tuning for different dataset characteristics, and the extension to multiclass cancer classification problems.