Next Article in Journal
Effects on Force, Velocity, Power, and Muscle Activation of Resistances with Variable Inertia Generated by Programmable Electromechanical Motors During Explosive Chest Press Exercises
Previous Article in Journal
Heart Sound Classification Based on Multi-Scale Feature Fusion and Channel Attention Module
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Novel Hybrid Feature Selection Using Binary Portia Spider Optimization Algorithm and Fast mRMR

1
Department of Information Technology, Vardhaman College of Engineering (Autonomous), Hyderabad 501218, Telangana, India
2
Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar 751030, Odisha, India
3
Department of Artificial Intelligence & Data Science, Vallurupalli Nageswara Rao Vignana Jyothi Institute of Engineering and Technology (VNRVJIET), Hyderabad 500090, Telengana, India
4
Department of Mechatronics Engineering, Parul Institute of Technology, Parul University, Vadodara 391760, Gujarat, India
5
Department of Computer Science and Engineering, GITA Autonomous College, Bhubaneswar 752054, Odisha, India
6
Centre for Intelligent Healthcare, Coventry University, Coventry CV1 5RW, UK
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Bioengineering 2025, 12(3), 291; https://doi.org/10.3390/bioengineering12030291
Submission received: 2 February 2025 / Revised: 6 March 2025 / Accepted: 12 March 2025 / Published: 14 March 2025

Abstract

:
Objective: The cancer death rate has accelerated at an alarming rate, making accurate diagnosis at the primary stages crucial to enhance prognosis. This has deepened the issue of cancer mortality, which is already at an exponential scale. It has been observed that concentration on datasets drawn from supporting primary sources using machine learning algorithms brings the accuracy expected for cancer diagnosis. Methods: This research presents an innovative cancer classification technique that combines fast minimum redundancy-maximum relevance-based feature selection with Binary Portia Spider Optimization Algorithm to optimize features. The features selected, with the aid of fast mRMR and tested with a range of classifiers, Support Vector Machine, Weighted Support Vector Machine, Extreme Gradient Boosting, Adaptive Boosting, and Random Forest classifier, are tested for comprehensively proofed performance. Results: The classification efficiency of the advanced model is tested on six different cancer datasets that exhibit classification challenges. The empirical analysis confirms that the proposed methodology FmRMR-BPSOA is effective since it reached the highest accuracy of 99.79%. The result is of utmost significance as the proposed model emphasizes the need for alternative and highly efficient greater precision cancer diagnosis. The classification accuracy concludes that the model holds great promise for real-life medical implementations.

1. Introduction

Microarray technology is rapidly evolving, opening new avenues for molecular-level cancer research by generating massive gene expression databases. However, information knowledge differs from data. From these vast datasets, meaningful information must be extracted and biological data analyzed. Microarray gene expression datasets contain significant redundancy, are noisy, are high-dimensional, and have a small sample size. These qualities make feature selection more complex [1]. Many redundant genes may lead to a low categorization accuracy in general. Only a gene can be placed in a specific category using the conventional feature selection method based on the clustering method [2]. There is a biological condition that a gene can participate in several pathways, but this operation does not perfectly suit that requirement. In addition, using the threshold value to ascertain the quantity and dimensions of the modules cannot be considered appropriate. Methods such as independent and principal component analyses are based on the assumption that the original signal is statistically independent. Furthermore, it does not adhere to the internal interaction between transcription factors and biological genes. Because intelligent swarm optimization methods demonstrate greater performance, researchers have started employing them for feature selection [3]. However, because microarray gene expression datasets contain hundreds of genes, it is difficult to use an evolution algorithm to choose relevant genes directly related to the disease. As a result, combining two algorithms that complement one another for feature selection becomes a key area of research.
Authors [4] presented a hybrid two-stage model with the filter-based approach in the first stage to filter out unnecessary and unrelated features and then feed them into the wrapper method using the recent swarm-based algorithm, the salp swarm algorithm, SSA. In [5], a hybrid approach to feature selection is presented by combining an improved mRMR method with differential evolution. Here, the proposed approach modifies the quantization functions of mRMR to be compatible with the continuous features of microarray data that serve as an initial filter in the feature selection process. In addition, improved differential evolution is used to improve these properties. Alomari et al. [6] introduce a filtering technique called mRMR, which combines with a wrapping approach known as the Bat algorithm for gene selection in microarray data. The most significant genes were determined by using mRMR directly on the whole gene expression dataset set; conversely, BA was used to obtain the most relevant subset of genes from the reduced set created by MRMR to be used for cancer identification purposes [7]. This work combines an ensemble of filters that enhance the robustness and stability of mRMR. The filtering-based strategy generates discriminative genes. On the other hand, the wrapper-based strategy uses the findings from the filtering-based strategy to define the search space of gene selection. The wrapper approach encompasses the bat algorithm in gene selection while conjugating it with beta hill climbing, a potent local search method. It highlights the deep learning side when figuring out the search space and finding robust and stable discriminative genes. In [8], authors developed a feature selection approach called mRMR, which is integrated with an artificial bee colony (ABC) algorithm, which results in mRMR-ABC and also allows the identification of essential genes from the microarray profiles. An SVM algorithm is used to determine the classification accuracy of selected genes. Authors in [9] introduce a hybrid feature selection technique, mRMR-ICA, which will combine the minimum redundancy-maximum relevance (mRMR) with the imperialist competition algorithm (ICA) for cancer classification. The mRMR-ICA uses mRMR to remove redundant genes as a pre-feature selection step, thus providing ICA with reduced feature sets [10]. This paper presents a metaheuristic approach for optimizing the most relevant n genes from drug microarray data to improve the minimum redundancy-maximum relevance filter methodology. The study uses three metaheuristic algorithms: particle swarm optimization, cuckoo search, and artificial bee colony. Finally, k-nearest neighbors and Support Vector Machines are used as classifiers to reveal how effectively classification performance has been achieved [11]. The adopted approach will cover both filter and wrapper methodologies for selecting the biomarker genes. The set of worthy genes was obtained during the filtering step by applying MRMR (minimal redundancy and maximum relevance). The wrapper approach combines two principal metaheuristic approaches, BAT-HS and Support Vector Machine [12]. The new feature selection approach based on a combination of mRMR with an Aquila optimizer is presented. The mRMR methodology in the initialization stage improves the initial population qualities and improves the overall quality of the population. A random opposition-based learning strategy is superimposed over this base to enhance Aquila populations’ diversification with a faster algorithm convergence rate. A positional update equation of subsequent iterations of the Aquila optimizer is developed and added with inertia weight integration to avoid getting trapped into local optima and improve its detection capability for the global optimum [13]. The proposed approach is divided into two primary stages: the filter stage, with the mRMR attribute evaluation method, and the wrapper stage, with the strengths of the Northern Goshawk Algorithm (NGHA) and the SVM classifier for feature selection. Authors [14] discussed a new methodology called GradWise, which is proposed to improve the feature selection performance. GradWise introduces rank-based weighted hybrid filter-mRMR and the built-in feature selection technique to identify the most appropriate features from biomedical datasets. The author in [15] introduces a new feature selection model that ranks all features based on the mRMR feature selection technique. Features with higher ranking values are selected. In the following stage, the wrapper approach is used to select the best features. Such features are identified with the help of the Improved Equilibrium Optimization algorithm. The current study proposes a hybrid approach that integrates Salp Swarm Optimization (SSO) and Support Vector Machine (SVM) for breast tumor classification and gene selection. This methodology is applied in two steps: first, mRMR finds the informative and discriminative genes, and then SSO is applied to WSVM for classification. In SSO-assisted WSVM, redundant genes are eliminated, and important ones are weighted in the classification. SSO uses the weights of the selected genes in order to optimize kernel parameters [16]. The authors in [17] described and implemented a machine learning model for liver cancer diagnosis from expression data of genes. By unifying feature selection with a stacking ensemble learning model, the method addresses the issues of the high dimensionality and complexity of genomic data and improves diagnostic accuracy and interpretability. Similarly, in [18], authors designed a framework where the random parameters of an extreme learning machine (ELM) classifier are tuned using the self-adaptive multi-population-based elite strategy Jaya (SAMPEJ) algorithm. This results in a robust classifier, which is called SAMPEJ-ELM. In [19], authors propose and develop a two-stage ensemble deep learning methodology that utilizes metaheuristic approaches for detecting lung and colon cancers. This method consists of specialized deep CNN architectures for feature extraction and the electric eel foraging optimization algorithm to accurately diagnose a set of unique biomarkers through indicators for detection. A novel hybrid CSSMO metaheuristic learning-based approach has been developed, which utilizes a deep learning classifier for gene selection and correctly classifies cancer in cosmo-enabled spider monkey optimization and cuckoo search algorithm-selected genes, even for early-stage patients. The dimensionality of the gene expression data is further decreased by mRMR filtering, which improves the CSSMO outcome [20]. Deep learning (DL) techniques are also adopted by various researchers to optimize the features subset. In [21], a hybrid approach with deep learning technique called ES-DL. In the study, elephant search is used to select the optimal feature subset, followed by a stochastic gradient-based DL approach to reduce the feature subset. This study proposed a hybrid multifilter-ensemble model using the Grey Wolf Optimizer with two DL techniques like the Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) classifiers for cancer prognosis [22]. In [23], to address the limitations that arise in machine learning algorithms, a hybrid approach with a combination of Enhanced Chimp Optimization (ECO) and different DL approaches are presented to deal with the cancer dataset.

1.1. Motivation and Objectives

The feature identification, of which the main characteristic is gene expression data, is the basis for improved diagnosis and therapeutic efficacy in identifying diseases. However, feature selection in gene expression data is prone to possible problems, such as poor removal of redundant features and suboptimal classification performance. This paper presents a novel feature selection method that aims to enhance the effectiveness of feature selection in high-dimensional gene expression datasets that exceed those adverse problems.
This work aims to present an enhanced feature selection method that utilizes fast mRMR and PSOA to enhance the discovery of critical features in gene expression data. Fast mRMR is a filter feature selection method grounded in information theory, adept at identifying the features most pertinent to the target function while minimizing redundancy across features. This study presents a strategy that effectively decreases the dimensionality of the features and improves the classification performance. The primary contributions of this paper are summarized as follows:
  • Using fast mRMR in the PSOA initialization step in providing a better population-to-target optimization using feature significance, thereby forming a strong basis upon which subsequent optimization searches are built and in which the effectiveness of the algorithm in optimization is enhanced.
  • Apply the PSOA optimization methodology in the process of selecting the most relevant features under a high-dimensional framework.
  • Evaluate the performance of the proposed model using six distinct cancer microarray datasets.

1.2. Paper Structure

The rest of the paper is organized as follows. Section 2 describes some basic principles involved with fast mRMR and the Portia spider optimizer algorithm. Section 3 describes the implementation of the methodology presented. Section 4 carries out different studies to check the effectiveness of the presented strategy. This paper is concluded in Section 5.

2. Methods

2.1. Portia Spider Optimization Algorithm

PSOA is a recently developed swarm-enriched te chnique designed according to the hunting nature of Portia spiders. The design of the PSOA is carried out according to the following assumption while locating and attacking the preys [24].
  • Portia spiders use the vibration released by the ensnared insects.
  • A silk dragline is drawn by the Portia spiders to ensure the position of the prey.
Portia spiders use two different approaches for prey selection, which are presented below. In the first approach, to avoid the threat of neighbors, Portia spiders use the approach of chemical signals during prey selection. The position of the prey is identified according to the signal generated by the neighbor spider. The Portia spider then adjusts its movement to divert itself from the neighbors. After waiting a long time for the prey to focus, it gradually moves closer and attacks the prey. In the second approach, it identifies the prey using the natural web vibrations initiated by ambient breezes, else it has the potential to generate similar types of vibration independently. If the above-mentioned step fails to attract the prey, the Portia drags a silk dragline above the web and captures the prey. The adjustment of the position is performed dynamically according to the observed responses.
Mathematical Design: The framework conceptualizes and facilitates an improvement in a computational model that outlines the behaviors and strategies of Portia spiders. In this context, each Portia spider is an individual solution to a unique problem. Variables can be used to denote each spider or solution, possibly representing different parameters or characteristics of the solution.
S P = s p 1 1 s p 2 1 s p d 1 s p 1 2 s p 2 2 s p d 2 s p 1 N s p 2 N s p d N
To visualize the algorithm in detail, the expected solutions are presented in matrix format as mentioned in Equation (1). Each row of the representation is a Portia spider, or possibly a solution, while each column represents a solution variable or parameter. In Equation (1), N is defined as the total population of Portia spiders, and d denotes the variables or factors relevant to the optimization problem. This matrix comprises an all-inclusive view of the possible solutions and their corresponding characteristics. The positions of the Portia spiders that represent the solutions may oscillate throughout the performance of the algorithm, showing the optimization process or how the solutions come closer to the objective of the problem. The ranked fitness of Portia spiders is represented using Equation (2). The movement of Portia spiders depends on the position of the prey. The position of the prey is presented using Equation (3).
S P = F ( S P 1 ) F ( S P 2 ) F ( S P N )
T = t 1 t 2 t d
For the optimal solution, the movement of Portia spiders, guided only by their prey’s position, becomes important. As if employing adaptive hunting strategies from Portia spiders, the PSOA navigates the solution space adaptively. This method allows the algorithm to employ some trial-and-error techniques and to adjust its strategies based on feedback, similar to how the Portia spiders refine their hunting tactics in response to the behavior of their prey. Unlike other metaheuristic algorithms, the PSOA algorithm needs a balance between exploration and exploitation. The details of the PSOA algorithm are presented using Algorithm 1.
  • Mechanism of stalking and striking (exploration phase):
    This phase describes the position change mechanism of the Portia spider according to the chemical signal received from the neighbor spider. It changes myopathic gait and takes an instinctive jump to prevent the concealing of the prey. From this movement, the Portia spider confirms the position of the prey. The methodology behind this approach is mentioned using Equations (4) and (5).
    S P i j t + 1 = t j if   α 1 < S P F ( S i ) , S P i j t if   α 1 S P F ( S i ) ,
    Subject to
    S P F ( S i ) = F ( S P i ) 1 N F ( S P i ) 2
    where S P i j t + 1 presents the updated value of some state or variable S P i j at iteration t + 1 . t j depicts the best solution and jth the parameter of the targeted prey. The current value of S P i j in iteration t is symbolically presented as S P i j t . The random value of the variable α 1 lies between 0 and 1. The standardized fitness score and the actual fitness score of the solution S P i are presented using N F ( S P i ) and F ( S P i ) . The update rule can be interpreted as follows: If α 1 is less than S P F ( S i ) , the value of S P i j is updated to a new candidate t j . Otherwise, S P i j retains its previous value.
  • Mechanism of invading and imitating (exploitation phase):
    This phase presents the natural techniques adopted by the Portia spider on the web of other spiders for easy hunting. In this stage, the Portia spider generates its own vibration and tries to sneak into the web of the prey. Rather than generating its own signal, it often takes advantage of the vibrations produced by the other insects like a gentle breeze. The intriguing skill of the Portia spider is presented using Equation (6).
    s p i j t + 1 = r j t + | α 2 r j t s p i j t | × e α 3 × cos ( 2 π α 4 )
    Most Portia spiders adopt the above-mentioned trick to capture the prey. If it fails, a silk dragline is dragged just above the web so that it can reach the updated position and attack the prey with a single jump (new position change). The optimal solution or the new update position can be mathematically determined using Equation (7) from Equation (6).
    s p i j t + 1 = r j t + α 2 × ( r j t s p i j t )
The final exploitation phase can be derived using Equation (8) to clearly understand the mechanism behind this study.
s p i j t + 1 = r j t + | α 2 r j t s p i j t | × e α 3 × cos ( 2 π α 4 ) α 5 < 0.5 r j t + α 2 × ( r j t s p i j t ) α 5 0.5
Subject to
α 3 = 1 I c u r I m a x 2
Here, α 1 , α 2 , and α 3 are three random variables and the values lies within the range 0,1. To balance the gap between both phases, the value of α 3 is calculated using Equation (9), where I c u r and I m a x denote the current iteration and maximum iteration, respectively.
The following observations of PSOA encourage us to design a new binary PSOA present in this study to propose a new hybrid model called mRMR-BPSOA. The choice of BPSOA is based on its unique problem-solving strategies, inspired by the predatory behavior of Portia spiders, which provide significant advantages over conventional metaheuristics. PSOA addresses issues with PSO, GA, and ACO-style heuristics suffering from premature convergence by incorporating tactical hunting and memory-based tracking, maintaining a balance between exploration and exploitation. By employing Portia spider-inspired probabilistic movement patterns, BPSOA mitigates stagnation in suboptimal regions. Unlike PSO’s reliance on velocity updates and GA’s crossover and mutation dependency, PSOA enhances global optimality detection with intelligent searching strategies. Microarray gene expression data sets present complex problems for optimization algorithms, especially in feature selection with multiple dimensions. PSOA sustains robustness in multidimensional feature spaces due to its dynamic prey tracking and adaptive step-size mechanisms. Ant Colony Optimization, Whale Optimization Algorithms, and other bio-inspired algorithms face high computational costs due to the extensive function evaluations they require. Through selective strategic engagement and memory retention, PSOA limits unnecessary evaluations, enhancing execution speed without compromising solution quality. Algorithm 1 shows the working of PSOA.
Algorithm 1 Portia spider algorithm
Require: Population size (N); number of iterations ( I max )
1:
Begin
2:
Generate random Portia spiders
3:
while  I cur < I max  do
4:
    Calculate and sort the fitness value;
5:
    Determine prey position;
6:
    exploration phase;
7:
    Update value of α 1 ;
8:
    Calculate standardized fitness score;
9:
    Update Portia spider position using Equation (4);
10:
    exploitation phase;
11:
    Update values of α 2 , α 3 , α 4 , and α 5 ;
12:
    Update Portia spider position using Equation (8);
13:
end while
14:
Update Portia spider set;
15:
Determine the best solution;
16:
I cur = I cur + 1 ;
Ensure: Optimal solution and its fitness score.

2.2. Fast mRMR

Among filter-based feature selection methods, one widely applied approach is the mRMR algorithm first introduced by Peng et al. [25]. It calculates the relevance of the features toward the class label using mutual information, as well as the probability that they are redundant with each other. The technique was first invented for the evaluation of microarray gene expression data, which performed excellently. It is, however, currently applied in many fields, such as medicine and anomaly detection in networks and telecommunications. Although the above explanation emphasizes the application to discrete data distributions, the method can also be applied to continuous data, as formulated in the initial description of this method.
mRMR is a sophisticated algorithm designed with the explicit objective of ranking all possible features based on their relevance and importance in relation to that feature [26]. This ranking may be achieved through the evaluation of how much each individual feature may be related to and relevant to the target variable of interest, along with the integration of a mechanism by which the features are penalized for redundancy or one feature overlapping with another. In this paper, a concept called mutual information (I) is used, in which the stated primary objective is to maximize the level of dependency and connection associated with the set of features X with the corresponding class c. This objective will be met by mutual information. Using mutual information, by measuring the marginal probabilities that are defined as p ( a ) and p ( b ) , and the joint probability, which is defined as p ( a , b ) , it is a measurable quantity describing the strength of dependence or the relationship between two discrete random variables, also known as features, expressed in Equation (10) through the marginal probabilities mentioned p ( a ) and p ( b ) .
I ( A ; B ) = b B a A p ( a , b ) log p ( a , b ) p ( a ) p ( b )
This could then be approximated based on available data samples in case marginal probabilities are not defined. In fact, Equation (2) supports the same principle, where n represents the number of samples which take the value n a for the feature in question and n is the total number of samples. For example, to show this, consider the case as p(a,b).
p ( a ) n a n
Similarly, using Equation (12), we can estimate the joint probability where the values of the desired characteristics are indicated by a and b, and the number of samples that include a and b in those features is denoted by n a , b .
p ( a ) n a , b n
In view of the difficulties involved in applying the maximum dependency criterion in high-dimensional environments, maximal relevance (max-relevance), as defined in Equation (13), was used in replacement in the original proposal. Here, a set of distinguishing features is symbolized by X.
max D ( X , c ) , D = 1 | X | x i X I ( x i ; c )
The chosen attributes may be highly correlated with each other; therefore, they are included as a penalty to allow for mutual exclusive selection. The minimum redundancy criterion is defined as Equation (14). In that definition, X represents a set of features.
min R ( X ) , R = 1 | X | 2 x i , x j X I ( x i ; x j )
The computational efficiency of the fast mRMR is O ( N ) and it uses the matrix-based model whereas the traditional mRMR uses the pair-wise MI method and performs with computational efficiency of O ( N 2 ) . In many studies, traditional mRMR is not adequate while dealing with high-dimensional datasets. So, the updating scheme and matrix transformation architecture of fast mRMR can deal the high-dimensional datasets with low overhead. Fast mRMR applies advanced optimization techniques, whereas classical mRMR uses an iterative approach to deal with redundancy which affects the performance of the classifier. Fast mRMR effectively handles high-dimensional datasets by prioritizing relevant features without excessive computational burden in comparison with traditional mRMR.
High computational cost, dependence on data quality, inability to handle complex feature interactions, scalability issues, and a lack of adaptation to continuous data are a few limitations raised with the basic mRMR. So, in this study, we have used fast mRMR. The algorithm then calculates the relevance value for each input feature and stores it inside a vector named relevance vector. The algorithm then decides which features are most relevant to the problem at hand and declares the feature with the highest relevance to be the last selected feature. The remaining features then start getting selected one after the other under the supervision of fast mRMR criterion. This process is to be repeated until the number of features required is obtained. Then, at each iteration of this iterative procedure, the algorithm computes how much mutual information is between the last selected feature and the features that are yet to be selected. Algorithm 2 shows the working of Fast mRMR feature selection. Most importantly, the most valuable feature, according to mRMR, is included in the final selection and is denoted as the feature selected last. The accumulated redundancy vector is one of the improvements included in this methodological framework. It is expensive, and an accumulation of redundancy occurs at each step and only information about the mutual information between the set of features that were not selected and the most recently selected feature is computed. This is because it computes mutual information between every possible pair in detail [27].
Algorithm 2 Fast mRMR feature selection
Require:
1:
Dataset with n features: F = { f 1 , f 2 , , f n }
2:
Target variable T (class label)
3:
Number of features to select: k
Ensure: Selected subset of k features: S F
4:
Initialize:  S = { }                    ▹ Empty feature subset
5:
Compute Relevance:
6:
for each feature f i F do
7:
    Calculate mutual information I ( f i ; T )
8:
end for
9:
Iterative Feature Selection:
10:
while  | S | < k do
11:
    Select Feature with Maximum Relevance:
12:
    Select f i F S with highest I ( f i ; T )
13:
    Compute Redundancy:
14:
    for each feature f j F S  do
15:
        Calculate R ( f j ; S ) = 1 | S | f s S I ( f j ; f s )
16:
    end for
17:
    Evaluate mRMR Criterion:
18:
    for each feature f j F S  do
19:
        Calculate mRMR ( f j ) = I ( f j ; T ) R ( f j ; S )
20:
    end for
21:
    Add Feature to Subset:
22:
    Select feature with maximum mRMR score
23:
    Add selected feature to S
24:
end while
25:
return Selected feature subset S

2.3. Binary Portia Spider Optimizer Algorithm

Such a traditional PSOA can be used only in problems of continuous value optimization and cannot be applied to discrete values. The feature selection problem can be imagined as a binary vector that consists of only 0s and 1s, where 0 marks the exclusion of the feature and 1 shows its inclusion. This paper introduces a sigmoid transfer function to transform the values from the continuous search space into a discrete one, represented by Equation (15):
S ( X ) = 1 1 + e 10 ( X 0.5 )
Here, X and S ( X ) present the original and probability binary value of X after conversion. So, the new position of the Portia spider can be expressed using Equation (16).
S P i j ( t ) = 1 , if   S ( S P i j ( t ) ) > rand 0 , otherwise
BPSOA exploits the efficient Portia spiders’ hunting behavior for local optimal escape and broadened search space. The algorithm captures the essence of adaptive predation by allowing candidate solutions to change their movements based on feedback from the environment. By enabling the BPSOA to remember previously found solutions, it becomes possible to improve them over successive cycles, thus avoiding premature convergence. This approach improves the balance between exploitation and exploration by focusing on good regions while preventing stagnation. Moreover, movement based on the Portia spider’s erratic style of hunting ensures that diverse and less constrained search behavior is realized, which mitigates the chances of being stuck in local optima. Deceptive fitness landscapes are helped by the use of stochastic perturbations and variable step sizes. Collectively, these actions increase the perceived stability of convergence and the likelihood of finding global optima in high-dimensional space.

2.4. Weighted Support Vector Machine

SVM is a strong supervised machine learning algorithm that is used in classification and regression problems. It works like the creation of an optimal hyperplane, maximized to create a margin between two classes in a given data set. Margin refers to the distance between the closest points of each class to the hyperplane, which are called support vectors. At a higher level of abstraction, the SVM algorithm is designed to create a boundary between classes that maximizes separation with an eye toward maximizing generalization to new unobserved instances [28].
SVM can be used to solve non-linear and linear classification problems. For separable and linear data, it develops a linear hyperplane. It uses kernel functions, including the RBF, polynomial, and sigmoid kernels for non-linear data. These kernels map the data in a high-dimensional space so that it is possible to implement the linear hyperplane. The method is called the kernel trick, where SVM can efficiently address complex classifications without applying the direct execution of the transformation.
The regularization parameter and the gamma parameter are the significant parameters of SVM, which regulate the trade-off between maximizing the margin and minimizing the error during classification. However, one of the significant limitations of conventional SVM is that it assumes that all the genes have a homogenous contribution to increase the accuracy of classification. The present paper proposes a weighted Support Vector Machine, which integrates PSOA.
The weighted Support Vector Machine (weighted SVM) is an extension of the basic SVM algorithm. It has been specially customized to handle imbalanced datasets. In datasets, if one class is significantly larger compared to the others, then the standard SVM is highly biased to favor the larger class. The classifier performs poorly for classification in the case of smaller classes. Weighted SVM addresses this issue by assigning different weights to the classes in question and penalizing errors more heavily for the minority class compared to the majority class. In weighted SVM, different penalty weights ( C + and C ) are assigned to the majority and minority classes, respectively:
min w , b , ξ 1 2 w 2 + i = 1 n C i ξ i
where C i is a class-specific weight defined as:
C i = C + , if   y i = + 1 ( minority class ) , C , if   y i = 1 ( majority class ) .
The constraints remain the same:
y i ( w x i + b ) 1 ξ i , ξ i 0 , i = 1 , 2 , , n .
The dual form of the weighted SVM becomes:
max α i = 1 n α i 1 2 i = 1 n j = 1 n α i α j y i y j K ( x i , x j )
subject to:
0 α i C i , i = 1 n α i y i = 0 ,
where K ( x i , x j ) is the kernel function used to handle non-linear decision boundaries.
The selection of WSVM in our study is motivated by its superior handling of imbalanced data, the evaluation of the importance of characteristics and the robustness of classification.

3. FmRMR-PSOA: The Proposed Model

This section proposes a new feature selection method known as FmRMR-PSOA. In the first phase, the fast mRMR filter algorithm is applied to select the appropriate features with an enhancement in population diversity through the initialization of two distinct populations. After that, three different methods are employed to improve the search efficiency of the binary PSOA technique, and at the final stage, an optimal feature subset is achieved with high accuracy and a smaller number of features. Finally, the appropriate evaluation metric was obtained in the test set by performing a 5-fold cross-validation of the best subset of characteristics to illustrate the performance of this subset of characteristics. Figure 1 illustrates the flow chart related to FmRMR-PSOA.
The initialization method of population vectors in BPSOA is very important. In the conventional BPSOA approach, it generates members of the initial population by picking random specific characteristics. Even though it gives a broad spectrum of possibilities and converges towards feasible solutions to a great extent, the initial search phase of BPSOA takes tremendous computational time as classifiers have to be repeatedly evaluated for the best solutions. This paper utilizes fast mRMR as an extraction of features, which enhances the velocity of convergence of BPSOA. We propose a dual method for the initial population by making use of the result of fast MRMR. The initial members are created using the set of relevant features, as obtained from fast MRMR. The initial population of fast mRMR is presented in Figure 2.
A novel variant of the binary BPSOA method tailored specifically for the feature selection task in microarray data analysis. PSOA is a global adaptive optimization algorithm that is commonly used because it is simple to implement, converges quickly, and is quite robust. It has been successfully applied to various fields, including data mining, pattern recognition, and artificial neural networks. We enhance the PSOA algorithm by introducing a new binary quantization approach and scaling factor. This update will further enhance the exploratory capability of the algorithm in the exploration phase and enhance population diversity. In addition to this, we also guarantee that the exploitation capability of the algorithm for the next phase will exploit local optima. We present an adaptive crossover operator designed to improve the speed of convergence in the initial phase while maintaining the algorithm’s ability to exploit solutions in later stages. This paper introduces a new advanced binary PSOA technique that effectively balances exploration and exploitation to solve the feature-selection problem of microarray data. This improvement provides more efficient and accurate feature relevance detection for the assessment of microarrays, which improves the general efficiency and effectiveness of the feature selection procedure. The mechanism behind the execution of BPSOA is presented in Section 2. BPSOA exploits the efficient Portia spiders’ hunting behavior for local optimal escape and broadened search space. The algorithm captures the essence of adaptive predation by allowing candidate solutions to change their movements based on feedback from the environment. By enabling the BPSOA to remember previously found solutions, it becomes possible to improve them over successive cycles, thus avoiding premature convergence. This approach improves the balance between exploitation and exploration by focusing on good regions while preventing stagnation. Moreover, movement based on the Portia spider’s erratic style of hunting ensures that diverse and less constrained search behavior is realized, which mitigates the chances of being stuck in local optima. Deceptive fitness landscapes are helped by the use of stochastic perturbations and variable step sizes. Collectively, these actions increase the perceived stability of convergence and the likelihood of finding global optima in high-dimensional space.
The feature selection aims at achieving a subset of features from dataset X, which would result in maximum classification accuracy using limited feature information. However, we could not get most of these used datasets balanced; hence, the accuracy had to be replaced with MCC for this study. Of course, each of the above factors has different contributions with respect to the effectiveness of the KNN classifier in a given classification. We combine all of this into a single weighted metric and apply the same fitness function. The fitness function used in this study is presented in Equation (22).
F = α × MCC ( KNN ) + ( 1 α ) × 1 N f N
where N and N f depict the total number of features and features selected. The constant α represents the weight of the associated objective, and the value is between [0,1].

3.1. Implementation Steps of fastmRMR-BPSOA

In the context of the proposed model FmRMR, the following is the process of Feature selection using fast mRMR and PSOA. Fast mRMR is used for ranking the features, and PSOA is used to select the optimal set of features from the dataset.
Input:
  • Dataset D with features F = { f 1 , f 2 , , f n } and labels Y. [Training data (80%)-Test data (20%)].
  • Number of top features M to consider after fast mRMR.
  • PSOA parameters: population size N, maximum iterations T, and attraction–repulsion parameters.
Output:
  • Optimal subset of features F opt .
Steps:
1.
Apply fast mRMR
(a)
Compute the mutual information (MI) for all features with respect to the target variable Y.
(b)
Apply fast mRMR to rank the features based on their relevance to Y and minimal redundancy among features.
(c)
Select the top M features: F top = { f 1 , f 2 , , f M } .
2.
Initialize PSOA
(a)
Representation: Each spider is a binary vector S i = { b 1 , b 2 , , b M } , where b j = 1 indicates inclusion of feature f j , and b j = 0 indicates exclusion.
(b)
Randomly initialize a population of N spiders { S 1 , S 2 , , S N } . Each vector has a size of M, corresponding to the features in F top .
3.
Define Fitness Function
(a)
For each spider S i :
i
Extract the feature subset F sub corresponding to S i .
ii
Train a machine learning model M L using F sub on training data.
iii
Evaluate M L using a validation metric (e.g., accuracy, F1 score).
iv
Fitness score: Fitness ( S i ) = Evaluation Metric ( M L ) .
4.
Run PSOA
(a)
Iteration: For t = 1 to T:
i
For each spider S i :
A.
Update its position (binary vector S i ) based on PSOA attraction–repulsion rules and fitness scores of other spiders.
B.
Apply boundary constraints to ensure binary values { 0 , 1 } .
ii
Evaluate fitness for updated positions of all spiders.
(b)
Stopping criteria: Stop if the algorithm converges (no significant improvement in fitness) or reaches T.
5.
Return the Optimal Subset
(a)
Identify the spider S best with the highest fitness.
(b)
Return F opt , the feature subset represented by S best .

3.2. Outline of Classifiers

The use of several classifiers enables the article to achieve the research goals. This is especially true for basic algorithms like SVM and DT, and for ensemble learning-based methods which include two boosting algorithms, XGBoost and AdaBoost, as well as a bagging method, RF [29,30]. These algorithms are briefly discussed in this section.

3.2.1. Support Vector Machine (SVM)

The objective of the Support Vector Machine (SVM) is to construct a hyperplane that maximizes the margin between classes in the dataset. This hyperplane is given by Equation (23).
H : ω T × x + d = 0
Select the subset of features with the highest fitness score [24].
The letter w is a normal vector for the concerned hyperplane and d represents the distance of the hyperplane from a point (xi, yi). The hyperplane is represented by the superplane. The dual problem can be obtained through the Lagrange multiplier application as given in Equation (24).
L = 1 2 ω 2 i = 1 n κ i y i ( ω · x i + d ) + i = 1 n κ i
Feature selection is a minimal problem. So, the value of δ L δ x is equivalent to 0. So, the value of ω can be expressed in Equation (25).
ω = i = 1 n κ i · y i

3.2.2. Decision Tree

The decision tree is a widely used machine learning technique that is trained on a dataset for classification and regression analysis. This model classifies samples based on answers to many questions. A tree is an excellent example of the classification process. The first level of the tree includes all samples. The most difficult part of creating a decision tree is the choice of optimal partition characteristics, which can be estimated with the help of entropy, gain ratio, or gini index.

3.2.3. XGBoost

Extreme GB (Gradient Boosting) has advanced benefits proprietary to addressing a knowledge base problem. It harnesses the fixed boosting algorithms which are very accurate and emphasize scrimped data control. This also makes it applicable to large feature datasets. This technique also happens to be one of the best implementations of a gradient-boosted decision tree (gbDT), which has become a centric topic because of its rapid speed and ease of model execution. The trees estimate weight in order to get close to the goal variable and take a number of actions at every leaf node. A new method was presented that employs boosting processes to aid in combining the predictive power of weak or base component classifiers into one strong classifier. One variety of component classifiers is a decision tree. This model is popularly used in many areas such as regression, ranking and classification because of its unmatched flexibility and agility in computing. XGBoost, the most popular ensemble algorithm for gradient-boosted learning provides a good estimate for the target variable. It originates from innovative works based on information from newly constructed trees. In every iteration, the incorrectly classified points are evaluated by examining the most recent tree. In addition, these points are assigned weights. The next cycle focuses on creating a new tree which helps in correctly classifying the points mentioned earlier. In this manner, a new tree is acquired after each cycle. The final model uses an ensemble of all trees generated in previous cycles to minimize the misclassification error. With this method, the entire process is additive and self-reinforcing. As a start, the model was built using the default parameters.

3.2.4. AdaBoost

AdaBoost was originally conceived by Freund and Schapire with the aim of enhancing the classification accuracy of a base algorithm. They accomplish this by employing a methodology in which a set of weak classification functions attempts to build a more powerful classifier. Their boost strategy is not only effective for achieving high classification speed but also serves as an alternative for feature selection.
In particular, Adaboost is considered the most widely used boosting strategy because it captures the intuition of boosting perfectly. Adaboost has wide applicability and is one of the most effective boosted approaches with sequential learning. To select the best subset from the base classifier pool, it picks the most important weak classifiers and combines them into a strong classifier.
In the realm of boosting, AdaBoost, which was designed by Freund and Schapire in 1997, is considered to be the most robust machine learning technique. In addition, it can be regarded as a primary stepping stone having great utility for classification tasks. This method did not only help solve the problem of weak learners but also single-handedly solved the problem of overfitting and served as a fast yet effective classifier. An added advantage of this system is that the features are selected based on the previously selected weak classifiers. The classifiers that are initially selected as stumps, or classifiers containing a single decision split, are indicative of weak counters.
The idea of having a single source (root) node attached to a pair of nodes is pictured as a decision stump which is built for every feature of given data. Essentially, every AdaBoost implemented is considered as a unique trained repeating algorithm that performs like so: AdaBoost is provided with a unique set of training instances (samples), which, over the course, is achieved by changing the partition of weight over training instances (samples). In the end, alongside the weighted mean of a multitude of soft classifiers, a single robust classifier is said to be obtained. AdaBoost performs well with ensemble classifiers that are robust, or at least, it is capable of building those types of classifiers. It does not ensure a reduction in mean error, which is rather problematic. Because of using the ensemble of classifiers, the mean error increases. This is the reason why it succeeds. In order not to “weaken” it too much, and for the refinement of the last ensemble, the set of various weak classifiers is used. Basically, the rest of AdaBoost brings an example of an autonomous algorithm that boasts effectiveness yet falls short of perfect precision.

3.2.5. Random Forest

The ensemble technique Random Forest (RF) is made up of many decision trees that are trained and their outputs are combined for better prediction accuracy and less overfitting. Each tree in the forest is trained on different subsets of the data using the bagging technique. This helps the trees gain diversity among them. Each tree outputs a classification and they are aggregated through majority voting. For classification and regression tasks, Random Forest is very efficient due to its accuracy, capability of handling overfitting considerably better than individual decision trees, and tackling high-dimensional data.

3.3. Summary of Biomedical Datasets

This paper utilizes six publicly available biomedical datasets to evaluate the performance of the fastmRMR-BPSOA algorithm relative to the performance achieved by other methods. The datasets that were used for ALL-AML (A1), colon tumor (A2), central nervous system (A3), and ovarian cancer (A4) are available in a single zip file for download from the website http://csse.szu.edu.cn/staff/zhuzx/Datasets.html accessed on 14 December 2024. The other two datasets, GSE4115 (A5) and GSE10245 (A6), can be obtained from GEO database at https://www.ncbi.nlm.nih.gov/geo accessed on 14 December 2024. The NCBI has maintained GEO which is a central biological database which contains large volumes of gene expression data that are contributed to by different researchers around the world.
The datasets used in the experiments differ in size: their instance numbers range from 60 to 235, and all are of high dimensionality; the features vary between 2000 and 54,676. These datasets provide valuable information on gene expression, protein profiling, and genomic sequence, which can be used for classification and disease diagnosis purposes. Table 1 gives a summary of the details of these biomedical datasets.

4. Experimental Analysis

This part describes a set of experiments carried out to test the performance and effectiveness of the proposed method. First, the raw datasets were examined in order to establish a performance measure for classification in the absence of any feature selection, which served as a basic measure in such analyzes. Then, the BPSOA algorithm was tested using these raw datasets in order to evaluate its capacity in feature selection. Finally, the outcomes achieved with BPSOA were compared to the non-BPSOA outcome from the primary datasets.
The new hybrid system was applied to competitions to check how they would perform against the other examined hybrid systems in this work. It started using the fast mRMR approach, which was a first-pass filter designed to increase the performance of a particular learning algorithm by eliminating non-informative features from the datasets. The resultant reduced datasets were then input into the wrapper phase, where BPSOA was used to initialize the search space. To evaluate the approach, five classifiers were applied, basically including simple models, SVM, weighted SVM and DT, as well as more sophisticated ensemble strategies like XGBoost AdaBoost (boosting) and Random Forests (bagging). Classifier performance was foremost evaluated in terms of accuracy, which is the fraction of the correctly classified test instances, which is appropriate for the research question and is the use of the fitness function.
The splitting of the datasets is carried out such that 80% was used for training and 20% for testing. The hybrid framework is developed using a system with a Windows 10 OS, equipped with an Intel i7 processor with a 4.6 GHz clock speed, 16 GB of RAM, 1 TB of SSD, and 1 TB of HDD. The analysis of the proposed model is evaluated in two different phases. In the first phase, the model is made without using the BPSOA algorithm and in the second phase of the evalution the model is evaluated using the BPSOA algorithm. To evaluate the performance of the developed models different parameters Accuracy ( A c c ) , Misclassification Ratio ( M C R ) , Precision ( P r e ) , Recall ( R c l ) , Specificity ( S p e ) , F1 Score ( F 1 S c ) , False Negative Ratio ( F a N e g ) , False Positive Ratio ( F a P o ) , and Mathews Correlation Coefficient ( M C C ) are considered. Equations (26)–(33) show the calculation of the above mentioned parameters with t P o , f P o as true and false positive, respectively, t N e , f N e as true and false negative, respectively.
A c c = t P o + t N e t P o + f P o + t N e + f N e
P r e = t P o t P o + f P o
R c l = t P o t P o + f N e
S p e = t N e t P o + f N e
F 1 S c = 2 × t P o t P o + f P o × t P o t P o + f N e t P o t P o + f P o + t P o t P o + f N e
F a N e g = f N e t P o + f N e
F a P o = f P o t P o + f N e
M C C = ( t P o · t N e ) ( f P o · f N e ) ( t P o + f P o ) ( t P o + f N e ) ( t N e + f P o ) ( t N e + f N e )

4.1. Phase-I Evaluation

  • Out of all the methods, fast mRMR-RF outperformed the others at 93.50% in the ALL-MLL dataset, achieving the highest accuracy. fast mRMR-RF surpassed the next best model, fast mRMR-AdaBoost, by 0.61%. It achieved an F1-Score of 94.78 and an MCC of 86.60. Compared to the worst algorithm of fast mRMR-SVM, the Random Forest algorithm was found to outperform by 4.61%. Additionally, it achieved an FNR of 1.67%, indicating an impressive ability to predict true positives with an FPR of 13.75%.
  • Fast mRMR-AdaBoost had the best results in the colon cancer dataset with an accuracy of 93.94% and an F1-Score of 95.24. This was an improvement of 0.51% over the previous best algorithm fast mRMR-WSVM. Compared to fast mRMR-SVM, which had an accuracy of 88.38%, AdaBoost demonstrates an improvement of 5.56%. Moreover, with an MCR of 6.06%, AdaBoost was also able to balance sensitivity at 97.56% with specificity at 88%, pointing toward an impressive reduction in misclassifications.
  • In the case of the central nervous system dataset, the implemented model fast mRMR-RF has the best performance with 93.43% accuracy while qualifying F1-Score at 94.65%. It surpasses the second-best model, which is fast mRMR-AdaBoost, by a margin of 1.01% in accuracy and improves the fast mRMR-SVM method by 6.06%. Its MCC is 86.29%, combined with the lowest FPR of 11.39% makes the model Random Forest demonstrate strong classification results with low false alarms. The model creates these results while having an MCR that is low at 6.57%.
  • Regarding the ovarian cancer dataset, the achieved results are different as the model fast mRMR-AdaBoost exhibits the best performance with an achieved overall accuracy of 92.93% while meanwhile having an F1-Score of 94.89. This model shows improvement of 1.01 in comparison with the second-best model fast mRMR-XGBoost while increasing 3.03% over fast mRMR-SVM. With 87.10% specificity combined with an FNR of 4.41%, the model AdaBoost demonstrates performance that is stable and predictive for this dataset.
  • As indicated in GSE4115 dataset, the fast mRMR-AdaBoost model outshines the rest with the highest accuracy of 94.47% while also outperforming the second-best, fast mRMR-SVM, with a score of 0.05 in accuracy. This model has a 1.04% improvement in accuracy compared to the fast mRMR-XGBoost method. Furthermore, the AdaBoost model also boasts a low FNR of 1.36% while increasing specificity to 82.69%, which solidifies this as the best methodology for this dataset.
  • Fast mRMR-WSVM and fast mRMR-AdaBoost models show the highest accuracy with 94.95% in the GSE10245 dataset which beats the third model in rank fast mRMR-RF by 0.51%. Furthermore, the highest F1 Score of 96.58% with a MCC of 87.46 indicates more reliability in classification is established by fast mRMR-AdaBoost. The lowest-ranked model of all, fast mRMR-SVM, when compared to the AdaBoost model, does show a notable difference in accuracy of 4.04%, as shown. In addition, the FNR is at 0.70% making the AdaBoost method the best choice for this dataset.
  • Table 2 shows the performance analysis of the model without using the BPSOA technique as the feature selection algorithm.
  • Figure 3 shows the ROC analysis of the model without using BPSOA for different datasets.

4.2. Phase-II Evaluation

  • For the ALL-MLL dataset, the leading algorithm is still the fast mRMR-BPOSA-XGBoost with an accuracy of 99.49%, F1-Score of 99.68%, and MCC score of 98.43%. This result is significantly better than the previously discussed fast mRMR-RF, with an accuracy of 93.50%, and means a 6.39% improvement.
  • Similarly for the colon cancer dataset, the top performer is the fast mRMR-BPOSA-RF approach with 98.99% accuracy, F1-Score of 99.36%, and MCC of 96.92%. The previous methodology known as fast mRMR-AdaBoost was also outperformed (its accuracy was 93.94%, so the RF model shows a remarkable improvement of 5.05%). The RF model has an extremely low MCR of 1.01% and an even lower FNR of 0.64%, showing its great reliability and effectiveness in true positive detection.
  • The methodology with the best results on the central nervous system dataset is the fast mRMR-BPOSA-AdaBoost, with an accuracy of 99.49%, F1-Score of 99.70%, and MCC of 98.18%. There is a noticeable improvement from the previously mentioned fast mRMR-RF (accuracy of 93.43%), with a 6.06% increase, so again there is some advantage.
  • Fast mRMR-BPOSA-WSVM has an accuracy of 98.99%, F1-Score of 99.40%, and MCC of 96.18%, making it the best for the ovarian cancer dataset. Its performance is better than fast mRMR-AdaBoost, the previous best performer, which had an accuracy of 92.93%. The accuracy improvement is 6.06%, a significant increase. Furthermore, WSVM has a low FNR of 0.60%, which allows for highly accurate positive predictions.
  • For the GSE4115 dataset, the best performing model is fast mRMR-BPOSA-RF with an accuracy of 99.50%, F1-Score of 99.72% and MCC of 97.32%. When comparing it to the previously mentioned model fast mRMR-AdaBoost with an accuracy of 94.47%, RF has an improvement of 5.03%. With a 0.56% FNR and a 100% perfect specificity, it is the most reliable method.
  • The best model previously, fast mRMR-AdaBoost, had an accuracy of 94.95%. However, for the GSE10245 dataset, fast mRMR-BPOSA-RF has an accuracy of 97.98%, F1-Score of 98.84%, and MCC of 90.94%, which is a 3.03% increase. With an MCR of only 2.02% and a low FNR of 1.72%, for this dataset, RF proves to be the most effective methodology.
  • Table 3 shows the performance analysis of the proposed model using the fast mRMR and BPSOA techniques as the feature selection algorithm.
  • Figure 4 shows the ROC analysis of the proposed model using the fast mRMR and BPSOA techniques for selecting features for different datasets.

4.3. Critical Analysis

  • The newly designed model has increased accuracy to 99.38%, which is better than any previously reported accuracy. The figure is also so much better than [5], whose accuracy was 97.23%, which is actually 2.15% better. It has also become better than what [7,8] achieved. For [7], accuracy is 98.61%, and for [8], it was 94.17%, with improvements of 0.78% and 5.21%, respectively. It can also conclude that of achieving improved accuracy of 7.98%, 2.28%, and 4.50% to the research [21,22,23], respectively. This indicates that all these improvements show the efficiency of the fast mRMR-BPSOA model while utilizing the advanced feature selection mechanism to classify the ALL-AML dataset.
  • The proposed model has increased accuracy to 98.99%, which clearly shows improvement over multiple existing methods. Compared to 93.33% achieved in [5], the increase is indeed remarkable, standing at 5.66%, while compared to 93.12% garnered in [6], the number stands at 5.87%. In the same way, concerning [7] with an accuracy of 85.48% with an increment of 13.51%, while from [8] with an accuracy of 95.83%, there has been a 3.16% increase. So has it been from [11] 7.06% increment, while in the case of [20], the proposed model stands at 0.73% increment. It can also conclude that of achieving improved accuracy of 25.26%, and 3.76% to the research [21,23], respectively. These results best justify the claim of the proposed techniques concerning the different models used on the colon cancer dataset.
  • The proposed model for the CNS dataset achieves an accuracy of 99.79%, again outperforming the previously best accuracy of 85.48% achieved by [7]. This phenomenal 16.46% outperformance shows that the model can perform well on complex datasets with high-dimensional features. It is also achieved improved accuracy of 20.02% to the research [21] with 83.14% of accuracy.
  • The proposed model reveals that the system outperforms existing benchmarks with a 98.99% accuracy for ovarian cancer. Again, the proposed model reveals a decrement of 1.01% compared to [6,7] that scored 100%. Here, the slight increment of 0.11% reveals the model outperforms 98.88% from [11]. Also, it outperforms 96.98 from [20] with a 2.01% increment. It is also achieved improved accuracy of 5.17% to the research [21] with 94.10% of accuracy. The developed model seems to perform reasonably well but does not outperform all benchmark strategies for the ovarian cancer dataset.
  • For the GSE4115 dataset the model’s accuracy on the GSE4115 dataset is 99.50%, which is also the improvements in accuracy of 8.76% to [21] with 91.49% of accuracy. The accuracy on the GSE10245 dataset is 97.98%, which is also an improved accuracy of 11.04% and 8.75% to the research [21,23], respectively. In any case, these results illustrate the strength of the proposed model in working across different datasets.
  • Table 4 shows the comparative study of the proposed model with existing models.

5. Conclusions

This research introduces an advanced methodology for cancer diagnosis that hybridizes machine learning techniques with feature extraction processes, integrating fast mRMR with Portia spider optimization. The incorporation of mRMR with Portia spider optimization BPOSA extracted the most relevant and least associated redundant features, which increased classification performance while minimizing the costs incurred by the computations. This process is performed on the classifying algorithms SVM, weighted SVM, XGBoost, AdaBoost, and Random Forest (RF). The method above was applied to six diverse cancer datasets: ALL-MLL, colon, CNS, ovarian, GSE4115, and GSE10245. In terms of classification, the fast mRMR-BPOSA-AdaBoost was overwhelmingly the best performer achieving an astounding 99.79% accuracy. These remarkable achievements suggest that this AdaBoost algorithm is the best approach for accurate cancer classification given mRMR feature selection and is the best for high-dimensional genomic datasets. These results emphasize the potential of the proposed hybrid machine learning framework to tackle the challenges such as high-dimensionality, feature redundancy, and imbalance that are posed in cancer prognosis and final decision making diagnosis support systems (DSSs) for cancer detection. With such astounding accuracy, it is apparent that this model overcomes the primary challenge that detection and diagnosis models face, which is the robustness of the model allowing it to aid and support treatment actions and planning in a timely manner. Future work might delve into cross-omics datasets and the use of this model in actual clinical cases with an effort toward personalized medicine and to develop a parallel computational model using fast mRMR and BPOSA.

Author Contributions

Conceptualization, B.S. and A.P. (Amrutanshu Panigrahi); methodology, B.S., A.P. (Amrutanshu Panigrahi) and A.P. (Abhilash Pati); software, A.P. (Abhilash Pati) and M.N.D.; validation, B.S., A.P. (Amrutanshu Panigrahi) and A.P. (Abhilash Pati); formal analysis, G.S. and P.J.; investigation, B.S.; resources, A.P. (Amrutanshu Panigrahi) and A.P. (Abhilash Pati); data curation, A.P. (Abhilash Pati) and G.S.; writing—original draft preparation, B.S. and A.P. (Amrutanshu Panigrahi); writing—review and editing, A.P. (Abhilash Pati), P.J. and H.L.; visualization, A.P. (Abhilash Pati) and A.P. (Amrutanshu Panigrahi); supervision, B.S. and A.P. (Amrutanshu Panigrahi); project administration, P.J. and H.L.; funding acquisition, P.J. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in the manuscript is available at http://csse.szu.edu.cn/staff/zhuzx/Datasets.html and https://www.ncbi.nlm.nih.gov/geo accessed on 14 December 2024.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rajeswari, M.; Chandrasekar, A.; Nasiya, P.M. Disease Prognosis by Machine Learning over Big Data from Healthcare Communities. Int. J. Recent Technol. Eng. 2019, 8, 680–683. [Google Scholar] [CrossRef]
  2. Sahu, B. A Combo Feature Selection Method (Filter +Wrapper) for Microarray Gene Classification. Int. J. Pure Appl. Math. 2018, 118, 389–401. [Google Scholar]
  3. Xue, B.; Zhang, M.; Browne, W.N.; Yao, X. A Survey on Evolutionary Computation Approaches to Feature Selection. IEEE Trans. Evol. Comput. 2016, 20, 606–626. [Google Scholar] [CrossRef]
  4. Mahapatra, M.; Majhi, S.K.; Dhal, S.K. MRMR-SSA: A Hybrid Approach for Optimal Feature Selection. Evol. Intell. 2022, 15, 2017–2036. [Google Scholar] [CrossRef]
  5. Yu, K.; Li, W.; Xie, W.; Wang, L. A Hybrid Feature-Selection Method Based on MRMR and Binary Differential Evolution for Gene Selection. Processes 2024, 12, 313. [Google Scholar] [CrossRef]
  6. Alomari, O.A.; Khader, A.T.; Al-Betar, M.A.; Abualigah, L.M. MRMR BA: A Hybrid Gene Selection Algorithm for Cancer Classification. J. Theor. Appl. Inf. Technol. 2017, 95, 2610–2618. [Google Scholar]
  7. Alomari, O.A.; Khader, A.T.; Al-Betar, M.A.; Awadallah, M.A. A Novel Gene Selection Method Using Modified MRMR and Hybrid Bat-Inspired Algorithm with β-Hill Climbing. Appl. Intell. 2018, 48, 4429–4447. [Google Scholar] [CrossRef]
  8. Alshamlan, H.; Badr, G.; Alohali, Y. MRMR-ABC: A Hybrid Gene Selection Algorithm for Cancer Classification Using Microarray Gene Expression Profiling. BioMed Res. Int. 2015, 2015, 1–15. [Google Scholar] [CrossRef]
  9. Wang, S.; Kong, W.; Aorigele; Deng, J.; Gao, S.; Zeng, W. Hybrid Feature Selection Algorithm MRMR-ICA for Cancer Classification from Microarray Gene Expression Data. Comb. Chem. High Throughput Screen. 2018, 21, 420–430. [Google Scholar] [CrossRef]
  10. Sahu, B.; Panigrahi, A.; Sukla, S.; Biswal, B.B. MRMR-BAT-HS: A Clinical Decision Support System for Cancer Diagnosis. Leukemia 2020, 7129, 48. [Google Scholar]
  11. Qin, X.; Zhang, S.; Dong, X.; Shi, H.; Yuan, L. Improved Aquila Optimizer with MRMR for Feature Selection of High-Dimensional Gene Expression Data. Clust. Comput. 2024, 27, 13005–13027. [Google Scholar] [CrossRef]
  12. Vincentina, N.; Nagarajan, R. Feature Selection in Microarray Using Proposed Hybrid Minimum Redundancy-Maximum Relevance (MRMR) and Modified Genetic Algorithm (MGA). Int. J. Exp. Res. Rev. 2024, 39, 82–91. [Google Scholar] [CrossRef]
  13. Yaqoob, A. Combining the MRMR Technique with the Northern Goshawk Algorithm (NGHA) to Choose Genes for Cancer Classification. Int. J. Inf. Technol. 2024. [Google Scholar] [CrossRef]
  14. Tasci, E.; Jagasia, S.; Zhuge, Y.; Camphausen, K.; Krauze, A.V. GradWise: A Novel Application of a Rank-Based Weighted Hybrid Filter and Embedded Feature Selection Method for Glioma Grading with Clinical and Molecular Characteristics. Cancers 2023, 15, 4628. [Google Scholar] [CrossRef]
  15. Varzaneh, Z.A.; Hossein, S.; Mood, S.E.; Javidi, M.M. A New Hybrid Feature Selection Based on Improved Equilibrium Optimization. Chemom. Intell. Lab. Syst. 2022, 228, 104618. [Google Scholar] [CrossRef]
  16. Yaqoob, A.; Verma, N.K.; Aziz, R.M. Improving Breast Cancer Classification with MRMR + SS0 + WSVM: A Hybrid Approach. Multimed. Tools Appl. 2024. [Google Scholar] [CrossRef]
  17. Mahmoud, A.; Takaoka, E. An Enhanced Machine Learning Approach with Stacking Ensemble Learner for Accurate Liver Cancer Diagnosis Using Feature Selection and Gene Expression Data. Healthc. Anal. 2025, 7, 100373. [Google Scholar] [CrossRef]
  18. Paramita, D.P.; Mohapatra, P. A Modified metaheuristic Algorithm Integrated ELM Model for Cancer Classification. Sci. Iran. 2022, 29, 613–631. [Google Scholar] [CrossRef]
  19. Razmjouei, P.; Moharamkhani, E.; Hasanvand, M.; Daneshfar, M.; Shokouhifar, M. Metaheuristic-Driven Two-Stage Ensemble Deep Learning for Lung/Colon Cancer Classification. Comput. Mater. Contin. 2024, 80, 3855–3880. [Google Scholar] [CrossRef]
  20. Mahto, R.; Ahmed, S.U.; Rahman, R.u.; Aziz, R.M.; Roy, P.; Mallik, S.; Li, A.; Shah, M.A. A Novel and Innovative Cancer Classification Framework through a Consecutive Utilization of Hybrid Feature Selection. BMC Bioinform. 2023, 24, 479. [Google Scholar] [CrossRef]
  21. Panda, M. Elephant Search Optimization Combined with Deep Neural Network for Microarray Data Analysis. J. King Saud Univ.—Comput. Inf. Sci. 2017, 32, 940–948. [Google Scholar] [CrossRef]
  22. Sahu, B.; Dash, S. Hybrid multifilter ensemble based feature selection model from microarray cancer datasets using GWO with deep learning. In Proceedings of the 2023 3rd International Conference on Intelligent Technologies (CONIT), Hubli, India, 23–25 June 2023; pp. 1–6. [Google Scholar]
  23. Das, A.; Neelima, N.; Deepa, K.; Özer, T. Gene Selection Based Cancer Classification with Adaptive Optimization Using Deep Learning Architecture. IEEE Access 2024, 12, 62234–62255. [Google Scholar] [CrossRef]
  24. Son, H.; Nguyen, T. Portia Spider Algorithm: An Evolutionary Computation Approach for Engineering Application. Artif. Intell. Rev. 2024, 57, 24. [Google Scholar] [CrossRef]
  25. Peng, H.; Ding, C. Minimum Redundancy and Maximum Relevance Feature Selection and Recent Advances in Cancer Classification. Feature Sel. Data Min. 2005, 52, 1–107. [Google Scholar]
  26. Panigrahi, A.; Pati, A.; Sahu, B.; Das, M.N.; Kumar, S.; Sahoo, G.; Kant, S. En-MinWhale: An Ensemble Approach Based on MRMR and Whale Optimization for Cancer Diagnosis. IEEE Access 2023, 11, 113526–113542. [Google Scholar] [CrossRef]
  27. Ramírez-Gallego, S.; Lastra, I.; Martínez-Rego, D.; Bolón-Canedo, V.; Benítez, J.M.; Herrera, F.; Alonso-Betanzos, A. Fast-MRMR: Fast Minimum Redundancy Maximum Relevance Algorithm for High-Dimensional Big Data. Int. J. Intell. Syst. 2016, 32, 134–152. [Google Scholar] [CrossRef]
  28. Huang, C.; Zhou, J.; Chen, J.; Yang, J.; Clawson, K.; Peng, Y. A Feature Weighted Support Vector Machine and Artificial Neural Network Algorithm for Academic Course Performance Prediction. Neural Comput. Appl. 2023, 35, 11517–11529. [Google Scholar] [CrossRef]
  29. Elnaggar, A.H.; El-Hameed, A.S.A.; Yakout, M.A.; Areed, N.F.F. Machine Learning for Breast Cancer Detection with Dual-Port Textile UWB MIMO Bra-Tenna System. Information 2024, 15, 467. [Google Scholar] [CrossRef]
  30. Mohi, M.; Mamun, A.A.; Chakrabarti, A.; Mostafiz, R.; Dey, S.K. An Ensemble Machine Learning-Based Approach to Predict Cervical Cancer Using Hybrid Feature Selection. Neurosci. Inform. 2024, 4, 100169. [Google Scholar] [CrossRef]
Figure 1. Workflow model of FmRMR-BPSOA.
Figure 1. Workflow model of FmRMR-BPSOA.
Bioengineering 12 00291 g001
Figure 2. Initial population of fast mRMR.
Figure 2. Initial population of fast mRMR.
Bioengineering 12 00291 g002
Figure 3. ROC analysis for (a) ALL-AML (b) colon (c) CNS (d) ovarian (e) GSE4115 (f) GSE10245 cancer datasets without using BPSOA as the feature selection algorithm.
Figure 3. ROC analysis for (a) ALL-AML (b) colon (c) CNS (d) ovarian (e) GSE4115 (f) GSE10245 cancer datasets without using BPSOA as the feature selection algorithm.
Bioengineering 12 00291 g003
Figure 4. ROC analysis for (a) ALL-AML (b) colon (c) CNS (d) ovarian (e) GSE4115 (f) GSE10245 cancer datasets using Fast mRMR and BPSOA as the feature selection algorithm.
Figure 4. ROC analysis for (a) ALL-AML (b) colon (c) CNS (d) ovarian (e) GSE4115 (f) GSE10245 cancer datasets using Fast mRMR and BPSOA as the feature selection algorithm.
Bioengineering 12 00291 g004
Table 1. Summary of biomedical datasets.
Table 1. Summary of biomedical datasets.
DatasetInstances/AttributesClasses DistributionImbalance Ratio
A170/7129AML/ALL: 25/471.88
A262/2000Tumor/Normal: 40/221.82
A360/7129Class0/Class1: 39/211.85
A4235/15,154Cancer/Normal: 162/911.78
A5192/22,216Cancer/Normal: 97/951.02
A658/54,676AC/SCC: 40/182.22
Table 2. Performance metrics for different datasets using fast mRMR with various classifiers.
Table 2. Performance metrics for different datasets using fast mRMR with various classifiers.
DatasetModel A cc (%)MCR (%) P re (%) R cl (%) F 1 Sc (%) S pe (%) Fa Neg (%) Fa Po (%)MCC (%)
ALL-AMLLFast mRMR-SVM88.8911.1185.2797.3590.9177.652.6522.3577.90
Fast mRMR-WSVM90.919.0986.2999.0792.2481.110.9318.8982.53
Fast mRMR-XGBoost91.418.5991.7494.0792.8987.505.9312.5082.10
Fast mRMR-AdaBoost92.897.1191.2797.4694.2686.082.5413.9285.27
Fast mRMR-RF93.506.5091.4798.3394.7886.251.6713.7586.60
COLONFast mRMR-SVM88.3811.6286.6194.8390.5379.275.1720.7376.10
Fast mRMR-WSVM93.436.5790.5599.1494.6585.370.8614.6386.79
Fast mRMR-XGBoost92.937.0793.3994.9694.1789.875.0410.1385.21
Fast mRMR-AdaBoost93.946.0693.0297.5695.2488.002.4412.0087.10
Fast mRMR-RF91.928.0891.1395.7693.3986.254.2413.7583.18
CNSFast mRMR-SVM87.3712.6386.4093.1089.6379.276.9020.7373.89
Fast mRMR-WSVM88.3811.6287.8093.1090.3881.716.9018.2975.97
Fast mRMR-XGBoost91.418.5988.5297.3092.7083.912.7016.0982.87
Fast mRMR-AdaBoost92.427.5891.2796.6493.8886.083.3613.9284.20
Fast mRMR-RF93.436.5792.7496.6494.6588.613.3611.3986.29
OvarianFast mRMR-SVM89.9010.1090.0795.4992.7078.464.5121.5476.70
Fast mRMR-WSVM91.928.0892.4895.3593.8985.514.6514.4982.04
Fast mRMR-XGBoost92.427.5889.6098.2593.7284.521.7515.4884.79
Fast mRMR-AdaBoost92.937.0794.2095.5994.8987.104.4112.9083.44
Fast mRMR-RF90.829.1890.0094.7492.3185.375.2614.6381.10
GSE4115Fast mRMR-SVM94.425.5893.0699.2696.0683.870.7416.1387.06
Fast mRMR-WSVM93.946.0692.0399.2295.4984.290.7815.7186.87
Fast mRMR-XGBoost93.436.5793.8897.1895.5083.932.8216.0783.54
Fast mRMR-AdaBoost94.475.5394.1698.6496.3582.691.3617.3185.42
Fast mRMR-RF93.436.5791.9599.2895.4780.000.7220.0084.42
GSE10245Fast mRMR-SVM90.919.0992.9595.3994.1676.094.6123.9173.84
Fast mRMR-WSVM94.955.0596.8896.8896.8886.843.1313.1683.72
Fast mRMR-XGBoost93.946.0695.6396.8496.2382.503.1617.5080.89
Fast mRMR-AdaBoost94.955.0594.0099.3096.5883.930.7016.0787.46
Table 3. Performance metrics for different datasets using fast mRMR-BPSOA with various classifiers.
Table 3. Performance metrics for different datasets using fast mRMR-BPSOA with various classifiers.
DatasetModel A cc (%)MCR (%) P re (%) R cl (%) F 1 Sc (%) S pe (%) Fa Neg (%) Fa Po (%)MCC (%)
ALL-AMLFast mRMR-BPSOA-SVM95.484.5295.4898.6797.0585.711.3314.2987.60
Fast mRMR-BPSOA-WSVM97.472.5397.4499.3598.3891.110.658.8992.73
Fast mRMR-BPSOA-XGBoost99.380.6299.37100.0099.6897.500.002.5098.43
Fast mRMR-BPSOA-AdaBoost98.481.5298.7499.3799.0595.000.635.0095.27
Fast mRMR-BPSOA-RF98.991.0199.3899.3899.3897.300.622.7096.68
COLONFast mRMR-BPSOA-SVM97.982.0299.3998.2098.8096.771.803.2392.61
Fast mRMR-BPSOA-WSVM96.973.0398.1198.1198.1192.311.897.6990.42
Fast mRMR-BPSOA-XGBoost97.982.0299.3898.1798.7797.061.832.9493.12
Fast mRMR-BPSOA-AdaBoost98.481.5299.3998.7999.0996.971.213.0394.63
Fast mRMR-BPSOA-RF98.991.0199.3699.3699.3697.560.642.4496.92
CNSFast mRMR-BPSOA-SVM95.964.0497.6297.6297.6286.672.3813.3384.29
Fast mRMR-BPSOA-WSVM97.982.02100.0097.6398.80100.002.370.0092.63
Fast mRMR-BPSOA-XGBoost97.342.6699.3597.4598.3996.772.553.2390.85
Fast mRMR-BPSOA-AdaBoost99.790.2199.40100.0099.7096.970.003.0398.18
Fast mRMR-BPSOA-RF97.982.0299.3998.1998.7996.881.813.1392.79
OvarianFast mRMR-BPSOA-SVM97.472.5398.7498.1398.4394.741.885.2691.95
Fast mRMR-BPSOA-WSVM98.991.0199.4099.4099.4096.770.603.2396.18
Fast mRMR-BPSOA-XGBoost97.982.0298.8598.8598.8591.671.158.3390.52
Fast mRMR-BPSOA-AdaBoost98.481.5299.4398.8799.1595.241.134.7692.21
Fast mRMR-BPSOA-RF96.953.0598.1698.1698.1691.181.848.8289.34
GSE4115Fast mRMR-BPSOA-SVM97.472.5398.2098.8098.5090.631.209.3890.58
Fast mRMR-BPSOA-WSVM99.490.51100.0099.3999.70100.000.610.0098.22
Fast mRMR-BPSOA-XGBoost98.991.0199.3999.3999.3996.970.613.0396.36
Fast mRMR-BPSOA-AdaBoost97.472.5399.3797.5398.4497.222.472.7891.89
Fast mRMR-BPSOA-RF99.500.50100.0099.4499.72100.000.560.0097.32
GSE10245Fast mRMR-BPSOA-SVM94.445.5697.5995.8696.7286.214.1413.7978.83
Fast mRMR-BPSOA-WSVM96.973.0398.2998.2998.2986.961.7113.0485.24
Fast mRMR-BPSOA-XGBoost96.973.0398.1698.1698.1691.431.848.5789.59
Fast mRMR-BPSOA-AdaBoost96.463.5499.4096.5197.9496.153.493.8586.13
Fast mRMR-BPSOA-RF97.982.0299.4298.2898.8495.831.724.1790.94
Table 4. Performance comparison (proposed vs. existing models) (%).
Table 4. Performance comparison (proposed vs. existing models) (%).
MethodologyALL-AMLColonCNSOvarianGSE4115GSE10245
[5]97.2393.33
[6]93.12100
[7]98.6185.4883.33100
[8]94.1795.83
[11]91.9398.88
[20]98.2796.98
[21]92.1179.0383.1494.1091.4988.24
[22] with RNN97.11
[22] with LSTM97.17
[23] with CNN94.5093.6087.70
[23] with LSTM94.1094.3088.20
[23] with DCNN95.1095.4090.10
[Proposed]99.3898.9999.7998.9999.597.98
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sahu, B.; Panigrahi, A.; Pati, A.; Das, M.N.; Jain, P.; Sahoo, G.; Liu, H. Novel Hybrid Feature Selection Using Binary Portia Spider Optimization Algorithm and Fast mRMR. Bioengineering 2025, 12, 291. https://doi.org/10.3390/bioengineering12030291

AMA Style

Sahu B, Panigrahi A, Pati A, Das MN, Jain P, Sahoo G, Liu H. Novel Hybrid Feature Selection Using Binary Portia Spider Optimization Algorithm and Fast mRMR. Bioengineering. 2025; 12(3):291. https://doi.org/10.3390/bioengineering12030291

Chicago/Turabian Style

Sahu, Bibhuprasad, Amrutanshu Panigrahi, Abhilash Pati, Manmath Nath Das, Prince Jain, Ghanashyam Sahoo, and Haipeng Liu. 2025. "Novel Hybrid Feature Selection Using Binary Portia Spider Optimization Algorithm and Fast mRMR" Bioengineering 12, no. 3: 291. https://doi.org/10.3390/bioengineering12030291

APA Style

Sahu, B., Panigrahi, A., Pati, A., Das, M. N., Jain, P., Sahoo, G., & Liu, H. (2025). Novel Hybrid Feature Selection Using Binary Portia Spider Optimization Algorithm and Fast mRMR. Bioengineering, 12(3), 291. https://doi.org/10.3390/bioengineering12030291

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop