Escaping Stagnation through Improved Orca Predator Algorithm with Deep Reinforcement Learning for Feature Selection

: Stagnation at local optima represents a significant challenge in bio-inspired optimization algorithms, often leading to suboptimal solutions. This paper addresses this issue by proposing a hybrid model that combines the Orca predator algorithm with deep Q-learning. The Orca predator algorithm is an optimization technique that mimics the hunting behavior of orcas. It solves complex optimization problems by exploring and exploiting search spaces efficiently. Deep Q-learning is a reinforcement learning technique that combines Q-learning with deep neural networks. This integration aims to turn the stagnation problem into an opportunity for more focused and effective exploitation, enhancing the optimization technique’s performance and accuracy. The proposed hybrid model leverages the biomimetic strengths of the Orca predator algorithm to identify promising regions nearby in the search space, complemented by the fine-tuning capabilities of deep Q-learning to navigate these areas precisely. The practical application of this approach is evaluated using the high-dimensional Heartbeat Categorization Dataset, focusing on the feature selection problem. This dataset, comprising complex electrocardiogram signals, provided a robust platform for testing the feature selection capabilities of our hybrid model. Our experimental results are encouraging, showcasing the hybrid strategy’s capability to identify relevant features without significantly compromising the performance metrics of machine learning models. This analysis was performed by comparing the improved method of the Orca predator algorithm against its native version and a set of state-of-the-art algorithms.


Introduction
Over the years, optimization mechanisms have significantly evolved, in tandem with the rise in scientific knowledge, as evidenced by the considerable progress in this field of study [1].In this context, artificial intelligence has led the most significant innovations, and within this realm, bio-inspired optimization algorithms, especially those employing biomimetic approaches, have achieved significant benefits [2].Despite their proven efficiency in solving a variety of complex optimization problems [3], these methods face a crucial challenge: stagnation at local optima.Many works have faced this issue as a problem to be normally solved by using sophisticated exploration techniques to visit other places of the search spaces [4][5][6].However, this stagnation can be viewed as an opportunity to exploit promising zones of the search space with more focused and controlled diversification, potentially enhancing the algorithm's performance and accuracy.
In swarm intelligence algorithms, the stagnation problem typically occurs when solutions converge to a local optimum rather than the desired global optimum [7].This is primarily due to artificial agents being guided by local information in their vicinity, leading to premature convergence to suboptimal solutions.When the swarm becomes trapped in a local optimum, exploring other regions of the search space becomes challenging.This convergence is altered with random exploration processes.This implies that targeted tactics do not always enrich the search for the global optimum but are sometimes guided by a more random process.Although this approach may open new areas for exploration [8,9], it does not necessarily ensure a more effective path to optimal solutions.
In this research, we pioneer the synergy of the Orca predator algorithm (OPA) and deep Q-learning (DQL), aiming to harness the strengths of both to identify and exploit promising solutions more effectively, building on the foundational work detailed in [10].Traditional methodologies often fall short by prematurely converging on local optima; our strategy, however, capitalizes on such convergences, treating them as gateways for directed exploration towards areas of promise.The selection of OPA was meticulous, driven by its intricate parameter requirements and the profound influence of its calibration.This methodology is designed to be adaptable and suitable for application to other techniques with similar functional characteristics.The fusion of OPA and DQL marks a notable innovation in tackling optimization challenges, merging OPA's biomimetic prowess in swiftly pinpointing areas worth exploring with DQL's precision in refining searches within these zones.
Despite our proposal's significant advancements, we also face challenges inherent in integrating optimization techniques with deep learning algorithms [11].One of the main challenges in this field is the management of high-dimensional search spaces, which often lead to a higher probability of generating similar solutions.Additionally, the computational complexity of combining OPA with DQL raises practical concerns regarding computational resources and execution time, particularly in large-scale problems [12].We acknowledge the potential of our approach in various optimization scenarios while also considering that its applicability and effectiveness may need to be tailored to the specific characteristics and complexities of each unique problem set.
The proposal will be evaluated on the feature selection (FS) problem, aiming to identify a subset of features that offers comparable results with less computational effort.This is particularly relevant in datasets with irrelevant, noisy, and high-dimensional data [13,14].The dataset selected for evaluation is the Heartbeat Categorization Dataset [15], available on Kaggle.This dataset contains electrocardiogram signals of heartbeats, which can be normal or affected by various arrhythmias and myocardial infarctions.The dataset includes 188 features, reflecting the dimensionality of our problem.To validate our proposal, the improved Orca predator algorithm will be rigorously compared with its native version and various established techniques already considered state-of-the-art, underscoring its competitive edge and potential for setting new benchmarks in the field.
This research is committed to exploring the frontiers of bio-inspired optimization methods and machine learning, seeking not only to advance in theory and methodology but also in the practical application of these techniques to real-world problems.With a focus on feature selection in complex datasets, this study aims to overcome limitations and pave new paths for future research in this exciting and rapidly evolving field.
The structure of this work is presented as follows: Section 2 provides a robust analysis of the current work on hybridizations between learning techniques and metaheuristics.Section 3 details the conceptual framework of the study, emphasizing the biomimetic principles of OPA, the formal description of DQL, and the formulation of FS.Section 4 explains how reinforcement learning is integrated into the Orca predator algorithm.Section 5 outlines the phases of the experimental design.Section 6 discusses the results achieved, underscoring the approach's effectiveness in feature selection.Finally, the conclusions and future work are presented in Section 7.

Related Work
In the realm of computational intelligence, mainly biomimetic algorithms, there has been a concerted and vigorous effort to enhance the search capabilities of these algorithms.This endeavor has involved not only the augmentation of existing models with more sophisticated decision-making and optimization strategies but also the integration of in-novative computational techniques that mimic natural processes more closely [16].These improvements have been manifested in numerous ways, from increased speed and accuracy in finding optimal solutions to a more profound ability to tackle high-dimensional and dynamically changing environments.Initial strides have been made towards devising new strategies for swarm intelligence algorithms, augmenting their search capabilities through formulations specifically designed to modulate the balance between exploration and exploitation.These efforts underscore the potential of bio-inspired methods to autonomously modulate their behavior.Some recent examples can be seen in [17][18][19][20][21][22][23][24][25][26].
Swarm-based intelligence methods have firmly established themselves as a robust framework for navigating and optimizing search spaces.However, the integration of machine learning into these algorithms represents a pivotal advancement, significantly amplifying their efficacy [27].By harnessing machine learning, these algorithms gain unprecedented capabilities to learn from, adapt to, and effectively address the intricacies of complex problem spaces.
This synthesis of swarm intelligence and machine learning facilitates the creation of algorithms that not only traverse search spaces with enhanced efficiency but also demonstrate a superior ability to identify and utilize optimal solutions across varied problem domains.This interdisciplinary fusion is driving noteworthy progress in computational intelligence, equipping swarm-based optimization strategies with a level of adaptability and learning proficiency previously unattainable [28,29].
Such evolutionary strides are crucial for the development of sophisticated methods tailored to meet the challenges posed by complex optimization landscapes.At the heart of this transformative shift is the goal to harmonize the inherent exploratory capabilities of swarm intelligence algorithms with the predictive and analytical prowess of machine learning.Achieving this equilibrium ensures a dynamic interplay between exploration and exploitation, enabling the algorithms to navigate and solve complex problems more effectively [30].
Exploring closer the fusion of metaheuristics with reinforcement learning unveils a robust corpus of research dedicated to augmenting the efficiency of optimization algorithms through this multidisciplinary approach [31].This integration has led to significant advancements across several research domains.Innovations in local search optimization [32], dynamic parameter tuning [33], and the identification of promising search areas represent key areas of progress [34].Studies have demonstrated the efficacy of this integration in enhancing algorithmic intelligence and adaptability, contributing to fields ranging from optimization challenges in theoretical contexts to practical applications like neural network training and cloud computing load balancing [35,36].This collective body of work highlights a pivotal shift towards more sophisticated, efficient, and adaptive optimization strategies capable of addressing complex problems across a wide range of disciplines [37].
Integrating bio-inspired techniques and machine learning also plays a fundamental role in feature selection.This approach is an essential component in the effectiveness of machine learning models.Recent research has explored various methodologies and techniques to enhance feature selection.For instance, ref. [38] offers a comprehensive review of the use of metaheuristic algorithms in feature selection, suggesting paths for future research in developing or modifying these algorithms for classification tasks.In [39], the performance of recent metaheuristic algorithms in feature selection is evaluated, highlighting the challenges and opportunities in this field.
The models presented in [40,41] demonstrate the application of hybrid approaches that combine optimization techniques and deep reinforcement learning for feature selection in network intrusion detection.Additionally, ref. [42] compiles a series of research on feature selection, where a taxonomy of objective functions is presented, and related metrics are classified into four categories: classifiers, metaheuristics, features, and statistical tests.This classification facilitates effective comparison of different feature selection methodologies, focusing on metaheuristics' implementation and hybridization strategies.
Finally, ref. [43] introduces a technique for predicting customer churn in the telecommunications sector.The Archimedes optimization algorithm is employed for optimal feature selection, combined with a hybrid deep learning model for customer churn predic-tion.The study is notable for its focus on efficient feature selection and hyperparameter tuning, achieving significant improvements in the accuracy and effectiveness of the predictive model.The study's results indicate the high efficiency of the proposed technique compared to other methods, achieving outstanding accuracy.

Preliminaries
In this section, we introduce the conceptual framework of our study, focusing on the biomimetic principles underlying the Orca predator algorithm, providing a formal description of deep reinforcement learning through deep Q-learning, and outlining the formulation of feature selection.This framework forms the foundation of our approach, integrating these distinct yet complementary components to address the challenges at hand.

Biomimetic Orca Predator Algorithm
The Orca predator algorithm is inspired by the predatory tactics of orcas, leveraging their strategic hunting behaviors as a foundation for its operational framework [44].Recognized for their complex social structures and cooperative hunting strategies, orcas utilize echolocation to navigate and communicate within their aquatic territories, orchestrating coordinated assaults on their prey.In this algorithmic model, a potential solution is conceptualized as an n-dimensional vector, with the collective solution space denoted by X = [x 1 , x 2 , . . ., x n ] T .
OPA operates through two primary stages: the chase and attack phases, each embodying distinct aspects of orca hunting patterns.The chase phase comprises two distinct actions: herding the prey towards the water's surface and surrounding it to prevent escape.This is regulated by a parameter p, which dictates the likelihood of executing either action, based on a comparison with a random value r within the interval [0, 1].Should p < r, the algorithm opts for the herding strategy; otherwise, it proceeds with encirclement.The subsequent attack phase mirrors the orcas' approach to narrowing the gap with their prey, employing precision and cooperation to secure their target.The efficacy of OPA hinges on its ability to mimic these intricate behaviors, requiring accurate sensory data and collaborative decision-making processes to optimize the search for solutions.

Chase Phase
This strategy operates under two distinct environmental scenarios.The initial scenario emerges when a diminutive shoal of fish constricts the spatial domain available for the orca's navigation.Conversely, the second scenario unfolds as an expansive shoal of fish broadens the orca's operational hunting territory.In response to these varying conditions, the algorithm delineates two specific approaches for engaging with the prey.
The dynamics of velocity and spatial positioning are computed by Equations ( 1)-( 4).In this strategy, v t chase,1,i delineates the velocity according to the initial chase tactic, while x t chase,1,i marks the corresponding spatial coordinate of the i-th orca at moment t.Analogously, v t chase,2,i and x t chase,2,i are defined for the alternative chase strategy.The variable x t i is the general coordinate of the i-th orca at time t, and x t best is the optimal position among the orcas, signifying proximity to the prey or the most efficient strategy at time t.The term M quantifies the mean position within the orca assembly, with a, b, and d representing random variables uniformly distributed in [0, 1], and e spans a range of [0,2].The parameter F encapsulates the influence or attractive force exerted by one agent upon another, and q, lying within [0, 1], dictates the likelihood of selecting a specific method for prey pursuit.
With the prey now accessible at the water's surface, the subsequent maneuver involves encircling the target.Orcas, utilizing sonar for communication, ascertain their ensuing positions through coordination with proximate members of their pod.It is posited that orcas adjust their locations based on the coordinates of three orcas chosen at random.The calculation of their positions post-maneuver is articulated through Equations ( 5)-( 7), providing a mathematical representation of this coordinated movement.
where MaxIter denotes the maximal iteration count, whereas j 1 , j 2 , and j 3 represent three distinct orcas chosen at random from the collective, ensuring j 1 ̸ = j 2 ̸ = j 3 .The variable x t chase,3,i,k corresponds to the spatial coordinate of the i-th orca subsequent to adopting the tertiary strategy for pursuit at the iteration t.
Throughout the chase, orcas leverage acoustic signals to ascertain the prey's whereabouts, modulating their spatial orientation in response.This instinctive behavior prompts orcas to either persist in the pursuit, contingent on the perceived proximity of the prey, or to halt and maintain their current position if the prey seems to be distancing itself.

Attack Phase
After encircling their prey, orcas execute a coordinated attack by taking turns entering the formed circuit to feed.Upon feeding, they revert to their original positions within the enclosure to allow another orca to partake in the feeding process.Assuming that four orcas correspond to the optimal attack positions within the circle, other orcas may choose to join the circuit based on their positional preferences.The direction in which orcas return to the enclosure circle after feeding, in preparation for replacement, is determined by the positions of randomly selected neighboring orcas.This behavioral pattern is precisely characterized through Equations ( 8)- (10).
In this context, v t attack,1,i symbolizes the velocity vector of the i-th orca engaged in prey pursuit at iteration t, while v t attack,2,i denotes the velocity vector of the i-th orca returning to the rendezvous circuit during the same iteration t.The positions of the four orcas optimally positioned within the circuit are represented by x t 1st , x t 2nd , x t 3rd , and x t 4th .Additionally, j 1 , j 2 , and j 3 are three orcas randomly selected from the population during the pursuit, with the constraint j 1 ̸ = j 2 ̸ = j 3 .The positional update of the i-th orca after the attack phase at iteration t is expressed as v t attack,i .Furthermore, g 1 is a random value chosen from the interval [0, 2], and g 2 is a random number within the range [−2.5, 2.5].
Subsequent to this phase, orcas employ sonar to detect the prey's location and adjust their positions accordingly, mirroring the pursuit process.The lower boundary limit (lb) of the problem's potential range dictates the orcas' positional adjustments, conforming to the stipulations outlined in the algorithm proposed by [44].
Algorithm 1 details the orca predator procedure.The pseudocode requires the definition of inputs such as population size (S), probabilities for selecting methods (p and q), and dimensionality (n), aiming to produce the best solution.Thus, the algorithm begins by initializing the objective function and randomly computing the first generation of S orcas (Line 1).For each orca and each decision variable, positions and velocities are assigned randomly (Lines 4-6).The fitness of each orca is calculated using the objective function defined in the optimization problem, and the best orca is identified.If an orca is better than the best solution achieved, then this solution is updated.
Algorithm 1: Pseudocode for the orca predator method.
Input: S: population size; p: probability to select driving prey or encircling prey method; q: probability to select a method when the orca group is small or large; n: dimensionality.Result: The best solution achieved Subsequently, the algorithm enters a loop to produce generations of orcas until a predefined iteration limit is reached (Line 16).Within this loop, each orca is processed.The algorithm selects three random orcas (Line 18).Based on the probability p, the algorithm decides whether to perform the chase phase to drive the prey or encircle the prey for each variable (Lines 19-29).If p is more significant than a random value, the chase phase is executed by either driving the prey or encircling it based on the value of q (Lines 20-28).If p is less than a random value, the attack phase is performed for each variable (Lines 29-33).
The fitness of each orca is computed again (Line 34), and the best solution achieved is updated if an orca is found to be better (Lines [36][37][38][39][40].The algorithm concludes by returning the post-processed results and visualization (Line 42).

Deep Reinforcement Learning
Reinforcement learning (RL) is an advanced machine learning technique focusing on autonomous agents striving to maximize rewards through their actions [45].These agents learn by trial and error, identifying actions that lead to greater rewards, both immediate and long-term, a distinguishing feature of RL [46].
In the RL process, agents continually interact with their environment, involving key components such as the value function, policy, and occasionally, a model of the environment [47][48][49][50].The value function evaluates the effectiveness of the agent's actions in the environment, and the agent's policy adjusts based on received rewards.
Q-learning, a fundamental technique in RL, focuses on establishing a function to assess the effectiveness of a specific action a t in a given state s t at time t [51,52].The Q function updates using Equation (11): where α is the learning rate and γ is the discount factor.r t+1 is the reward obtained after performing action a t .Deep Reinforcement Learning (DRL) integrates deep learning techniques with RL, enabling addressing problems with greater complexity and dimensionality [53].In DRL, deep neural networks approximate value functions or policies.Deep Q-Learning is a DRL technique that uses a neural network to approximate the Q value function to measure the expected cumulative reward for acting in a particular state.The Q value function learns through an iterative process when the agent takes actions in the environment and receives rewards.
In DQL, the Q function is defined as Q(s t , a t ; θ), where s t represents the current state, a t the action taken by the agent at time t, and θ the network weights [54].The Q function is updated by using Equation (12): where, s t+1 and a t+1 represent the new state and action at time t + 1, respectively.The parameter α, known as the learning rate, regulates the magnitude of the update to the Q value function at each training step.A high value of α accelerates the adaptation of the Q function to environmental changes, being beneficial in the initial stages of learning or in highly dynamic environments.However, an excessively high learning rate can cause a disproportionate response to random variations in rewards, leading to instability in learning [54].On the other hand, a lower α promotes a more gradual and stable learning process but may prolong the time required to achieve convergence.The factor γ, known as the discount factor, assigns relevance to future rewards compared to immediate ones.A value close to 1 encourages the agent to give almost equal importance to long-term rewards as to current ones, thereby incentivizing strategies that aim to maximize long-term benefits.
In contrast, a lower γ prioritizes short-term rewards, which is advantageous in scenarios where future rewards are less predictable or when effective policies are required in reduced time horizons.Finally, r t+1 is the reward received after executing action a t in state s t .θ − represents the parameters of a target neural network that is periodically updated with the values of θ to improve training stability.
A distinctive feature of DQL is its use of replay memory, which constitutes a crucial part of its learning architecture [55,56].Replay memory stores the agent's past experiences in the form of tuples ⟨s t , a t , r t+1 , s t+1 ⟩.Each tuple represents an individual experience of the agent, where s t is the current state, a t the action taken, r t+1 the reward received, and s t+1 the subsequent state.This approach of storing and reusing past experiences aids in enhancing the efficiency and effectiveness of the learning process, allowing the agent to learn from a more diverse set of experiences.With that, it reduces the correlation between consecutive learning sequences, a critical aspect to avoid over-dependence on recent data and to promote more generalized learning.Additionally, the mini-batch technique is implemented for sampling experiences from replay memory during the training process [57].Instead of learning from a single experience at each step, the algorithm randomly selects a mini-batch of experiences.This batch sampling method contributes to the stability of learning by promoting the independence of samples and allows for more efficient use of computational resources.
Finally, learning in DQL is guided by a loss function according to Equation (13), which measures the discrepancy between the estimated Q and target values.
where y is the target value, calculated by Equation ( 14): Here, r t+1 is the reward received after taking action a t in state s t , and γ is the discount factor, which balances the importance of short-term and long-term rewards.The formulation max a ′ Q(s t+1 , a t+1 ; θ − ) represents the maximum estimated value for the next state s t+1 , according to the target network with parameters θ − .Q(s t , a t ; θ t ) is the Q value estimated by the evaluation network for the current state s t and action a t , using the current parameters θ t .In each training step in DQL, the evaluation network receives a loss function backpropagated based on a batch of experiences randomly selected from the experience replay memory.The evaluation network's parameter, θ, is then updated by minimizing the loss function through the stochastic gradient descent (SGD) function.After several steps, the target network's parameter, θ − , is updated by assigning the latest parameter θ to θ − .After a period of training, the two neural networks are trained stably.
Lastly, the epsilon-greedy [58] strategy in DQL crucially balances exploration and exploitation.It does this by initially favoring exploration with a probability of ϵ for selecting actions randomly and gradually shifting towards exploitation, choosing actions that maximize the Q function with a probability of 1 − ϵ.As the agent gains more knowledge, ϵ decreases, effectively transitioning from exploration to exploitation.This approach prevents the agent from settling into suboptimal strategies, ensuring adaptability and learning from diverse experiences.

Feature Selection
Feature selection is an integral component in data preprocessing that identifies and selects a subset of relevant features (variables, predictors) for use in model construction [59].It is crucial in enhancing the efficiency of machine learning models by eliminating redundant and irrelevant data, thus streamlining the computational process and potentially improving model performance [60,61].
Formally, let us consider a dataset with a set of features D = {x 1 , x 2 , x 3 , . . ., x n } and a class label Y.The objective of feature selection is to find a subset D ′ ⊂ D such that |D ′ | < |D| and the predictive power of D ′ concerning Y is maximized.This process involves evaluating the importance of each feature in D and retaining only those that contribute significantly to our understanding of Y.
There are three primary approaches to feature selection [62,63]: filter methods, wrapper methods, and embedded methods.Filter methods rank features based on statistical measures independent of any learning algorithm.Wrapper methods use a predictive model to score feature subsets and are computationally intensive, as they involve training models on different subsets.Embedded methods perform feature selection as part of the model training process and are specific to particular algorithms.
We focus on the wrapper approach, particularly its application in selecting features that provide high classification accuracy while minimizing computational resources.In this approach, subsets of features are iteratively evaluated using a specific learning algorithm, and the subset that yields the highest classification performance is selected.This process can be computationally expensive due to the need to train a model for each subset.
Our work utilizes an improved bio-inspired metaheuristic algorithm to optimize the feature selection process.Each potential solution in this algorithm represents a subset of features encoded as a binary string where one indicates the inclusion of a feature and zero is its exclusion.The algorithm iteratively refines these solutions, guided by the learning model's performance on these subsets [64].The performance of these subsets is evaluated using metrics like F 1 score and accuracy.F 1 score provides a balance between precision and recall, while accuracy measures the proportion of correctly predicted instances [65].
Feature selection is undoubtedly a vital step in the data preprocessing phase, aiding in reducing the dimensionality of the dataset, curtailing overfitting, and enhancing the generalizability and efficiency of the learning models [61].Therefore, our work contributes to this field by applying an advanced metaheuristic approach to optimize the feature selection process, thereby striking a balance between model complexity and computational efficiency.

Developed Solution
Our proposed solution integrates the Orca predator algorithm with deep Q-learning for dynamically optimizing OPA's parameters.This approach leverages the predatory behavior of orcas and DQL's management of large state and action spaces to enhance feature selection efficiency.OPA, inspired by orca hunting strategies [44], offers an innovative optimization approach, while DQL addresses high-dimensional space challenges in complex combinatorial problems [66].
DQL plays an essential role in transitioning towards exploitation, particularly in the later stages of the optimization process.As OPA explores the solution space by mimicking orcas' collaborative and adaptive strategies in their natural habitat, DQL refines this exploration by targeting the most promising regions.This shift from exploration to exploitation is efficiently managed through DQL's epsilon-greedy policy, significantly enhancing the selection of optimal actions as the model accumulates knowledge.
OPA operates as a metaheuristic where each orca represents an independent search agent, exploring the solution space characterized by a binary vector.These vectors represent selected features from the database, with each vector being individually evaluated through a variety of machine learning algorithms, such as decision trees (DT), random forests (RF), support vector machines (SVM), and extremely randomized trees (ERT).F 1 score and accuracy derived from these evaluation methods provide crucial feedback to OPA about the quality of the solutions, thereby allowing the orcas to adjust their strategies and explore bounded promising areas of the solution space.
DQL intervenes in this dynamic adaptive process by providing a sophisticated reinforcement learning-based strategy to regulate the operational parameters of OPA.DQL accumulates experiences from the evaluations of the orcas, including rewards based on the diversity of solutions and performance metrics.These experiences are essential for updating the decision policies in OPA, precisely calibrating the probabilities and parameters that guide both the exploration and exploitation of the search space.Through the implementation of replay memory, DQL learns from past experiences, enabling it to anticipate and maximize future rewards.This leads to a significant and iterative improvement in feature selection, constantly refining the solution search process.
Figure 1 illustrates in detail the flow of the proposed solution process, highlighting the synergy between OPA and DQL in this advanced optimization method.It underscores how the interaction between these two components leads to a more efficient and effective search for optimal solutions, combining OPA's nature-inspired exploration with DQL's outcome-oriented and adaptive learning strategy.The core of our methodology is embodied in the pseudocode of Algorithm 2. The algorithm begins by taking data from a dataset, such as dimensionality and row data, trying to find the best solution.Then, it continues by initializing a swarm of orcas and setting up the initial state and action parameters based on performance metrics.
Algorithm 2: Pseudocode for the improved orca predator method.The main loop of the hybridization continues until a specified iteration limit equal to the original version is reached.During each iteration, every orca in the swarm is processed.First, it selects three random orcas and updates the position and velocity of each orca.These two steps are also equivalent to the original version.Next, a machine learning technique is invoked to compute fitness (Line 8).This process uses the solution stored in the orca, corresponding to an n-dimensional vector composed of zeros and ones.The machine learning method generates a trained model only with selected columns (ones of the vector).
We strongly emphasize critical performance metrics in evaluating our machine learning model, notably the F 1 score and accuracy.We have adopted a comprehensive multiobjective strategy to enhance the model's effectiveness.This approach does not just focus on maximizing these essential metrics but also includes the solution diversity metric and a reduction component aimed at optimizing the balance between the number of features used and the total available features.This optimization is crucial for streamlining our algorithm and making it more efficient in feature selection.
Integrating individual performance metrics into a unified measure known as the feature efficiency index (FEI) is central to our study.The FEI is designed to capture the holistic performance of the model by synthesizing the F 1 score, accuracy, diversity of solutions, and effectiveness of the feature reduction strategy.This composite measure is calculated through a carefully designed linear scalarization method, as detailed in Equation ( 15), allowing us to gauge the overall efficiency and effectiveness of the model in utilizing the selected features. (15)  In the above formula, ω (i,j) represents the weights assigned to each pair of objective functions, ensuring that the total influence of each function is balanced and the sum of all weights equals 1.This normalization is critical, as it maintains the proportionality and relevance of each function relative to the others, ensuring no single metric disproportionately influences the FEI.The c values represent upper bounds to minimize within single-objective functions, ensuring that the FEI strives for optimal values across all dimensions.
The terms f (i or j) (⃗ x) and e (i or j) (⃗ x best ) refer to the current and best values of the objective functions, respectively, allowing the FEI to dynamically adjust as improvements are found during the optimization process.The objective functions f 1 and f 2 aim to maximize the F 1 score and accuracy, respectively, while f 3 seeks to minimize the diversity (heterogeneity) of solutions, and f 4 focuses on maximizing the use of non-dominant features (reduction strategy).With that, a reward is generated based on performance metrics.Changes in the F 1 score, accuracy, diversity, and reduction are computed by comparing the current values with previous values.
where δ, ϵ, ζ, and η serve as weights, each assigned to quantify the significance of corresponding terms in the reward calculation.The terms F 1 score new and F1 score previous represent the F 1 scores of the current and previous best orcas, respectively.Similarly, accuracy new and accuracy previous refer to the accuracies of the current and previous best orcas, respectively.Both F 1 scores and accuracies, values close to 1 indicate a good performances.Diversity is evaluated using the Euclidean distance between the positions of each pair of orcas across the entire swarm.For a set of orcas, with each individual's position denoted by the vector p i , diversity is determined in accordance with Equation (17).
where S denotes the total number of orcas, n represents the dimension of the solution space, and p ik and p jk are k-th decision variables of solutions p i and p i , respectively.Values close to 0 note a good performance.Finally, the reduction metric is calculated by determining the percentage of columns not accessed by the most proficient orca in relation to the total number of available columns.Again, values close to 1 indicate a good performances.We can see that the reward function is tailored to encourage solutions that not only enhance performance in terms of F 1 score and accuracy but also maintain diversity and utilize a reduced number of features.Next, updating of states and actions is conducted through the DQL process (Line 12).In this process, the state is a vector composed by the current performance metrics and the diversity of the orca swarm, and is represented as follows: state = [F 1 score, accuracy, diversity, reduction] In the training phase of the DQL process, the formulation of states is a critical aspect, as it provides the necessary context for decision-making [67].This state design (Equation ( 18)) ensures that all critical dimensions of the problem are represented, allowing DQL to make informed adjustments in OPA's operational parameters.This information is compiled in real time and feeds into DQL to facilitate continuous and adaptive adjustment decisions.
The actions in the DQL process are decisions on how to adjust OPA's operational parameters, identified as a, b, d, and e.These parameters are vital in defining the movement and adaptation strategies of the orcas in OPA.Specifically, a and b control the intensity and direction of the orcas' movement, respectively, influencing their capacity to explore the solution space.On the other hand, d and e determine the orcas' sensitivity to environmental signals, affecting their ability to adapt and converge towards optimal solutions.
In the DQL process, two key neural networks are implemented: (1) the evaluation network, which is constantly updated with new experiences, and (2) the target network, which provides a stable estimation of Q-function values and is periodically updated to stabilize the learning process [68].This approach helps mitigate issues associated with data correlation and rapid changes in Q-estimation, which can lead to unstable learning.The loss function in DQL is critical for calibrating the evaluation network.Huber loss, a combination of mean squared and absolute errors, is employed, as it is less sensitive to outliers in predictions and provides more stable training.The loss is calculated by comparing the predictions of the evaluation network with the target values generated by the target network.Here, the gradient descent algorithm plays a crucial role, as it adjusts the network parameters by minimizing this loss function, allowing the model to learn from its errors effectively.
During DQL training, each accumulated experience, based on the performance and diversity of the orcas' solutions, contributes to updating the parameters to be improved.The aim is to achieve an effective balance between exploring new areas and exploiting the most promising ones in the solution space.The parameter tuning is carried out through an iterative reinforcement learning process, where the DQL model evaluates the impact of different configurations of these parameters on the overall system performance.Through the reward function, DQL identifies which adjustments in these parameters enhance the efficacy of the search, thus guiding the evolution of the exploration and exploitation strategy in OPA.
The parameters of OPA are optimized using the DQL training function, which employs a neural network for continuous estimation of the Q-function values, along with a memory buffer to store experiential data (referenced in .The neural network facilitates the DQL model's learning process, enabling it to maximize expected long-term rewards by refining its predictions based on feedback from environmental interactions.Concurrently, the memory buffer plays a crucial role by preserving a record of accumulated experiences during these interactions.Each recorded experience encapsulates the system's current state, the action executed, the reward obtained, and the subsequent state achieved post-action.This historical data empowers the model to extract lessons from previous scenarios, thereby boosting its predictive accuracy and adaptability in new situations-a critical aspect for enhancing decision-making efficiency over time.
Action selection follows an epsilon-greedy strategy, ensuring a proper mix between exploration and exploitation.After executing an action, the new state and reward are observed (Lines 21-22) and stored in the memory buffer.Once sufficient experiences are accumulated, a mini-batch is extracted to train the evaluation network.This training process involves updating the target network with state-action pairs and their corresponding target values, simultaneously adjusting the values of parameters (Lines [26][27]. This iterative DQL training method is reintegrated into the main algorithm, allowing optimization cycles to continue.With each iteration, DQL adapts and improves its policy, thus refining OPA's search for focused exploitation in a promising area of the solution space.This leads to more precise and efficient feature selection for machine learning models (Line 16).This iterative procedure is repeated, progressively refining the solutions proposed by the orcas through a reward that reflects both the quality of the solution in terms of model performance and the exploitation of new potential solutions.With each iteration, DQL adapts, improving its policy and refining OPA's search for more effective exploitation of the solution space, leading to a more precise and efficient feature selection for machine learning models.
Finally, regarding computational complexity, our metaheuristic exhibits O(kn) complexity, where n denotes the dimension of the problem and k represents either the number of iterations or the population size, encapsulating the total number of function evaluations throughout the algorithm's execution.We also evaluate the complexity of the DQL algorithm, which is generally O(MN) [69,70], with M indicating the sample size and N the number of network parameters, both essential for thorough dataset analysis.Despite the potentially large values of M and N, they remain fixed, making the integration of DQL a justifiable increase in computational demand for superior outcomes.Additionally, the continual advancement in computing technology significantly mitigates the impact of this heightened complexity.

Experimental Setup
We used a rigorous quantitative methodology to evaluate our proposal, comparing the hybrid approach with the traditional version of the optimization algorithm.Moreover, we tested each machine-learning method independently.This involved detailed planning, execution, and assessment, including statistical testing and in-depth analysis.

Methodology
To evaluate the enhanced metaheuristic's performance rigorously, we have established a comprehensive quantitative methodology that aligns with the principles outlined in [71].Our approach involves a comparative analysis between the solutions provided by our proposed hybridization, the native version of the optimization algorithm, and the results generated by machine learning working alone.For that, we employed the following methodological strategy:

•
Preparation and planning: Define specific multi-objective goals for feature selection effectiveness, aiming to minimize the number of selected features while simultaneously maximizing accuracy and the F 1 score.Design experiments to systematically evaluate the enhanced technique under controlled conditions, ensuring a balanced optimization of these criteria.

•
Execution and assessment: Perform a multi-faceted evaluation of the technique, assessing not only the quality of the solutions generated but also the computational efficiency and convergence properties.Employ rigorous statistical tests to compare the performance with baseline methods.Here, to evaluate data independence and statistical significance, we use the Kolmogorov-Smirnov-Lilliefors test for assessing sample autonomy and the Mann-Whitney-Wilcoxon test for comparative analysis.This approach involves calculating the fitness from each one executions per instance.• Analysis and validation: Conduct thorough in-depth analysis to understand the deep Q-learning's parameter influence and the orca predator algorithm's behavior on the feature selection task.This involves iterating over a range of hyperparameters to fine-tune the model, using the dataset to validate the consistency and stability of the selected features.To ensure the validity of the results generated by our proposal, we conducted tests to evaluate the final outcomes.We can assure that all simulated experiments were carried out with reliability.

Dataset
Our research utilized the "ECG Heartbeat Categorization Dataset" from Kaggle [15], which comprises records of ECG signals indicative of various heart conditions.These signals are classified into five categories: Normal (N), Supraventricular (S), Ventricular (V), Fusion (F), and Unclassified (Q).The dataset is rich, with 188 distinct features per record, encapsulating the complexity of our problem space, represented as n.The dataset is divided into two primary files: the training set, containing 87,554 records, and the test set, with 21,892 records.For our analysis, we further split the test set equally into two parts: one half (10,946 records) is utilized for cross-validation to fine-tune the model parameters, while the remaining half is reserved for the final testing phase to evaluate the model's performance.This significantly reduces the risk of overfitting.
We employed a strategic sampling approach to address class imbalance during training.For classes 1 (S), 2 (V), and 4 (Q), which had a larger number of records, we randomly selected a subset of 1000 records each.This was conducted to ensure that these classes did not overwhelm the learning process.In contrast, for class 3 (F), which had fewer instances, we included all 641 available records.This selective sampling was crucial to maintain a balanced representation across all classes, enhancing the classifier's ability to identify each category accurately without bias.
During the testing or prediction phase, we utilized the available records.This comprehensive approach in the testing phase allowed us to assess the model's performance across a diverse and complete range of data.In both the training and testing stages, we applied all 188 features in the dataset to the machine learning techniques, ensuring a thorough analysis and utilization of the available data.This exhaustive feature application was key to comprehensively training and evaluating the machine learning models, providing us with a detailed understanding of the dataset and the effectiveness of our approach.

Implementation Aspects
We conducted empirical assessments of the Orca predatory algorithm.These trials involved independent executions of four distinct machine learning models: decision tree (DT), random forest (RF), support vector machine (SVM), and extremely randomized trees (ERT).In subsequent stages, we evaluated the hybridization of the OPA algorithm with each of these classification techniques.
In [44], the parameter values yielding the best average performance regarding swarm behavior are described.Based on this, we developed a set of preliminary evaluations with a reduced test scenario in order to find the best initial configuration [72].Setting q to 0.9 notably boosts OPA performance, because the algorithm has more opportunities to search for solutions in a wide search space against the exploration phase than looking for a local search procedure.These tests generated good results, and we can observe that the a and d parameters take values closer to 1 than 0. The b value is kept random without unpredictable behavior.Finally, these experiments show the F value is always close to 2. Consequently, our proposal will be challenged to operate and improve in an initially adverse situation.We extracted results from 30 runs per algorithm, each run iterating 100 times and S = 10 orcas.In order to identify how different values for the parameters affect the final performance, we plotted its values against the performance metrics, enabling us to analyze trends and optimal values that enhance the model's effectiveness.This analysis will reveal which parameters are critical for the algorithm's performance.
We established the following parameters for the deep Q-learning configuration: The state size was set to 5, with an action size of 40 to encompass a broad range of potential parameter adjustments in OPA.We constructed a sequential neural network with dense layers and dropout for regularization.The network consists of layers with 24 neurons each, using ReLU activation, and a final layer of size equal to the number of possible actions (40), with linear activation.The Huber loss function and the RMSprop optimizer with a learning rate of 0.001 were used.Regarding the epsilon-greedy policy, we started with an epsilon value of 1.0, decaying to a minimum of 0.01 to balance exploration and exploitation.Using a double Q-learning approach, the network trains with random experiences from a minibatch during training and updating.The target network is updated every 50 training steps to stabilize the estimation of Q-values.The initial values for parameters a, b, d, and e, used in moving the orcas and updating their positions and velocities, are randomly generated in the range of [0, 1] at the start of each OPA algorithm execution.Table 1 summarizes the initial parameter values.Another key aspect is the adaptation of OPA to the binary domain, which necessitates the inclusion of a binarization phase following alterations to the solution vector.In this instance, the transformation function utilized was the conventional sigmoid function.This entails that if [1/(1 + e −x )] > ϕ, with ϕ representing a random value uniformly distributed within the interval [0, 1], then, discretization is achieved by setting x ← 1. Conversely, should this condition not hold, x ← 0 is applied.
Finally, we used Python 3.10 to code all algorithms.The computer used for each test had the following specifications: macOS 14.2.1 Darwin Kernel version 23 with an Ultra M2 chip and 64 GB of RAM.All codes are available in [73].

Results and Discussion
The initial comparative analysis between our approach and the traditional version of OPA focuses on the runtime.On one hand, we consider the time it takes to generate the solution vector, encompassing all necessary actions and the time required to process and evaluate fitness accuracy.On the other hand, the time spent on the classification phase using learning techniques is not considered because this phase is considered part of the model's training process, where the primary focus is on achieving high accuracy and reliability rather than speed.
OPA achieves the best runtime in all the tests performed, with a value of 200 ms.In contrast, the best time achieved by OPADQL is 500 ms, which is even worse than the average time of OPA, equivalent to 220 ms.There are no instances where OPADQL's runtime is inferior to OPA's runtimes.However, it is important to note that although OPADQL does not outperform OPA in runtime, the difference is negligible.The anticipated benefit in terms of prediction accuracy and feature reduction justifies the additional execution time required by OPADQL.
Furthermore, we acknowledge the importance of examining how the selected features vary across different runs.To this end, we conducted several runs of the OPA and OPADQL methods, analyzing the consistency in feature selection.We found that, although there is inherent variability due to the stochastic nature of the algorithms, certain features tend to be selected more frequently, indicating their potential relevance to the problem at hand.We can see that discarded features are usually related to constant or poorly varied values since they do not provide useful information for differentiating between heartbeat categories, as well as those that offer redundant information.
Continuing with experimental results, Tables 2 and 3 present our computational findings.Tables are structured into five sections, each comprising six rows corresponding to evaluation metrics.We define the best value as the optimal performance attained across each model and feature selection technique combination, highlighted using bold font.Conversely, the worst value signifies the least effective yield.The mean value provides an average of the results, while the standard deviation (std) value quantifies the variability in findings.Additionally, the median value and interquartile range (iqr) value represent the middle value and the spread of the middle 50% of performances.Regarding columnar representation, the notation n/o specifically denotes the absence of an optimizer method, indicating that algorithms were run without optimization techniques to enhance performance.In this context, all machine learning methods are employed to classify the dataset using all features.The best-performing methods are then identified, and their results are stored for comparative analysis.OPA stands for a bio-inspired optimizer lacking a learning component, and OPADQL represents our enhanced version of OPA.Finally, we employ F 1 score, accuracy, diversity, reduction, and FEI for metrics.All of them were defined in the previous section.In the context of the F 1 score and accuracy, the OPADQL model exhibits a remarkable and consistent enhancement in performance when contrasted with the native OPA and nonoptimized learning methods.
This improvement is particularly pronounced in models employing RF and SVM, underscoring the advanced capability of OPADQL in managing intricate feature selection challenges.This integration enables a more nuanced and precise feature selection process.Consequently, this leads to a significant boost in model performance, as evidenced by the higher F1 scores and accuracy rates, which are critical indicators of a model's predictive power and reliability in classification tasks.
Concerning the diversity metric, an essential measure of the variation and uniqueness in the solutions generated, both OPA and OPADQL showcase commendable performances.However, OPADQL distinguishes itself by achieving marginally inferior values.This slight but critical advantage highlights the efficacy of the hybrid model in sustaining the comprehensive exploitation of the solution space.
The OPADQL's ability to try to maintain a diverse set of solutions while simultaneously homing in on the most effective ones exemplifies its balanced approach to optimization.This diverse exploitation ensures that the algorithm does not prematurely converge to suboptimal solutions, thereby enhancing the robustness and reliability of the outcomes.
The results of the reduction metric are particularly striking.In this aspect, OPADQL outshines OPA by a significant margin, indicating a more proficient approach in maximizing the number of features while not preserving or even enhancing model accuracy.This attribute of OPADQL is especially beneficial in dealing with high-dimensional data, where reducing the feature set without compromising the model's effectiveness is a challenging but crucial task.The ability to selectively reduce features contributes to more streamlined models, reducing computational complexity and enhancing efficiency.
Finally, when considering the feature efficiency index, which amalgamates all the above-discussed metrics, the superiority of OPADQL becomes even more evident.
The FEI values for OPADQL consistently surpass those of the native OPA and nonoptimized methodologies across all tested machine learning techniques.This comprehensive index affirms the overall effectiveness of OPADQL in feature selection.By excelling in multiple metrics-including accuracy, diversity, reduction, and the FEI-OPADQL proves itself as a robust, versatile, and effective tool for feature selection.Its ability to deliver high-quality results across different evaluation criteria and machine learning models signifies a significant advancement in the field, offering a sophisticated solution for complex feature selection challenges.
The charts deployed in Figure 2 enrich our comprehension of the effectiveness of OPA (left side) and OPADQL (right side) across different ML models.These graphical illustrations reveal the data's distribution and density, highlighting OPADQL's supremacy in several dimensions.Specifically, Figure 2a,b, showcasing the F 1 score and accuracy of OPADQL, respectively, demonstrate its notable superiority, indicating more accurate and dependable classification performance.
Regarding diversity, OPADQL exhibits an outstanding yield by achieving slightly lower values than its native version (see Figure 2c).This performance is due to the intrinsic nature of the native algorithm that attempts to explore promising areas.
Regarding feature reduction, OPADQL is noted for its capability to efficiently reduce the number of features without sacrificing model precision (refer to Figure 2d).Moreover, in the context of the feature efficiency index (FEI) (refer to Figure 2e), OPADQL showcases a comprehensive advantage, underscoring its efficiency in feature selection across diverse ML models.This evidence underlines OPADQL's robustness and adaptability, positioning it as a crucial tool for overcoming feature selection challenges.Additionally, we employed two-dimensional charts to represent the multivariate variables influencing the performance of these algorithms (see Figure 3).In general terms, OPADQL exhibits superiority over OPA in various ML models.For example, Figure 3a, which shows the DT learning model, reveals OPADQL's distinct advantage across all performance metrics, accentuating its selection efficacy.Similarly, Figure 3b, related to the RF technique, highlights OPADQL's exceptional outcomes, particularly in F 1 score and accuracy, with a noted reduction in diversity, suggesting a more targeted feature selection.
In the SVM context (see Figure 3c), OPADQL's performance in F 1 score and accuracy further proves its effective classification capability, alongside a remarked decrease in solution diversity.Lastly, Figure 3d, corresponding to the ERT model, demonstrates OPADQL's superiority in all evaluated metrics, showcasing its ability to balance feature selection efficiency with precision and diversity maintenance.These findings affirm OPADQL's ability to navigate the complexities of various ML models, optimizing feature selection to enhance model performance while ensuring the balance between diversity and accuracy.

Statistical Test
To showcase that our proposal overcomes the native version, we employ a robust statistical strategy exploiting generated results.This strategy includes two contrasting hypotheses: (a) evaluation of normality and (b) hypothesis testing to ascertain whether the samples originate from a uniformly distributed sequence.To assess whether the observations (runs per instance) form a Gaussian distribution, we define H 0 to propose that samples do not adhere to a normal distribution, with H 1 positing the contrary.For that, the Kolmogorov-Smirnov-Lilliefors test was utilized.The significance level for the p-value was set at 0.05, indicating that results falling below this threshold would suggest that the test is significant.Consequently, this leads to a failure to reject the null hypothesis H 0 .
The findings revealed that the samples do not conform to a normal distribution, allowing the non-parametric Mann-Whitney-Wilcoxon test.In this context, H 0 suggests that there is a difference between the distributions of the two samples P(samples OPA < samples OPADQL ).H 1 states the opposite.The following table shows the Mann-Whitney-Wilcoxon test results.The acronym sws indicates statistically without significance.
Table 4 shows that for F 1 score and accuracy metrics, the results are highly significant across all models, with p-values far below the 0.05 threshold, illustrating OPADQL's superior performance.In terms of diversity, significant differences are observed in the RF and SVM models, as indicated by their p-values, whereas DT and ERT models show statistically insignificant differences.For reduction and FEI metrics, the p-values again highlight OPADQL's enhanced efficiency across most models, demonstrating its effectiveness in feature selection and overall model performance optimization.Tables 5 and 6 present a comparative analysis between OPADQL and various stateof-the-art algorithms, including genetic algorithm (GA) [74], particle swarm optimization (PSO) [75], Bat Algorithm (BAT) [76], Black Hole Optimization (BH) [77], Grey Wolf Optimizer (GWO) [78], Golden Eagle Optimization (GEO) [79], Reptile Search Algorithm (RSA) [80], and a random strategy.
This analysis is conducted using machine learning models and key metrics used in Tables 2 and 3. Additionally, we include a random strategy for the construction of the solution vector for feature selection; however, this strategy is not applicable to the diversity metric due to the generation of a unique random solution.Best results are highlighted using bold font.In relation to DT, OPADQL surpasses all other algorithms in F 1 score, with a notable difference compared to the second-best result (GWO).This trend remains consistent across other metrics such as accuracy and FEI, where OPADQL also achieves the best results.In RF, OPADQL continues to be the most effective algorithm, reaching the highest value in accuracy and F 1 score, surpassing the results of other state-of-the-art algorithms.Additionally, OPADQL exhibits an exceptional ability to reduce the number of features without compromising model accuracy, as reflected in superior reduction and FEI values compared with other algorithms.In SVM, OPADQL performs outstandingly, surpassing other algorithms in several key metrics.Although GWO presents a slightly higher F 1 score, OPADQL stands out in the accuracy metric and in FEI, highlighting its ability to identify feature sets that are not only predictively powerful but also efficient in terms of the number of features used.Finally, in ERT, OPADQL continues to demonstrate its superiority, achieving the highest values in both accuracy and F 1 score compared to state-of-the-art algorithms and the random strategy.This efficacy is particularly notable in the reduction metric, where OPADQL equals or surpasses other methods, evidencing its exceptional skill in maintaining a simplified model without compromising performance.
The superiority of OPADQL is significantly attributed to its advanced configuration, which incorporates a higher number of adjustable parameters in comparison to other state-of-the-art algorithms.This additional complexity, far from being an obstacle, becomes a decisive advantage thanks to the integration of DQL.Being a central component of OPADQL, it grants a decisive edge by enabling a dynamic and detailed adaptation of these parameters according to the complexities of the data and the variety of the solution space.
OPADQL's ability to effectively adjust these additional parameters contrasts with the rigidity of traditional state-of-the-art algorithms, which, while efficient within their own frameworks, lack the flexibility to adapt to the complex variations inherent in high-dimensional data.
The integration of DQL into OPADQL not only enhances performance through a more accurate and relevant feature selection but also reflects the algorithm's capacity to learn and continuously improve through experience, adjusting its internal parameters to efficiently explore the search space and exploit the most promising regions.
The convergence graphs for DT, RF, SVM, and ERT models presented in Figure 4 consistently demonstrate OPADQL's superior performance in terms of the FEI metric, surpassing state-of-the-art algorithms and the random strategy across iterations.While the other algorithms also show progress, they do not reach the level of efficacy of OPADQL.The random strategy, on the other hand, shows the lowest performance, highlighting the importance of meticulous parameter selection and adjustment.OPADQL not only achieves a higher score in FEI, but its curve also suggests a faster convergence compared to the other algorithms, which is particularly notable in the SVM and ERT models, where the competition is closer.Through these models, OPADQL demonstrates its ability to efficiently optimize feature selection, adapting to different dynamics and data requirements.The robustness and versatility of OPADQL become evident as it maintains a clear advantage over alternative algorithms in a wide variety of machine learning contexts, demonstrating its potential for practical applications by effectively improving model performance.Finally, once the optimal result is achieved, we analyze the solution vectors and observe that ten dominant columns, which are consistently active, are indexed by [124,23,0,112,36,38,54,9,134,35].Conversely, ten columns that are least present in the vectors are indicated by [72,132,4,18,91,113,76,102,92,2].In both cases, the lists are sorted from best to worst.Typically, ECG datasets do not use labels for columns; instead, they utilize serial time or continuous signals.For this reason, indexes are generally used to identify columns.

Conclusions
This study introduces a hybrid optimization method that synergizes the biomimetic prowess of the Orca predator algorithm with the adaptive intelligence of deep Q-learning.This approach was meticulously designed to enhance the process of feature selection in machine learning, enabling the construction of more efficient models by selecting the most relevant subset of features.However, our approach is not solely confined to these.When applying our method to regression tasks, it is critical to make certain adaptations.For instance, instead of relying on the F1 score and accuracy, we recommend employing analogous metrics such as mean squared error, root mean squared error, and mean absolute error.These metrics are broadly utilized for their effective ability to capture the discrepancy between predicted and actual values, thereby providing a clear measure of prediction accuracy and model performance in continuous output spaces.
To rigorously evaluate the efficacy of our proposed model, we deployed it on the ECG Heartbeat Categorization Dataset, characterized by its high-dimensional feature space of 188 attributes.We further validated our approach across a diverse array of machine learning models to ensure a comprehensive assessment under varied modeling techniques.
The results obtained from the extensive evaluation clearly demonstrate that the Orca predator algorithm hybridized with Deep Q-learning significantly outperforms the standalone OPA across all tested machine learning models.According to the findings, we can see that the proposal seeks to increase the values of a and b while reducing the value of q, signifying a decrease in exploration probability, that is, intensifying the search process.For that, our analysis revealed that OPADQL consistently achieved higher F 1 scores and accuracy, effectively enhancing the predictive performance of decision trees, random forests, support vector machines, and extremely randomized tree models.Furthermore, OPADQL exhibited a superior ability to reduce the feature space without compromising the model's effectiveness, as evidenced by its higher reduction scores and the feature efficiency index (FEI).These results not only underscore the enhanced optimization capability of OPADQL but also highlight its potential to significantly streamline the feature selection process, making it an invaluable tool for building more robust and efficient machine learning models.
Based on all generated results, we can infer that our solution proposal outperforms traditional a posteriori techniques' random permutation feature importance thanks to our use of sophisticated metaheuristics for an exhaustive analysis of the feature space.This approach not only pinpoints essential features and their synergies but also significantly enhances model generalization.Although our technique may require more time, it delivers unparalleled accuracy, robustness, and dependability, vital for applications where precise outcomes are paramount.
Finally, some future studies in this domain could focus on exploring the integration of DQL with other bio-inspired optimization algorithms and advanced machine learning techniques, aiming to further enhance its feature selection capabilities.Additionally, adapting OPADQL to address multi-objective optimization problems presents an intriguing avenue for expanding its applicability across various disciplines.Another promising line of inquiry involves investigating the scalability and efficiency of OPADQL in processing extremely large datasets, particularly in fields such as genomics and text mining, where high-dimensional data are prevalent.These future directions not only promise to refine the efficacy and versatility of OPADQL but also open up new possibilities for innovative applications in the ever-evolving landscape of machine learning and data analysis.

Figure 1 .
Figure 1.Scheme of the hybrid proposal combining OPA and DQL for the feature selection problem.

Figure 2 .
Distributions for all metrics employed to analyze the effectiveness of OPA and OPADQL on machine learning methods.(a) Distribution of F 1 score values obtained by OPA (left side) and OPADQL (right side).(b) Distribution of accuracy values obtained by OPA (left side) and OPADQL (right side).(c) Distribution of diversity values obtained by OPA (left side) and OPADQL (right side).(d) Distribution of reduction values obtained by OPA (left side) and OPADQL (right side).(e) Distribution of FEI values obtained by OPA (left side) and OPADQL (right side).

Figure 3 .
Figure 3. Radar analysis comparing OPA and OPADQL performance across all machine learning techniques.(a) DT radar analysis: performance comparison between OPA and OPADQL.(b) RF radar analysis: performance comparison between OPA and OPADQL.(c) SVM radar analysis: performance comparison between OPA and OPADQL.(d) ERT radar analysis: performance comparison between OPA and OPADQL.

do 11 if orca o is better than best solution achieved then 12 Update best solution achieved 13 end 14 end 15 // produce generations of S orcas. 16 while iteration up to limit do 17 foreach orca
o, (∀ i = {1, . . ., S})

13 end 14 end 15 end 16 return Post
Data from dataset, i.e, dimensionality and row data for training.-process and visualize results 17 Function DQLProcess(state): Input: Input: OPA's parameters.Result: The best solution achieved 1 .Initialize a swarm S of orcas 2 state ← performance metrics 3 action ← initial values for OPA's parameter 4 while iteration up to limit do 5 foreach orca o, (∀ i = {1, . . ., S}) do 6 Select randomly three orcas 7 Update position and velocity of orca 8 Call training phase to orca o and compute its fitness via Equation (15) 9 if orca o is better than best solution achieved then 10 Update best solution achieved 11 Calculate reward based on performance metrics via Equation (16) 12 state, action ← DQLProcess(state, action, reward) via Equation (18) 30 end

Table 2 .
Comparative analysis of algorithms' performance: OPA vs. OPADQL for DT and RF.

Table 3 .
Comparative analysis of algorithms' performance: OPA vs. OPADQL for SVM and ERT.

Table 5 .
OPADQL for DT and RF vs. state-of-the-art algorithms.

Table 6 .
OPADQL for SVM and ERT vs. state-of-the-art algorithms.