A Novel Learning-Based Binarization Scheme Selector for Swarm Algorithms Solving Combinatorial Problems

: Currently, industry is undergoing an exponential increase in binary-based combinatorial problems. In this regard, metaheuristics have been a common trend in the ﬁeld in order to design approaches to successfully solve them. Thus, a well-known strategy includes the employment of continuous swarm-based algorithms transformed to perform in binary environments. In this work, we propose a hybrid approach that contains discrete smartly adapted population-based strategies to efﬁciently tackle binary-based problems. The proposed approach employs a reinforcement learning technique, known as SARSA (State–Action–Reward–State–Action), in order to utilize knowledge based on the run time. In order to test the viability and competitiveness of our proposal, we compare discrete state-of-the-art algorithms smartly assisted by SARSA. Finally, we illustrate interesting results where the proposed hybrid outperforms other approaches, thus, providing a novel option to tackle these types of problems in industry.


Introduction
High complexity problems in binary domains are a common sight in industry, along with high digitalization and the incorporation of artificial intelligence. Among well-known problems with great complexity, we can find the Set Covering Problem (SCP) [1], Knapsack Problem [2], Set-Union Knapsack Problem [3], and Feature Selection [4]. In order to solve these kinds of problems, the employment of exact methods can be unmanageable within restricted resources, such as computational time. Thus, approximate methods, such as metaheuristics (MH), which do not guarantee the optimality, but do obtain solutions as close as possible to the optimal in a reasonable computational time, have been a recurrent answer from the scientific community.
In the literature, there exists designed MH capable to address them without the need for modifications. However, it has been demonstrated that MH designed to work in continuous domains assisted by a discretization scheme outperforms the classic binarybased approaches [5]. The classic design includes the transformation of domains through the two-step techniques. However, novel learning-based hybrids have been reported, which focus on improvements in the transformation process [2,6].
Hybrid methods have been designed as novel approaches that use multiple optimization tools. They have been a hot topic in the field, and several improvements over MH have been reported. Among the most relevant lines of research in the literature, four can be clearly distinguished, such as MH with "Mathematical Programming" [7]; hybridization between MH [8]; "Simheuristics", which interrelates MH with the simulation of problems [9]; and MH with Machine Learning (ML) [10][11][12].
In this work, we propose a hybrid approach composed of MH and ML, which includes continuous-based population algorithms supported by a learning-based binarization scheme. The novelty in the proposition concerns a multiple binarization scheme being balanced by a Reinforcement Learning (RL) technique, named SARSA, which is based on the run time. The main idea is to provide an adaptive binary-selector mechanism based on the knowledge generated by the processed dynamic data generated through the search, such as the diversity of solutions.
In RL approaches, the employment and management of rewards are well-known. In this work, five different types are considered in the reward system implemented: the global best, with a penalty, root adaptation, without penalty, and escalating adaptation. Regarding the population-based algorithms, in this work, the Sine-Cosine Algorithm (SCA), Harris Hawk Optimization (HHO), Whale Optimization Problem (WOA), and Grey Wolf Optimizer (GWO) are employed. This complete set of components profiting from the data generated on the run time by population-based algorithms motivated the challenge of proposing a learning-based approach with the capability to self-adapt and improve through the search.
In order to prove the competitiveness of the proposed hybrid algorithm, experimentation tests were carried out against multiple state-of-the-art binarization strategies solving the SCP. Lastly, we highlight the good performance illustrated by the proposed approach proving to be a good alternative to solving binary optimization problems.
The rest of this paper is organized as follows. In Section 2, we present a detailed description of all the implemented population-based MH, the state-of-the-art binarization scheme, how MH has been supported by ML, and the optimization problem tackled. The proposed hybrid is illustrated in Section 3, where we describe the designed learning model and the details employing SARSA with the reward system. Section 4 presents the results obtained together with their respective tables and figures. Finally, a proper analysis and discussions are illustrated in Section 5, followed by our conclusions and future lines of work.

Related Work
In this section, we present all the required concepts related to the proposal in order to understand the ideas and objectives behind the design.

Sine-Cosine Algorithm
This MH was designed by Mirjalili in 2016 [13] and took inspiration from the sinecosine trigonometric functions. Sine-Cosine can be classified as a population-based algorithm where the population is randomly generated and subsequently perturbed by the following methodology: i,j = X t+1 i,j = X t i,j + r 1 · sin(r 2 ) · |r 3 P t j − X t i,j |, r 4 < 0.5 X t+1 i,j = X t i,j + r 1 · cos(r 2 ) · |r 3 P t j − X t i,j |, where parameter r 1 and uniform random numbers r 2 , r 3 , and r 4 are illustrated in Equations (2)- (5), respectively. In this regard, parameter r 1 determines the direction of the motion, that is towards or away from the best known solution. r 2 indicates the magnitude of the motion. r 3 gives how random the motion will be, thus, when r 3 > 1, it will be highly stochastic. r 4 determines which equation will be employed. In other words, it determines the phases of algorithm (exploration or exploitation) [14].

Harris Hawk Optimizacion
This MH was designed by Hidari et al. in 2019 [15] and named Harris Hawk Optimization (HHO). This was inspired by the cooperative and hunting behavior of Harris Hawks over peregrines. In this regard, in each iteration, the best peregrine is assigned as the X rabbit and is the objective for the rest of the population. Initially, for each falcon, their energy E and jump J are computed. E determines if exploitation or exploration is performed. This energy decreases over time and can be interpreted as the flock getting weaker after eluding the attacks of hawks. This situation can be mathematically modeled as follows.
t ∈ {1, 2, ..., T} T = maximum iterations (6) In each iteration, the initial energy level E 0 is randomly adjusted by [−1, 1]. When E 0 decreases from 0 to 1, it demonstrates that the energy is running out for the flock; and when E 0 increases from 0 to 1, it means that the flock is gaining energy. However, as the iterations progress, the current energy E follows a decreasing trend. While |E ≥ 1|, HHO performs exploration, and this situation changes to exploitation when |E < 1|. The exploration is mathematically modeled as follows.
X t+1 i = X t rand − r 1 · |X t rand − 2 · r 2 · X t i | q > 0.5 (X t rabbit − X t m ) − r 3 · (LB + r 4 (UB − LB)) q < 0.5 Thus, when q > 0.5, the scenario applied is for the hawks to randomly search the solution space. When q < 0.5, this represents a scenario where Peregrines perch around the flock.
Additionally, X t+1 i corresponds to the updated position of the current falcon, X t rand is a randomly selected falcon, X t rabbit is the position of the best solution, and X t i is the current position of the falcon. r 1 to r 4 , and q are uniform random numbers ranging between [0,1], while LB and UB are the limits of the search space, and the mean location of the population is X t m .
The exploitation strategies are carried out according to Equations (8), (9), (11) and (12). To decide what type of exploitation behavior is going to be used, the value of the current energy |E| and r is used. In this context, r corresponds to a random number between [0.1], and when |E| ≥ 0.5 and r ≥ 0.5, we employ Equation (8).
where ∆X t i is the distance between the best position discovered thus far and the i-th hawk's present position. r 5 corresponds to a random number between [0,1] and represents the rabbit's erratic hop in an attempt to escape the predator. When |E| ≥ 0.5 and r < 0.5, then we apply Equation (9).
where D and S are, respectively, the dimensions of the problem and a D-size vector containing random numbers, while f (Y) and f (Z) are values of the objective functions for the given vectors. LF represents the Lévy flight, which can be represented with the Equation (10).
where µ and v are random numbers between [0,1], β is a constant with a value of 1.5. The value of 0.01 is used to control the step length, which can be changed to fit the problem landscape. When |E| < 0.5 and r ≥ 0.5, we apply Equation (11).

Whale Optimization Algorithm
The Whale Optimization Algorithm (WOA) was designed by Mirjalili and Lewis in 2015 [16] and is inspired by humpback whale hunting behavior-particularly how they utilize a technique known as "bubble netting". WOA begins with a set of randomly generated solutions. The whales change their locations considering a randomly selected whale or the best solution found thus far at each iteration. In this context, when the Equation (13) has the value | − → A | ≥ 1, a new random whale is picked. However, when | − → A | < 1, the best solution is chosen. On the other hand, through the parameter "p", WOA decides between a spiral and circular motion. In this regard, there are three motions that are critical:

1.
Searching for prey (p < 0.5 and |A| ≥ 1): The whales randomly search for prey based on the position of each prey. When the algorithm determines that | − → A |≥ 1, we may say that it is exploring, allowing WOA to carry out a global search. This initial move is mathematically represented as follows: where t indicates the current iteration, − → A and − → C are coefficient vectors, and −−→ X rand is a randomly selected position vector (i.e., a random whale) from the current population. The vectors − → A and − → C may be calculated using the following Equation (14): where, − → a linearly decrease from 2 to 0 over iterations (in both the exploration and exploitation phases) and − → r is a uniform random vector of values between [0, 1].

2.
Encircling the prey (p < 0.5 and |A| < 1): When the whales find their target, they proceed to surround them. At the beginning, an optimal location is unknown; thus, each agent focuses on the nearest prey. After the best search agent has been identified, the other agents attempt to update their locations towards that agent. This movement is mathematically represented by Equation (15): where − → X * is the position vector of the best solution found thus far, and − → X is the current position vector. Equation (14) is used to compute the vectors − → A and − → C . It is worth noting that, if a better solution exists, − → X must be changed at each iteration.

3.
Bubble net attack (p ≥ 0.5): The "shrinking net method" is given by this movement. This behavior is accomplished by reducing the value of a in the Equation (14). As the whale spins, the bubble net decreases until the prey is captured. The following Equation (16) is used to represent this motion: where − → D is the distance between the i-th whale and the prey (the best solution obtained thus far), b is a constant employed to define the form of the logarithmic spiral, and l is a random integer between [−1, 1].
Moreover, humpback whales swim around their prey in a decreasing circle while also following a spiral trajectory. To simulate this behavior, there is a 50% chance of selecting either the encircling prey mechanism (2) or the spiral model (3) to update the location of the whales during optimization. Here is the mathematical model:

Grey Wolf Optimizer
The Grey Wolf Optimizer (GWO) is inspired by the behavior of gray wolves. The hierarchy employed is led by the alpha wolf (α), which is followed by beta (β) wolves and delta wolves (δ). The remaining members of the pack are referred to as omegas [17]. The global optimum represents the location of the prey, and the alpha, beta, and delta wolves are the closest to the prey. The rest of the pack, formerly known as omegas, are updated in the search space based on the leaders. In GWO, in order to hunt prey, the following steps are required: encircle, stalk, raid, and search for prey.

1.
Encircling the prey: The objective is for the pack to surround the prey, in order to carry out this movement; thus, each wolf will be moving toward the target.

− →
where t denotes the current iteration, − → X p (t) denotes the prey's location in the t-th iteration, − → X (t) denotes the wolf's position, and − → D may be described as follows: Additionally, the coefficient vectors − → A and − → C of Equations (18) and (19) are computed as follows: where a is a parameter, − → r1 , and − → r2 are uniform random vectors with values from 0 to 1.

2.
Stalking the prey until it stops: This action is carried out by the whole pack based on information provided by the α, β, and δ wolves, who are supposed to be aware of the position of the prey. This action may be mathematically represented as follows:  (20)).

3.
Attack the prey: the main parameter in this movements is a, which manages the exploration or exploitation performed by GWO, i.e., moving closer to or further away from the prey. In this regard, a is defined between [0,2] and is mathematically illustrated as follows: where t is the current iteration and T is the total amount of iterations. According to the corresponding author, the range of possible values for a enables a seamless transition between exploration and exploitation. Thus, when a is close to 0, the wolves are attacking the prey or rather, the MH is exploiting the search space.

4.
Search for prey: In order to hunt down their prey, wolves disperse. This behavior is mimicked by setting the parameter a to a value closer to 2. It is worth noting that every wolf can discover a more suitable (ideal) prey. If a wolf approaches the prey, it becomes the new alpha, and the remaining wolves are classified as beta, delta, or omegas according to their distance from the prey.
The four metaheuristics described above were created to solve continuous optimization problems. In order for continuous metaheuristics to solve discrete optimization problems, a transfer of solutions is necessary.

Two-Step Binarization Scheme
The methodology behind binarization techniques for continuous MH consists in transferring the values of the continuous domain of the MH to a binary domain; this is done to preserve the quality movements that have continuous MH in order to generate quality binary solutions. Although there are MH that work in binary domains without the need to incorporate a binary scheme, continuous MH assisted by a binary scheme has proven to achieve great performance in multiple combinatorial NP-Hard problemsfor instance, the Binary Bat Algorithm [18], Particle Swarm Optimization [19], Binary Salp Swarm Algorithm [20], Binary Dragonfly [21], and Binary Magnetic Optimization Algorithm [22].
In the literature, among the binary schemes, two large groups can be defined. First, the operators that do not provoke alterations in operations related to different elements of the MH. In this regard, the two-step techniques stand out, as they are the most used in the last decade [5] and the Angle Modulation technique [23]. The second group includes the methods that alter the normal functioning of a MH. For instance, Quantum Binary [24] and Set-Based Approaches, in addition to the techniques based on clustering [2,6].
In the scientific community, the two-step binary schemes are of great relevance. They have been employed to tackle multiple types of problems [25]. This binarization scheme, as the name implicates, it is composed of two steps. The first step is the transfer function [19], which transfers the values generated by the continuous MH to a continuous interval between 0 and 1. The second step involves binarization, which consists in transferring the number between that interval in a binary value, Figure 1.  [26] introduced transfer functions to the optimization field. Their main advantage is the delivery of a probability between 0 and 1 at a low computational cost. There are two types of functions, the S-Shaped [19,27] and the V-Shaped [28], which are illustrated in the Figure 2. For each type of function, four variations were proposed, Table 1.  [27].

Second Step: Binarization
The binarization functions discretize the probability obtained from the transfer function and deliver a binary value. For this step, there are different techniques in the literature [29], such as those exemplified in Table 2. Table 2. Techniques of binarization [5].

Binarization
Type Static Probability

Set Covering Problem
The SCP is defined as a binary matrix (A) of m-rows and n-columns, where a i,j ∈ {0, 1} is the value of each cell in the matrix A, where i and j are the size of the m-rows and ncolumns, respectively: Defining the column j satisfies a row i if a ij is equal to 1, and this is the contrary case if this is 0. In addition, there is an associated cost c j ∈ C, where C = {c 1 , c 2 , ...c n }, together with that I = {1, 2, ..., m} and J = {1, 2, ..., n} are the sets of rows and columns, respectively.

The Proposal: Binarization Scheme Selector
This paper proposes a binarization scheme selector that incorporates multiple transfer functions and discretization methods. The main objective includes the smart selection, employment, and correct balance of them led by SARSA.
This novel binarization approach is based on the behavior of hyperheuristics, which have been proven to be effective for several issues [30]. The proposed design determines which types of binarization are more appropriate to apply in each iteration. The decision is based on the knowledge processed from dynamic data generated through the search in the run time. In this context, more adequate binarization methods can be applied with a higher probability to achieve good results. Figure 3 and Algorithm 1 illustrate the proposed design for the binary scheme selector, where a key element is depicted as ∆, which represents the dimensional perturbation on each byte in the solution vector, in other words, we can represent the perturbations performed by the MHs.
Algorithm 1 Data-driven dynamic discretization framework 1: Initialize a random swarm 2: Initialize Q- Table  3: for iteration (t) do 4: Select action a t for s t from the Q- Table   5: for solution (i) do 6: for dimension (d) do 7: Get immediate reward r t

11:
Get the maximum Q value for the next state s t+1

SARSA
Temporal Difference (TD) algorithms are well-known RL approaches that focus on the study of the environment, generate knowledge, and update the current state [31]. The difference between the present assessment of a state's worth, the discounted value of the future state, and the reward are displayed by the TD algorithms. These algorithms concentrate on state-to-state transitions and state-learned values.
Among the TD algorithms, there is SARSA, an online control algorithm and onpolicy method [31]. In other words, SARSA algorithms are online algorithms because they perform the updates of the action-value function estimation at the end of each step without waiting for the term condition. Due to this, the Q-value is available to be used in the next state. They are control algorithms since they perform actions to achieve their purpose, which is the state-action optimal value function estimation.
On the other hand, they are on-policy, which means that the agent learns the value of the state-action pair based on the action performed and, thus, evaluating the current policy, unlike other techniques, such as Q-learning, which performs one policy and evaluates another.
These kinds of policies allow agents to learn to act optimally by experiencing the consequences of their actions without having to develop domain maps. The "environment" includes the current "states" in which the agent interacts and makes decisions, and there are several recognized states. In this context, each agent has a set of actions that cause a modification in the "reward" as well as in the subsequent state.
Thus, when the value reached is equal to one, the state is modified. In addition, when the agent selects an action to perform, he receives a reward for his decision. Rewards are delayed, and the agents must learn from the system to receive them. The value of the state-action pair is learned by the agent as a function of the action performed. Thus, when the value of the current state-action is updated, the next action a t+1 is performed.
In Figure 4, the state-to-state transitions are considered, and the values of each have been learned. To understand the algorithm, let us consider the transitions as a pair of values, state-action to state-action, where the values of the state-action pairs are learned. Formally, these cases are identical: both are Markov chains with a reward process. The theorems ensuring the convergence of state values under TD are also applied to the corresponding algorithm for the action values. The update performed by the state-action can be defined as follows, Equation (24): After each transition, the state is updated, until a terminal state is reached. When a state s t+1 is terminal, then Q(s t+1 , a t+1 ) is defined as zero. Each transition process is composed of five events: s t , a t , r t+1 , s t+1 , and a t+1 (State-Action-Reward-State-Action); providing the name for the SARSA algorithm. Algorithm 2 of the algorithm is shown below:

Rewards
The rewards in RL algorithms are a critical component in the performance. It is such an important issue that several methods have been proposed in the literature [32][33][34][35]. The value assigned to r in the generic SARSA is determined by the type of reward from the chosen metrics as illustrated in Figure 5. In Table 3, we illustrate detailed information about the rewards employed by SARSA. First, we use the penalty employed by Xu Yue and Song Shin Siang in [32,33], respectively. This reward applies a fixed increment or reduction in value for actions that result in an improvement or absence of it in the overall fitness. Regarding the reward without penalty, employed by Abed-alguni [34]. In this reward, no penalty is attached to the action committed. On the other hand, we have three additional sorts of incentives, which were reviewed and presented by Nareyek Alexander in [35].

Metric (getMetric)
Several metrics have been proposed in the literature. In this work, the fitness improvement metric is employed. The objective is directly related to the following scenarios, if the fitness function improves, SARSA will reward; if the fitness function remains unchanged, SARSA will apply a penalty as indicated by the type of reward.

State Determination (getState)
As is well known, metaheuristics have two phases that allow them to perform the optimization process. The phases are the exploration of the search space to find tentative regions with good solutions and the exploitation phase where the search for the best regions to find better solutions is intensified. Our proposal will have, as states for both Q-Learning and SARSA, the exploration and exploitation phases. To use these phases, we need to measure the exploration and exploitation of our algorithms. One of the techniques that stands out is the use of diversity metrics.
There are numerous methods for determining diversity [36]. In this work, diversity is computed by using the method proposed by Hussain Kashif et al. [37], which is expressed mathematically as: where Div represents the diversity status determination,x d denotes the mean values of the individuals in dimension d, x d i denotes the value of the i-th individual in dimension d, n denotes the population size, and l denotes the size of the individuals' dimension.
We consider the exploration and exploitation percentages to be XPL% and XPT%, respectively. The percentages XPL% and XPT% are computed from the study of Morales-Castañeda et al. [38] as follows: where Div represents the diversity state determined by Equation (25), and Div max denotes the maximum value of the diversity state discovered throughout the optimization process.

Experimental
In order to determine if the integration of SARSA as a binary scheme selector improves the performance of a MH, five versions of SARSA corresponding to different types of rewards are implemented and compared against Q-Learning [39], Table 4 illustrates the details corresponding to the name assigned to each approach and the type employed. In this work, the performance comparison is carried out by analyzing five versions of SARSA, five versions of Q-Learning, and two well-known and recommended binarization strategies, Table 5. The 12 approaches described were applied on HHO, GWO, WOA, and SCA solving the SCP, as shown in Table 6, in order to demonstrate the robustness of our hybridization proposal. The configuration of the parameters of the four metaheuristics was carried out based on the original authors of each one of them.

Experimental Results
In Tables 7-14, the achieved results are illustrated. The detailed information presented in each table can be described as follows: the first column corresponds to the name of the Beasley's instances (45 in total) [40], the second column is the best value known to date, the next three columns (Best, Avg, and RPD) present the best value reached, the averages, and the RPD obtained from the independent runs. The RPD corresponds to the Relative Percentage Deviation as defined in Equation (28). These three columns mentioned above are repeated for all versions (BCL, MIR, QL1, SA1, QL2, SA2, QL3, SA3, QL4, SA4, QL5, and SA5).
Finally, the last row corresponds to the mean of each column, and we highlight in bold the best values reached. For each MH implemented, the population size employed was 40, and 1000 iterations were performed per run. With this, the stopping condition was at 40,000 evaluations of the objective function as employed in [29]. The implementation was developed in Python 3.8.5 and processed using the free Google Colaboratory services [41]. The parameter settings for SARSA and QL algorithms were as follows: γ = 0.4 and α = 0.1.          In Tables 7 and 8, the approach SA1 leads with mean values for the columns Best, Avg, and RPD with 259.56, 263.21, and 1.78, respectively. Nevertheless, in terms of the computed performance, other SA approaches follow right behind SA1. In Tables 9 and 10, the lead in performance is shared by SA1 with the best mean value for the column Best with 259.91 and SA5 with the best computed mean values for the columns Avg and RPD with 265.11 and 2.02, respectively. Here, we can observe more robustness in the performance by SA5 and some inconsistency by SA1.
In Tables 11 and 12, the approach BCL leads the overall performance with the minimum mean values for the columns Best, Avg, and RPD, followed by SA5 and SA1. In Tables 13 and 14, the approach SA1 leads the best values reached (Best) with the smallest RPD values with 259.44 and 1.84, respectively. The approach with the best mean for the column Avg is BCL with 264.47. Nevertheless, approaches employing SARSA follow close to the leaders in the performance, which proves the effectiveness of the proposal.

Statistical Results
A p-value lesser than 0.05 means that the difference between the techniques is statistically significant, and thus the comparison of their averages is valid.
The results obtained are grouped in Tables 15-18. For each table, a matrix of the averages obtained from the 45 instances is illustrated. The description for each table is as follows: the first row and first columns present the 11 versions of the MH to be compared. We can read the information by row as follows: obtaining a p-value less than 0.05 means that the version in the row obtained a better performance over the version located in the column for the SCP and that the difference between the results is statistically significant.
p-values greater than 0.05 have been replaced by ">0.05" to facilitate the reading of the comparison matrix. In this context, significant differences in the performance can be observed in Tables 15 and 16. First, all the hybrid approaches employing learning-based components outperformed the classic approach employing two-step transformation (BCL). The performances between approaches employing the same RL techniques, such as Qlearning (QL1-QL5 vs. QL1-QL5) and SARSA (SA1-SA5 vs. SA1-SA5), performed equally.
Lastly, approaches employing the RL technique SARSA in almost all instances significantly outperformed the ones employing Q-learning. On the other hand, an interesting phenomenon can be observed in Tables 17 and 18. The only major differences in performance were observed between BCL against the approaches employing Q-learning (QL1-QL5) and SA4-SA5. The performances between approaches employing the same and different RL techniques were not significantly different.

Action Charts
The charts of average actions performed are illustrated in Figures 6-13. They represent the graphical representation of the average choice of Q-Learning or SARSA during the iterative process. They generate a weight system in order to properly select the binarization schemes, and the objective is the identification of preferences according to the state of exploitation or exploration of the environment using the run time. The graphs are composed in the x-axis by the average number of times the action was selected for the respective state, while the 40 possible actions to be taken for the binarization scheme selector are in the y-axis.

Exploration and Exploitation Charts
The visualization of MH metrics is fundamental to understanding their behavior through the search, and thus the exploration and exploitation graphs presented in Morales-Castañeda et al. [38] are of great contribution to the analysis of exploration and exploitation in terms of diversity among the solutions. In this work, the decision-making of the state was calculated by means of the Dimensional-Hussain diversity (Section 3.4). In the exploration and exploitation plots illustrated in Figures 14-21, the x-axis is the number of total iterations, while the y-axis is the percentage of exploration and exploitation. They are measured by Equations (26) and (27)

Discussions and Analysis
In this section, a discussion and detailed analysis of the results achieved in Section 4 are presented.

Best, Average, and RDP
First, by observing the results obtained and summarized in Figures 22-25 While conducting a comparison of behaviors of the MH, we can observe that the SCA and WOA, achieved improvements in their performance when using Q-Learning and SARSA. On the other hand, regarding HHO and GWO, statistically significant improvements were not obtained. In this context, one of the reasons for this behavior lies in the movement operators of each MH. In HHO and GWO, we observe operators of higher complexity, which follow different logic according to the behavior of their internal parameters.
For instance, E in the case of HHO, the energy is decreasing during the iterations, thus, influencing the motion operator employed. This is based on the logic that, the first iterations should be explored and the last ones exploited. In the case of SCA and WOA, we observe simpler movement operators, where the use of the exploration or exploitation operators depends on random decisions. For instance, SCA with parameter r 4 and p in the case of WOA.

Average Wilcoxon Test
The Wilcoxon-Mann-Whitney test is the non-parametric test we used to compare two independent samples. In Tables 15-18, the average p-values obtained are presented in order to simplify their visualization. From these tables, the following is observed:

Choice of Binarization Schemes
It is known from the literature that binarization schemes have a strong impact on the performance of the MH [5], and thus Figures 6-12 give us detailed information related to the employment of these during the exploration and exploitation processes, which can be described as follows: • The action corresponding to MIR (V4 and Complement), had the lowest selection rate for actions by both Q-Learning and SARSA versions. • For both Q-Learning and SARSA binarization scheme selectors, when in a scanning state, the preference observed was for S-type transfer functions. • For both Q-Learning and SARSA binarization scheme selectors, when in a scanning state, there was a preference for Standard and Static binarization followed by Complement. • In the selectors of binarization schemes with Q-Learning, when in an exploitation state, there was a preference for V-type transfer functions. • In the selectors of binarization schemes with SARSA, when in an operational state, there was a preference for the transfer function types S1, S2, S3, and V1. • For both Q-Learning and SARSA binarization scheme selectors, the Elitist and Elitist Roulette binarization were mainly preferred when in an operational state.

Exploration and Exploitation
The exploration and exploitation measurements proposed in [38] provide detailed information related to the behaviors observed through the exploration and exploitation metrics during the iterative process. This process allows the proper observation of the influence of the binarization schemes employed between different versions, Figures 14-21 illustrate the following information: • In SCA, WOA, GWO, and HHO, the BCL version presented a sharp increase in the exploitation percentage in the initial iterations and remained mostly constant. • In SCA and WOA, the approaches based in MIR presented high exploitation values during the whole search process. This behavior, according to Morales-Castañera, can be attributed to random search behaviors, which are related to the fact of achieving the worst performance among all the versions compared by RDP. • In GWO, the approach based in MIR, presented a slight increase in the exploitation percentage performed. However, the values reached were low in quality. This observation was equally done over the approaches employing BCL but with a worse performance when compared with the RPD. • In HHO, the approach based in MIR, obtained higher exploration values in the early runs compared to the approaches employing static schemes. This can be explained by the similarities to the Q-Learning and SARSA approaches, where the movements are focused on exploration, which is well-known of HHO. • In WOA and GWO, the approach SA1 presented a behavior similar to the one obtained by MIR in GWO. However, greater amplitude in variations were observed as recurrent results of Q-Learning and SARSA in the experimentation phase. They were the third and fourth approaches with better performances among the versions with a binarization scheme selector compared by RPD. • In SCA and GWO, within Q-Learning and SARSA versions, exploration and exploitation graphs with constant changes in each iteration were presented. • In HHO, the Q-Learning version presented an equal amount of variations as the ones observed in SCA and GWO. However, a change during the second half of the iterative process was observed, a common change of movements in HHO.
Along with this, we can observe a different influence of the binarization schemes to each MH. In the literature, the recommendations for the case of BCL were V4, which is associated with exploration, and Elitist, which is associated with exploitation. On the other hand, concerning MIR, both V4 and Complement were associated with exploration. The different behaviors observed in the performance and balance of exploration and exploitation opens the following question, Are MH more susceptible to binarization schemes?. In this regard, future works will focus on exposing this relationship and building scientific evidence regarding this issue.

Conclusions
In this work, a novel learning-based binarization scheme selector was proposed. In this context, novel approaches have proven to be highly efficient in tackling hard optimization problems. The designed learning-based method employed a Reinforcement Learning technique, named SARSA, which utilizes the dynamic data generated through the search by continuous populated-based algorithms. The main objective behind the proposed approach was to design a balanced binarization scheme selector.
Regarding the results achieved, the five different versions of SARSA demonstrated competitive performances. The experimentation solving the SCP illustrated that WOA and SCA assisted by Q-Learning and SARSA obtained statistically significant better results. However, regarding HHO and GWO, the opposite phenomenon was observed for the version applying Q-Learning. In this regard, the implementations employing static binarization schemes (V4 and Elitist), presented better performance in most of the 45 instances. Nevertheless, the implementations applying SARSA maintained good overall performance.
On the other hand, observing the real profits given by the employed rewards with Q-Learning, we could not demonstrate a significant difference. The results achieved were similar, and thus it cannot be concluded that, for the solved problems, the type of reward used directly impacts the quality of the solutions. In the case of SARSA, an equal phe-nomenon was observed. No statistically significant differences were determined. However, comparing the versions of Q-Learning and SARSA, the latter achieved significance for SCA and WOA.
Within future works, along with answering the question in Section 5.3, the option of evaluating other MH with exploration and exploitation behaviors must be pursued in order to further exploit the benefits and continue building solid evidence using the improvements of learning-based models. Likewise, different diversities can be evaluated to determine if there are significant differences in their results and the possibility of grouping them under another classification according to the exploration and exploitation percentages they generate.
This is an area of great interest due to a large number of methods for calculating diversity. Other future works can contemplate the increase of actions for the proposed selection scheme, i.e., to add more transfer and binarization functions, such as O-Shaped [42], Z-Shaped [43], Q-Shaped [44], and U-Shaped [45]. In addition to evaluating other techniques of Temporal Difference, it is possible to explore new options, such as using techniques focused on large multi-dimensional variable sizes. This context includes the "Deep Q-Network" and others from deep learning.