1. Introduction
The resolution of real problems can be approached through mathematical modeling to find a solution with an optimization algorithm [
1]; under this scheme, it is increasingly common for different industries to solve combinatorial problems for their normal operation to minimize costs and times, as well as maximize profits. Such is the case of the forestry industry [
2], flight planning for unmanned aircraft [
3], or the detection of cracks in pavements [
4]. Combinatorial problems are mostly NPhard, which makes it difficult to find solutions with polynomialtime algorithms [
5], which is why the use of intelligent optimization algorithms, mainly metaheuristics (MHs) [
6], have considerably supported the growth of combinatorial problem solving; through their search processes, they manage to intelligently explore the search space, finding quasioptimal solutions in reasonable computational times.
MHs are generalpurpose algorithms widely used to solve optimization problems. Talbi in [
7] indicated that metaheuristics can be classified according to how they perform the search process. Singlesolution metaheuristics transform a single solution during the search process. Some classic examples of this type of metaheuristics are simulated annealing [
8] and tabu search [
9]. On the other hand, populationbased metaheuristics are a set of solutions that are evolved as the optimization process progresses. Some classic examples of this type of metaheuristics are particle swarm optimization [
10], cuckoo search [
11], and the generic algorithm [
12]. In the literature, populationbased metaheuristics are more widely used than singlesolution metaheuristics.
The nofreelunch (NFL) theorem [
13] tells us that there is no optimization algorithm that is good at all optimization problems. This theorem motivates researchers to keep developing new innovative algorithms. Thanks to this theorem, new metaheuristics have been created with very good performance. These good metaheuristics are grey wolf optimization [
14], the whale optimization algorithm [
15], and the sine–cosine algorithm (SCA) [
16].
The grey wolf optimizer has been used for example in feature selection [
17], training neural networks [
18], optimizing support vector machines [
19], designing and tuning controllers [
20], economic dispatch problems [
21], robotics and path planning [
22], and scheduling [
23].
The whale optimization algorithm has been used for example in optimal power flow problem [
24], the economic dispatch problem [
25], the electric vehicle charging station locating problem [
26], image segmentation [
27], feature selection [
28], drug toxicity prediction [
29], and
$C{O}_{2}$ emissions prediction and forecasting [
30].
The sine–cosine algorithm has been used for example in the trajectory controller problem [
31], feature selection [
32], power management [
33], network integration [
34], engineering problems [
35], and Image processing [
36].
The grey wolf optimizer, sine–cosine algorithm, and whale optimization algorithm have been developed to perform in continuous domains; however, there are a small number of techniques that are capable of operating in binary and continuous domains, as is the case of genetic algorithms [
37] and some variations of ant colony optimization (ACO) [
38]. However, they have not been able to obtain the performance obtained by continuous metaheuristic techniques that use an operator capable of transforming their continuous solutions to binary space. In recent years, there has been an increase in the literature on new and novel binarization operators, such as those based on machine learning, specifically clustering techniques such as Kmeans [
39] and DBscan [
40], based on reinforcement learning, such as Qlearning [
41] and SARSA [
42], among other inspirations, such as quantum [
43], logical operators [
44], crossovers [
45], and percentiles [
46]. Among the mostcommon and used operators is the twostep operator, which consists of normalizing the continuous values through a transfer function (Step 1) to be subsequently binarized an approximation rule, finally obtaining a value of 0 or 1 (Step 2) [
47]. In this context, it is necessary to continue the search for new variations of binarization techniques, since it has been proven that they directly influence the performance of the MHs [
48,
49,
50,
51,
52].
In this work, we propose a new intelligent operator using a binarization scheme selection (BSS) capable of adapting any continuous MHs to work in the binary domain. BSS is based on the twostep technique, where employing an intelligent operator, the transfer function, and the binarization rules to be used are chosen; this scheme was first proposed in [
53]. BSS has been previously used with reinforcement learning techniques within the machine learning umbrella: Qlearning and SARSA; in this case, a new intelligent operator called backward Qlearning (BQSA) [
54] is presented, which is a combination between Qlearning and SARSA, updating the Qvalues by SARSA and in a delayed way with Qlearning. On the other hand, the set of schemes of this proposal is more extensive, going from 40 possible combinations to 80. All of the above points to the need to investigate hybrid methods to improve the algorithm’s performance. The contributions made as a result of this work are presented below:
The implementation of a BSS, as a binarization operator capable of operating in any continuous MH.
The use of the BQSA as an smart operator in BSS.
A larger set of transfer functions obtained from the literature, which generates an increase from 40 to 80 possible binarization schemes to be used.
Experimental tests were carried out against multiple stateoftheart binarization strategies that solve the set covering problem. Among the results obtained, there are considerably competitive performances for the proposed work, but not having statistically significant differences, although the difference with static versions is validated. In these quantitative comparisons, we can observe differences in the convergence behavior between the 80 and 40 actions versions, as well as the balance between exploration and exploitation.
The rest of the paper is structured as follows:
Section 2 presents the work related to metaheuristic techniques, machine learning, and their hybridization.
Section 3 presents the proposal for the incorporation of the BQSA in BSS, while in
Section 4, its implementation is validated with the results obtained and the respective statistical tests, ending with the analysis and conclusions in
Section 5.
2. Related Work
In the following subsections, we present some of the concepts necessary to internalize the work.
2.1. Reinforcement Learning
Machine learning (ML) aims to analyze data in the search of patterns and to develop predictions obtaining future results [
55]; when decomposing ML, we can find four main types (
Figure 1): supervised, unsupervised, semisupervised, and reinforcement learning (RL). In RL, an agent receives a set of instructions, actions, or guidelines with which it will then make its own decisions based on a process of reward and penalty to guide the agent toward the optimal solution to the problem.
Under this basis, an RL agent is constituted by four subelements: policy; value function; reward function; the environment model; however, the latter is usually optional [
56]. As a definition of these elements, we give the following:
A policy defines the agent’s behavior at each instant of time, i.e., it is a mapping from the set of perceived states to the set of actions to be performed when the agent is in those states.
The value function allows the agent to maximize the sum of total rewards in the long run. It calculates the value of a state–action pair as the total amount of rewards that the agent can expect to accumulate in the future, starting from the state it is in. Thus, the agent selects the action based on value judgments. Indeed, while the reward determines the immediate and intrinsic desirability of a state–action pair, the value indicates the longterm desirability of a state–action pair considering likely future state–action pairs and their rewards.
A reward function represents the agent’s goal, i.e., it translates each perceived state–action pair into a single number. In other words, a reward indicates the intrinsic desirability of that state–action pair. It is a way of communicating to the agent what the agent wants to obtain, but not how to achieve it.
An environment model is intended to reproduce the behavior of the environment, i.e., the model that directs the agent to the next state and the subsequent reward based on the current state–action pair. The environment model is not always available and is therefore an optional element.
Among the existing RL classifications, we find the temporal difference techniques [
57], where the Qlearning (QL) algorithm is the most popular for its contributions in several areas [
58]. Still, there are other algorithms, such as SARSA [
59] and the BQSA [
54], which are variations of QL, but have obtained different performances for different problems.
2.2. QLearning
Among the algorithms present in RL, we find the QL algorithm first proposed in [
60], which provides agents with the ability to learn to act optimally without the need to build domain maps. Having several possible
s states, where from the “environment”, we obtain the current
s in which the agent interacts and performs decisions. The agent has a set of possible actions which affect the reward and the next state. Once an action is performed, the state changes. When changing state, the agent receives a reward for the decision made. Where the rewards received by the agent consequently generate learning in the agent. To solve the problem, the agent learns the best course of action it can take, which has a maximum cumulative reward. The sequence of actions from the first state to the terminal state is called an episode. The transition of states is given by the Equation (
1).
where
${Q}_{new}({s}_{t},{a}_{t})$ is nominating the reward of the action taken in state
${s}_{t}$ and
${r}_{n}$ is the reward received when action
${a}_{t}$ is taken,
$maxQ({s}_{t+1},{a}_{t+1})$ is the maximum value of the action for the next state, the value of
$\alpha $ must be
$0<\alpha \le 1$ and corresponds to the learning factor. On the other hand, the value of
$\gamma $ must be
$0\le \gamma \le 1$ and corresponds to the discount factor. If
$\gamma $ reaches the value of 0, only the immediate reward will be considered, while as it approaches 1, the future reward receives greater emphasis relative to the immediate reward. QL is algorithmically presented with the following Algorithm 1:
Algorithm 1 Qlearning. 
 1:
Initialize$Q(s,a)$  2:
for each episode do  3:
Initialize state s  4:
while $s\ne {s}_{terminal}$ do  5:
Choose action a from state s  6:
Take action a  7:
Observe reward r  8:
Observe next state ${s}^{\prime}$  9:
Update $Q(s,a)$ using Equation ( 1)  10:
$s\leftarrow {s}^{\prime}$  11:
end while  12:
end for

2.3. StateActionRewardStateAction
This is a method of RL that uses generalized policy iteration patterns, which consists of two processes that are performed simultaneously and interact with each other, where one performs the value function with the current policy, and at the same time, the other one improves the current policy. These two processes complement each other in each iteration, but each one does not need to be completed before the next one begins.
The learning agent learns the current value function derived from the policy currently in use. To understand how it works, the first step is to learn an actionvalue function instead of a statevalue function. In particular, for the onpolicy method, we must estimate ${Q}_{\pi}(s,a)$ for the current policy $\pi $ and all states s and actions a.
To understand the algorithm, let us consider the transitions as a pair of values, state–action to state–action, where the values of the state–action pairs are learned. These cases are identical: both are Markov chains with a rewarding process. The theorems that ensure convergence of state values also apply to the corresponding algorithm for action values, with Equation (
2).
After each transition the state is updated, until a terminal state is reached. When a state
${s}_{t+1}$ is terminal then
$Q({s}_{t+1}$,
${a}_{t+1})$ is defined as zero. Each transition process is composed of five events:
${s}_{t},{a}_{t},{r}_{t+1},{s}_{t+1},{a}_{t+1}$ (StateActionRewardStateAction), giving the name to the SARSA algorithm. The following is the Algorithm 2:
Algorithm 2 SARSA. 
 1:
Initialize$Q(s,a)$  2:
for each episode do  3:
Initialize state s  4:
Choose action a from state s  5:
while $s\ne {s}_{terminal}$ do  6:
Take action a  7:
Observe reward r  8:
Observe next state ${s}^{\prime}$  9:
Choose next action ${a}^{\prime}$ from next state ${s}^{\prime}$  10:
Update $Q(s,a)$ using Equation ( 2)  11:
$s\leftarrow {s}^{\prime}$  12:
$a\leftarrow {a}^{\prime}$  13:
end while  14:
end for

2.4. Backward QLearning
Backward Qlearning is another RL technique, although its name is similar to QL, it is not an ordinary Qfunction update function (Equation (
1)), this time, a backward update is added, hence the name backward Qlearning.
In this structure, action is directly affected, while the policy is indirectly affected. As the agent increases interaction with the environment, the agent’s precise knowledge also increases. Due to its structure, the agent can improve the learning speed, balance the explore–exploit dilemma and converge to the global minimum by using those previous states, actions, and information in an episode. Then, it is recorded that agents went through states, chose actions, and acquired rewards in an episode, and then this information will be used to update the Qfunction again.
When the agent reaches the target state in the current episode, the produced data is used to update the Qfunction backward. For example, state
${s}_{0}$ is defined as an initial state, and state
${s}_{n}$ is defined as a terminal state. The agent updates the Qfunction N times from the initial state
${s}_{0}$ to the terminal state
${s}_{n}$ in an episode thanks to the recorded events: “
s,
a,
r,
s”. Therefore, we redefine Equation (
1) of a step as:
where
$i=1,2,\cdots ,N$ is the number of times the Qfunction will be updated in the current episode. In turn, the agent simultaneously records the four events in
${M}^{i}$, represented mathematically with the Equation (
4):
Once the agent reaches the terminal state, the agent will backward update the Qfunction based on the information obtained from Equation (
4) as follows (Equation (
5)).
where
$j=N,N1,N2,\cdots ,1$,
${\alpha}_{b}$ and
${\gamma}_{b}$ are the learning and discount factors respectively for the backward update of Qfunction. Algorithm 3 is added for a better understanding of the above.
Algorithm 3 Backward Qlearning. 
 1:
Initialize arbitrarily all $Q(s,a)$, M and set $\alpha $ and $\gamma $  2:
for each episode do  3:
Choose initialize state ${s}_{t}$  4:
Choose an action ${a}_{t}$ from state ${s}_{t}$  5:
while $N\ge 1$ do  6:
for each step in the episode do  7:
Execute the select action ${a}_{t}^{i}$ to the environment  8:
Observe reward ${r}_{t+1}$  9:
Observe new state ${s}_{t+1}^{i}$  10:
Choose an action ${a}_{t+1}^{i}$ from state ${s}_{t+1}^{i}$  11:
Record the four events in ${M}^{i}\u27f5{s}_{t}^{i},{a}_{t}^{i},{r}_{t}^{i},{s}_{t+1}^{i}$  12:
Update $Q({s}_{t}^{i},{a}_{t}^{i})$ using the Equation ( 3)  13:
${s}_{t}^{i}\leftarrow {s}_{t+1}^{i}$  14:
${a}_{t}^{i}\leftarrow {a}_{t+1}^{i}$  15:
$i\leftarrow i+1$  16:
end for  17:
end while  18:
for $j=N$ to 1 do  19:
Backward update $Q({s}_{t}^{j},{a}_{t}^{j})$ using Equation ( 5)  20:
end for  21:
Initialize all M values  22:
end for

2.5. Metaheuristics
MHs are used to solve optimization problems, which can be considered a strategy that guides and modifies other heuristics to produce solutions beyond those normally generated in a search for local optimality [
61]. Blum and Roli [
6] mention two main components of any metaheuristic algorithm, which are: diversification and intensification, also known as exploration and exploitation. Exploration means generating diverse solutions to explore the search space on a global scale, while exploitation is to focus the search on a local region, knowing that a good solution is found in this neighborhood. Global optimization can be achieved with a good combination of these two main components, as a good balance while selecting the best solutions will improve the convergence rate of the algorithms. Choosing the best solutions can ensure that the solutions converge to the optimum. At the same time, diversification through randomization allows the search from the local optimum to the whole search space, causing the diversity of the solutions to grow.
The great advantage of MHs is that they are able to generate nearoptimal solutions in reduced computational times as opposed to exact methods and, in addition, they are able to adapt to the problem in contrast to heuristic methods [
7].
A high percentage of MHs are naturebased; this is mainly because their development is based on some abstraction of nature. The existing taxonomy can be defined in various ways and in various sections of the MH if one wanted to dissect; at first, they can be classified as those based on the trajectory or based on population. A popular classification can be those presented by [
62,
63], decomposing the populationbased ones into 4 main categories: physicsbased, humanbased, evolutionarybased and swarmbased (
Figure 2). Furthermore, the taxonomy presented in [
64] deepens in the main components that conform to these algorithms: the solution evaluation, parameters, encoding, initialization of the agents or the population, management of the population, operators, and finally, local search.
Along this regard, and going deeper into the metaheuristic components and behavior, a metaheuristic can be represented algorithmically as a triple nested cycle. The first cycle corresponds to the iterations performed during the optimization process, the second cycle corresponds to the solutions obtained from the agents and finally, the third cycle corresponds to the dimensions associated with the problem. It is necessary to mention that within the third cycle, there is a
$\Delta $ that is characteristic of each MH. In the Algorithm 4 the above mentioned is presented.
Algorithm 4 Discrete general scheme of metaheuristics. 
 1:
Random initialization  2:
for$iteration\phantom{\rule{2.0pt}{0ex}}\left(t\right)$do  3:
for $solution\phantom{\rule{2.0pt}{0ex}}\left(i\right)$ do  4:
for $dimension\phantom{\rule{2.0pt}{0ex}}\left(d\right)$ do  5:
${X}_{i,d}^{t+1}\phantom{\rule{2.0pt}{0ex}}=\phantom{\rule{2.0pt}{0ex}}{X}_{i,d}^{t}\phantom{\rule{2.0pt}{0ex}}+\Delta $  6:
end for  7:
end for  8:
end for

2.6. Hybridizations
Hybridizations between MHs and other approaches have been thoroughly studied in the literature, among these approaches is ML including RL. Various authors propose taxonomies and types of interactions such as [
65,
66,
67,
68]. Optimization and ML approaches according to Song et al. [
65] interact in four ways:
Optimization supports ML.
The ML supports the optimization.
ML supports ML.
Optimization supports optimization.
Interaction number 2 is also structured in the classification presented in [
66,
67], where the way it supports ML is at the problem level, replacing, for example, objective functions or constraints that are costly; at a low level, i.e., in the components or operators of the MH; and finally at a high level, where ML techniques can be used to choose between different MHs or components of an MH.
2.7. Binarization
As mentioned above, a high percentage of populationbased metaheuristics are continuous in nature. Therefore, they are not suitable for solving binary optimization problems directly. Due to this reason, an adaptation to the context of the 0 and 1 domain is necessary. The twostep sequential mechanism and the BSS used in other works are detailed below [
41,
42].
TwoStep
Within the literature, the transfer from continuous to binary through the twostep mechanism is one of the most common. Its characteristic name is due to its sequential mechanism, where the first step consists of transferring from the reals to a bounded interval through transfer functions, i.e.,
$[0,1]$, then in the second step, by means of binarization rules, the bounded interval is transformed to a value of
$\{0,1\}$. In the last few years, several transfer functions have been presented, among the most common ones are the S and V type functions [
49], others exist such as the X [
69,
70], Z [
71], U [
72,
73], Q [
74], in the literature we can find versions that change during the iterative process by means of a decreasing parameter [
75,
76,
77,
78]
2.8. Binarization Scheme Selector
As mentioned in
Section 1, other works proposed the idea of combining RL with MHs, following the framework proposed by Talbi et al. in [
66,
67], being born from this combination an intelligent selector based on the twostep technique mentioned above, called BSS. The breakthrough of this smart selector is that it is able to learn autonomously through trial and error which transfer function (first step) and binarization rule (second step) best fit in the binarization when obtaining the solutions of the optimization process. In the proposals [
42], there are Stype and Vtype transfer functions, and as binarization rules there are standard, static, complement, elitist and elitist roulette, resulting in a total of 40 possible combinations, combinations that the intelligent selector is able to determine which one to use in each iteration.
Figure 3 shows in general how the BSS is applied, and the Algorithm 5 explains the operation of the BSS.
Algorithm 5 Binarization scheme selector. 
 1:
Initialize a random swarm  2:
Initialize Qtable  3:
for$iteration\phantom{\rule{2.0pt}{0ex}}\left(t\right)$do  4:
Select action ${a}_{t}$ for ${s}_{t}$ from the Qtable  5:
for $solution\phantom{\rule{2.0pt}{0ex}}\left(i\right)$ do  6:
for $dimension\phantom{\rule{2.0pt}{0ex}}\left(d\right)$ do  7:
${X}_{i,d}^{t+1}\phantom{\rule{2.0pt}{0ex}}\u27f5\phantom{\rule{2.0pt}{0ex}}{X}_{i,d}^{t}\phantom{\rule{2.0pt}{0ex}}+\Delta \phantom{\rule{2.0pt}{0ex}}\left({a}_{t}\right)$  8:
end for  9:
end for  10:
Get immediate reward ${r}_{t}$  11:
Get the maximum Qvalue for the next state ${s}_{t+1}$  12:
Update Qtable using Equation ( 1) or Equation ( 2)  13:
Update the current state ${s}_{t}\phantom{\rule{2.0pt}{0ex}}\u27f5\phantom{\rule{2.0pt}{0ex}}{s}_{t+1}$  14:
end for

4. Experimental Results
To validate the performance of our proposal, a comparison of eight different versions of GWO, SCA, and WOA was carried out. Three of these versions incorporate BQSA, QL, and SARSA, where 80 binarization schemes were selected (80aBQSA, 80aQL, 80aSA). The other three versions of the final refer to incorporating BQSA, QL, and SARSA, where 40 binarization schemes were selected (40aBQSA,40aQL,40aSA). Finally, the last two use fixed binarization schemes. Regarding the binarization schemes and versions, it is necessary to mention that the schemes or also called actions are composed of the multiplication of the
$\{Transfer\phantom{\rule{4pt}{0ex}}Functions\times Binarization\phantom{\rule{4pt}{0ex}}Rules\}$; in this case, we have four families of transfer functions with 4 functions each, 16 in total and 5 binarization rules:
$16\times 5=80$. Regarding the versions mentioned and performed, 40aQL and 40aSA are the original versions of 40 actions (families S and V) applied to QL [
41] and SARSA [
42], respectively, in order to see the performance of our proposal, we replicated QL and SARSA, but this time in 80 actions and BQSA in both 40 and 80 actions. Finally, the versions in the middle, the first, BCL, uses the V4Elitist [
48]. The second, called MIR, uses the V4Complement [
49].
The benchmark instances of the set covering problem solved are those proposed in Beasley’s ORLibrary [
81]. In particular, we solved 45 instances delivered in this library.
The programming language used in the construction of the algorithms was Python 3.7, and it was executed with the free services of Google Colaboratory [
82]. The results were stored and processed from databases provided by the Google Cloud Platform. The authors at [
48] suggest making 40,000 calls to the objective function. To this end, we used 40 individuals from the population and 1000 iterations for all GWO, SCA, and WOA execution instances. 31 independent executions were performed for each executed instance. All parameters used for GWO, SCA, WOA, BQSA, QL, and SARSA are detailed in
Table 2.
The results obtained from the experimentation process are summarized in the
Table 3,
Table 4 and
Table 5, where the results are presented for each of the eight versions, and the 45 benchmark instances, wherein the first row the names of the versions are presented, in the second row, the titles that go as follows, the first column names the ORLibrary instances used (Inst.), the second column the optimal value known for each of these instances (Opt.), while the following columns are titled in three columns the best result obtained in the 31 independent runs (Best), the average of these 31 runs (Avg), and finally the relative percentage deviation (RPD), which is defined in Equation (
6). While in the last row, each column’s average results are presented to facilitate the comparison between versions.
4.1. Convergence and Exploration–Exploitation Charts
During the experimentation, several data from the optimization process were recorded, such as the fitness obtained in iteration and the diversity among individuals as presented in [
41,
42], in order to analyze their behavior during the iterations. A graphical representation is the convergence graphs shown in
Figure 5,
Figure 6 and
Figure 7, where the X axis corresponds to the 1000 iterations, while the Y axis presents the best fitness obtained up to that iteration. The graphs correspond to the best runs for representative instances, and the fitness value found is recorded in the subtitle of each graph, while in the graph set title, the known optimum is presented in order to make a simpler comparison. Other representations are the exploration and exploration graphs, presented in
Figure 8,
Figure 9 and
Figure 10, where on the Xaxis, the iterations are presented while on the Yaxis, the exploration (XPL) and exploitation (XPLT) are displayed.
4.2. Statistical Results
To validate the comparison between averages, it is necessary to define by means of the corresponding statistical test if the difference between the results is significant, for which we use the Wilcoxon–Mann–Whitney test [
83], making a comparison between all the versions, with a significance level of 0.05, for each of the MH used, where these results are represented in
Table 6,
Table 7 and
Table 8. These tables are structured as follows; the first column presents the techniques used, the following columns present the average
pvalues of the 45 instances compared with the version indicated in the column title; if the value of this comparison is greater than 0.05 it is presented as “
$\ge $0.05”, when the comparison is against the same version the symbol “” is presented and the values have been approximated to the second decimal place.
5. Conclusions
The increase in computing capacity at more accessible costs has allowed the democratization of the use of machine learning, which has generated an increase in research in different areas, where we can see that every day, the use of these techniques is more common in both academia and industry. The use of machine learning in the improvement of the metaheuristics search process is a field in constant development, where several researches seek to validate the use of ML as a process improvement. The literature presents two explicit schemes for these hybridizations, the highlevel ones, where we can observe the hyperheuristics, and the lowlevel ones, as is the case of this research, where the ML technique is a further operator of the MH.
In this paper, a new intelligent operator is presented in the use of binarization scheme selection (BSS), able to adapt any continuous MH to work in the binary domain. The main contributions are the implementation of a BSS capable of operating in any MH, using a new intelligent operator such as BQSA, and the increase of the possible actions of the intelligent operator from 40 to 80. The increase in these actions comes from the incorporation of two novel families that are rarely used in binarization schemes, such as Zshaped and Xshaped.
The twostep binarization schemes are the mostused binarization methods in the literature [
47], both for their versatility in programming and their low computational cost; for this, it is necessary to choose a transfer function (Step 1) and binarization rule (Step 2), but in the literature, there are many different ways to binarize, since there is a combinatorial problem between the options of Step 1 and Step 2, reaching the extreme of having infinite alternatives [
75,
78], having so many options, it is necessary to choose intelligently among all the possible options.
In the literature, to solve the problem of choosing the binarization scheme, in most cases, a combination that has presented good performance is chosen, while in some more exhaustive works, the combination is validated against an extensive experimental analysis, as is the case in [
48,
49], but these only confirm a good twostep combinatorial problem, for a given problem and instances, which is not necessarily replicable to other problems in the same domain. Under this context, an intelligent scheme was proposed in [
41] to select among different actions (twostep combinatorics), where by means of QL, an action is chosen for each iteration, which is rewarded or penalized according to their performance, this being a hybridization where a machine learning technique supports metaheuristics. There are other works where BSS is used, which present different intelligent selectors such as QL [
41] and SARSA [
42], but always with the mostused set of 40 actions (8 choices of Vshaped and Sshaped transfer functions and five binarizations), which have presented diverse favorable performances, but in this work, besides replicating these experiments, we analyzed the performance of 80 actions (16 options of Vshaped, Sshaped, Xshaped, and Zshaped transfer functions, and 5 binarizations), with the objective of validating that, by having a wider range of actions, the intelligent selector will be able to choose in a better way, avoiding biases by having reduced actions, besides directing the research so that the intelligent selector has more options to choose from.
In response to this proposal, an extensive set of experiments was carried out, which were detailed in
Section 4, where eight different versions were compared between 3 MHs, solving 45 different instances of the set covering problem, all of them executed in 31 independent runs, in order to perform the respective statistical tests. The versions containing 80 actions (80aBQSA, 80aQL, and 80aSA) presented competitive performances, obtaining a similar average RPD and, in some cases, better, but having differences that were not significant in front of the respective statistical tests; therefore, we cannot conclude that they have a better performance compared to the versions with 40 actions (40aBQSA, 40aQL, and 40aSA), but for the MHs’ WOA and SCA, there were significant differences when compared to the static version MIR (v4Complement), which had the worst performance of the eight versions. After analyzing the convergence plots, we can observe that, although the results are diverse, we can assume that the static schemes present early convergences, compared to the dynamic versions (40 and 80 actions), which is related to a good search process without getting trapped early in local optima. Along with this, the exploration and exploitation graphs give us a different perspective of the behavior during the search process, which gives us information on the diversity between individuals, as defined in [
84]; from these graphs, we can conclude that the versions with 80 shares tend to have a more predominant tendency to exploit, On the other hand, we confirmed that the recommendations of the literature will not necessarily be applicable for any problem, as is the case of using the MIR combination, which was validated for another problem, confirming what is stated in the nofreelunch theorem [
13].
During the theoretical and experimental development of this work, new research questions have arisen, which remain as possible future works based on the work implemented in this paper, where we can highlight the need to advance in addressing the use of variable transfer functions [
75,
78]. This is in order to take advantage of the richness of being able to vary the transfer function under a continuous parameter, but which in turn generates a problem to solve, which is that BQSA, SARSA, QL, and other temporal difference techniques are defined to choose between a discrete set of actions, not allowing directly choosing actions in continuous domains. It is also necessary to study the influence of transfer functions against binarization rules, i.e., to use a variety of actions, either individual transfer functions or individual binarization rules. Along with answering the above questions, the option of evaluating other MHs in the literature in other binary domain problems is contemplated in order to confirm that the incorporation of reinforcement learning techniques generates the same effect on them. Another area to investigate is the behavior of this hybridization under smaller subsets in order to evaluate the impact of each of the combinatorics.