1. Introduction
Nature–inspired optimization techniques are a set of algorithms or methods designed to adapt and solve complex optimization problems [
1]. The level of adaptability of metaheuristics is because they do not depend on the mathematical structure of the problem but are based on heuristic procedures and intelligent search strategies [
2,
3]. There is a subset of metaheuristics known as swarm intelligence methods which are defined as procedures that operate based on a population of artificial individuals who can cooperate with each other to try to find the solution to the problem. Solutions can be classified as good enough and are found in a certain time, configured through the input parameters, which dictate the internal behavior of the execution [
4].
Particle swarm optimization (PSO) is probably the bio-inspired optimization algorithm most applied in the last decades [
5]. This method uses inertia weights, acceleration coefficients, and social coefficients to calculate the movement of its individuals in the search space. The parameters are so relevant in the execution and performance of the algorithm that, by making small adjustments, you can directly impact the result found [
6]. Based on the “No Free Lunch” theorem, we can infer that no universal configuration for this algorithm can provide us with the best possible solution for all optimization problems [
7]. Therefore, adapting the algorithm to the problem in question is necessary, considering the parameters that must be readjusted when facing different problems. It has been shown that parameter setting drastically affects the final result of the algorithm, and it is still a hot topic [
8]. From this, the problem of parameter adjustment arises, which can be considered an optimization problem [
9]. There are at least two ways to approach the problem of parameter tuning: (a) offline tuning, which implies the identification of the best values for a problem during a testing phase, and which does not modify the parameters during the execution of the algorithm, and (b) online control, which adapts the values of the parameters during execution, according to different strategies that can be deterministic, adaptive or self–adaptive. Due to the lack of a single solution to this problem, the scientific community has searched for hybrid modules inspired by different disciplines to provide a solution. One of these methods is Learnheuristics, which combines machine learning (ML) techniques with metaheuristic algorithms [
10].
This study aims to investigate and develop different online parameter control strategies for swarm intelligence algorithms at runtime through reinforcement learning [
11]. The proposal contemplates integrating a variety of Q–Learning algorithms into the PSO algorithm to assist in the control of its parameters online. Each variety of Q–Learning has its unique characteristics and adapts to different situations, making it possible to effectively address a variety of scenarios. The first strategy includes Q–Learning where the Q–table stores new parameter values of PSO. The second one integrates one table for each parameter, and finally, the third method is free for states. In order to demonstrate that the different technical proposals are viable, and to compare the performance of each one, some of the most challenging instances of the multidimensional knapsack Problem (MKP) will be solved. MKP is a wide-known NP-complete optimization problem consisting of items with profit and
n–dimensional weight, and knapsacks that have been filled [
12]. The objective is to choose a subset of items with maximum profits without exceeding the knapsack capacities. This problem was selected because it has a wide range of practical applications and continues to be a topic of interest in the operations research community [
13,
14,
15]. For computational experiments, 70 of the most challenging instances of the MKP taken from the OR–Library [
16] were used. The results were evaluated through descriptive analysis and statistical inferences, mainly a hypothesis contrast applying non–parametric evaluations.
The rest of the manuscript is divided as follows:
Section 2 presents a robust analysis of the current relationship work on hybridizations between learning techniques and metaheuristics.
Section 3 details the conceptual framework of the study.
Section 4 explains how reinforcement learning techniques are applied on particle swarm optimization.
Section 5 exposes the phases of the experimental design, while
Section 6 discusses the results achieved. Finally, the conclusions and future work are given in
Section 7.
2. Related Work
In recent years, the integration between swarm intelligence algorithms and machine learning has been extensively investigated [
10]. To this end, various approaches have been described to implement self-adaptive and learning capabilities in these techniques. For example, in ref. [
17], the virus optimization algorithm is modified to add self-adaptive capacities in the parameters. The performance was compared with different-sized optimization instances, and similar or better performance was observed for the improved version. Similarly, in ref. [
18], the firefly algorithm was enhanced to auto-compute the parameter
a that controls the balance between exploration and exploitation. In ref. [
19], an analogous strategy modifies the cuckoo search algorithm to balance the intensification and diversification phases. The work published in [
20] proposes improving the artificial bee colony by incorporating self-adaptive capabilities of its agents. This study aims to improve the convergence ratio by altering the parameter that controlled it during the run. A comparable work can be seen in [
21]. Here, the differential evolution strategy is modified by adding auto–tuning qualities to the scalability factor and the crossing ratio to increase the convergence rate. The manuscript [
22] describes an improvement of the discrete particle swarm optimization algorithm, which includes an adaptive parameter control to balance social and cognitive learning. A new formulation updates the probability factor
in the Bernoulli distribution, which updates the parameters
(social learning) and
(cognitive learning). Following the same line, in [
23,
24], self-adaptive evolutionary algorithms were proposed. The first one details an enhancement through a population of operators that change based on a punishment and reward scheme, depending on the operator’s quality. The second one presents an improvement where the crossover and mutation probability parameters are adapted to balance exploration and exploitation in the search for solutions. Both cases show outstanding results. In ref. [
25], a self-adjust was applied to the flower pollination algorithm. This proposal balances the exploration and exploitation phases during the run and uses a parameter
as an adaptive strategy. A recent wolf pack algorithm was altered to auto-tune its parameter
w that controls prey odor perception [
26]. Here, the new version intensifies the local search toward more promising zones. Finally, the work in [
27] proposes integrating the autonomous search paradigm in the dolphin echolocation algorithm for population self-regulation. The paradigm is applied when stagnation of a local optimum is detected.
Integrating metaheuristics with machine learning, regression, and clustering techniques has also been the subject of studies [
28,
29,
30,
31,
32]. For example, in ref. [
33], the authors propose an evolutionary algorithm that controls the parameters and operators. This is accomplished by integrating a controller module that applies learning rules, measuring the impact and assigning restarts to the parameter set. Under this same paradigm, the work reported in [
34] explores the integration of the variable neighborhood search algorithm with reinforcement learning, applying reactive techniques for parameter adjustment, and selecting local searches to balance the exploration and exploitation phases. In ref. [
35], a machine learning model was developed using the support vector machine, which can predict the quality of solutions for a problem instance. This solution then adjusts the parameters and guides the metaheuristics to more promising search regions. In ref. [
36,
37], the authors propose the integration of PSO with regression models and clustering techniques for population management and parameter tuning, respectively. In ref. [
38], another combination between PSO and classifier algorithms is presented with the goal of deparameterizing the optimization method. In this approach, a previously trained model is used to classify the solutions found by the particles, which improves the exploration of the search space and the quality of the solutions obtained. Similar to previous works, in ref. [
39], PSO is again enhanced with a learning model to control its parameters, obtaining a competitive performance compared to other parameter adaptation strategies. The manuscript [
40] presents the hybridization between PSO, Gaussian Process Regression, and Support Vector Machine for real–time parameter adjustment. The study concluded that the hybrid offers superior performance compared to traditional approaches. The work presented in ref. [
41] integrates randomized priority search with the Inductive Decision Tree data mining algorithm for parameter adjustment through a feedback loop. Finally, in ref. [
42], the authors propose the integration of algorithms derived from ant colony optimization with fuzzy logic for the control of the parameters of pheromone evaporation rate, exploration probability factor, and the number of ants for solving the feature selection problem.
More specifically, reviewing studies on integrating metaheuristics and reinforcement learning, we find many works combining these two techniques to improve the search in optimization problems [
43]. For example, in [
44], the authors proposed the integration of bee swarm optimization with Q–Learning to improve their local search. In this approach, the artificial individuals become intelligent agents that gain and accumulate knowledge as the algorithm progresses, thus improving the effectiveness of the search. Along the same lines, ref. [
45] proposes integrating a learning–based approach to the ant colony algorithm to control its parameters. It is carried out by assigning rewards to the change of parameters in the algorithm, storing them in an array, and learning the best values to apply to each parameter at runtime. In [
46], another combination of a metaheuristic algorithm with reinforcement learning techniques is proposed. In this case, tabu search was integrated with Q–Learning to find promising regions in the search space when the algorithm is stuck at a local optimum. The work published in [
47] also explored the application of reinforcement learning techniques in the context of an optimization problem. Here, the biased–randomized heuristic with reinforced learning techniques was studied to consider the variations generated by the change in the rewards obtained. Finally, Ref. [
48] presents the implementation of a Q–Learning algorithm to assist in training neural networks to classify medical data. In this approach, a parameter tuning process was carried out in radial–based neural networks using Stateless Q–Learning. Although the latter is not an optimization algorithm, the work is relevant to our research.
Even though machine learning techniques have already been explored in bio-inspired algorithms, it is worth continuing research on this type of hybridization. Our strategy involves reinforcement learning on PSO, using Q–Learning and new variations that have not been studied yet. In this context, we studied how Q–Learning can be modified to provide better results. This approach can be fruitful if done properly.
4. Developed Solution
In this section, we detail different ways to apply Q–Learning on PSO and how this implementation allows us to improve its performance when solving NP–Complete combinatorial optimization problems.
4.1. Particle Swarm Optimization
Particle swarm optimization is a swarm intelligence algorithm inspired by group behavior that occurs in flocks of birds and schools of fish [
64]. In this algorithm, each particle represents a possible solution to the problem and has a velocity vector and a position vector.
PSO consists of a cyclical process in which each particle sees its trajectory influenced based on two types of learning:
social learning, acquired through knowledge of other particles in the swarm, and
cognitive learning, acquired through the experiences of the particle [
5,
65]. In the traditional PSO, the velocity of the particle is represented as a vector
, as long as their position is described as
. Initially, each particle’s position vector and velocity vector are randomly created. Then, during the execution of the algorithm, the particles are moved using the Equations (
3) and (
4):
where
w is the acceleration coefficient,
and
are social learning and cognitive learning, respectively, and finally,
and
are uniformly distributed random values in the range
. The method needs a memory, called
, representing the best position met for the
i–th particle. The best particle is stored in
, and it is found when the algorithm ends. Algorithm 2 summarizes the PSO search procedure.
Algorithm 2: PSO pseudocode |
|
As a solution to the parameter optimization problem, integrating the Q–Learning algorithm in a traditional PSO is proposed. The objective of this combination is focused on PSO that can acquire the ability to adapt its parameters online, that is, during the execution of the algorithm.
The approach initializes by declaring the swarm, its particles, the necessary velocity vectors, and the
Q table. The normal course for a PSO algorithm is then continued. The Q–Learning module is invoked when the algorithm is stagnation. We detect stagnation by applying the approximate theory of nature-inspired optimization algorithms, derived from
, with
uniform value in
. This module analyzes the environment for possible state changes and then updates the
Q table with the appropriate reward. If this is the first call to this module, these steps should be ignored. Subsequently, a decision must be made between two possible actions to adjust the algorithm’s parameters. One comes from the policy derived from
Q, while the other option is to change the parameters randomly. The difference between these two options is that the first one will provide a better return depending on the existing knowledge. In contrast, the second one will allow us to find previously unknown knowledge. Finally, the Q–Learning module transfers control to PSO to continue operating.
Figure 1 depicts the flow of the proposal.
In the work, the proposed strategies allow us to integrate different levels of reinforced learning into swarm intelligence algorithms: Classic Q–Learning, Modified Q–Learning, and Single State Q–learning.
- (a)
Classic Q–Learning (CQL): The first one directly applies the Q–Learning theory, including a single Q table. This table represents the states by combining the possible values of each parameter (in intervals of 0 and 1) and actions (transitions from one state to another). This process computes the fitness variation from the previous invoke to the module to the current call, assigning a positive or negative reward value according to the objective function. Then, the Q table is updated concerning the action/state pair. Finally, the most favorable action is derived according to the current state and the greedy policy, applying the new set of parameters to the swarm.
- (b)
Modified Q–Learning (MQL): The second one is a variation of the classic Q–Learning. This strategy divides the Q table by each parameter and, furthermore, decreases the number of possible actions, allowing the agent only to move forward, backward, or stay in one state. In this method, the module is allowed to use each particle individually for the training of the Q tables and the modification of parameters. This means that, unlike the method seen above, this strategy is invoked only by one of the individuals in the swarm instead of the entire swarm. It is important to consider that each particle must store its current state and action as attributes such as velocity or position.
- (c)
Single state Q–Learning (SSQL): The third one replaces the concept of a Q table with an array, removing states and looking only at changes produced by actions. The new Q array includes actions representing the parameter amount to modify. Similar to the previous version, the Q array has been split by each parameter and allows to use of the entire swarm individually. These changes effectively remove the state dependency, granting for more precise parameter changing.
4.2. Integration
Algorithm 3 shows the steps to follow to integrate reinforcement learning into PSO.
Algorithm 3: Integration of reinforcement learning into PSO |
|
Firstly, the procedure determines the states that represent the current condition or configuration of the PSO algorithm. These states could include the positions of the particles, their velocities, or any other relevant variables. Next, the algorithm identifies the actions that can be taken to modify the parameters of the PSO algorithm. These actions could involve changing the inertia weight, acceleration coefficients, or any other parameter that influences the behavior of the particles. In the third step, the method uses the reward function that evaluates the performance of the PSO algorithm based on the solutions obtained. The reward function should provide feedback on how well the algorithm is performing and guide the Q–learning process. The fourth step describes how the Q–table is created. Q–table is a lookup table that maps states and actions to their corresponding Q–values and it is initially filled with random elements.
The iterative process runs while PSO needs it. Here, the current state of PSO saves information, such as the positions and velocities of the particles, and the best solution found so far. Next, the most appropriate action is chosen based on the –greedy method. This method is employable because all actions are equally applicable. We consider the –greedy method as the policy to identify the best action. A uniform probability of where is used. The action modifies the parameter configuration of PSO in a random way: , where and . Then, PSO is executed to obtain the solutions. Its performance is evaluated by using the reward function and it allows to modify the Q–value of the previous state–action pair applying the Q–learning update rule. Finally, the previous state is updated to the current state, preparing for the next iteration of the algorithm. Over time, the Q-table will be updated, and the PSO algorithm will learn to select actions that lead to better solutions.
Before implementing, we analyze the time complexity of each component and its integration. Firstly, the time complexity of PSO depends mainly on the number of particles and the number of iterations. At each iteration, the positions and velocities of the particles are updated, and the objective function is evaluated. However, in this case, these components are constant, and the algorithm really depends on the dimensionality of the problem. Therefore, the complexity of the PSO algorithm itself is , where K represents the number of particles per number of iterations, and n defines the number of decision variables. On the other hand, the time complexity of Q–Learning is based on the size of the Q–table, which is determined by the number of possible states and actions. If the search space is large and the Q–table is large, the complexity will increase. In our case, we use a value range for parameters that remain constant during the run of PSO. Then, we can guarantee that the three proposals are efficient because none of them exceeds the polynomial time.
5. Experimental Setup
In order to comprehensively evaluate the performance of the proposed hybridizations, it is crucial to conduct a robust analysis that encompasses various aspects. One essential step in this analysis is to compare the solutions obtained by each strategy with the classic version of PSO. By benchmarking the solutions against the PSO’s results, we establish a reliable reference point for evaluating the effectiveness and efficiency of the proposed hybridizations. This enables us to gauge the extent to which the algorithmic enhancements contribute to improving solution quality and reaching optimality. Moreover, this comparative analysis serves to validate the credibility and competitiveness of the proposed hybridizations in the field. By showcasing their ability to achieve results that are on par with or even surpass PSO, we can establish the superiority of our approach and its potential to outperform existing methods.
To ensure the robustness of the performance analysis, it is important to employ a diverse set of benchmark problems that accurately represent the challenges and complexities encountered in real-world scenarios. By testing the proposed hybridizations on these benchmarks, we can assess their adaptability, generalizability, and ability to handle various problem instances effectively.
Figure 2 indicates the steps taken to examine the three proposals’ performance thoroughly. In addition, we establish objectives and recommendations for the experimental phase, in order to demonstrate that the proposed approaches allow for improving the optimization of metaheuristic parameters.
The analyses include: (a) the resolution time to determine the difference produced when applying the different methods, (b) the best value found by each method, which is an important indicator to assess future results, and finally, (c) an ordinal analysis and statistical tests to determine if one method is significantly better than another.
For the experimental phase, several optimization instances were solved in order to measure the performance of the different proposed methods. These instances were taken from the OR–Library, a virtual library that J.E. Beasley first described in 1990 [
16], and in which it is possible to find various test data sets. In this study, 70 binary instances of the multidimensional knapsack problem were used (from MKP1 to MKP70).
Table 1 details each instance, indicating its optimal solution, the number of backpacks, and the number of objects.
For instances from MKP56 to MKP70, there are no recorded optimal values because they could not be resolved by using exact methods. For this reason, we use “unknown” to describe that this value has not been found to date.
Equation (
5) defines the formulation of MKP:
where
describes whether or not the object is included in a backpack, and
n represents the total number of objects. Each object has a real value
that represents its profit and is used to calculate the objective function. Finally,
stores the weight of each object based on the backpack
k with maximum capacity
. As can be seen, this is a combinatorial problem that deals with the dilemma of including or not an object in a certain backpack that has a certain capacity.
To execute a metaheuristic of a continuous nature in a binary domain, it is required to add a binarization phase after the solution vector changes [
69]. Here, a standard sigmoid function was used as the transformation function, that is,
, with
as a uniform random value between
. Then, if the previous formulation is true, we use
as discretization. Otherwise, we use
.
The performance of each method is evaluated after resolving each of the 70 instances a total of 30 times. Once the complete set of results is obtained from all executions and instances, an outlier analysis is performed to study possible irregular results. Here, influential outliers were detected using the Tukey test, which takes as reference the difference between the first quartile (Q1) and the third quartile (Q3), or the interquartile range. In our case, it is considered a slight outlier if the result is 1.5 times that distance from one of those quartiles or an extreme outlier if it is three times that distance. This test was implemented using a spreadsheet to calculate the statistical values automatically. All outliers were removed to avoid distortion of the samples, and then new tests were taken to replace the removed solutions. Moreover, we use the metric of the relative percentage difference (RPD) between the best solution to the problem and the best solution found. This value is calculated on .
As a next step, a descriptive and statistical analysis of the results was carried out. For the first, metrics such as maximum and minimum values, the mean, the quasi–standard deviation, the median, and the interquartile range are used to compare the results generated by the three methods. The second analysis corresponds to statistical inference. In this analysis, two hypotheses are contrasted to reveal the one with the greatest statistical significance. The tests employed for that were: (a) the Shapiro–Wilk test for normality and (b) the Wilcoxon–Mann–Whitney test for heterogeneity. In addition, for a better understanding of the robustness of the analysis, it is essential to highlight that, given the independent nature of the instances, the results obtained in any of them do not affect the results of the others. Likewise, the repetition of an instance does not imply the need for more repetitions of the same instance.
In [
70], the parameter values with the best average results in terms of swarm performance are described. Considering this, it has been determined that the initial values for the PSO parameters will be:
,
,
, and
. A sampling phase was carried out for Q–Learning’s parameters to determine their value that offers the best results. Then, the best initial configuration was:
,
, and
. Finally, all the methods were coded in the Java 1.8 programming language, and executed on a workstation whose infrastructure had a Windows 10 Enterprise operating system, AMD Ryzen 7 1700m 8–core 3.64 GHz processor, and 16 GB of memory. 1197.1 MHz RAM. It is important to note that parallel implementation was not required. Instances, data, and codes are available in [
71,
72,
73].
6. Discussion
All algorithms were run 30 times in the testing phase for each instance. Results were recorded, distinguishing each method to be further compared (native PSO or NPSO, classic Q–Learning or CQL, modified Q–Learning or MQL, and single state Q–Learning or SSQL).
Table 2 summarizes how many known optimums were found for each version, and in the case of the instances with unknown optimums, how many of these could reach the best solution found in a limited testing time. We employ a cut–off equal to five minutes. If an approach exceeds this bound, it is not included in the results.
Analyzing only these results, we can see that the number of optimal values achieved by the native PSO is better than those achieved by the basic Q–Learning implementation. In contrast, both are overshadowed by our modified version of Q–Learning and the single state Q–Learning. Thus, we can preliminarily observe that: (a) the performance of basic Q–Learning is inferior to that of native PSO, (b) a significant difference between modified Q–Learning and single state Q–Learning cannot yet be detected, (c) there is a significant difference between modified Q–Learning and single state Q–Learning in terms of the unknown optimums reached. In general, the single state Q–Learning obtained the best results. All the results obtained by each version of the algorithms are present in
Table 3,
Table 4,
Table 5 and
Table 6.
Now, to demonstrate more robustly which approach works best, we take more restricted instances of MKP to graph the distribution of best values generated by each strategy. These instances have many objects to select and a small number of backpacks to use.
Figure 3 shows the convergences of each method. For the MKP06 instance, we can see a similar convergence among strategies, with classic Q–Learning being the version with the latest convergence compared to the others. For the MKP35 instance, large convergence differences are observed in the four strategies. Here, PSO is the algorithm with the latest convergence, and the modified version of Q–Learning and the Single State version have the earliest convergence. For the MKP70 instance, a similar performance can be seen between the modified Q–Learning and single state Q–Learning, while the convergences between default PSO and classic Q–Learning are the latest.
Observing the results presented by the distribution
Figure 4, it can be again concluded that in general, the standard PSO obtains final results very similar to its version assisted by Q–Learning. With these results and the previously mentioned, we dare say that a possible explanation for this phenomenon would be the high time cost required to train all the action/state pairs of the
Q table, causing that, at the end of the execution, the algorithm cannot find a better parametric configuration than the initial one.
This possible problem is mitigated in the other two implemented methods due to the considerable reduction and division of the Q table. Last, the PSO algorithms assisted by the modified Q–Learning and single state Q–Learning obtain significantly better results. Here, we observe that both algorithms, in their runtime, train, and obtain edge configurations adjusted to the instance.
Following up with a robust result review, we employed the two statistical tests (mentioned in
Section 5): (a) normality assessment and (b) contrast of hypotheses to determine if the samples come or not from an equidistributed sequence. For determining if observations (runs per instance) draw a Gaussian distribution, we establish
as samples follow a normal distribution. Then,
is the opposite.
The cutoff of the
p–value is
, for which results under this threshold state the test is said to be significant (
rejects). Results confirmed that the samples do not follow a normal distribution, so we employ the non-parametric test, Mann–Whitney–Wilcoxon. Here, we assume
as the null hypothesis that affirms native methods generate better values than their versions improved by the Q–Learning. Thus,
suggests otherwise. In total, six tests were carried out, and the results are presented in
Table 7 and
Table 8. In the comparison between native PSO and the classic Q–Learning, we can note that the first one exceeds the 95% reliability threshold in 59 of the 70 instances, while the second one only does so in one instance.
On the other hand, in the comparison between native PSO and the modified Q–Learning, it is possible to observe that MQL surpasses the threshold in 44 instances, while NPSO does so only in two instances. Regarding the comparison between NPSO and SSQL, we observe that the latter outperforms the threshold in 53 instances, while NPSO does not exceed the threshold in any instance. Finally, in comparing MQL and the single state version, we can analyze that SSQL beats the threshold in 10 instances, while the modified version of Q–Learning only does so in six instances. Furthermore, in the remaining comparisons, MQL and SSQL exceed the 95% confidence threshold in all instances, while classic Q–Learning and NPSO do not exceed the threshold in any instance.
From all obtained results, we can conclude that there is a better performance in the modified Q–Learning and SSQL compared to its classic version and the standard PSO.
7. Conclusions
This article presents an approach to improve the efficiency of a swarm intelligence algorithm when solving complex optimization problems by integrating reinforcement learning techniques. Specifically, we use Q–Learning to adjust the optimal parameters of the particle swarm optimization for solving several instances of the multidimensional knapsack problem.
The analysis of the data obtained in the testing phase shows that the algorithms assisted by reinforcement learning obtained better results in multiple aspects when compared to the native version of PSO. In particular, the single state Q–Learning assisting PSO finds solutions that, as a whole, have better quality in terms of mean, median, standard deviation, and interquartile ranges. In addition, it is observed that SSQL achieves earlier convergence in significant instances when compared to the other methods. Notwithstanding the preceding, it is observed that the native PSO has a slightly better general performance than PSO improved by classic Q–Learning. This is attributed to the high time cost required to train all the action/state pairs. Here, Q–Learning can not guarantee the algorithm finding a better-than-initial parametric configuration at the end of each run. It is suggested to explore the performance of Q–learning with PSO in other optimization problems beyond the multidimensional knapsack problem. In general views, the effect of different parameter settings for the Q–learning algorithm, such as the learning rate and the discount factor, should also be explored to evaluate their effectiveness in the reinforcement learning method in conjunction with PSO.
Finally, it is suggested to explore the comparisons of using the Q–learning method with other bio-inspired algorithms, such as the gray wolf optimizer, whale optimization, bald eagle search optimization, and Harris hawks optimization, among others.
In conclusion, there is a promising approach for improving the performance of swarm intelligence algorithms through reinforcement learning when solving optimization problems. More research is needed to explore its effectiveness in other complex optimization problems and to be able to compare the results obtained with other currently existing methods.