A Learning—Based Particle Swarm Optimizer for Solving Mathematical Combinatorial Problems

Olivares, Rodrigo; Soto, Ricardo; Crawford, Broderick; Ríos, Víctor; Olivares, Pablo; Ravelo, Camilo; Medina, Sebastian; Nauduan, Diego

doi:10.3390/axioms12070643

Open AccessArticle

A Learning—Based Particle Swarm Optimizer for Solving Mathematical Combinatorial Problems

by

Rodrigo Olivares

^1,*

,

Ricardo Soto

^2,*

,

Broderick Crawford

²

,

Víctor Ríos

¹

,

Pablo Olivares

¹

,

Camilo Ravelo

¹

,

Sebastian Medina

¹

and

Diego Nauduan

¹

Escuela de Ingeniería Informática, Universidad de Valparaíso, Valparaíso 2362905, Chile

²

Escuela de Ingeniería Informática, Pontificia Universidad Católica de Valparaíso, Valparaíso 2362807, Chile

^*

Authors to whom correspondence should be addressed.

Axioms 2023, 12(7), 643; https://doi.org/10.3390/axioms12070643

Submission received: 18 May 2023 / Revised: 21 June 2023 / Accepted: 26 June 2023 / Published: 28 June 2023

(This article belongs to the Special Issue New Trends in Learning-Based Techniques Hybridizing Bio-Inspired Optimization Algorithms)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This paper presents a set of adaptive parameter control methods through reinforcement learning for the particle swarm algorithm. The aim is to adjust the algorithm’s parameters during the run, to provide the metaheuristics with the ability to learn and adapt dynamically to the problem and its context. The proposal integrates Q–Learning into the optimization algorithm for parameter control. The applied strategies include a shared Q–table, separate tables per parameter, and flexible state representation. The study was evaluated through various instances of the multidimensional knapsack problem belonging to the

NP

-hard class. It can be formulated as a mathematical combinatorial problem involving a set of items with multiple attributes or dimensions, aiming to maximize the total value or utility while respecting constraints on the total capacity or available resources. Experimental and statistical tests were carried out to compare the results obtained by each of these hybridizations, concluding that they can significantly improve the quality of the solutions found compared to the native version of the algorithm.

Keywords:

reinforcement learning; learning–based hybridizations; particle swarm optimization; mathematical combinatorial problem

MSC:

65K05; 68T05; 68T20; 68W25; 68W50; 78M50; 90C59

1. Introduction

Nature–inspired optimization techniques are a set of algorithms or methods designed to adapt and solve complex optimization problems [1]. The level of adaptability of metaheuristics is because they do not depend on the mathematical structure of the problem but are based on heuristic procedures and intelligent search strategies [2,3]. There is a subset of metaheuristics known as swarm intelligence methods which are defined as procedures that operate based on a population of artificial individuals who can cooperate with each other to try to find the solution to the problem. Solutions can be classified as good enough and are found in a certain time, configured through the input parameters, which dictate the internal behavior of the execution [4].

Particle swarm optimization (PSO) is probably the bio-inspired optimization algorithm most applied in the last decades [5]. This method uses inertia weights, acceleration coefficients, and social coefficients to calculate the movement of its individuals in the search space. The parameters are so relevant in the execution and performance of the algorithm that, by making small adjustments, you can directly impact the result found [6]. Based on the “No Free Lunch” theorem, we can infer that no universal configuration for this algorithm can provide us with the best possible solution for all optimization problems [7]. Therefore, adapting the algorithm to the problem in question is necessary, considering the parameters that must be readjusted when facing different problems. It has been shown that parameter setting drastically affects the final result of the algorithm, and it is still a hot topic [8]. From this, the problem of parameter adjustment arises, which can be considered an optimization problem [9]. There are at least two ways to approach the problem of parameter tuning: (a) offline tuning, which implies the identification of the best values for a problem during a testing phase, and which does not modify the parameters during the execution of the algorithm, and (b) online control, which adapts the values of the parameters during execution, according to different strategies that can be deterministic, adaptive or self–adaptive. Due to the lack of a single solution to this problem, the scientific community has searched for hybrid modules inspired by different disciplines to provide a solution. One of these methods is Learnheuristics, which combines machine learning (ML) techniques with metaheuristic algorithms [10].

This study aims to investigate and develop different online parameter control strategies for swarm intelligence algorithms at runtime through reinforcement learning [11]. The proposal contemplates integrating a variety of Q–Learning algorithms into the PSO algorithm to assist in the control of its parameters online. Each variety of Q–Learning has its unique characteristics and adapts to different situations, making it possible to effectively address a variety of scenarios. The first strategy includes Q–Learning where the Q–table stores new parameter values of PSO. The second one integrates one table for each parameter, and finally, the third method is free for states. In order to demonstrate that the different technical proposals are viable, and to compare the performance of each one, some of the most challenging instances of the multidimensional knapsack Problem (MKP) will be solved. MKP is a wide-known NP-complete optimization problem consisting of items with profit and n–dimensional weight, and knapsacks that have been filled [12]. The objective is to choose a subset of items with maximum profits without exceeding the knapsack capacities. This problem was selected because it has a wide range of practical applications and continues to be a topic of interest in the operations research community [13,14,15]. For computational experiments, 70 of the most challenging instances of the MKP taken from the OR–Library [16] were used. The results were evaluated through descriptive analysis and statistical inferences, mainly a hypothesis contrast applying non–parametric evaluations.

The rest of the manuscript is divided as follows: Section 2 presents a robust analysis of the current relationship work on hybridizations between learning techniques and metaheuristics. Section 3 details the conceptual framework of the study. Section 4 explains how reinforcement learning techniques are applied on particle swarm optimization. Section 5 exposes the phases of the experimental design, while Section 6 discusses the results achieved. Finally, the conclusions and future work are given in Section 7.

2. Related Work

In recent years, the integration between swarm intelligence algorithms and machine learning has been extensively investigated [10]. To this end, various approaches have been described to implement self-adaptive and learning capabilities in these techniques. For example, in ref. [17], the virus optimization algorithm is modified to add self-adaptive capacities in the parameters. The performance was compared with different-sized optimization instances, and similar or better performance was observed for the improved version. Similarly, in ref. [18], the firefly algorithm was enhanced to auto-compute the parameter a that controls the balance between exploration and exploitation. In ref. [19], an analogous strategy modifies the cuckoo search algorithm to balance the intensification and diversification phases. The work published in [20] proposes improving the artificial bee colony by incorporating self-adaptive capabilities of its agents. This study aims to improve the convergence ratio by altering the parameter that controlled it during the run. A comparable work can be seen in [21]. Here, the differential evolution strategy is modified by adding auto–tuning qualities to the scalability factor and the crossing ratio to increase the convergence rate. The manuscript [22] describes an improvement of the discrete particle swarm optimization algorithm, which includes an adaptive parameter control to balance social and cognitive learning. A new formulation updates the probability factor

p_{i}

in the Bernoulli distribution, which updates the parameters

R_{1}

(social learning) and

R_{2}

(cognitive learning). Following the same line, in [23,24], self-adaptive evolutionary algorithms were proposed. The first one details an enhancement through a population of operators that change based on a punishment and reward scheme, depending on the operator’s quality. The second one presents an improvement where the crossover and mutation probability parameters are adapted to balance exploration and exploitation in the search for solutions. Both cases show outstanding results. In ref. [25], a self-adjust was applied to the flower pollination algorithm. This proposal balances the exploration and exploitation phases during the run and uses a parameter

S_{p}

as an adaptive strategy. A recent wolf pack algorithm was altered to auto-tune its parameter w that controls prey odor perception [26]. Here, the new version intensifies the local search toward more promising zones. Finally, the work in [27] proposes integrating the autonomous search paradigm in the dolphin echolocation algorithm for population self-regulation. The paradigm is applied when stagnation of a local optimum is detected.

Integrating metaheuristics with machine learning, regression, and clustering techniques has also been the subject of studies [28,29,30,31,32]. For example, in ref. [33], the authors propose an evolutionary algorithm that controls the parameters and operators. This is accomplished by integrating a controller module that applies learning rules, measuring the impact and assigning restarts to the parameter set. Under this same paradigm, the work reported in [34] explores the integration of the variable neighborhood search algorithm with reinforcement learning, applying reactive techniques for parameter adjustment, and selecting local searches to balance the exploration and exploitation phases. In ref. [35], a machine learning model was developed using the support vector machine, which can predict the quality of solutions for a problem instance. This solution then adjusts the parameters and guides the metaheuristics to more promising search regions. In ref. [36,37], the authors propose the integration of PSO with regression models and clustering techniques for population management and parameter tuning, respectively. In ref. [38], another combination between PSO and classifier algorithms is presented with the goal of deparameterizing the optimization method. In this approach, a previously trained model is used to classify the solutions found by the particles, which improves the exploration of the search space and the quality of the solutions obtained. Similar to previous works, in ref. [39], PSO is again enhanced with a learning model to control its parameters, obtaining a competitive performance compared to other parameter adaptation strategies. The manuscript [40] presents the hybridization between PSO, Gaussian Process Regression, and Support Vector Machine for real–time parameter adjustment. The study concluded that the hybrid offers superior performance compared to traditional approaches. The work presented in ref. [41] integrates randomized priority search with the Inductive Decision Tree data mining algorithm for parameter adjustment through a feedback loop. Finally, in ref. [42], the authors propose the integration of algorithms derived from ant colony optimization with fuzzy logic for the control of the parameters of pheromone evaporation rate, exploration probability factor, and the number of ants for solving the feature selection problem.

More specifically, reviewing studies on integrating metaheuristics and reinforcement learning, we find many works combining these two techniques to improve the search in optimization problems [43]. For example, in [44], the authors proposed the integration of bee swarm optimization with Q–Learning to improve their local search. In this approach, the artificial individuals become intelligent agents that gain and accumulate knowledge as the algorithm progresses, thus improving the effectiveness of the search. Along the same lines, ref. [45] proposes integrating a learning–based approach to the ant colony algorithm to control its parameters. It is carried out by assigning rewards to the change of parameters in the algorithm, storing them in an array, and learning the best values to apply to each parameter at runtime. In [46], another combination of a metaheuristic algorithm with reinforcement learning techniques is proposed. In this case, tabu search was integrated with Q–Learning to find promising regions in the search space when the algorithm is stuck at a local optimum. The work published in [47] also explored the application of reinforcement learning techniques in the context of an optimization problem. Here, the biased–randomized heuristic with reinforced learning techniques was studied to consider the variations generated by the change in the rewards obtained. Finally, Ref. [48] presents the implementation of a Q–Learning algorithm to assist in training neural networks to classify medical data. In this approach, a parameter tuning process was carried out in radial–based neural networks using Stateless Q–Learning. Although the latter is not an optimization algorithm, the work is relevant to our research.

Even though machine learning techniques have already been explored in bio-inspired algorithms, it is worth continuing research on this type of hybridization. Our strategy involves reinforcement learning on PSO, using Q–Learning and new variations that have not been studied yet. In this context, we studied how Q–Learning can be modified to provide better results. This approach can be fruitful if done properly.

3. Preliminaries

3.1. Parameter Setting

Metaheuristic algorithms are characterized by a set of parameters governing their behavior. In the scientific literature, the adjustment of these parameters is an interesting issue and still an open challenge [8,9,49,50,51]. According to [8], parameter tuning can be formally defined as follows:

Let A be an algorithm with $p_{1}$ , $p_{2}$ , …, $p_{k}$ parameters that affect its behavior.
Let C be a configuration space (parameter setting), where each configuration $c \in C$ describes the values of the parameters required by A.
I defines a set of instances to be resolved by the algorithm A.
m is a metric of A performance on an instance of set I given a configuration c.

Find the best configuration

\hat{c} \in C

, resulting in optimal performance of A, when resolving an instance of the set I according to the metric m.

Parameter settings can be treated by applying at least two options: parameter tuning and parameter control. Parameter tuning is the process of finding those configurations that allow the algorithm to present the best possible performance when solving a specific type of problem. Different methods or algorithms can be used to find the desired configuration manually or automatically. These methods include F–Race, Sampling F–Race, Iterative F–Race, ParamILS, Sharpening, and Adaptive Capping. These methods perform parameter changes before the execution of the algorithm, thus being considered a lengthy offline process that requires a large number of executions for each configuration instance [52]. On the other hand, parameter control is considered an online process because it focuses on implementing parameter changes during the execution of the algorithm [53]. Parameter control is classified into three strategies: deterministic, adaptive, and self–adaptive [54]. The deterministic strategy uses deterministic rules to change the parameters and does not have a feedback system. The adaptive strategy uses feedback to vary the direction and magnitude of the parameter change. The self-adaptive strategy encodes the parameters of each individual in the population and modifies them in a certain number of iterations based on the best solutions found up to that moment.

3.2. Reinforcement Learning

Reinforcement learning is an algorithm part of ML whose objective is to determine the set of actions that an agent must take to maximize a given reward [11]. These agents need to be able to figure out for themselves which actions have the highest reward return, which is accomplished through trial and error. In some cases, these actions can yield long–term rewards, known as retroactive rewards. Trial and error and retroactive rewards are the two most important features of reinforcement learning [55,56].

Reinforcement learning employs a cycle involving several elements: the agent, the environment, the value function, the reward, the policy, and the model [57]. The agent is the element that performs actions in the environment by learning and incorporating new policies to follow in future actions. Policies vary based on the reward received for performing an action. The agent’s objective is to choose actions that increase the reward in the long term. The environment is where the agent’s actions are applied, and based on these actions, it is the place where the state changes will be caused. The value function uses these state changes to determine the impact of each action. The value function is the evaluator element within the algorithm. Its function is to evaluate each action carried out by the agent and its impact on the environment. This is achieved by assigning a reward to the action/state change pair and delivering it to the agent, thus completing the cyclic element. The reward is the numerical value that the value function assigns to an action/state change pair. Usually, a positive value indicates positive feedback, while a negative value indicates negative feedback. The policy is the element that gives the algorithm the ability to continuously learn, allowing the agent’s behavior to be defined during the execution of the algorithm. Generally, a policy maps the perceived states in the environment and the actions that must be taken in those states. Finally, the model is an optional element that serves as input into the agent’s decision making. Unlike policies, the model is static at runtime and only has predefined data.

Uncertainty in Q–Learning can affect the exploration phase of PSO. This can impact the convergence and quality of solutions generated by PSO. The relationship between the learning uncertainty Q–Learning and the PSO is complex and depends on several factors, such as the configuration of the algorithm and the interaction between the parameters. This conflict remains latent among the scientific community. Optimization algorithms governed by learning methods continue to be a hot topic [58].

3.2.1. Q–Learning

Q–Learning is a model-free reinforcement learning algorithm introduced by C. Watkins in 1989 [59]. This algorithm introduces the Q function, which works as a table that stores the maximum expected rewards for each action performed in a given state. This function is defined as [60,61]:

Q^{t + 1} (S_{t}, A_{t}) \leftarrow Q^{t} (S_{t}, A_{t}) + α [R_{t + 1} + γ m a x_{a} Q^{t} (S_{t + 1}, a) - Q^{t} (S_{t}, A_{t})]

(1)

where A is the set of actions applicable to the environment,

S_{t + 1}

is the set of possible states of the environment in the next iteration,

S_{t}

is the current state, R is the reward given for changing the state,

α \in [0, 1]

is the learning rate, and

γ \in [0, 1]

is the discount factor. The procedure of Q–Learning can be seen in Algorithm 1.

Algorithm 1: Q–Learning pseudocode

3.2.2. Single State Q–Learning

In cases where it is difficult (or outright impossible) to determine the states of the system, there is the possibility of reducing Q–Learning to a single static state. This method is called Single State Q–Learning (or Stateless Q–Learning) [62,63]. Given the simplification of the algorithm, the table Q is transformed to an array, and the Equation (1) is abbreviated to the Equation (2):

Q (A_{t}) \leftarrow Q (A_{t}) + α [R - Q (A_{t})]

(2)

where A is the set of actions applicable to the environment, R is the reward given for performing an action, and

α \in [0, 1)

is the learning rate.

4. Developed Solution

In this section, we detail different ways to apply Q–Learning on PSO and how this implementation allows us to improve its performance when solving NP–Complete combinatorial optimization problems.

4.1. Particle Swarm Optimization

Particle swarm optimization is a swarm intelligence algorithm inspired by group behavior that occurs in flocks of birds and schools of fish [64]. In this algorithm, each particle represents a possible solution to the problem and has a velocity vector and a position vector.

PSO consists of a cyclical process in which each particle sees its trajectory influenced based on two types of learning: social learning, acquired through knowledge of other particles in the swarm, and cognitive learning, acquired through the experiences of the particle [5,65]. In the traditional PSO, the velocity of the particle is represented as a vector

{\vec{v}}_{i} = 〈 v_{i}^{1}, v_{i}^{2}, \dots, v_{i}^{j}, \dots, v_{i}^{n} 〉

, as long as their position is described as

{\vec{v}}_{i} = 〈 x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{j}, \dots, x_{i}^{n} 〉

. Initially, each particle’s position vector and velocity vector are randomly created. Then, during the execution of the algorithm, the particles are moved using the Equations (3) and (4):

v_{i}^{j} (t + 1) = w v_{i}^{j} (t) + r_{1} ϕ_{1} (p B e s t_{i}^{j} - x_{i}^{j} (t)) + r_{2} ϕ_{2} (g B e s t^{j} - x_{i}^{j} (t))

(3)

x_{i}^{j} (t + 1) = x_{i}^{j} (t) + v_{i}^{j} (t + 1)

(4)

where w is the acceleration coefficient,

r_{1}

and

r_{2}

are social learning and cognitive learning, respectively, and finally,

ϕ_{1}

and

ϕ_{2}

are uniformly distributed random values in the range

[0, 1)

. The method needs a memory, called

p B e s t_{i}

, representing the best position met for the i–th particle. The best particle is stored in

g B e s t

, and it is found when the algorithm ends. Algorithm 2 summarizes the PSO search procedure.

Algorithm 2: PSO pseudocode

As a solution to the parameter optimization problem, integrating the Q–Learning algorithm in a traditional PSO is proposed. The objective of this combination is focused on PSO that can acquire the ability to adapt its parameters online, that is, during the execution of the algorithm.

The approach initializes by declaring the swarm, its particles, the necessary velocity vectors, and the Q table. The normal course for a PSO algorithm is then continued. The Q–Learning module is invoked when the algorithm is stagnation. We detect stagnation by applying the approximate theory of nature-inspired optimization algorithms, derived from

x (t + 1) \approx x (t) - ρ Δ_{x_{b e s t} - x (t)}

, with

ρ

uniform value in

[0, 1)

. This module analyzes the environment for possible state changes and then updates the Q table with the appropriate reward. If this is the first call to this module, these steps should be ignored. Subsequently, a decision must be made between two possible actions to adjust the algorithm’s parameters. One comes from the policy derived from Q, while the other option is to change the parameters randomly. The difference between these two options is that the first one will provide a better return depending on the existing knowledge. In contrast, the second one will allow us to find previously unknown knowledge. Finally, the Q–Learning module transfers control to PSO to continue operating. Figure 1 depicts the flow of the proposal.

In the work, the proposed strategies allow us to integrate different levels of reinforced learning into swarm intelligence algorithms: Classic Q–Learning, Modified Q–Learning, and Single State Q–learning.

(a): Classic Q–Learning (CQL): The first one directly applies the Q–Learning theory, including a single Q table. This table represents the states by combining the possible values of each parameter (in intervals of 0 and 1) and actions (transitions from one state to another). This process computes the fitness variation from the previous invoke to the module to the current call, assigning a positive or negative reward value according to the objective function. Then, the Q table is updated concerning the action/state pair. Finally, the most favorable action is derived according to the current state and the greedy policy, applying the new set of parameters to the swarm.
(b): Modified Q–Learning (MQL): The second one is a variation of the classic Q–Learning. This strategy divides the Q table by each parameter and, furthermore, decreases the number of possible actions, allowing the agent only to move forward, backward, or stay in one state. In this method, the module is allowed to use each particle individually for the training of the Q tables and the modification of parameters. This means that, unlike the method seen above, this strategy is invoked only by one of the individuals in the swarm instead of the entire swarm. It is important to consider that each particle must store its current state and action as attributes such as velocity or position.
(c): Single state Q–Learning (SSQL): The third one replaces the concept of a Q table with an array, removing states and looking only at changes produced by actions. The new Q array includes actions representing the parameter amount to modify. Similar to the previous version, the Q array has been split by each parameter and allows to use of the entire swarm individually. These changes effectively remove the state dependency, granting for more precise parameter changing.

4.2. Integration

Algorithm 3 shows the steps to follow to integrate reinforcement learning into PSO.

Algorithm 3: Integration of reinforcement learning into PSO

Firstly, the procedure determines the states that represent the current condition or configuration of the PSO algorithm. These states could include the positions of the particles, their velocities, or any other relevant variables. Next, the algorithm identifies the actions that can be taken to modify the parameters of the PSO algorithm. These actions could involve changing the inertia weight, acceleration coefficients, or any other parameter that influences the behavior of the particles. In the third step, the method uses the reward function that evaluates the performance of the PSO algorithm based on the solutions obtained. The reward function should provide feedback on how well the algorithm is performing and guide the Q–learning process. The fourth step describes how the Q–table is created. Q–table is a lookup table that maps states and actions to their corresponding Q–values and it is initially filled with random elements.

The iterative process runs while PSO needs it. Here, the current state of PSO saves information, such as the positions and velocities of the particles, and the best solution found so far. Next, the most appropriate action is chosen based on the

ϵ

–greedy method. This method is employable because all actions are equally applicable. We consider the

ϵ

–greedy method as the policy to identify the best action. A uniform probability of

1 - ϵ

where

ϵ \in [0, 1]

is used. The action modifies the parameter configuration of PSO in a random way:

c^{'} \leftarrow c φ ψ

, where

φ \in [m i n (c), m a x (c)]

and

ψ \in {- 1, 0, 1}

. Then, PSO is executed to obtain the solutions. Its performance is evaluated by using the reward function and it allows to modify the Q–value of the previous state–action pair applying the Q–learning update rule. Finally, the previous state is updated to the current state, preparing for the next iteration of the algorithm. Over time, the Q-table will be updated, and the PSO algorithm will learn to select actions that lead to better solutions.

Before implementing, we analyze the time complexity of each component and its integration. Firstly, the time complexity of PSO depends mainly on the number of particles and the number of iterations. At each iteration, the positions and velocities of the particles are updated, and the objective function is evaluated. However, in this case, these components are constant, and the algorithm really depends on the dimensionality of the problem. Therefore, the complexity of the PSO algorithm itself is

O (K n)

, where K represents the number of particles per number of iterations, and n defines the number of decision variables. On the other hand, the time complexity of Q–Learning is based on the size of the Q–table, which is determined by the number of possible states and actions. If the search space is large and the Q–table is large, the complexity will increase. In our case, we use a value range for parameters that remain constant during the run of PSO. Then, we can guarantee that the three proposals are efficient because none of them exceeds the polynomial time.

5. Experimental Setup

In order to comprehensively evaluate the performance of the proposed hybridizations, it is crucial to conduct a robust analysis that encompasses various aspects. One essential step in this analysis is to compare the solutions obtained by each strategy with the classic version of PSO. By benchmarking the solutions against the PSO’s results, we establish a reliable reference point for evaluating the effectiveness and efficiency of the proposed hybridizations. This enables us to gauge the extent to which the algorithmic enhancements contribute to improving solution quality and reaching optimality. Moreover, this comparative analysis serves to validate the credibility and competitiveness of the proposed hybridizations in the field. By showcasing their ability to achieve results that are on par with or even surpass PSO, we can establish the superiority of our approach and its potential to outperform existing methods.

To ensure the robustness of the performance analysis, it is important to employ a diverse set of benchmark problems that accurately represent the challenges and complexities encountered in real-world scenarios. By testing the proposed hybridizations on these benchmarks, we can assess their adaptability, generalizability, and ability to handle various problem instances effectively.

Figure 2 indicates the steps taken to examine the three proposals’ performance thoroughly. In addition, we establish objectives and recommendations for the experimental phase, in order to demonstrate that the proposed approaches allow for improving the optimization of metaheuristic parameters.

The analyses include: (a) the resolution time to determine the difference produced when applying the different methods, (b) the best value found by each method, which is an important indicator to assess future results, and finally, (c) an ordinal analysis and statistical tests to determine if one method is significantly better than another.

For the experimental phase, several optimization instances were solved in order to measure the performance of the different proposed methods. These instances were taken from the OR–Library, a virtual library that J.E. Beasley first described in 1990 [16], and in which it is possible to find various test data sets. In this study, 70 binary instances of the multidimensional knapsack problem were used (from MKP1 to MKP70). Table 1 details each instance, indicating its optimal solution, the number of backpacks, and the number of objects.

For instances from MKP56 to MKP70, there are no recorded optimal values because they could not be resolved by using exact methods. For this reason, we use “unknown” to describe that this value has not been found to date.

Equation (5) defines the formulation of MKP:

\begin{matrix} max \sum_{j = 1}^{n} p_{j} x_{j} \\ s u b j e c t t o \\ \sum_{j = 1}^{n} w_{j k} x_{j} \leq b_{k}, \forall k \in {1, \dots, K} \\ x_{j} \in {0, 1} \end{matrix}

(5)

where

x_{j}

describes whether or not the object is included in a backpack, and n represents the total number of objects. Each object has a real value

p_{j}

that represents its profit and is used to calculate the objective function. Finally,

w_{j k}

stores the weight of each object based on the backpack k with maximum capacity

b_{k}

. As can be seen, this is a combinatorial problem that deals with the dilemma of including or not an object in a certain backpack that has a certain capacity.

To execute a metaheuristic of a continuous nature in a binary domain, it is required to add a binarization phase after the solution vector changes [69]. Here, a standard sigmoid function was used as the transformation function, that is,

[1 / (1 + e^{- x_{i}^{j}})] > δ

, with

δ

as a uniform random value between

[0, 1)

. Then, if the previous formulation is true, we use

x_{i}^{j} \leftarrow 1

as discretization. Otherwise, we use

x_{i}^{j} \leftarrow 0

.

The performance of each method is evaluated after resolving each of the 70 instances a total of 30 times. Once the complete set of results is obtained from all executions and instances, an outlier analysis is performed to study possible irregular results. Here, influential outliers were detected using the Tukey test, which takes as reference the difference between the first quartile (Q1) and the third quartile (Q3), or the interquartile range. In our case, it is considered a slight outlier if the result is 1.5 times that distance from one of those quartiles or an extreme outlier if it is three times that distance. This test was implemented using a spreadsheet to calculate the statistical values automatically. All outliers were removed to avoid distortion of the samples, and then new tests were taken to replace the removed solutions. Moreover, we use the metric of the relative percentage difference (RPD) between the best solution to the problem and the best solution found. This value is calculated on

(b e s t - r e a c h e d) / b e s t

.

As a next step, a descriptive and statistical analysis of the results was carried out. For the first, metrics such as maximum and minimum values, the mean, the quasi–standard deviation, the median, and the interquartile range are used to compare the results generated by the three methods. The second analysis corresponds to statistical inference. In this analysis, two hypotheses are contrasted to reveal the one with the greatest statistical significance. The tests employed for that were: (a) the Shapiro–Wilk test for normality and (b) the Wilcoxon–Mann–Whitney test for heterogeneity. In addition, for a better understanding of the robustness of the analysis, it is essential to highlight that, given the independent nature of the instances, the results obtained in any of them do not affect the results of the others. Likewise, the repetition of an instance does not imply the need for more repetitions of the same instance.

In [70], the parameter values with the best average results in terms of swarm performance are described. Considering this, it has been determined that the initial values for the PSO parameters will be:

M a x I t e r = 2000

,

p o p S i z e = 100

,

w = 0.6

, and

r_{1} = r_{2} = 0.9

. A sampling phase was carried out for Q–Learning’s parameters to determine their value that offers the best results. Then, the best initial configuration was:

S t e p S i z e = 6

,

γ = ϵ = 0.5

, and

α = 0.1

. Finally, all the methods were coded in the Java 1.8 programming language, and executed on a workstation whose infrastructure had a Windows 10 Enterprise operating system, AMD Ryzen 7 1700m 8–core 3.64 GHz processor, and 16 GB of memory. 1197.1 MHz RAM. It is important to note that parallel implementation was not required. Instances, data, and codes are available in [71,72,73].

6. Discussion

All algorithms were run 30 times in the testing phase for each instance. Results were recorded, distinguishing each method to be further compared (native PSO or NPSO, classic Q–Learning or CQL, modified Q–Learning or MQL, and single state Q–Learning or SSQL). Table 2 summarizes how many known optimums were found for each version, and in the case of the instances with unknown optimums, how many of these could reach the best solution found in a limited testing time. We employ a cut–off equal to five minutes. If an approach exceeds this bound, it is not included in the results.

Analyzing only these results, we can see that the number of optimal values achieved by the native PSO is better than those achieved by the basic Q–Learning implementation. In contrast, both are overshadowed by our modified version of Q–Learning and the single state Q–Learning. Thus, we can preliminarily observe that: (a) the performance of basic Q–Learning is inferior to that of native PSO, (b) a significant difference between modified Q–Learning and single state Q–Learning cannot yet be detected, (c) there is a significant difference between modified Q–Learning and single state Q–Learning in terms of the unknown optimums reached. In general, the single state Q–Learning obtained the best results. All the results obtained by each version of the algorithms are present in Table 3, Table 4, Table 5 and Table 6.

Now, to demonstrate more robustly which approach works best, we take more restricted instances of MKP to graph the distribution of best values generated by each strategy. These instances have many objects to select and a small number of backpacks to use.

Figure 3 shows the convergences of each method. For the MKP06 instance, we can see a similar convergence among strategies, with classic Q–Learning being the version with the latest convergence compared to the others. For the MKP35 instance, large convergence differences are observed in the four strategies. Here, PSO is the algorithm with the latest convergence, and the modified version of Q–Learning and the Single State version have the earliest convergence. For the MKP70 instance, a similar performance can be seen between the modified Q–Learning and single state Q–Learning, while the convergences between default PSO and classic Q–Learning are the latest.

Observing the results presented by the distribution Figure 4, it can be again concluded that in general, the standard PSO obtains final results very similar to its version assisted by Q–Learning. With these results and the previously mentioned, we dare say that a possible explanation for this phenomenon would be the high time cost required to train all the action/state pairs of the Q table, causing that, at the end of the execution, the algorithm cannot find a better parametric configuration than the initial one.

This possible problem is mitigated in the other two implemented methods due to the considerable reduction and division of the Q table. Last, the PSO algorithms assisted by the modified Q–Learning and single state Q–Learning obtain significantly better results. Here, we observe that both algorithms, in their runtime, train, and obtain edge configurations adjusted to the instance.

Following up with a robust result review, we employed the two statistical tests (mentioned in Section 5): (a) normality assessment and (b) contrast of hypotheses to determine if the samples come or not from an equidistributed sequence. For determining if observations (runs per instance) draw a Gaussian distribution, we establish

H_{0}

as samples follow a normal distribution. Then,

H_{1}

is the opposite.

The cutoff of the p–value is

0.05

, for which results under this threshold state the test is said to be significant (

H_{0}

rejects). Results confirmed that the samples do not follow a normal distribution, so we employ the non-parametric test, Mann–Whitney–Wilcoxon. Here, we assume

H_{0}

as the null hypothesis that affirms native methods generate better values than their versions improved by the Q–Learning. Thus,

H_{1}

suggests otherwise. In total, six tests were carried out, and the results are presented in Table 7 and Table 8. In the comparison between native PSO and the classic Q–Learning, we can note that the first one exceeds the 95% reliability threshold in 59 of the 70 instances, while the second one only does so in one instance.

On the other hand, in the comparison between native PSO and the modified Q–Learning, it is possible to observe that MQL surpasses the threshold in 44 instances, while NPSO does so only in two instances. Regarding the comparison between NPSO and SSQL, we observe that the latter outperforms the threshold in 53 instances, while NPSO does not exceed the threshold in any instance. Finally, in comparing MQL and the single state version, we can analyze that SSQL beats the threshold in 10 instances, while the modified version of Q–Learning only does so in six instances. Furthermore, in the remaining comparisons, MQL and SSQL exceed the 95% confidence threshold in all instances, while classic Q–Learning and NPSO do not exceed the threshold in any instance.

From all obtained results, we can conclude that there is a better performance in the modified Q–Learning and SSQL compared to its classic version and the standard PSO.

7. Conclusions

This article presents an approach to improve the efficiency of a swarm intelligence algorithm when solving complex optimization problems by integrating reinforcement learning techniques. Specifically, we use Q–Learning to adjust the optimal parameters of the particle swarm optimization for solving several instances of the multidimensional knapsack problem.

The analysis of the data obtained in the testing phase shows that the algorithms assisted by reinforcement learning obtained better results in multiple aspects when compared to the native version of PSO. In particular, the single state Q–Learning assisting PSO finds solutions that, as a whole, have better quality in terms of mean, median, standard deviation, and interquartile ranges. In addition, it is observed that SSQL achieves earlier convergence in significant instances when compared to the other methods. Notwithstanding the preceding, it is observed that the native PSO has a slightly better general performance than PSO improved by classic Q–Learning. This is attributed to the high time cost required to train all the action/state pairs. Here, Q–Learning can not guarantee the algorithm finding a better-than-initial parametric configuration at the end of each run. It is suggested to explore the performance of Q–learning with PSO in other optimization problems beyond the multidimensional knapsack problem. In general views, the effect of different parameter settings for the Q–learning algorithm, such as the learning rate and the discount factor, should also be explored to evaluate their effectiveness in the reinforcement learning method in conjunction with PSO.

Finally, it is suggested to explore the comparisons of using the Q–learning method with other bio-inspired algorithms, such as the gray wolf optimizer, whale optimization, bald eagle search optimization, and Harris hawks optimization, among others.

In conclusion, there is a promising approach for improving the performance of swarm intelligence algorithms through reinforcement learning when solving optimization problems. More research is needed to explore its effectiveness in other complex optimization problems and to be able to compare the results obtained with other currently existing methods.

Author Contributions

Formal analysis, R.O., R.S. and B.C.; investigation, R.O., R.S., B.C. and D.N.; methodology, R.O., R.S. and B.C.; resources, R.S. and B.C.; software, R.O. and D.N.; validation, R.S., B.C., V.R., P.O., C.R. and S.M.; writing—original draft, R.O., V.R., P.O., C.R., S.M. and D.N.; writing—review and editing, R.O., R.S., B.C., V.R., P.O., C.R., S.M. and D.N. All authors have read and agreed to the published version of the manuscript.

Funding

Rodrigo Olivares is supported by grant ANID/FONDECYT/INICIACIÓN/11231016. Broderick Crawford is supported by Grant ANID/FONDECYT/REGULAR/1210810.

Data Availability Statement

Data is online available at the following links: https://figshare.com/articles/dataset/Test_Instances/14999907; https://figshare.com/articles/dataset/PSOQLAV_Parameter_Test/14999874; https://figshare.com/articles/dataset/PSOQL_Test_Data/14995374 (accessed on 17 May 2023).

Acknowledgments

Víctor Ríos and Pablo Olivares received scholarships REXE 2286/2022 and REXE 4054/2022, respectively. Both were from Doctorado en Ingeniería Informática Aplicada, Universidad de Valparaíso.

Conflicts of Interest

The authors declare no conflict of interest. The funding sponsors had no role in the study’s design; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

References

Du, K.L.; Swamy, M.; Du, K.L.; Swamy, M. Particle swarm optimization. In Search and Optimization by Metaheuristics: Techniques and Algorithms Inspired by Nature; Springer: Berlin/Heidelberg, Germany, 2016; pp. 153–173. [Google Scholar]
Talbi, E.G. Metaheuristics: From Design to Implementation; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
Boussaïd, I.; Lepagnot, J.; Siarry, P. A survey on optimization metaheuristics. Inf. Sci. 2013, 237, 82–117. [Google Scholar] [CrossRef]
Panigrahi, B.K.; Shi, Y.; Lim, M.H. Handbook of Swarm Intelligence: Concepts, Principles and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011; Volume 8. [Google Scholar]
Shami, T.M.; El-Saleh, A.A.; Alswaitti, M.; Al-Tashi, Q.; Summakieh, M.A.; Mirjalili, S. Particle Swarm Optimization: A Comprehensive Survey. IEEE Access 2022, 10, 10031–10061. [Google Scholar] [CrossRef]
Bansal, J.C. Particle swarm optimization. In Evolutionary and Swarm Intelligence Algorithms; Springer: Berlin/Heidelberg, Germany, 2019; pp. 11–23. [Google Scholar]
Wolpert, D.H.; Macready, W.G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1997, 1, 67–82. [Google Scholar] [CrossRef]
Hoos, H.H. Automated algorithm configuration and parameter tuning. In Autonomous Search; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–71. [Google Scholar]
Huang, C.; Li, Y.; Yao, X. A Survey of Automatic Parameter Tuning Methods for Metaheuristics. IEEE Trans. Evol. Comput. 2019, 24, 201–216. [Google Scholar] [CrossRef]
Calvet, L.; Armas, J.D.; Masip, D.; Juan, A.A. Learnheuristics: Hybridizing metaheuristics with machine learning for optimization with dynamic inputs. Open Math. 2017, 15, 261–280. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Skackauskas, J.; Kalganova, T. Dynamic Multidimensional Knapsack Problem benchmark datasets. Syst. Soft Comput. 2022, 4, 200041. [Google Scholar] [CrossRef]
Liu, J.; Wu, C.; Cao, J.; Wang, X.; Teo, K.L. A binary differential search algorithm for the 0–1 multidimensional knapsack problem. Appl. Math. Model. 2016, 40, 9788–9805. [Google Scholar] [CrossRef]
Cacchiani, V.; Iori, M.; Locatelli, A.; Martello, S. Knapsack problems-An overview of recent advances. Part II: Multiple, multidimensional, and quadratic knapsack problems. Comput. Oper. Res. 2022, 143, 105693. [Google Scholar] [CrossRef]
Rezoug, A.; Bader-El-Den, M.; Boughaci, D. Application of supervised machine learning methods on the multidimensional knapsack problem. Neural Process. Lett. 2022, 54, 871–890. [Google Scholar] [CrossRef]
Beasley, J.E. OR-Library: Distributing test problems by electronic mail. J. Oper. Res. Soc. 1990, 41, 1069–1072. [Google Scholar] [CrossRef]
Liang, Y.C.; Cuevas Juarez, J.R. A self-adaptive virus optimization algorithm for continuous optimization problems. Soft Comput. 2020, 24, 13147–13166. [Google Scholar] [CrossRef]
Olamaei, J.; Moradi, M.; Kaboodi, T. A new adaptive modified firefly algorithm to solve optimal capacitor placement problem. In Proceedings of the 18th Electric Power Distribution Conference, Kermanshah, Iran, 30 April–1 May 2013; pp. 1–6. [Google Scholar]
Li, X.; Yin, M. Modified cuckoo search algorithm with self adaptive parameter method. Inf. Sci. 2015, 298, 80–97. [Google Scholar] [CrossRef]
Li, X.; Yin, M. Self-adaptive constrained artificial bee colony for constrained numerical optimization. Neural Comput. Appl. 2014, 24, 723–734. [Google Scholar] [CrossRef]
Cui, L.; Li, G.; Zhu, Z.; Wen, Z.; Lu, N.; Lu, J. A novel differential evolution algorithm with a self-adaptation parameter control method by differential evolution. Soft Comput. 2018, 22, 6171–6190. [Google Scholar] [CrossRef]
de Barros, J.B.; Sampaio, R.C.; Llanos, C.H. An adaptive discrete particle swarm optimization for mapping real-time applications onto network-on-a-chip based MPSoCs. In Proceedings of the 32nd Symposium on Integrated Circuits and Systems Design, Sao Paulo, Brazil, 26–30 August 2019; pp. 1–6. [Google Scholar]
Cruz-Salinas, A.F.; Perdomo, J.G. Self-adaptation of genetic operators through genetic programming techniques. In Proceedings of the Genetic and Evolutionary Computation Conference, Berlin, Germany, 15–19 July 2017; pp. 913–920. [Google Scholar]
Kavoosi, M.; Dulebenets, M.A.; Abioye, O.F.; Pasha, J.; Wang, H.; Chi, H. An augmented self-adaptive parameter control in evolutionary computation: A case study for the berth scheduling problem. Adv. Eng. Inform. 2019, 42, 100972. [Google Scholar] [CrossRef]
Nasser, A.B.; Zamli, K.Z. Parameter free flower algorithm based strategy for pairwise testing. In Proceedings of the 2018 7th international conference on software and computer applications, Kuantan Malaysia, 8–10 February 2018; pp. 46–50. [Google Scholar]
Zhang, L.; Chen, H.; Wang, W.; Liu, S. Improved Wolf Pack Algorithm for Solving Traveling Salesman Problem. In FSDM; IOS Press: Amsterdam, The Netherlands, 2018; pp. 131–140. [Google Scholar]
Soto, R.; Crawford, B.; Olivares, R.; Carrasco, C.; Rodriguez-Tello, E.; Castro, C.; Paredes, F.; de la Fuente-Mella, H. A reactive population approach on the dolphin echolocation algorithm for solving cell manufacturing systems. Mathematics 2020, 8, 1389. [Google Scholar] [CrossRef]
Karimi-Mamaghan, M.; Mohammadi, M.; Meyer, P.; Karimi-Mamaghan, A.M.; Talbi, E.G. Machine learning at the service of meta-heuristics for solving combinatorial optimization problems: A state-of-the-art. Eur. J. Oper. Res. 2022, 296, 393–422. [Google Scholar] [CrossRef]
Gómez-Rubio, Á.; Soto, R.; Crawford, B.; Jaramillo, A.; Mancilla, D.; Castro, C.; Olivares, R. Applying Parallel and Distributed Models on Bio–Inspired Algorithms via a Clustering Method. Mathematics 2022, 10, 274. [Google Scholar] [CrossRef]
Caselli, N.; Soto, R.; Crawford, B.; Valdivia, S.; Olivares, R. A self–adaptive cuckoo search algorithm using a machine learning technique. Mathematics 2021, 9, 1840. [Google Scholar] [CrossRef]
Soto, R.; Crawford, B.; Molina, F.G.; Olivares, R. Human behaviour based optimization supported with self–organizing maps for solving the S–box design Problem. IEEE Access 2021, 2021, 1–14. [Google Scholar] [CrossRef]
Valdivia, S.; Soto, R.; Crawford, B.; Caselli, N.; Paredes, F.; Castro, C.; Olivares, R. Clustering–based binarization methods applied to the crow search algorithm for 0/1 combinatorial problems. Mathematics 2020, 8, 1070. [Google Scholar] [CrossRef]
Maturana, J.; Lardeux, F.; Saubion, F. Autonomous operator management for evolutionary algorithms. J. Heuristics 2010, 16, 881–909. [Google Scholar] [CrossRef]
dos Santos, J.P.Q.; de Melo, J.D.; Neto, A.D.D.; Aloise, D. Reactive search strategies using reinforcement learning, local search algorithms and variable neighborhood search. Expert Syst. Appl. 2014, 41, 4939–4949. [Google Scholar] [CrossRef]
Zennaki, M.; Ech-Cherif, A. A new machine learning based approach for tuning metaheuristics for the solution of hard combinatorial optimization problems. J. Appl. Sci. 2010, 10, 1991–2000. [Google Scholar] [CrossRef]
Lessmann, S.; Caserta, M.; Arango, I.M. Tuning metaheuristics: A data mining based approach for particle swarm optimization. Expert Syst. Appl. 2011, 38, 12826–12838. [Google Scholar] [CrossRef]
Liang, X.; Li, W.; Zhang, Y.; Zhou, M. An adaptive particle swarm optimization method based on clustering. Soft Comput. 2015, 19, 431–448. [Google Scholar] [CrossRef]
Harrison, K.R.; Ombuki-Berman, B.M.; Engelbrecht, A.P. A parameter-free particle swarm optimization algorithm using performance classifiers. Inf. Sci. 2019, 503, 381–400. [Google Scholar] [CrossRef]
Dong, W.; Zhou, M. A supervised learning and control method to improve particle swarm optimization algorithms. IEEE Trans. Syst. Man Cybern. Syst. 2016, 47, 1135–1148. [Google Scholar] [CrossRef]
Kurek, M.; Luk, W. Parametric reconfigurable designs with machine learning optimizer. In Proceedings of the 2012 International Conference on Field-Programmable Technology, Seoul, Republic of Korea, 10–12 December 2012; pp. 109–112. [Google Scholar]
Al-Duoli, F.; Rabadi, G. Data mining based hybridization of meta-RaPS. Procedia Comput. Sci. 2014, 36, 301–307. [Google Scholar] [CrossRef]
Wang, G.; Chu, H.E.; Zhang, Y.; Chen, H.; Hu, W.; Li, Y.; Peng, X. Multiple parameter control for ant colony optimization applied to feature selection problem. Neural Comput. Appl. 2015, 26, 1693–1708. [Google Scholar] [CrossRef]
Seyyedabbasi, A.; Aliyev, R.; Kiani, F.; Gulle, M.U.; Basyildiz, H.; Shah, M.A. Hybrid algorithms based on combining reinforcement learning and metaheuristic methods to solve global optimization problems. Knowl.-Based Syst. 2021, 223, 107044. [Google Scholar] [CrossRef]
Sadeg, S.; Hamdad, L.; Remache, A.R.; Karech, M.N.; Benatchba, K.; Habbas, Z. Qbso-fs: A reinforcement learning based bee swarm optimization metaheuristic for feature selection. In Proceedings of the Advances in Computational Intelligence: 15th International Work-Conference on Artificial Neural Networks, IWANN 2019, Gran Canaria, Spain, 12–14 June 2019; Proceedings, Part II 15. Springer: Berlin/Heidelberg, Germany, 2019; pp. 785–796. [Google Scholar]
Sagban, R.; Ku-Mahamud, K.R.; Bakar, M.S.A. Nature-inspired parameter controllers for ACO-based reactive search. Res. J. Appl. Sci. Eng. Technol. 2015, 11, 109–117. [Google Scholar] [CrossRef]
Nijimbere, D.; Zhao, S.; Gu, X.; Esangbedo, M.O.; Dominique, N. Tabu search guided by reinforcement learning for the max-mean dispersion problem. J. Ind. Manag. Optim. 2020, 17, 3223–3246. [Google Scholar] [CrossRef]
Reyes-Rubiano, L.; Juan, A.; Bayliss, C.; Panadero, J.; Faulin, J.; Copado, P. A biased-randomized learnheuristic for solving the team orienteering problem with dynamic rewards. Transp. Res. Procedia 2020, 47, 680–687. [Google Scholar] [CrossRef]
Kusy, M.; Zajdel, R. Stateless Q-learning algorithm for training of radial basis function based neural networks in medical data classification. In Intelligent Systems in Technical and Medical Diagnostics; Springer: Berlin/Heidelberg, Germany, 2014; pp. 267–278. [Google Scholar]
Eiben, Á.E.; Hinterding, R.; Michalewicz, Z. Parameter control in evolutionary algorithms. IEEE Trans. Evol. Comput. 1999, 3, 124–141. [Google Scholar] [CrossRef]
Rastegar, R. On the optimal convergence probability of univariate estimation of distribution algorithms. Evol. Comput. 2011, 19, 225–248. [Google Scholar] [CrossRef] [PubMed]
Skakov, E.S.; Malysh, V.N. Parameter meta-optimization of metaheuristics of solving specific NP-hard facility location problem. J. Phys. Conf. Ser. 2018, 973, 012063. [Google Scholar] [CrossRef]
López-Ibáñez, M.; Dubois-Lacoste, J.; Cáceres, L.P.; Birattari, M.; Stützle, T. The irace package: Iterated racing for automatic algorithm configuration. Oper. Res. Perspect. 2016, 3, 43–58. [Google Scholar] [CrossRef]
Soto, R.; Crawford, B.; Olivares, R.; Galleguillos, C.; Castro, C.; Johnson, F.; Paredes, F.; Norero, E. Using autonomous search for solving constraint satisfaction problems via new modern approaches. Swarm Evol. Comput. 2016, 30, 64–77. [Google Scholar] [CrossRef]
Soto, R.; Crawford, B.; Olivares, R.; Niklander, S.; Johnson, F.; Paredes, F.; Olguín, E. Online control of enumeration strategies via bat algorithm and black hole optimization. Nat. Comput. 2017, 16, 241–257. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Huotari, T.; Savolainen, J.; Collan, M. Deep Reinforcement Learning Agent for S&P 500 Stock Selection. Axioms 2020, 9, 130. [Google Scholar] [CrossRef]
Van Otterlo, M.; Wiering, M. Reinforcement learning and markov decision processes. In Reinforcement Learning: State-of-the-Art; Springer: Berlin/Heidelberg, Germany, 2012; pp. 3–42. [Google Scholar]
Imran, M.; Khushnood, R.A.; Fawad, M. A hybrid data-driven and metaheuristic optimization approach for the compressive strength prediction of high-performance concrete. Case Stud. Constr. Mater. 2023, 18, e01890. [Google Scholar] [CrossRef]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Zhang, L.; Tang, L.; Zhang, S.; Wang, Z.; Shen, X.; Zhang, Z. A Self-Adaptive Reinforcement-Exploration Q-Learning Algorithm. Symmetry 2021, 13, 1057. [Google Scholar] [CrossRef]
Melo, F.S.; Ribeiro, M.I. Convergence of Q-learning with linear function approximation. In Proceedings of the 2007 European Control Conference (ECC), Kos, Greece, 2–5 July 2007; pp. 2671–2678. [Google Scholar]
Claus, C.; Boutilier, C. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI 1998, 1998, 2. [Google Scholar]
McGlohon, M.; Sen, S. Learning to cooperate in multi-agent systems by combining Q-learning and evolutionary strategy. Int. J. Lateral Comput. 2005, 1, 58–64. [Google Scholar]
Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
Piotrowski, A.P.; Napiorkowski, J.J.; Piotrowska, A.E. Population size in Particle Swarm Optimization. Swarm Evol. Comput. 2020, 58, 100718. [Google Scholar] [CrossRef]
Dammeyer, F.; Voß, S. Dynamic tabu list management using the reverse elimination method. Ann. Oper. Res. 1993, 41, 29–46. [Google Scholar] [CrossRef]
Drexl, A. A simulated annealing approach to the multiconstraint zero-one knapsack problem. Computing 1988, 40, 1–8. [Google Scholar] [CrossRef]
Khuri, S.; Bäck, T.; Heitkötter, J. The zero/one multiple knapsack problem and genetic algorithms. In Proceedings of the 1994 ACM Symposium on Applied Computing, New York, NY, USA, 6 April 1994; pp. 188–193. [Google Scholar]
Crawford, B.; Soto, R.; Astorga, G.; García, J.; Castro, C.; Paredes, F. Putting Continuous Metaheuristics to Work in Binary Search Spaces. Complexity 2017, 2017, 8404231. [Google Scholar] [CrossRef]
Eberhart, R.C.; Shi, Y. Comparison between genetic algorithms and particle swarm optimization. In Proceedings of the Evolutionary Programming VII: 7th International Conference, EP98, San Diego, CA, USA, 25–27 March 1998; Proceedings 7. Springer: Berlin/Heidelberg, Germany, 1998; pp. 611–616. [Google Scholar]
Universidad de Valparaíso. Implementations. 2021. Available online: https://figshare.com/articles/dataset/PSOQLAV_Parameter_Test/14999874 (accessed on 27 June 2023).
Universidad de Valparaíso. Test Instances. 2021. Available online: https://figshare.com/articles/dataset/Test_Instances/14999907 (accessed on 27 June 2023).
Universidad de Valparaíso. Data and Results. 2021. Available online: https://figshare.com/articles/dataset/PSOQL_Test_Data/14995374 (accessed on 27 June 2023).

Figure 1. Scheme of the proposal integrating a learning-based approach to the swarm intelligence method.

Figure 2. Schema of the experimental phase applied to this work.

Figure 3. Overall performance comparison for instances MKP06, MKP35, MKP70 between NPSO and its improvements with classic, modified, and single state Q–Learning. Convergences of the strategies.

Figure 4. Overall performance comparison for instances MKP06, MKP35, MKP70 between NPSO and its improvements with classic, modified, and single state Q–Learning. Distributions of the strategies.

Table 1. Instances of the multidimensional knapsack problem.

Instance	Name	Best	Knap.	Obj.	Instance	Name	Best	Knap.	Obj.
MKP01	-	3800	10	6	MKP36	WEISH19 [66,67]	7698	5	70
MKP02	-	8706.1	10	10	MKP37	WEISH20 [66,67]	9450	5	70
MKP03	-	4015	10	15	MKP38	WEISH21 [66,67]	9074	5	70
MKP04	-	6120	10	20	MKP39	WEISH22 [66,67]	8947	5	80
MKP05	-	12,400	10	28	MKP40	WEISH23 [66,67]	8344	5	80
MKP06	-	10,618	5	39	MKP41	WEISH24 [66,67]	10,220	5	80
MKP07	-	16,537	5	50	MKP42	WEISH25 [66,67]	9939	5	80
MKP08	SENTO1 [66,67,68]	7772	30	60	MKP43	WEISH26 [66,67]	9584	5	90
MKP09	SENTO2 [66,67,68]	8722	30	60	MKP44	WEISH27 [66,67]	9819	5	90
MKP10	WEING1 [66,67,68]	141,278	2	28	MKP45	WEISH28 [66,67]	9492	5	90
MKP11	WEING2 [66,67,68]	130,883	2	28	MKP46	WEISH29 [66,67]	9410	5	90
MKP12	WEING3 [66,67,68]	95,677	2	28	MKP47	WEISH30 [66,67]	11,191	5	90
MKP13	WEING4 [66,67,68]	119,337	2	28	MKP48	PB1 [66,67]	3090	4	27
MKP14	WEING5 [66,67,68]	98,796	2	28	MKP49	PB2 [66,67]	3186	4	34
MKP15	WEING6 [66,67,68]	130,623	2	28	MKP50	PB4 [66,67]	95,168	2	29
MKP16	WEING7 [66,67,68]	1,095,445	2	105	MKP51	PB5 [66,67]	2139	10	20
MKP17	WEING8 [66,67,68]	624,319	2	105	MKP52	PB6 [66,67]	776	30	40
MKP18	WEISH01 [66,67]	4554	5	30	MKP53	PB7 [66,67]	1035	30	37
MKP19	WEISH02 [66,67]	4536	5	30	MKP54	HP1 [66,67]	3418	4	28
MKP20	WEISH03 [66,67]	4115	5	30	MKP55	HP2 [66,67]	3186	4	35
MKP21	WEISH04 [66,67]	4561	5	30	MKP56	-	unknown	5	100
MKP22	WEISH05 [66,67]	4514	5	30	MKP57	-	unknown	5	100
MKP23	WEISH06 [66,67]	5557	5	40	MKP58	-	unknown	5	100
MKP24	WEISH07 [66,67]	5567	5	40	MKP59	-	unknown	5	100
MKP25	WEISH08 [66,67]	5605	5	40	MKP60	-	unknown	5	100
MKP26	WEISH09 [66,67]	5246	5	40	MKP61	-	unknown	5	100
MKP27	WEISH10 [66,67]	6339	5	50	MKP62	-	unknown	5	100
MKP28	WEISH11 [66,67]	5643	5	50	MKP63	-	unknown	5	100
MKP29	WEISH12 [66,67]	6339	5	50	MKP64	-	unknown	5	100
MKP30	WEISH13 [66,67]	6159	5	50	MKP65	-	unknown	5	100
MKP31	WEISH14 [66,67]	6954	5	60	MKP66	-	unknown	5	100
MKP32	WEISH15 [66,67]	7486	5	60	MKP67	-	unknown	5	100
MKP33	WEISH16 [66,67]	7289	5	60	MKP68	-	unknown	5	100
MKP34	WEISH17 [66,67]	8633	5	60	MKP69	-	unknown	5	100
MKP35	WEISH18 [66,67]	9580	5	70	MKP70	-	unknown	5	100

Table 2. Summary of the best values achieved.

Approach	Best Value Reached (Known)	Best Value Reached (Unknown)
Native PSO (NPSO)	25 (45.45%)	0 (0%)	25 (35.71%)
Classic Q–Learning (CQL)	16 (29.09%)	0 (0%)	16 (22.85%)
Modified Q–Learning (MQL)	46 (83.63%)	6 (40%)	53 (75.71%)
Single state Q–Learning (SSQL)	46 (83.63%)	11 (73.33%)	57 (81.42%)

Table 3. Experimental results for the best, minimum, median, and maximum values obtained from Instances MKP35–MKP70.

ID	Best–Values				Minimum–Value				Median–Value				Maximum–Value
ID	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL
MKP01	30	30	30	30	3800	3800	3800	3800	3800	3800	3800	3800	3800	3800	3800	3800
MKP02	30	30	30	30	8706.1	8706.1	8706.1	8706.1	8706.1	8706.1	8706.1	8706.1	8706.1	8706.1	8706.1	8706.1
MKP03	30	30	30	30	4015	4015	4015	4015	4015	4015	4015	4015	4015	4015	4015	4015
MKP04	30	30	30	30	6120	6120	6120	6120	6120	6120	6120	6120	6120	6120	6120	6120
MKP05	30	30	30	30	12,400	12,400	12,400	12,400	12,400	12,400	12,400	12,400	12,400	12,400	12,400	12,400
MKP06	0	0	1	1	10,526	10,472	10,481	10,618	10,584	10,523	10,585	10,585	10,588	10,585	10,618	10,618
MKP07	0	0	0	0	16,340	16,223	16,374	16,518	16,402	16,316	16,442	16,441	16,471	16,445	16,494	16,518
MKP08	1	1	15	18	7692	7660	7707	7772	7719	7692	7765	7772	7772	7772	7772	7772
MKP09	0	0	1	1	8661	8625	8679	8722	8676	8651.5	8702	8702	8709	8703	8722	8722
MKP10	25	4	30	30	141,258	141,148	141,278	141,278	141,278	141,168	141,278	141,278	141,278	141,278	141,278	141,278
MKP11	13	4	30	30	130,773	130,103	130,883	130,883	130,773	130,748	130,883	130,883	130,883	130,883	130,883	130,883
MKP12	7	2	30	21	95,007	95,007	95,677	95,677	95,517	95,007	95,677	95,677	95,677	95,677	95,677	95,677
MKP13	30	30	30	30	119,337	119,337	119,337	119,337	119,337	119,337	119,337	119,337	119,337	119,337	119,337	119,337
MKP14	0	0	16	7	98,396	98,396	98,506	98,796	98,416	98,396	98,796	98,506	98,631	98,495	98,796	98,796
MKP15	24	6	19	23	130,163	130,103	130,233	130,623	130,623	130,123	130,623	130,623	130,623	130,623	130,623	130,623
MKP16	0	0	0	0	1,054,304	1,053,033	1,090,945	1,095,232	1,062,646	1,066,372	1,093,842	1,093,936	1,073,225	1,089,691	1,095,206	1,095,232
MKP17	0	0	0	0	617,055	612,176	618,360	623,952	618,875	615,860	621,239	623,092	621,086	621,086	623,952	623,952
MKP18	30	27	30	30	4554	4549	4554	4554	4554	4554	4554	4554	4554	4554	4554	4554
MKP19	30	30	29	30	4536	4536	4531	4536	4536	4536	4536	4536	4536	4536	4536	4536
MKP20	4	0	10	7	4106	4056	4106	4115	4106	4106	4106	4106	4115	4106	4115	4115
MKP21	30	30	30	30	4561	4561	4561	4561	4561	4561	4561	4561	4561	4561	4561	4561
MKP22	30	30	30	30	4514	4514	4514	4514	4514	4514	4514	4514	4514	4514	4514	4514
MKP23	3	0	15	13	5499	5432	5523	5557	5517	5480	5550.5	5542	5557	5515	5557	5557
MKP24	0	0	13	14	5502	5434	5517	5567	5526	5479.5	5549	5546	5550	5550	5567	5567
MKP25	0	0	12	7	5499	5427	5550	5605	5556.5	5507.5	5597.5	5592	5592	5542	5605	5605
MKP26	0	0	27	29	5194	5133	5192	5246	5194	5160	5246	5246	5210	5194	5246	5246
MKP27	0	0	24	24	6247	6216	6260	6339	6301	6247	6339	6339	6323	6301	6339	6339
MKP28	0	0	24	28	5555	5507	5562	5643	5555	5555	5643	5643	5605	5562	5643	5643
MKP29	1	0	25	27	6247	6175	6304	6339	6301	6216	6339	6339	6339	6301	6339	6339
MKP30	1	0	28	28	6028	5974	6083	6159	6072	5991	6159	6159	6159	6122	6159	6159
MKP31	0	0	2	0	6835	6706	6870	6923	6850	6807	6902	6885	6850	6850	6954	6923
MKP32	1	0	5	10	7391	7283	7421	7486	7412.5	7376	7424	7442	7486	7408	7486	7486
MKP33	0	0	0	1	7130	7060	7210	7289	7180	7117	7252.5	7246	7187	7185	7288	7289
MKP34	0	0	7	1	8466	8366	8567	8633	8499.5	8445.5	8610.5	8616.5	8591	8575	8633	8633
MKP35	0	0	0	1	9239	9150	9399	9580	9287.5	9212	9485	9515	9349	9385	9573	9580

Table 4. Experimental results for the best, minimum, median, and maximum values obtained from Instances MKP35–MKP70.

ID	Best–Values				Minimum–Value				Median–Value				Maximum–Value
ID	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL
MKP36	0	0	2	1	7403	7125	7372	7698	7475	7283	7557.5	7555	7586	7465	7698	7698
MKP37	0	0	12	6	9293	9179	9383	9450	9344	9283	9445	9445	9410	9405	9450	9450
MKP38	0	0	20	14	8894	8786	8966	9074	8959	8883	9074	9048	9048	8936	9074	9074
MKP39	0	0	2	4	8570	8338	8722	8947	8661	8519.5	8827.5	8833	8777	8713	8947	8947
MKP40	0	0	0	0	7902	7632	7972	8341	8006	7839	8124.5	8152	8137	8105	8341	8341
MKP41	0	0	1	1	9761	9649	10,063	10,220	9838	9747.5	10,130	10,138	9936	9961	10,220	10,220
MKP42	0	0	5	1	9637	9543	9843	9939	9730	9623	9894	9890.5	9825	9743	9939	9939
MKP43	0	0	0	0	9148	8923	9253	9522	9224	9054.5	9464	9464	9354	9244	9527	9522
MKP44	0	0	0	0	9268	9109	9568	9764	9399	9248	9666	9659	9562	9509	9764	9764
MKP45	0	0	1	0	8998	8731	9110	9433	9090	8898	9276.5	9286.5	9272	9125	9492	9433
MKP46	0	0	0	0	8898	8670	9080	9310	9006	8791	9208.5	9209	9178	9016	9310	9310
MKP47	0	0	1	2	10,801	10,671	11,057	11,191	10,905	10,778	11,143	11,146	11,040	10,982	11,191	11,191
MKP48	1	0	1	3	3042	3032	3045	3090	3076	3056	3060	3076	3090	3077	3090	3090
MKP49	0	0	2	4	3111	3063	3089	3186	3150	3099.5	3141.5	3167.5	3169	3173	3186	3186
MKP50	2	0	24	24	91,935	91,935	91,935	95,168	92,738	93,650	95,168	95,168	95,168	94,801	95,168	95,168
MKP51	30	25	29	30	2139	2096	2122	2139	2139	2139	2139	2139	2139	2139	2139	2139
MKP52	17	0	20	22	732	702	723	776	776	723	776	776	776	762	776	776
MKP53	0	0	7	7	1012	1004	1009	1035	1029	1017	1023	1033	1033	1033	1035	1035
MKP54	1	0	4	3	3380	3347	3339	3418	3404	3383	3396	3404	3418	3404	3418	3418
MKP55	0	0	3	1	3094	3037	3082	3186	3131	3086.5	3140	3147.5	3159	3144	3186	3186
MKP56	0	0	3	1	23,639	23,484	23,793	24,211	23,837	23,640	24,015	23,977	24,109	23,881	24,211	24,211
MKP57	0	0	1	1	24,089	23,906	23,995	24,274	24,153	24,054	24,124	24,182	24,258	24,182	24,274	24,274
MKP58	0	0	0	1	23,344	23,196	23,332	23,551	23,494	23,350	23,494	23,494	23,494	23,494	23,494	23,551
MKP59	0	0	0	0	22,825	22,678	22,757	23,227	23,007	22,842	23,014	23,061	23,140	23,092	23,153	23,227
MKP60	0	0	0	1	23,722	23,547	23,659	23,947	23,817	23,695	23,817	23,843	23,939	23,939	23,939	23,947
MKP61	0	0	0	1	24,289	24,183	24,304	24,601	24,413	24,271	24,411	24,417	24,477	24,474	24,555	24,601
MKP62	0	0	1	1	25,100	24,904	25,091	25,521	25,288	25,099	25,274	25,324	25,420	25,490	25,521	25,521
MKP63	0	0	0	1	23,010	22,814	23,030	23,320	23,159	23,001	23,177	23,230	23,285	23,279	23,305	23,320
MKP64	0	0	1	0	23,730	23,512	23,687	24,109	23,898	23,739	24,041	24,045	24,091	23,991	24,135	24,109
MKP65	0	0	2	2	24,138	24,057	24,088	24,342	24,206	24,140	24,225	24,220	24,287	24,264	24,342	24,342
MKP66	0	0	1	1	41,767	41,584	42,130	42,656	42,067	41,874	42,404	42,391	42,243	42,204	42,656	42,656
MKP67	0	0	0	0	41,564	41,293	41,874	42,388	41,788	41,609	42,211	42,218	42,026	42,002	42,364	42,388
MKP68	0	0	0	1	40,927	40,680	41,260	41,934	41,124	40,968	41,591	41,608	41,595	41,358	41,844	41,934
MKP69	0	0	1	0	44,042	43,682	44,364	44,905	44,212	44,019	44,735	44,788	44,631	44,500	45,047	44,905
MKP70	0	0	0	1	41,487	41,275	41,731	42,192	41,586	41,554	41,983	41,992	41,889	41,860	42,123	42,192

Table 5. Experimental results for the average, standard deviation, interquartile range, and RPD values obtained from Instances MKP01–MKP35.

ID	Averages–Values				Standard Deviations –Value				Interquatil Ranges–Value				RPD–Value
ID	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL
MKP01	3800	3800	3800	3800	0	0	0	0	3800	3800	3800	3800	0.00	0.00	0.00	0.00
MKP02	8706.1	8706.1	8706.1	8706.1	0	0	0	0	8706.1	8706.1	8706.1	8706.1	0.00	0.00	0.00	0.00
MKP03	4015	4015	4015	4015	0	0	0	0	4015	4015	4015	4015	0.00	0.00	0.00	0.00
MKP04	6120	6120	6120	6120	0	0	0	0	6120	6120	6120	6120	0.00	0.00	0.00	0.00
MKP05	12,400	12,400	12,400	12,400	0	0	0	0	12,400	12,400	12,400	12,400	0.00	0.00	0.00	0.00
MKP06	10,574	10,529	10,579	10,581	16.64	26	23	16.43	10,585	10,542	10,588	10,588	0.00	0.00	0.00	0.00
MKP07	16,404	16,314	16,442	16,450	26.66	52.81	38	34.21	16,417	16,351	16,474	16,471	0.00	0.00	0.00	0.00
MKP08	7726.2	7701	7754.57	7757.4	19	23.95	22.84	21	7739	7709	7772	7772	0.00	0.00	0.00	0.00
MKP09	8678.53	8656	8701	8700	11.36	18.21	11	11	8679.75	8664.75	8708.25	8704	0.00	0.00	0.00	0.00
MKP10	141,275	141,188	141,278	141,278	7.58	52.75	0	0	141,278	141,255	141,278	141,278	0.00	0.00	0.00	0.00
MKP11	130,821	130,627	130,883	130,883	55.44	237.56	0	0	130,883	130,773	130,883	130,883	0.00	0.00	0.00	0.00
MKP12	95,414	95,099	95,677	95,662	263.27	213.78	0	24	95,624	95,007	95,677	95,677	0.00	0.00	0.00	0.00
MKP13	119,337	119,337	119,337	119,337	0	0	0	0	119,337	119,337	119,337	119,337	0.00	0.00	0.00	0.00
MKP14	98,446	98,401	98,702	98,603	69.73	18.53	109.14	119.88	98,503	98,396	98,796	98,631	0.00	0.00	0.00	0.00
MKP15	130,541	130,241	130,480	130,532	166.51	199.59	191.15	167.77	130,623	130,233	130,623	130,623	0.00	0.00	0.00	0.00
MKP16	1,062,950	1,066,140	1,093,872	1,093,906	4771	7221.58	897.41	889.89	1,065,827	1,069,565	1,094,561	1,094,686	0.00	0.00	0.00	0.00
MKP17	618,826	616,378	621,910	622,358	1325.81	2723.54	1724.74	1553.86	619,863	618,456	623,749	623,947	0.00	0.00	0.00	0.00
MKP18	4554	4553.5	4554	4554	0	1.53	0	0	4554	4554	4554	4554	0.00	0.00	0.00	0.00
MKP19	4536	4536	4535.83	4536	0	0	0.91	0	4536	4536	4536	4536	0.00	0.00	0.00	0.00
MKP20	4107.2	4100.93	4109	4108.1	3.11	13.87	4.32	3.87	4106	4106	4115	4106	0.00	0.00	0.00	0.00
MKP21	4561	4561	4561	4561	0	0	0	0	4561	4561	4561	4561	0.00	0.00	0.00	0.00
MKP22	4514	4514	4514	4514	0	0	0	0	4514	4514	4514	4514	0.00	0.00	0.00	0.00
MKP23	5523.4	5478	5548.4	5540.27	17.21	21.9	9.47	18.23	5534	5493.75	5557	5557	0.00	0.00	0.00	0.00
MKP24	5525.63	5484.23	5552.47	5553.77	14.64	29	13.94	12.68	5537.25	5494	5567	5567	0.00	0.00	0.00	0.00
MKP25	5556.83	5500.27	5594.63	5593.17	31	27.71	14.84	15.26	5590.5	5517	5605	5603	0.00	0.00	0.00	0.00
MKP26	5194.53	5166.73	5241.4	5244.87	2.92	24.12	14.24	6.21	5194	5194	5246	5246	0.00	0.00	0.00	0.00
MKP27	6296.33	6255.53	6331.53	6334.73	17.2	25	19.11	9.63	6301	6263.75	6339	6339	0.00	0.00	0.00	0.00
MKP28	5559.4	5547.23	5636	5639.67	9.69	18.34	17.85	15	5562	5555	5643	5643	0.00	0.00	0.00	0.00
MKP29	6300.5	6242.13	6334.1	6335.73	12.25	42	11.2	10	6301	6301	6339	6339	0.00	0.00	0.00	0.00
MKP30	6077.4	6018.57	6155.23	6155.23	24	48.34	15.22	15.22	6072	6064.5	6159	6159	0.00	0.00	0.00	0.00
MKP31	6844	6793.23	6901.87	6892.8	7.47	46.36	17	9.88	6850	6814	6902	6902	0.00	0.00	0.00	0.00
MKP32	7413.2	7366.8	7439.23	7450.3	17.46	33.74	24.21	27.85	7413	7385	7449	7486	0.00	0.00	0.00	0.00
MKP33	7177.2	7123.3	7255.13	7245	12.39	30.52	23	19.66	7180	7126.75	7272.5	7250.5	0.00	0.00	0.00	0.00
MKP34	8501.6	8452.63	8609.73	8613.23	28.59	52.64	18.7	12.34	8512.25	8489.5	8621.75	8621	0.00	0.00	0.00	0.00
MKP35	9293.8	9228.37	9489.67	9519.53	28.69	59.76	48.66	33.16	9317	9262.75	9523.5	9550	0.00	0.00	0.00	0.00

Table 6. Experimental results for the average, standard deviation, interquartile range, and RPD values obtained from Instances MKP36–MKP70.

ID	Averages–Values				Standard Deviations –Value				Interquatil Ranges–Value				RPD–Value
ID	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL	NPSO	CQL	MQL	SSQL
MKP36	7484.73	7284.53	7553.47	7551.87	50.72	86.15	88.42	73.18	7524.25	7330.5	7591.75	7598.5	0.00	0.00	0.00	0.00
MKP37	9351.5	9284.27	9441.67	9438.1	25.9	63.12	15	16.24	9360.5	9341.75	9450	9445	0.00	0.00	0.00	0.00
MKP38	8963.2	8869.57	9056	9057.53	47.82	42.22	31.27	21	8972	8901.5	9074	9074	0.00	0.00	0.00	0.00
MKP39	8665.9	8503.7	8842	8840	54.1	90.43	53.86	65.18	8699.5	8563.75	8879.75	8873.75	0.00	0.00	0.00	0.00
MKP40	8014.6	7837	8138.27	8160.67	59	101	88.82	92.8	8055.75	7875	8190	8192.25	0.00	0.00	0.00	0.00
MKP41	9847	9772.5	10,140	10,132	43.83	82.86	42.32	44.48	9874.5	9816.5	10,178	10,163	0.00	0.00	0.00	0.00
MKP42	9738.83	9641.37	9901.5	9888.1	40.87	59.28	23.19	18.11	9758.5	9694	9920.25	9895.5	0.00	0.00	0.00	0.00
MKP43	9223.33	9059	9439.77	9451.77	49.73	86.62	80.59	45.12	9251.75	9112	9499.75	9468	0.00	0.00	0.00	0.00
MKP44	9407	9256.27	9682.5	9660.13	76.68	96.86	72.46	72.11	9455.75	9320.75	9764	9717.75	0.00	0.00	0.00	0.00
MKP45	9093.3	8895.1	9294.5	9291.23	67.94	92.92	90.13	73.59	9122	8937	9331.25	9339	0.00	0.00	0.00	0.00
MKP46	9006.4	8815.5	9192.73	9193.3	66.2	98.14	68.77	71.72	9049.25	8871	9226.75	9252	0.00	0.00	0.00	0.00
MKP47	10,903	10,790	11,136	11,142	63	72.48	28.84	24.13	10,947	10,817	11,150	11,160	0.00	0.00	0.00	0.00
MKP48	3072.4	3052.5	3065.63	3072	9.4	12.77	11.15	9.63	3076	3059.75	3076	3076	0.00	0.00	0.00	0.00
MKP49	3148	3103.33	3135.83	3159.83	13.1	25.69	26.55	20	3158.75	3115.75	3147.75	3171.75	0.00	0.00	0.00	0.00
MKP50	93,273	93,363	94,717	95,026	1397.17	1103.44	1112.71	588.89	94,801	93,975	95,168	95,168	0.00	0.00	0.00	0.00
MKP51	2139	2135.3	2138.43	2139	0	9.46	3.1	0	2139	2139	2139	2139	0.00	0.00	0.00	0.00
MKP52	767	728.93	767.73	772.27	12.53	12.84	15.4	6.3	776	732	776	776	0.00	0.00	0.00	0.00
MKP53	1027.7	1018.9	1024.7	1029	5.43	7.78	8.17	6.51	1033	1025.25	1033	1034	0.00	0.00	0.00	0.00
MKP54	3398.2	3381.23	3394.93	3402.17	9.29	12.14	16.25	8.49	3404	3387.5	3404	3404	0.00	0.00	0.00	0.00
MKP55	3130.47	3088.53	3134.57	3142.67	19.44	26.41	29.51	21.91	3146.25	3107	3150.75	3155.25	0.00	0.00	0.00	0.00
MKP56	23,866	23,650	24,029	23,969	104.24	114.85	112	114.45	23,897	23,714	24,111	24,021	0.00	0.00	0.00	0.00
MKP57	24,153	24,054	24,140	24,173	42.87	66.49	58.45	54.6	24,182	24,089	24,182	24,184	0.00	0.00	0.00	0.00
MKP58	23,469	23,354	23,467	23,485	42.73	74.46	47.66	29.95	23,494	23,374	23,494	23,494	0.00	0.00	0.00	0.00
MKP59	22,993	22,851	23,000	23,060	78.95	114	104.43	83.45	23,035	22,899	23,069	23,073	0.00	0.00	0.00	0.00
MKP60	23,832	23,697	23,804	23,855	55.2	86.37	66	62.92	23,855	23,731	23,832	23,909	0.00	0.00	0.00	0.00
MKP61	24,407	24,290	24,424	24,416	41.3	71.19	64.81	68.14	24,422	24,340	24,473	24,441	0.00	0.00	0.00	0.00
MKP62	25,276	25,099	25,267	25,346	95.48	124	96.16	77.9	25,324	25,146	25,324	25,406	0.00	0.00	0.00	0.00
MKP63	23,153	22,998	23,181	23,206	63.5	120.84	75.73	84.74	23,187	23,070	23,219	23,278	0.00	0.00	0.00	0.00
MKP64	23,901	23,734	24,003	24,024	110.3	116.57	114.42	69.91	23,983	23,803	24,084	24,084	0.00	0.00	0.00	0.00
MKP65	24,211	24,144	24,233	24,239	38.78	54.18	64	44	24,225	24,184	24,264	24,264	0.00	0.00	0.00	0.00
MKP66	42,046	41,896	42,402	42,397	133.29	170.34	131.5	109.51	42,158	42,042	42,505	42,452	0.00	0.00	0.00	0.00
MKP67	41,793	41,596	42,199	42,235	110.21	171.92	127.9	105.25	41,828	41,691	42,292	42,310	0.00	0.00	0.00	0.00
MKP68	41,151	40,981	41,595	41,595	156.22	180	131.67	118.55	41,257	41,066	41,688	41,663	0.00	0.00	0.00	0.00
MKP69	44,246	44,015	44,716	44,771	136.13	163.89	144.89	90.81	44,302	44,108	44,801	44,835	0.00	0.00	0.00	0.00
MKP70	41,620	41,557	41,967	41,998	97.55	156	104.15	96.88	41,668	41,674	42,025	42,058	0.00	0.00	0.00	0.00

Table 7. Test Wilcoxon–Mann–Whitney for PSO and its enhanced versions solving Instances MKP01–MKP35.

ID	NPSO vs. CQL		NPSO vs. MQL		NPSO vs. SSQL		MQL vs. SSQL		CQL vs. MQL		CQL vs. SSQL
ID	NPSO	CQL	NPSO	MQL	NPSO	SSQL	MQL	SSQL	CQL	MQL	CQL	SSQL
MKP01	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
MKP02	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
MKP03	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
MKP04	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
MKP05	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
MKP06	0	1	0.98	0.02	0.98	0.02	0.49	0.51	1	0	1	0
MKP07	0	1	1	0	1	0	0.77	0.23	1	0	1	0
MKP08	0	1	1	0	1	0	0.78	0.22	1	0	1	0
MKP09	0	1	1	0	1	0	0.41	0.59	1	0	1	0
MKP10	0	1	0.99	0.01	0.99	0.01	0.5	0.5	1	0	1	0
MKP11	0	1	1	0	1	0	0.5	0.5	1	0	1	0
MKP12	0	1	1	0	1	0	0	1	1	0	1	0
MKP13	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
MKP14	0	1	1	0	1	0	0	1	1	0	1	0
MKP15	0	1	0.14	0.86	0.46	0.54	0.87	0.13	1	0	1	0
MKP16	0.97	0.03	1	0	1	0	0.54	0.46	1	0	1	0
MKP17	0	1	1	0	1	0	0.86	0.14	1	0	1	0
MKP18	0.04	0.96	0.5	0.5	0.5	0.5	0.5	0.5	0.96	0.04	0.96	0.04
MKP19	0.5	0.5	0.16	0.84	0.5	0.5	0.84	0.16	0.16	0.84	0.5	0.5
MKP20	0	1	0.97	0.03	0.84	0.16	0.2	0.8	1	0	1	0
MKP21	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
MKP22	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
MKP23	0	1	1	0	1	0	0.07	0.93	1	0	1	0
MKP24	0	1	1	0	1	0	0.71	0.29	1	0	1	0
MKP25	0	1	1	0	1	0	0.16	0.84	1	0	1	0
MKP26	0	1	1	0	1	0	0.86	0.14	1	0	1	0
MKP27	0	1	1	0	1	0	0.55	0.45	1	0	1	0
MKP28	0	1	1	0	1	0	0.93	0.07	1	0	1	0
MKP29	0	1	1	0	1	0	0.75	0.25	1	0	1	0
MKP30	0	1	1	0	1	0	0.5	0.5	1	0	1	0
MKP31	0	1	1	0	1	0	0	1	1	0	1	0
MKP32	0	1	1	0	1	0	0.89	0.11	1	0	1	0
MKP33	0	1	1	0	1	0	0.04	0.96	1	0	1	0
MKP34	0	1	1	0	1	0	0.77	0.23	1	0	1	0
MKP35	0	1	1	0	1	0	0.99	0.01	1	0	1	0

Table 8. Test Wilcoxon–Mann–Whitney for PSO and its enhanced versions solving Instances MKP36–MKP70.

ID	NPSO vs. CQL		NPSO vs. MQL		NPSO vs. SSQL		MQL vs. SSQL		CQL vs. MQL		CQL vs. SSQL
ID	NPSO	CQL	NPSO	MQL	NPSO	SSQL	MQL	SSQL	CQL	MQL	CQL	SSQL
MKP36	0	1	1	0	1	0	0.39	0.61	1	0	1	0
MKP37	0	1	1	0	1	0	0.06	0.94	1	0	1	0
MKP38	0	1	1	0	1	0	0.14	0.86	1	0	1	0
MKP39	0	1	1	0	1	0	0.45	0.55	1	0	1	0
MKP40	0	1	1	0	1	0	0.76	0.24	1	0	1	0
MKP41	0	1	1	0	1	0	0.36	0.64	1	0	1	0
MKP42	0	1	1	0	1	0	0.01	0.99	1	0	1	0
MKP43	0	1	1	0	1	0	0.44	0.56	1	0	1	0
MKP44	0	1	1	0	1	0	0.12	0.88	1	0	1	0
MKP45	0	1	1	0	1	0	0.55	0.45	1	0	1	0
MKP46	0	1	1	0	1	0	0.53	0.47	1	0	1	0
MKP47	0	1	1	0	1	0	0.76	0.24	1	0	1	0
MKP48	0	1	0.01	0.99	0.42	0.58	0.99	0.01	1	0	1	0
MKP49	0	1	0.01	0.99	1	0	1	0	1	0	1	0
MKP50	0.56	0.44	1	0	1	0	0.6	0.4	1	0	1	0
MKP51	0.01	0.99	0.16	0.84	0.5	0.5	0.84	0.16	0.96	0.04	0.99	0.01
MKP52	0	1	0.75	0.25	0.95	0.05	0.78	0.22	1	0	1	0
MKP53	0	1	0.08	0.92	0.93	0.07	0.97	0.03	1	0	1	0
MKP54	0	1	0.28	0.72	0.98	0.02	0.98	0.02	1	0	1	0
MKP55	0	1	0.62	0.38	0.99	0.01	0.9	0.1	1	0	1	0
MKP56	0	1	1	0	1	0	0.01	0.99	1	0	1	0
MKP57	0	1	0.27	0.73	0.99	0.01	0.99	0.01	1	0	1	0
MKP58	0	1	0.5	0.5	0.92	0.08	0.93	0.07	1	0	1	0
MKP59	0	1	0.73	0.27	1	0	0.97	0.03	1	0	1	0
MKP60	0	1	0.11	0.89	0.96	0.04	1	0	1	0	1	0
MKP61	0	1	0.61	0.39	0.71	0.29	0.38	0.62	1	0	1	0
MKP62	0	1	0.32	0.68	1	0	1	0	1	0	1	0
MKP63	0	1	0.9	0.1	1	0	0.91	0.09	1	0	1	0
MKP64	0	1	1	0	1	0	0.54	0.46	1	0	1	0
MKP65	0	1	0.94	0.06	0.99	0.01	0.71	0.29	1	0	1	0
MKP66	0	1	1	0	1	0	0.42	0.58	1	0	1	0
MKP67	0	1	1	0	1	0	0.79	0.21	1	0	1	0
MKP68	0	1	1	0	1	0	0.4	0.6	1	0	1	0
MKP69	0	1	1	0	1	0	0.96	0.04	1	0	1	0
MKP70	0.05	0.95	1	0	1	0	0.83	0.17	1	0	1	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Olivares, R.; Soto, R.; Crawford, B.; Ríos, V.; Olivares, P.; Ravelo, C.; Medina, S.; Nauduan, D. A Learning—Based Particle Swarm Optimizer for Solving Mathematical Combinatorial Problems. Axioms 2023, 12, 643. https://doi.org/10.3390/axioms12070643

AMA Style

Olivares R, Soto R, Crawford B, Ríos V, Olivares P, Ravelo C, Medina S, Nauduan D. A Learning—Based Particle Swarm Optimizer for Solving Mathematical Combinatorial Problems. Axioms. 2023; 12(7):643. https://doi.org/10.3390/axioms12070643

Chicago/Turabian Style

Olivares, Rodrigo, Ricardo Soto, Broderick Crawford, Víctor Ríos, Pablo Olivares, Camilo Ravelo, Sebastian Medina, and Diego Nauduan. 2023. "A Learning—Based Particle Swarm Optimizer for Solving Mathematical Combinatorial Problems" Axioms 12, no. 7: 643. https://doi.org/10.3390/axioms12070643

APA Style

Olivares, R., Soto, R., Crawford, B., Ríos, V., Olivares, P., Ravelo, C., Medina, S., & Nauduan, D. (2023). A Learning—Based Particle Swarm Optimizer for Solving Mathematical Combinatorial Problems. Axioms, 12(7), 643. https://doi.org/10.3390/axioms12070643

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Learning—Based Particle Swarm Optimizer for Solving Mathematical Combinatorial Problems

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Parameter Setting

3.2. Reinforcement Learning

3.2.1. Q–Learning

3.2.2. Single State Q–Learning

4. Developed Solution

4.1. Particle Swarm Optimization

4.2. Integration

5. Experimental Setup

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI