AutoMH: Automatically Create Evolutionary Metaheuristic Algorithms Using Reinforced Learning

: Machine learning research has been able to solve problems in multiple aspects. An open area of research is machine learning for solving optimisation problems. An optimisation problem can be solved using a metaheuristic algorithm, which is able to ﬁnd a solution in a reasonable amount of time. However, there is a problem, the time required to ﬁnd an appropriate metaheuristic algorithm, that would have the convenient conﬁgurations to solve a set of optimisation problems properly. A solution approach is shown here, using a proposal that automatically creates metaheuristic algorithms aided by a reinforced learning approach. Based on the experiments performed, the approach succeeded in creating a metaheuristic algorithm that managed to solve a large number of different continuous domain optimisation problems. This work’s implications are immediate because they describe a basis for the generation of metaheuristic algorithms in real-time.


Introduction
The use of metaheuristic algorithms has become an approach that is widely used to solve a variety of optimisation problems [1], such as optimisation problems in the field of health, logistics, agriculture, mining, space, robotics, to name a few. The diversity of metaheuristic algorithms in the last decade has grown widely [1], with a great diversity of components, routines, selectors, internals, and especially a great variety of parameters. This diversity leads to different difficulties, such as, for example, being able to find a specific configuration of parameters for a specific type of optimisation problem. A situation that induces and generates a difficulty in being able to choose a metaheuristic algorithm adequately.
Various strategies have been adopted to minimise the effort of manual configurations. One area is machine learning, specifically in reinforced learning [2], where various advances have been made. For example, a general method to reformulate reinforced learning problems as optimisation tasks and apply the particle swarm metaheuristic algorithm to find optimal solutions [3]. A solution to solve the vehicle routing problem [4], Feature Selection [5], a design of a plane frame [6], or Resource Allocation problems [7]. Other approaches include Learnheuristics [8], Q-Learning [9], Meta-learning [10], and Hyper-heuristic [11,12], which provide diverse perspectives on tackling optimisation problems. In [13] the use of Multi-agent reinforcement learning is proposed, which allow for an upgrade in the reinforcement learning area, which generally uses a single agent.
In algorithm generation, there is an approach that uses the construction of a centralised hybrid metaheuristic cooperative strategy to solve optimisation problems [14]. Another approach, [15] uses a set of instructions to create a set of machine learning algorithms in real-time. A basis for understanding the scope of these approaches can be found in [16] which provides the taxonomy of combinations with metaheuristics, with mathematical programming, with constraint programming and machine learning. Open problems and a current status of the area can be found in [17], and [18].
For this research, an extension of the basic model of reinforced learning is proposed. The Agent will be called the learning agent, which contains two processes called the analysis process and the action process. The environment is the one that integrates a more significant number of changes because in addition to incorporating a set of optimisation problems to solve, it combines a swarm of non-intelligent agents, which have the purpose of executing the optimisation problems in a metaheuristic algorithm that improves its internal structure during each evolutionary process. The final objective is to be able to find through an evolutionary generation process the best metaheuristic algorithm(s) that solve the set of problems entered by the user.
In the scope of this research, he focuses on contributing within the area of High-level data-driven metaheuristics, specifically on the topic of metaheuristic generation by reinforcement learning. The main benefits expected from this work are as follows: • Find new metaheuristic algorithms in real-time that solve one or some optimisation problems given by the user. • Allow the extension of new components to generate new metaheuristic algorithms, as new operators or as new indivisible exploration intensification functions. • Contribute to the area of machine learning, specifically in the integration of reinforced learning to solve optimisation problems.
The rest of this paper is structured as follow: In section 2, the proposed design and the formalisation of its parts are detailed. In section 3, the tests performed and their results are detailed. Finally, in section 4 concludes and provides guidelines for future work.

Proposed reinforcement learning framework for the automatic creation of metaheuristics
The proposed reinforcement learning framework (AutoMH) is to be able to find through a learning agent one or a set of metaheuristic algorithms that are capable of finding the best solution for a portfolio of optimisation problems P.

Definition 1.
A continuous optimisation problem P is defined by minimise p(x) subject to l ≤ x ≤ u, where x = {x 1 , x 2 , . . . , x d }, d is the dimensions of the optimisation problem, l = {l 1 , l 2 , . . . , l d }, and u = {u 1 , u 2 , . . . , u d } are the lower bounds and the upper bounds of the corresponding variables in x, which define the feasible domain of the problem.
The figure 1 details the main parts of the framework architecture. It consists of two basic parts of a reinforcement learning system: the Learning Agent and the Environment.  Learning Agent: The learning agent has the function of analysing the data generated by the environment through the analysis process and taking actions that will affect through a set of actions the internal behaviour of each agent in the swarm of agents through the action process.
Environment: The environment is composed of three elements: • A Optimisation Problem Instance Set: Corresponds to a portfolio of optimisation problems P = {p 1 , p 2 , . . . , p n } that must be resolved by the agents. An example of optimisation problems is described in the Appendix A, and in the table 5. • A Swarm Agents: It is a set of non-intelligent agents A = {A 1 , A 2 , . . . , A n }.
• Two Swarm Process: The swarm status rewards process to extract information, and the swarm action process to add new information to the agents.
Swarm Agents: Each agent separately is in charge of executing the tests of the set of optimisation problems P. Formally, an agent is determined by Definition 2.

Definition 2.
A Agent A i is defined by the 3-tuple A i = M, Q, R , where: • A metaheuristic algorithm M, which is an empty structure named template. This structure is modified at runtime through the swarm action process. • A qualification Q corresponds to a variable that indicates the value of the rank assigned to the agent.
• A report R corresponds to a set of data structures in which the results of the optimisation tests are stored.
The stored data correspond to a matrix of summaries with best, worst, mean, std, fitness, and solution for each optimisation problem, and a matrix of details of each iteration for each execution of each optimisation problem.

Instruction
An instruction is an ordered grouping of elements with the objective of producing a change in the value of a variable. An instruction is made up of four elements: a variable, an assignment operator, an operator, and a function.
The composition of an instruction is detailed in the equation 1. Where, from right to left: f x t is the function that is applied using the current value of the variable x t in order to generate a new value, ∆ is the operator that will be applied with the value of the variable x t and with the value obtained by applying the function f x t , and the symbol ← is the assignment operator for a new value that will be assigned in x t+1 .
variable assignment operator variable operator f unction Formally, an instruction is determined by Definition 3. Additionally, instructions can derive into instruction types such as an exploration instruction ε, which is defined by function 2, or an intensification instruction γ that is defined by function 3.
For this research, a portfolio of functions with exploration and intensification instructions described in tables 1, 2 and 3 will be used.

Identifier Name
Function Code

Definition 4.
An operator ∆ is a mathematical symbol that indicates that a specific operation must be performed on a variable and a function. Table 4. List of operators.

Identifier Name Math Code
O00 None

Evolutionary metaheuristic algorithm
A metaheuristic algorithm M is a template that changes in each cycle depending on the decisions made by the learning agent through the action process. Changes to its structure are made at runtime through the swarm action process. The structure of the metaheuristic algorithm template is constituted by the SETUP, STEP, and END functions.
The description of the functions are described below: • The STARTED function is in charge of initialising the variables of the optimisation problem.
Initialisation is carried out using one or more scan instructions. Subsequently, the current fitness is calculated and the solution is stored. • The STEP function is the main core of the template. In this function the main modifications are made in the evolutionary metaheuristic algorithm. Actions are carried out such as adding, modifying or deleting instructions both of the type of exploration instructions, as well as intensification instructions. Subsequently, the new aptitude of the solution is calculated, and the new aptitude and solution is stored in the event that it is better than the previous one. • The END function is executed when the end condition of the metaheuristic algorithm ends. Its function is to extract the solution found and its associated aptitude. Figure 2 describes an example of a template that has already been modified by the learning agent. The STARTED function has a single instruction that is composed of the operator NONE with the code 7 of 22 O00, and by the UNIFORM10(0, 1) function with the code I109. The STEP function is composed of a exploration instruction <O02, I06> and two intensification instructions <O02, I06> and <O03, I14>.
# Example metaheuristic template with instructions.
def Started(P=0.5): # Exploration instruction.  The RUN function has the function of executing the STARTED, STEP and END functions. From executing the Started function to initialise the algorithm, executing the while loop with the calls to the STEP function, and ending the algorithm when the stopping criterion is met. When this condition is fulfilled, the End function will be called to extract the final solution of the optimisation problem P. The pseudocode of the RUN function is described in figure 3.

Swarm Status Rewards Process
The Swarm Status Rewards Process consists of a process to collect the information generated by the swarm of agents when executing the metaheuristic algorithm. Store the information for each problem P, for each execution of the problem P, the results regarding the best solution, the fitness, and the fitness values for each iteration. The values are stored in a structure called Report. Report = {best_solutions, best_ f itness, f itness_iterations}.

Analysis Process
The objective is to rank each agent in the swarm. The measure used to order the agents is the average value obtained for each optimisation problem. The average value is represented by the matrix Q ∈ R m×n (See 4), where: m ∈ {1, p} n ∈ {1, a}, p is the number of optimisation problems and a is the number of agents in the swarm.
Subsequently, a series of operations are carried out: 1. The assignment of ranges is carried out using the data provided by the Q matrix. The method used is the minimum (competition) method, which consists in that to perform the ranking to each value, the minimum of the ranges that would have been assigned is assigned to all tied values. The ranking result is stored in matrix R ∈ N m×n (See 5), where: m ∈ {1, p} n ∈ {1, a}, p is the number of optimisation problems and a is the number of agents in the swarm. The minimum method is performed for each row of the matrix Q, which will consider that each problem will have its own ranking among all the agents. The ranking result for each row will be stored in matrix R. (See 6), where: n ∈ {1, a}, and a is the number of agents in the swarm.
R m,n =       r 1,1 r 1,2 · · · r 1,n r 2,1 r 2,2 · · · r 2,n . . . . . . . . . . . . r m,1 r m,2 · · · r m,n       2. A sum is made by the value of each column in matrix R. Each sum will correspond to the final ranking value for each agent in the swarm. The values of each sum are stored in a vector S ∈ N n

Action Process
The action process takes the information generated by the analysis process and performs a swarm modification procedure. The steps performed are: The status NONE means that the agent will not have any modifications made to its metaheuristic algorithm. The status MODIFY means that the agent is enabled to carry out modifications in the structure of its metaheuristic algorithm.

Swarm Action Process
The Action Process has the function of making a modification in the agents of the swarm that have the MODIFY state. To carry out the modifications, each agent obtains a random integer by means of a uniform discrete distribution, the value obtained will correspond to a type of action that will modify the instruction structure of the metaheuristic algorithm.  Figure 4 shows the three types of modifications that are made in the metaheuristic algorithm.
From these modifications, the agent is able to repeat the optimisation tests again, in order to observe if the changes in its structure generate better or worse results.

Experiments
This section describes the characteristics of the tests and the results obtained by their execution.
Methodology: The methodology consists of collecting a portfolio of continuous function optimisation problems. Perform configuration learning system with a fixed number of agents, evolutionary iterations to be performed, and the type of mutation. For each non-intelligent agent, including the number of iterations and the number of executions each metaheuristic algorithm will perform. Finally, incorporate a list of intensification functions, exploration functions, and operators that will be used to create the instructions. Table 5 indicates the parameters used in the experiment.

Dimension 20
The dimension of the optimisation problems.
Description of the dataset: Table 6 describes 13 types of problems collected in the literature. Problems P01 to P07 correspond to unimodal functions, and problems P08 to P13 correspond to multimodal functions. Detailed descriptions of these functions are given in Appendix A.  Table 7 describes the results obtained at the end of the test with the 200 evolutionary iterations. This result corresponds to the best agent in the swarm. The results are the best, and worst fitness found for 13 continuous optimisation problems. In addition, the average and standard deviation of the fitness have been incorporated. The instructions with the best algorithm found during the 200 evolutionary iterations are:

Results
Started function: Instruction for generating random solution.

Discussion
According to the results of table 7, the best algorithm that has been found through the evolutionary process was able to find the value of the global optimum of 9 out of 13 optimisation problems. The problems are P1, P2, P3, P4, P6, P7, P9, P10, and P11, where the best fitness found with the value 0.0000, with a standard deviation and average fitness value of 0.0000. Considering the convergence graphs 5, 6, 7, 8, 10, 11, 13, and 14, it can be seen that the optimal global value was reached during the first 10 of the 100 iterations in which the algorithm was executed. These data provide visual evidence that the algorithm managed to find the optimal global value with a limited number of iterations.
For problem P12, the best fitness was obtained for a value of 0.1167, with a standard deviation of 0.0749 and an average value of 0.2373. Problems P5, P8, and P13 did not achieve good fitness values; their standard deviation is 0.0000. When performing a visual inspection on convergence graphs 9, 6, and 7, it can be observed that from a point in their execution, the algorithms cannot obtain a new fitness value, maintaining a single value throughout the iteration. This may be because the instructions that make up the proposed algorithm are not suitable for those specific problems. This point is important because we can obtain important information that this group of problems can be solved with another technique or generate tests with only those optimisation problems.
Considering the results of section 3 of the instructions obtained from the best algorithm found. These instructions can be translated into pseudocode, so that a final display algorithm found can be seen in Figure 18. if random_number < P: # Exploration instructions.  Finally, this work introduces a reinforced learning framework to generate metaheuristic algorithms (AutoMH). According to the exploratory tests carried out, the AutoMH has managed to find a metaheuristic algorithm that finds the optimal global value of 9 of the 13 continuous optimisation has one global minimum at f min (x * ) = 0 + ∑ d i=1 η i with x * = [0, 0, . . . , 0]. Where, η is a random number bounded between [0, 1).