Comparative Analysis of Selection Hyper-Heuristics for Real-World Multi-Objective Optimization Problems

: As exact algorithms are unfeasible to solve real optimization problems, due to their computational complexity, meta-heuristics are usually used to solve them. However, choosing a meta-heuristic to solve a particular optimization problem is a non-trivial task, and often requires a time-consuming trial and error process. Hyper-heuristics, which are heuristics to choose heuristics, have been proposed as a means to both simplify and improve algorithm selection or configuration for optimization problems. This paper novel presents a novel cross-domain evaluation for multi-objective optimization: we investigate how four state-of-the-art online hyper-heuristics with different characteristics perform in order to ﬁnd solutions for eighteen real-world multi-objective optimization problems. These hyper-heuristics were designed in previous studies and tackle the algorithm selection problem from different perspectives: Election-Based, based on Reinforcement Learning and based on a mathematical function. All studied hyper-heuristics control a set of ﬁve Multi-Objective Evolutionary Algorithms (MOEAs) as Low-Level (meta-)Heuristics (LLHs) while ﬁnding solutions for the optimization problem. To our knowledge, this work is the ﬁrst to deal conjointly with the following issues: (i) selection of meta-heuristics instead of simple operators (ii) focus on multi-objective optimization problems, (iii) experiments on real world problems and not just function benchmarks. In our experiments, we computed, for each algorithm execution, Hypervolume and IGD+ and compared the results considering the Kruskal–Wallis statistical test. Furthermore, we ranked all the tested algorithms considering three different Friedman Rankings to summarize the cross-domain analysis. Our results showed that hyper-heuristics have a better cross-domain performance than single meta-heuristics, which makes them excellent candidates for solving new multi-objective optimization problems.


Introduction
Choosing a meta-heuristic to solve a particular optimization problem is a non-trivial task.Without detailed prior information as to which particular algorithm to use, it demands the evaluation of several algorithms in order to find out which is more suitable to solve a given problem.However, due to the non-deterministic nature of these algorithms, this process demands to be repeated several times.Hyper-heuristics, which are heuristics to choose heuristics, have been proposed as a means to both simplify and improve algorithm selection or configuration for optimization problems [1,2].The idea is, through automation of the heuristic search, to provide effective and reusable cross-domain search methodologies that are applicable to the problems with different characteristics from various domains without requiring much expert involvement [3].
Hyper-heuristics (HH) employ learning methods, by using some feedback from the search process.Based on the source of this feedback, HHs can be classified as online or offline.A hyper-heuristic employs online learning if learning takes place while the algorithm is solving an instance of a problem.It is offline if the knowledge is gathered in the form of rules or programs from a set of training instances that hopefully generalize to solving unseen instances [1,2].
Hyper-heuristics can also be classified as selection or generation methodologies [4,5].Selection hyper-heuristics at the high-level control and mix low-level (meta)heuristics (LLHs), automatically deciding which one(s) to apply to the candidate solution(s) at each decision point of the iterative search process [4].On the other hand, generation methodologies produce new heuristics or heuristic parts using pre-defined components.
Much of the previous research focuses on online selection hyper-heuristics.The interest in such algorithms has been growing in recent years, but the majority of research in this area has been limited to single-objective optimization [6].
The majority of the research in the HH literature focus on treating operators, such as crossover, mutation, and differential evolution as LLH [2,6].In [7], Cowling et al. proposed the Choice Function, an equation responsible for rank heuristics considering the algorithm performance, in this case, a fitness function for a mono-objective problem, and the computational time.
In [8], Li et al. introduced FFRMAB, a variation of the original Multi-Armed Bandit [9] (MAB) where the Fitness Rate Ranking (FRR) was proposed as reward assignment.In this hyper-heuristic, a set of Differential Evolution operators was considered as LLH for being chosen.Following the choice of DE operators, Gonçalves et al. [10] proposed a hyperheuristic based on the choice function to control a set of five Differential Evolution operators.Results showed this hyper-heuristic overcoming the performance of the standard MOEA (using a single operator) when solving ten unconstrained benchmark functions with two and three objectives.In [11], the authors applied a similar approach using several versions of MAB instead of using a Choice Function.In this case, the CEC 2009 benchmark [12] was employed for performance evaluation.In [13], Almeida et al. evaluated three different versions of MAB by applying them to the permutation flow shop problem.
In [14], Guizzo et al. designed a hyper-heuristic to solve the Class Integration Test Order Problem [15], a software engineering problem where nodes from a graph have to be visited, where these nodes are the classes to be tested.This hyper-heuristic was built using a choice function [6] and a Multi-Armed Bandit [9] to select a LLH from a set of nine to operate together with a fixed MOEA, in this case, the NSGA-II algorithm.This set was built by combining different crossover and mutation operators.The evaluation of LLHs was performed based on the dominance relationship among parent solutions and their offspring.In [16], this approach was tested considering SPEA2 as the fixed MOEA.In [17], de Carvalho tacked the same problem by creating a hyper-heuristic based on FRRMAB [8] and considering the same set of LLHs.Among all these versions, the Choice Function applied together with NSGA-II ( [14] version) obtained the best results.
Only a few studies focus on online learning hyper-heuristics selecting/mixing multiobjective evolutionary algorithms (MOEAs) for solving multi-objective optimization problems.Among them, Maashi et al. [6] proposed the Choice Function Hyper-Heuristic (HHCF) which is an interesting approach employing different quality indicators to evaluate a set of LLHs.In this approach, each LLH executes for some generations, then the resulting population is evaluated based on m different quality indicators, generating a table n × m, where n is the number of LLHs, and m is the number of quality indicators.Following that, a second table containing the rankings is generated, which is used as input to a choice mathematical function responsible for determining which LLH should execute next.This approach was tested using the Walking Fish Group (WFG) benchmark [18] and on a single real-world problem: the Crashworthiness problem [19].The authors used the algorithms NSGAII [20], SPEA2 [21], and MOGA [22] as LLHs, and compared their results to Amalgam [23] and to all single MOEAs.The experimental results indicated the success of HHCF outperforming them all.In [24], Li et al. replaced MOGA by IBEA [25], and evaluated the approach on solving another real-world problem: Wind Farm layout optimization [26].Although this hyper-heuristic yielded good results, the use of a two-level ranking approach was not justified properly and there was no theoretical background to it.
The following studies have employed well-established computational methods within the design of new hyper-heuristics.Li et al. [3] proposed MOHH-RILA, an online selection hyper-heuristic treating the problem as a learning automata, where the automata action is the selection of an LLH from a set of MOEAs.As reward and penalty, they used Hypervolume and Spread indicators (the last in case of ties).Thus, this study can be classified as a Reinforcement Learning based online selection hyper-heuristic.The authors employed IBEA, SPEA2, and NSGAII as LLHs to find solutions for the WFG and DTLZ [27] benchmarks and variants of the crashworthiness problem.The results showed that this approach outperformed HHCF, making it the state-of-the-art online selection hyper-heuristic for multi-objective optimization.
de Carvalho et al. [28,29] proposed the Multi-Objective Agent-Based Hyper-Heuristic (MOABHH), an online selection hyper-heuristic enabling the best performing LLH to have a greater share of the new solutions/offspring.They designed this HH as a multi-agent system by treating all LLH and quality indicators as agents and employed Social Choice Theory [30] voting methods in order to summarize quality indicators preferences for the participation assignment mechanism.In the first study [28], they employed NSGAII, IBEA, and SPEA2 as LLHs and applied MOABHH to the WFG Suite.GDE3 [31] was additionally included within the LLH set in [32].In the latter [29], they evaluated how MOABHH performs on real-world problems, including the Crashworthiness, Water [33], Car Side Impact [34], and Machining [35].In [36], the authors evaluated the proposed approach considering different voting criterion considering these four real-world problems and the multi-objective travel salesperson problem [37].This proposed HH outperformed each individual MOEA and random choice hyper-heuristic; however, MOABHH was not compared to any of the other state-of-the-art hyper-heuristics.
Hence, the goal of this paper is to perform a thorough investigation of four reportedly top-ranking hyper-heuristics across an extensive set of problems: (i) MOABHH, which achieved promising results in a previous study; (ii) two variants of MOHH-RILA (HHLA and HHRL); and (iii) HHCF.
We used eighteen real-world multi-objective problems in total: four that were already used in [32], ten problems from the Black Box Optimization Competition [38], and four other bi-objective problems from FourBarTuss [39], Goliski [40], Quagliarella [41], and Poloni [42].We also increased the number of LLHs used by all HHs to five: NSGAII, IBEA, SPEA2, GDE3, and mIBEA [43].As a consequence, the task of selecting a LLH at each decision point was harder for all tested HHs.This paper focuses on comparing selection online hyper-heuristics specialized in selecting multi-objective evolutionary algorithms.For this reason, hyper-heuristics focused on selecting/generating parts of algorithms such as heuristics (i.e., crossover and mutation) can not be considered.This is also the case of micro-genetic algorithms [44], which also focus on selecting part of the evolutionary algorithms.Other approaches, such as parameter control [45], which focuses on diminishing the effort on setting up parameters by automatically setting them, will also not be considered since they are not directly related to hyper-heuristics.
The rest of the paper is organized as follows: Section 2 describes the studied hyperheuristics and succinctly describes MOEAs employed as low-level heuristics.Section 3 describes our methodology for the performed experiments.In Section 4, we present and discuss our obtained results.Finally, we present our conclusions and further work in Section 5.

Multi-Objective Optimization
Multi-objective optimization (MOPs) consists of finding solutions which simultaneously consider two or more conflicting objectives to be minimized or maximized.Thus, the search aims to find a set of solutions, each one reflecting a trade-off between the objectives.MOPs are tackled today using Evolutionary Algorithms by engineers, computer scientists, biologists, and operations researchers alike [46].These algorithms are heuristic techniques that allow a flexible representation of the solutions and do not impose continuity conditions on the functions to be optimized.Moreover, MOEAs are extensions of EAs for multi-objective problems that usually apply the concepts of Pareto dominance [47].
MOEAs have been applied to solve MOPs from different areas, from logistic problems as the Ridesharing problem [48], Software Engineering problems as architecture optimization [49], machine learning problems such as feature selection [50], and by optimizing antibiotic treatments [51].
Besides EAs, there are other nature-based algorithms that have been successfully applied to solve optimization problems.This is the case of the Ant Colony Optimization [52] and related improved versions such as [53], and Particle Swarm Optimization [54,55].This last one with several different applications such as a vehicle routing problem [56][57][58] and engineering problems such as the design of near-field time delay equalizer metasurface [59], the Artificial Magnetic Conductor Surface [60], and for the design of a dielectric phase-correcting structure for antennas [61].
There are several MOEAs proposed in the literature, some of them are based on genetic algorithms and differ from each other on their replacement strategies; others are based on differential evolution.All of them use a population of current solutions P to generate offspring solutions O, combine them in P ∪ O, and then employ a replacement strategy to generate a new population of solutions P .In the sequence, we describe five MOEAs used in this work.

NSGAII
Non-dominated Sorting Genetic Algorithm-II [20] performs the replacement strategy considering Pareto Dominance and Crowding Distance selection, its major contribution.This selection evaluates how close solutions are to their neighbors, giving a better evaluation to large values, allowing them a better diversity in the population.Thus, NSGAII selects surviving solutions from P ∪ O first taking non-dominated solutions to compose P .Two situations may occur: this set may be lower than or equal to the maximum population size or not.In the first case, it adds iteratively dominated solutions with higher Crowding Distance values until P is complete.In the second case, NSGAII discards the solutions with lower Crowding values.

SPEA2
Differently from NSGAII, Strength Pareto Evolutionary Algorithm 2 [21] performs the replacement strategy considering Pareto Dominance and Strength values, which computes the relative difference between the number of other solutions that a particular solution dominates and those that it is dominated by.Higher values, meaning that a solution is more dominant, are better.As NSGAII, SPEA2 starts to fill P population with non-dominated solutions and, if it is necessary to complete P , it selects dominated solutions with higher Strength value.In the case of more non-dominated solutions than allowed, SPEA2 does the same procedure as NSGAII.

IBEA
Indicator-Based Evolutionary Algorithm [25] performs a replacement considering a contribution of a particular solution to improve a specific quality indicator.This algorithm selects surviving solutions from P ∪ O by removing the ones which contribute less to the given quality indicator.Usually, Hypervolume is adopted as the quality indicator.

mIBEA
The Modified Indicator-Based Evolutionary Algorithm [43], based on IBEA, employs Hypervolume as the quality indicator.Different from its predecessor, which considers all solutions in P ∪ O to select solutions to compose P based on the quality indicator contribution, this algorithm uses only non-dominated solutions from the union set, and then select the ones which contribute more.This algorithm works in the same way as IBEA after this.This modification improves the algorithm convergence and removes solutions with high-quality indicator contribution which are far away from the Pareto Front.

GDE3
Differently from the previously MOEAs, Generalized Differential Evolution 3 does not employ a crossover operator to generate offspring.Instead, this algorithm employs the differential evolution operator (DE) [62].This operator generates offspring by combining more than three different parent solutions.This operator is performed until O is filled.The algorithm behaves like NSGAII on the further steps when generating the new population solution P .

Selecting a MOEA
MOEAs do not generate a single final solution but a set of non-dominated solutions which are considered to be of the same quality.The performance comparison and selection of the best one among alternative algorithms can just be done using quality indicators.We can classify these indicators according to what they focus: convergence or diversity.
Convergence focuses on measuring the closeness of a given non-dominated solution set to the Pareto optimal front.Diversity focus on measuring how diverse the obtained solution set is along the Pareto optimal front [63].
Some quality indicators focus both on convergence and diversity: Hypervolume [64], Hyper-area Ratio (HR) [65], and Pareto Dominance Indicator (ER) [66].Others focus on diversity: Uniform distribution (UD) [67] and a Ratio of Non-dominated Solutions (RNI) [67].There are also other quality indicators, for example, Algorithm Effort (AE) [67], which focus on diversity but at the same time considers the computational time.
We can use quality indicators to determine which MOEA is the best for solving a given problem.However, due to the stochastic nature of these algorithms, we can just say an algorithm A is better than B according to a given quality indicator after running multiple trials of both, taking indicator averages and performing a statistical comparison.One way to reduce the overall computational effort is assigning an algorithm to do this task, in this case, a hyper-heuristic.

HHCF
Maashi et al. [6] proposed an online selection hyper-heuristic based on the Choice Function [7] named Hyper-Heuristic based on Choice Function.Their work aimed to select, one at a time, an LLH from a set H (with size n), and apply it along with g generations.To evaluate the performance of the LLHs, this approach uses a two-level-ranking system.First, each LLH is evaluated according to a group of m quality indicators, here composed by Hypervolume, RNI, UD, and AE.
Each quality indicator evaluates every LLH assigning a performance value to each of them.Then, LLHs are ranked, by the quality indicator, from the best performing LLH (rank 1) to the worst (rank n).At this point, a table containing all the quality indicators values for each LLH is created, a table with size n * m.This paper defines RN I rank (h) as the function that returns, for a given LLH h, his rank according to the RNI quality indicator.A second table (Freq rank ) is generated by computing how many times each LLH has the best value for each quality indicator.This is how HHCF summarizes several quality indicator preferences.
In order to select which LLHs to execute, HHCF selects a LLH that maximizes Equation ( 1), which is composed of an exploitation term f 1 and an exploration term f 2 weighted by a parameter α, a fixed parameter for this algorithm: The exploitation term f 1 is calculated by Equation (2).In this equation, n is the number of low-level heuristics, Freq rank (h i ) is the number of times that a given low-level heuristic h is the best one according to all quality indicators, and RN I rank (h i ) is the rank of the low-level heuristic according to the RNI quality indicator: The exploration term f 2 is the computational waiting time (WT) that a given algorithm has waited inactive.In this present paper, due to the different computational effort demanded by the problems, we normalize f 2 using Equation (3): Algorithm 1 illustrates how HHCF works.First, a random population of solutions is generated (Line 7) and used in the initialization process (Line 8).In this process, each h ∈ H executes for g generations(Line 9), the current population Pop is updated and the values of quality indicators for Pop are computed and stored (Line 10).Afterwards, the algorithm continues with the process until the stopping criteria are met, by ranking each h ∈ H (Line 12) and calculating Equation (2) (Line 13) and Equation (3) (Line 14).With this information, the LLH which maximizes Equation ( 1) is selected (Line 15) and used to generate solutions during g generations (Line 16).Finally, the new population Pop is created using Pop and the offspring population (Line 17), and all the quality indicators are recalculated for h i using Pop (Line 18).Algorithm 1: HHCF pseudocode.

HHLA and HHRL
Learning Automata-based Multi-Objective Hyper-Heuristic with a Ranking scheme Initialization [3] implements a learning automata whose action is to select a LLH at each decision point, while the optimization problem is solved.There are two versions available of this algorithm: HHLA and HHRL.The only difference resides in the fact that HHRL employs an initialization process used in order to reduce the number of LLHs in the pool.
Algorithm 2 illustrates how both hyper-heuristics work.First, all of the initialization process is performed (Line 6).The algorithm continues while a stopping criterion is not reached; this hyper-heuristic applies the current LLH h i to the current population (Pop) during g generations producing a new offspring (Pop ).In the following, Pop and Pop are combined to generate the new current population Pop.In Line 11, this HH verifies whether it is time to switch or not to another LL: this is performed by verifying if is there an improvement in the Hypervolume value (compared to previous iterations).If it is the case, the current LLH keeps running, and, if not, the reinforcement learning scheme updates the transition matrix P. Finally, another LLH is selected by the ε-RoulleteGreedy method from A considering the transition matrix P.
The ε-RoulleteGreedy method focuses on exploring different transition pairs by performing a given number of trials in order to get a better view of LLH pairwise performance at the early stage.Then, it becomes more and more greedy exploiting the accumulated knowledge.
As mentioned before, the only difference between HHLA and HHRL lies in the initialization method.Algorithm 3 describes this process.First, the method creates a random population of solutions (Line 5) and the transition matrix P (Line 6), which describes the selection probabilities of transitions between LLHs.If HHLA is being run (Line 7), all LLHs are allowed to execute.Otherwise, only allowed LLH is selected.For this purpose, the set of LLHs (H) is reduced in order to eliminate poor-performing LLHs.This works as follows: First, all LLHs are executed in sequence for a number of stages.Every time an LLH executes, HHRL computes the resulting population Hypervolume, computed using the same reference points.The scheme counts how many times an LLH becomes the best one in all stages.These counts are then used to determine which LLH should compose the allowed set A. LLHs with a performance worse than the average are not allowed to compose A. The algorithm continues by selecting a current LLH h i according to the ε-RoulleteGreedy and returning all the generated information.

MOABHH
Multi-Objective Agent-Based Hyper-Heuristic [28,29] is a hyper-heuristic designed as a multi-agent system.According to Wooldridge [68], an agent is "a computer system that is situated in some environment, and is capable of autonomous action in this environment in order to meet its objectives".
This hyper-heuristic is designed to consider LLHs and quality indicators as agents in one election, which means that LLHs are candidates to be voted by quality indicator agents (the voters).The election happens each g generations, and the outcome tells us which LLH is performing better.Then, the election winners generate more offspring after the election.
MOABHH differs from HHCF, HHRL, and HHLA regarding how the current population is processed by the LLHs.In this algorithm, all LLH executes in parallel acting on a share of the main population and generating offspring in the same generation.This is performed by splitting the main population into subpopulations according to election outcomes where the best LLH receives a bigger subpopulation and can generate more offspring.In the beginning, this main population is equally split into subpopulations, each one processed by a different LLH agent, which will receive a sub-population, generate new offspring solutions, and find the surviving solutions.
When it is the election time, all quality indicators' agents (voters) evaluate each LLH agent (candidate), rank them, and send their rank to the HH agent, which is responsible for taking all votes and processing them according to an election method in order to generate an election outcome.In our approach, we use the Copeland [69] voting method, in which candidates are ordered by the number of pairwise victories, minus the number of pairwise defeats.The election outcome is used by the HH agent to increase the participation in generating offspring to the election winners and decreasing it to the losers in the next cycles.
Algorithm 4 details MOABHH steps.First, all agents, components, and global variables are initialized (line 3).A random population of solutions is generated (line 4).The execution continues creating a Participation array that is responsible for determining how many solutions each LLH can generate per generation.At this time, this array is created uniformly, assigning the same participation in generating offspring at the begging of the search.
The algorithm continues by splitting the population into subpopulations according to the Participation array.Then, all LLH generates offspring and updates the main population in parallel in a synchronized task.
Each g generations the voting process then starts, and the Voter agents evaluate the solutions produced by LLHs, rank them according to their preferences (line 18), and send them to the HH agent.After that, the HH agent calculates the social ranking according to the Copeland voting method (line 19) and assigns a bigger participation in the population for the election winner and a lower one for election losers (line 20).

Real-World Multi-Objective Problems
Over the years, several artificially constructed test problems have been proposed to compose benchmarks for evaluating meta-heuristics.These problems offer many advantages over real-world problems for the purpose of general performance testing [70], by allowing users to compare the results of their algorithms (regarding effectiveness and efficiency) with others, over a spectrum of algorithms' instantiations [46].
In the literature, one can find several of these MOP benchmarks, such as WFG, DTLZ [27], and UF [12].However, even if an algorithm has successfully solved these problems, this does not guarantee effectiveness and efficiency in solving real-world problems [46].
In terms of multi-objective hyper-heuristics, researchers have been using both benchmarks and real-world applications.However, few studies consider more than one real-world problem.This choice can, in fact, diminish the accuracy of evaluating hyper-heuristics on cross-domain applications.
Table 1 presents the eighteen real-world problems that we have used in this work.Ten of them were picked from Black Box Optimization Competition (https://www.ini.rub.de/PEOPLE/glasmtbl/projects/bbcomp/downloads/realworld-problems-bbcomp-EMO-201 7.zip, accessed date 10 April 2021) [38], a group of problems created/selected for the 9th International Conference on Evolutionary Multi-Criterion Optimization (EMO'2017) (http://www.emo2017.org/,accessed date 10 April 2021).We included the CrashWorthiness problem due to the fact it was already considered in previous papers for all the hyper-heuristics.The problems Water, Machining, and CarSideImpact were studied using MOABHH, but not for HHCF, HHLA, and HHRL.The other four problems were selected from the optimization literature and picked from the jMetal framework [71], the framework used by all studied hyper-heuristics.In the table, their number of objectives, variables, and constraints are detailed.All the problems are in continuous space.

Methodology
We set up the four studied hyper-heuristic controlling five LLH (GDE3, IBEA, NSGAII, SPEA2, and mIBEA) to solve the eighteen continuous real-world optimization problems presented in Table 1.Different configurations were employed for these problems.In particular, the number of generations and population size were slightly different for P17-P18.This was due to the high computational effort demanded in these applications, which takes almost three months for experiments using the same setup considered for the problems P01-P16.
Table 2 presents all these parameter used in our experiments.We set up HHLA and HHRL parameters according to [3], MOABHH according to [32] and HHCF according to [6] for P01-P16.For P17 and P18, we set MOABH g = 1, β = 0.5 and HHLA and HHRL g = 1.For these two problems, parameters were defined empirically.
All the original hyper-heuristics were designed to work with populational genetic algorithms.They have specific procedures to manipulate the current population of solutions.In this comparison, we add two other MOEAs (mIBEA and GDE3), which are modeled the same as the three others.For this reason, it does not demand deep changes in the compared hyper-heuristics, changes that would define new hyper-heuristics.Thus, algorithms such as MOEA/D [76] and SMPSO [55] cant be considered without proposing four new algorithms.
We employed Hypervolume and IGD+ averages obtained from the 30 executions.First, for each problem, we joined all results obtained by all algorithms, found the nadir point, necessary for Hypervolume calculation, and took the non-dominated set in order to generate the known Pareto Front (PF known ), necessary for IGD+ calculation.Then, we calculated the quality indicator averages and compared them using Kruskal-Wallis as the statistical test with a confidence level of 95%.In order to perform this, we first identify which algorithm has the best average according to the quality indicator.Thus, all the other algorithms are compared to the best, generating a set of p-values.We define an algorithm tied statistically with the best when a given p-value is superior to the significance level, which in this case is 0.05.

Experimental Results
In this section, we present the empirical results and our analysis.Tables 3 and 5 present, respectively, averages for Hypervolume and IGD+.For each problem, the algorithms results were submitted to a Kruskal-Wallis statistical test where the following hypothesis was answered: Hypothesis 1. Considering a quality indicator (Hypervolume or IGD+), is a given algorithm output equivalent to the algorithm that has the best output?
In these tables, we highlighted (in grey) the algorithm with the best average output; moreover, considering the statistic test outcomes, we represented in bold those algorithms whose outputs are equivalent to the best one.Finally, if we compare the results obtained by all nine algorithms, GDE3 has higher Hypervolume values on four problems (P04, P14, P16, and P17), IBEA has better averages on three problems (P05, P07, and P13), NSGAII has the best result on three problems (P06, P10, and P18), SPEA2 on three problems (P08, P09, and P11), MOABHH has better averages on three problems (P01, P12, and P15), and HHRL in two of them (P02 and P03).Moreover, mIBEA (among MOEAs), HHLA, and HHCF (both among HHs) did not excel in any problem.

Hypervolume Analysis
Finally, it is interesting to compare the results obtained by HHs with individual MOEAs.Table 4 presents a summary of the statistical pairwise comparison between HHs and individual MOEAs considering Hypervolume averages.An x in a certain cell means that the corresponding HH could not achieve a result as good as the one obtained by a particular MOEA, considering the Hypervolume indicator.First of all, we can notice that all HHs were overcome in problems P17 and P18.Moreover, MOABHH could not achieve statistically tied results with the best algorithms in three other problems (P10, P11, and P14), HHLA was bet in six other problems (P04, P06, P08, P09, P10, and P12), HHRL could not achieve the best result in two other problems (P06 and P10), and HHCF did not get good results in six other problems (P05, P07, P08, P09, P11, and P16).Finally, if we compare the results obtained by all nine algorithms, GDE3 is the best algorithm on three problems (P04, and P16-P17), IBEA performs better on three problems (P07, P11, and P15), NSGAII on 2 problems (P06 and P18), SPEA2 excels on four problems (P02, P08, P09, and P11), mIBEA just on P05, MOABHH performs better on four problems (P01, P10, P12, and P14), and HHRL also in two problems (P03 and P13).Moreover, HHLA and HHCF (both among HHs) did not excel in any problem.Finally, as done in Section 4.1, it is also interesting to compare the results obtained by HHs with individual MOEAs.Table 6 presents a summary of the statistical pairwise comparison between HHs and individual MOEAs considering IGD+ averages.An x in a certain cell means that the corresponding HH could not achieve a result as good as the one obtained by a particular MOEA, considering the IGD+ indicator.First of all, we can notice that all HHs were overcome in problems P15, P17, and P18.Moreover, MOABHH could not achieve statistically tied results with the best algorithms in three other problems (P03, P08, and P11), HHLA was also bet in three other problems (P04, P06, P08), HHRL could not achieve the best result in four other problems (P06, P07, P14, and P16), and HHCF did not get good results in six other problems (P05, P07, P08, P10, P11, and P13).

Hyper-Heuristics Analysis Utilization of Low-Level Meta-Heuristics
In this section, we address the following issue: how much a single LLH is chosen by a particular hyper-heuristics.Figure 1 graphically presents this usage: for MOABHH, it represents the percentage of participation in generating offspring along with the search.On the other hand, for HHRL, HHLA, and HHCF, the figure presents the percentage of times that each LLH was chosen.The data consider all problem instances, each of them running 30 times.The particular behavior of each HH may be found by analyzing Figure 1.For example, if we consider problem P03, one may notice that MOABHH has chosen more times GDE3 (blue) and SPEA2 (red), while HHLA chose more often SPEA2.On the other hand, HHRL and HHCF have chosen LLHs more uniformly.Table 7 presents a summarized evaluation of this analysis, where we classified HHs'c behavior into four classes: (i) One Elitist: problems where one LLH is clearly selected more times than any other, i.e. more than 50%; (ii) Two Elitist: problems where two LLHs are privileged, i.e. each LLH selected more than 40% ; (iii) Three Elitist: problems where three LLHs are selected more times than others, i.e., each LLH selected more than 30%; (iv) Not Elitist: when there is no clear LLH preference.We can identify HHLA and HHCF with more problems classified as One Elitist, while MOABHH and HHRL had more problems in Two Elitist category.For Three Elitist, HHLA, HHCF, and HHRL had four problems classified while MOABHH had three.Considering all elitist classified problems (One Elitist + Two Elitist + Three Elitist), we can identify all HH behaving in a similar way.

Generality Analysis
In order to perform a cross-domain evaluation of algorithms, we followed [77] and generated the average and aligned Friedman ranking considering both Hypervolume and IGD+ values.These rankings consider which position each algorithm takes on each problem.We also concatenated Hypervolume (Table 3) and IGD+ ( Considering just MOEA results, SPEA2 is the best single MOEA according to almost all statistical scores.The second best is GDE3, while IBEA and mIBEA are the worstperforming MOEAs according to this cross-domain analysis, with mIBEA performing 'slightly' better than IBEA.Considering all of the nine studied algorithms, MOABHH is the algorithm which performs better in a cross-domain perspective, as highlighted (in grey) in Table 8.HHRL is the secondbest except considering mixed average ranking values (5th column of the table): in this specific case, HHLA is considered the second-best algorithm.
Another interesting result is that it is not the case that an HH always gets better results than the LLHs that it is composed of; one can notice that, in general, SPEA2 and GDE3 got better statistical values when compared both to HHLA and HHCF.

Discussion
The cross-domain tests illustrated that mIBEA is one of the worst poor-performing algorithms for this group of problems, while GDE3 is one of the top-performing algorithms.Thus, the inclusion of mIBEA and GDE3 in the LLH pool increases the difficulty of choosing the best algorithm for hyper-heuristics.In the previous studies, mainly three algorithms were considered in the LLH pool: IBEA (one of the worst algorithms in our study), SPEA2 (the best one in this study), and NSGAII.
MOABHH and HHRL performed quite well, showing clearly superior results when compared to HHLA and HHCF.Both MOABHH and HHRL removed the poor performing LLHs at the beginning of the search, letting the best LLH run more time than others.HHLA and HHCF, however, kept trying poor-performing LLHs.HHRL removed the poor performing LLHs right away, without giving a proper chance to them during its initialization process, but those LLHs might have performed well in the later stages of the search process.On the other hand, MOABHH has kept all the algorithms running in parallel and removing a percentage of the offspring generation from the worse algorithms.This increases the MOABHH capability of exploring the search and avoiding the chance of removing an LLH with potential good performance in the later stages from the LLH pool in the beginning of the execution.

Conclusions
In this study, we investigated reportedly the top four online selection hyper-heuristics across eighteen real-world optimization problems.The hyper-heuristics controlled and mixed a set of five low-level MOEAs to produce improved trade-off solutions.The performance of algorithms was also evaluated with a larger set of MOEAs.
To the best of the authors' knowledge, this work is the first one which addresses the problem of using real-world problem instances for cross-domain performance evaluation of hyper-heuristics.In particular, we addressed in this paper the following issues: (i) an evaluation of four state-of-the-art online hyper-heuristics (MOABHH, HHRL, HHLA, and HHCF) using exclusively real-world problems; (ii) a harder selection task for these four hyper-heuristics by increasing the number of Low-Level heuristics used: in our work, we used five LLHs, whereas, in previous publications, the number of LLHs used were 3 or 4; (iii) a cross-domain tested formed by eighteen real-world optimization problems (presented in Table 1) in order to evaluate multi-objective hyper-heuristics, which gives a more realistic overview of their performance.
As expected, the empirical results showed that individual MOEAs deliver different performances on different problems, making those real-world problem instances very useful for cross-domain performance evaluation of hyper-heuristics.Moreover, our results showed that hyper-heuristics have a better cross-domain performance than single meta-heuristics.This means that, when a new multi-objective optimization problem must be solved, hyperheuristics are excellent candidates, reducing the user's effort to run repeatedly several meta-heuristics with different parameters settings in order to get a solution.In particular, MOABHH turned out to be the best algorithm delivering the best overall cross-domain performance, beating the other state-of-the-art hyper-heuristics with respect to two quality indicators: IGD+ and Hypervolume.
As future work, these hyper-heuristics will be studied across various applications in the discrete multi-objective optimization domain, such as Search-Based Software Engineering Problems [78] and exploring the potential of these algorithms on helping to improve the current pandemic, in terms of diagnosis and treatment [79] and drug discovery [80].Many objective problems also impose a challenge for researchers and practitioners: some many-objective approaches were recently proposed in the literature.Another research direction would be to study the performance of the top-performing online learning selection hyper-heuristics across problems with more than eight objectives, varying their LLHs.

Figure 1 .
Figure 1.Utilization rate for the four hyper-heuristics.

i ; 19 end 20 return Pop 21 end
1 Input: 2 Problem; 3 g-generations before evaluate an LLH; 4 H: set of LLHs {h 1 , ..., h i , ..., h n }; 5 α exploitation parameter; 6 begin 7 Generate a random population of solutions Pop; 8 Initialize components using H; 9 All h ∈ H uses Pop to generate Pop during g generations; 10 Compute all quality indicators for all h ∈ H; 11 while A stopping criterion is not reached do 12 Compute Freq rank and RN I rank for all h ∈ H; 13 Equation (2) is computed for all h ∈ H; 14 Equation (3) is computed for all h ∈ H; 15 Select h i according to Equation (1); 16 h i executes for g generations and generates Pop ; 17 Pop ← Pop //acceptance criterion; 18 Compute all quality indicators for h

Table 1 .
A brief description real-world multi-objective problems containing the number of objectives, variables constraints, and source.

Table 2 .
Parameters used in experiments.

Table 3 presents
Hypervolume averages.From the experiments, we could conclude the following:

Table 3 .
Hypervolume averages considering 30 executions.Highlighted values are the best Hypervolume values among all nine algorithms.Bold values are tied statistically with the best value.

Table 4 .
Summary of the statistical pairwise comparison between HHs and individual MOEAs considering Hypervolume averages.An x means that the HH could not achieve a result as good as the one obtained by an MOEA.

Table 5 .
IGD+ averages considering 30 executions.Highlighted values are the best IGD+ values among all nine algorithms.Bold values are tied statistically with the best value.

Table 6 .
Summary of the statistical pairwise comparison between HHs and individual MOEAs considering IGD+ averages.An x means that the HH could not achieve a result as good as the one obtained by a MOEA.

Table 7 .
How elitist HHs are on selection LLHs.

Table 5 )
tables, both with 18 lines of data, in order to create a new table of mixed quality indicators with 36 lines of data.As best Hypervolume and IGD+ values are, respectively, the highest and the lowest ones, we have used IGD+ = 1 − IGD+ values instead.Table 8 presents the statistical evaluation.In this table, smaller statistical values are considered as better.

Table 8 .
Friedman Ranking and Aligned Friedman Rank of the algorithms for Hypervolume and IGD+.Highlighted values are the best values among all nine algorithms.