Sampling Effects on Algorithm Selection for Continuous Black-Box Optimization

: In this paper, we investigate how systemic errors due to random sampling impact on automated algorithm selection for bound-constrained, single-objective, continuous black-box optimization. We construct a machine learning-based algorithm selector, which uses exploratory landscape analysis features as inputs. We test the accuracy of the recommendations experimentally using resampling techniques and the hold-one-instance-out and hold-one-problem-out validation methods. The results demonstrate that the selector remains accurate even with sampling noise, although not without trade-offs.


Introduction
A significant challenge in continuous optimization is to quickly find a target solution to a problem lacking algebraic expressions, calculable derivatives or a well-defined structure.For such black-box problems, the only reliable information available is the set of responses to a group of candidate solutions.Bound-constrained, single-objective, continuous Black-Box Optimization (BBO) problems arise in domains such as science and engineering [1].Topologically, these problems can have local and global optima, large quality fluctuations between adjacent solutions, and inter-dependencies among variables.Such problem characteristics are typically unknown.Furthermore, BBO problems commonly involve computationally intensive simulations.Hence, finding a target solution is difficult and expensive.
Automated algorithm selection and configuration [2][3][4][5][6][7] solves this problem by constructing a machine learning model that uses Exploratory Landscape Analysis (ELA) features [8] as inputs, where the output is a recommended algorithm.To calculate the ELA features, a sample of candidate solutions must be estimated.Since performance is measured in terms of the number of function evaluations, the added cost of extracting the features must be low enough to avoid significant drops in performance.Precise features are obtained as the sample size increases [9][10][11][12][13][14]; hence, approximated features are used in practice [5][6][7]15].However, even these successful works acknowledge that with small sample sizes, feature convergence cannot be guaranteed and uncertainty on their values can exist.
There is limited literature exploring the reliability of approximated features.In [16], we explored the effect that translations had on the features, when the cost function is boundconstrained, demonstrating that translations led to phase transitions; hence, providing evidence of non-generality of the features across instances.Renau et al. [17] explored the robustness of the features against the random sampling, number of sample points, and the expressiveness in terms of ability to discriminate problems.Focusing on a fixed dimension of five and seven feature sets, they determined that most features are not robust against the sampling.Saleem et al. [12] proposed a method to evaluate features based on a ranking of similar problems and Analysis of Variance (ANOVA), which does not require machine learning models or confounding experimental factors.Focusing on 12 features, four benchmark sets in two-and five-dimensions, and four one-dimensional transformed functions, they identified that some features effectively capture certain characteristics of the landscape.Moreover, they emphasize the necessity to examine the variability of a feature with different sample sizes, as some can be estimated with small sizes, while others not.Škvorc et al. [13] used ELA to develop a generalized method for visualizing a set of arbitrary optimization functions, focusing on two-and ten-dimensional functions.By applying feature selection, they showed that many features are redundant, and most are non-invariant to simple transformations such as scaling and shifting.Then, Renau et al. [18] identified that the choice of sample generator influences the convergence of the features, depending on their space-covering characteristics.Finally, in more recent work [14], we explored the reliability of five well-known ELA feature sets on a larger set of dimensions, across multiple sample sizes, using a systematic experimental methodology combing exploratory and statistical validation stages, which statistical significance tests and resampling techniques to minimize the computational costs.The results showed that some features are highly volatile for particular functions and sample sizes.In addition, a feature may have significantly different values between two instances of a function due to the bounds of the input space.This implies that the results from an instance should not be generalized across all instances.Moreover, the results show evidence of a curse of the modality, which means that the sample size should increase with the number of local optima in the function.
However, none of these works explore the effect that uncertainty on the features has on the performance of automated algorithm selectors.In this paper, we examine this issue by constructing an automated algorithm selector based on a multi-class, cost-sensitive Random Forest (RF) classification model, which weighted each observation according to the difference in performance between algorithms.We validated our selector on new instances of observed and unobserved problems, achieving good performance with relatively small sample sizes.Then, using resampling strategies, we estimate the systemic error on the selector, demonstrating that randomness in the sample has a minor effect on the accuracy.These results demonstrate that as long as the features are informative, and have low variability, the systemic error on the selector will be small.This conclusion is the main contribution of our paper.The data generated from our experiments has been collected and is available in the LEarning and OPtimization Archive of Research Data (LEOPARD) v1.0 [19].
The remainder of the paper is organized as follows: Section 2 describes the continuous BBO problem and its characteristics using the concepts of fitness landscape, neighborhood, and basin of attraction.In addition, this section describes the algorithm selection framework [20] on which the selector is based.Section 3 presents the components of the selector.Section 4 describes the implementation and validation procedure.We present the results in Section 6.Finally, Section 7 concludes the paper with a discussion of the findings and the limitations of this work.

Black-Box Optimization and Fitness Landscapes
Before describing the methods employed, let us define our notation.Without losing generality over maximization, we assume minimization throughout this paper.The goal in a bound-constrained, single-objective, continuous black-box optimization problem is to minimize a cost function f : X → Y where X ⊂ R D is the input space, Y ⊂ R is the output space, and D ∈ N * is the dimensionality of the problem.A candidate solution x ∈ X is a D-dimensional vector, and the scalar y ∈ Y is the candidate's cost.A continuous BBO problem is the one such that f is unknown; the gradient information is unavailable or uninformative; and there is a lack of specific goals or solution paths.Only the inputs and the outputs of the function are reliable information, and they are analyzed without considering the internal workings of the function.Therefore, a target solution, x t ∈ X , is found by sampling the input space.
To find a target solution we may use a search algorithm, which is a systematic procedure that takes previous candidates and their cost as inputs, and generates new and hopefully less costly candidates as outputs.For this purpose, the algorithm maintains a simplified model of the relationship between solutions and costs.The model assumes that the problem has an exploitable structure.Hence, the algorithm can infer details about the structure of the problem by collecting information during the run.We call this structure fitness landscape [21], which is defined as the tuple L = (X , f , d), where d denotes a dis- tance metric that quantifies the similarity between candidates in the input space.Similar candidates are those where the distance between each one of them is less than a radius r → 0 [22], forming a neighborhood, N r (x) ⊂ X , which defines associations between the candidates in the input space.
A local optimum is a candidate x l ∈ X such that y > y l for all x ∈ N r (x l ).We denote the set of local optima as X l ⊂ X .The global optimum for a landscape L is a candidate x o ∈ X such that y l ≥ y o for all x l ∈ X l .We denote the set of global optima as X o ⊆ X l .A fitness landscape can be described using the cardinality of X l and X o .An unimodal landscape is the one such that A closely related concept is smoothness, which refers to the magnitude of the change in cost between neighbors.A landscape can be informally classified as rugged, smooth, or neutral.A rugged landscape has large cost fluctuations between neighbors.It presents several local optima and steep ascents and descents.A neutral landscape has large flat areas or plateaus, where changes in the input fail to generate significant changes in the output.At these extremes, both gradient and correlation values are uninformative [23].
The set of local optima X l can constitute a pattern known as the global structure, which can be used to guide the search for x t .If the global structure is smooth, it may provide exploitable information.If the global structure is rugged or nonexistent, finding an optimum can be challenging, even impossible [24].It is also possible to find problems that exhibit deceptiveness, i.e., the global structure and gradient information leads the algorithm away from an optimum, resulting in a performance similar to a random search [23].
A landscape can also be described using its basins of attraction, which is the subset of X in which a local search always converges to the same local optimum.The shape of the basins of attraction is a key attribute of the fitness landscape.Landscapes with isotropic or spherical basins have a uniform gradient direction.Landscapes with anisotropic or nonspherical basins are the result of non-uniform variable scaling.This means that changes in one variable result in smaller changes in cost compared to changes in other variables.A successful search in such landscapes requires a small change over one variable and a large change in the others.Anisotropic basins might be non-elliptical.These problems cannot be broken into smaller problems of lower dimensionality, each of which is easier to solve [24].These problems are non-separable.
In summary, the concepts of basin of attraction, neighborhood and fitness landscape are useful to describe the characteristics of a given optimization problem.The characteristics can be described qualitatively when the structure of the function is known, using descriptors such as rugged or unimodal [24].On the other hand, the characteristics of the function are unknown in a BBO problem.In addition, the only information available for black-box problems is the set of candidate solutions and their cost values.As such, data-driven methods are the only approach to acquire appropriate knowledge about the structure of the landscape.
A search algorithm should produce x t in finite time.However, the algorithm may converge prematurely, generating candidates in a small area without finding x t .This effect is a direct consequence of the model the algorithm maintains of the problem, which is a hard coded bias in the algorithm [25].The impact that the problem characteristics have on algorithm performance is significant [26].Both theoretical and experimental studies demonstrate that each algorithm exploits the problem structure differently.As a consequence, no algorithm can solve all possible problems with the best performance [27].In addition, the relationship between problem structure and algorithm performance is unclear.Furthermore, the significant algorithmic diversity compromises our ability to deeply understand all reported algorithms.Consequently, practitioners may find the algorithm selection stage to be a stumbling block.Most likely, a practitioner will select and adjust/modify one or two algorithms-including their parameters-hoping to obtain the best performance.However, unless the hard coded bias in the algorithm matches the landscape characteristics, we misplace computational resources and human effort.Ideally, we should use an automatic algorithm selection framework.

The Algorithm Selection Framework
Rice [20] proposed an algorithm selection framework based on measurements of the characteristics of the problem.Successful implementations of this framework can be found in graph coloring, program induction, satisfiability, among other problem domains [28][29][30][31].Our interpretation of the framework results in the selector illustrated in Figure 1, whose components are: • Problem set (F) contains all the relevant BBO problems.It is hard to formally define and has infinite cardinality.• Algorithm set (A) contains all search algorithms.This set potentially has infinite cardinality.

•
Performance evaluation is a procedure to evaluate the solution quality or speed of a search algorithm α in a function f , using a performance measure, ρ.

•
Feature extraction is a procedure where an input sample, X, is evaluated in the problem instance, resulting in an output sample, Y.These two samples are then used to calculate the ELA features, c, as illustrated in Figure 2.

•
Algorithm selection is a procedure to find the best algorithm α b using c.Its main component is the learning model [28], as illustrated in Figure 2. The learning model maps the characteristics to the performance of a subset of complementary or wellestablished algorithms, A ⊂ A, also known as algorithm portfolio.A method known as pre-selector constructs A. The learning model is trained using experimental data contained in the knowledge base [32].To predict the best performing algorithm for a new problem, it is necessary to estimate c and evaluate the model.

Input Sample
Output Sample

Methods
In this section, we describe the architecture of the selector.Section 3.1 describes the structure of the learning model.Section 3.2 describes the details of the ELA features used as inputs.Section 3.3 describes the expected running time [33], which is used measure of algorithm performance.

Learning Model
We use as learning model a multi-class classification model, in which each class represents the best performing algorithm, instead of a regression model, which predicts the actual performance of the algorithms.Classification avoids the censoring issue existing in regression, i.e., one or more algorithms cannot find a target solution for all the problems in the training set, leaving gaps on the knowledge base.Although these gaps may be ignored or corrected, it is still possible for a regression model to overestimate the performance [30].On the other hand, a classification model will be properly fitted if there is at least one algorithm that finds the target solution.
An ensemble of Binary Classification (BC) models is used where each one produces a probability score, R α i , α j ∈ [0, 1], as output, where α i , α j are candidate al- gorithms in a set A. This score represents the truth level of α i α j .Correspondingly, R α j , α i = 1 − R α i , α j .The pairwise scores are combined to generate an algorithm score, Σ(α i ) = ∑ α i =α j R α i , α j .The selected algorithm is the one that maximizes Σ(•) [34].The architecture of the model is illustrated in Figure 3 for a portfolio of four algorithms.This architecture does have some shortcomings worth acknowledgment.The number of individual BC models increases at O |A| 2 .However, as shown by Bischl et al. [35], only a small number of algorithms are required to obtain good performance in many domains.Moreover, it is possible for the model to select the worst performing algorithm due to a classification error.This results in performance degradation on problems for which all the algorithms find target solutions, or no solution at all, or if only one algorithm finds a target solution.

BC model
The six BC models take the vector ELA features c as input and generate a probability score, R α i , α j , which are used to generate a recommended algorithm α b as output.

Exploratory Landscape Analysis Features
As inputs to the selector, we use the features summarized in Table 1, as they are quick and simple to calculate.Furthermore, these features can share a sample obtained through a Latin Hypercube Design (LHD), guaranteeing that the differences observed between features depend only on the instance, and not on sample size or sampling method.For example, the convexity and local search methods described by [8] use independent sampling approaches; hence, they cannot share a sample between them nor with the methods in Table 1.The convexity method takes two candidates, x i , x j , and forms a linear combination with random weights, x k .Then, the difference between y k and the convex combination of y i and y j is computed.The result is the number of iterations out of 1000 in which the difference is less than a threshold.Meanwhile, the local search method uses the Nelder-Mead algorithm, starting from 50 random points.The solutions are hierarchically clustered to identify the local optima of the function.The basin of attraction of a local optimum, x l , i.e., the subset from X from which a local search converges to x l , is approximated by the number of algorithm runs that terminate at each x l .Both sampling approaches do not guarantee the resulting sample is unbiased; hence, reusable.Besides ensuring a fair comparison, sharing a LHD sample reduces the overall computational cost, as no new candidates must be taken from the space.The adjusted coefficient of determination, R2 , measures the fit of linear or quadratic least squares regression models.This approach estimates the level of modality and the global structure [8], and it can be thought of as measuring the distance between a problem under analysis and a reference one [29].To provide information about variable scaling, the ratio between the minimum and the maximum absolute values of the quadratic term coefficients in the quadratic model, CN, is also calculated [8].
The mutual information can be used as a measure of non-linear variable dependency [36].Let V = {1, . . . ,D} be a set of variables indexes, and V ⊂ V be a combination of such variables.The information significance, ξ(V), is calculated as follows: where H(Y) is the information entropy, is the mutual information, with Y ⊂ Y being the cost of X.The information entropy of a continuous distribution is estimated through a kd-tree partition method [37] and it is not bounded between [0, 1].Let k = |V| be the order of information significance.The mean information significance of order k is defined as: Please note that as D increases the number of possible variable combinations increases.Therefore, we limit ourselves to k = {1, D}.
The probability distribution of Y may indicate whether the function is neutral, smooth or rugged, as well as the existence or lack of a global structure [8].The shape of the distribution is characterized by estimating the sample skewness, γ(Y), and kurtosis, κ(Y) [8], and entropy, H(Y) [38].

Algorithm Performance Measure
To measure the efficiency of an algorithm-and indirectly the selector-we use the expected running time, T, which estimates the average number of function evaluations required by an algorithm to reach y t within a target precision for the first time [33].A normalized, log-transformed version is calculated as follows: where e t is the target precision, #FEs((y b − y t ) ≥ e t ) is the number of function evaluations over all runs where the best cost, y b , is greater than the optimal cost, and #succ is the number of successful runs.The distribution T is log-transformed to compensate for its heavy left tail, i.e., most problems require many function evaluations to reach the target.
When some runs are unsuccessful, T depends on the termination criteria of the algorithm.
The average expected running time over a subset of problems, F, is defined as:

Selector Implementation
To implement the selector, our first step is to choose representative subsets for the problem, F, and algorithm, A, spaces.We use the MATLAB implementation of the Comparing Continuous Optimizers (COCO) noiseless benchmarks v13.09 [33] as representative of F. The motivation for this choice of benchmark problems revolves around practical advantages.For example, there is a wealth of data collected about the performance of a large set of search algorithms, and there are established conventions on the number of dimensions and the limits on the function evaluation budget.The software implementation of COCO generates instances by translating and rotating the function in the input and output spaces.For example, let f (x) = R(x − x o ) 2 + y o be a functions in COCO, where R is an orthogonal rotation matrix, x o and y o cause translational shifts on the input and output space respectively.A function instance is generated by providing values for R, x o , and y o .To allow replicable experiments, the software uniquely identifies each instance using an index.For each function at each dimension, we analyze instances [1, . . . , 15]at D = {2, 5, 10, 20}, giving us a total of 1440 problem instances for analysis.
As a representative subset of the algorithm space, A ⊂ A, we have used a subset of algorithms from the Black-Box Optimization Benchmarking (BBOB) sessions, composed of (1 + 2) m s CMA-ES, BIPOP-CMA-ES, LStep and Nelder-Mead with Resize and Half-Runs.Corresponding results are publicly available at the benchmark website (http://coco.gforge.inria.fr/doku.php).The subset was selected using ICARUS (Identification of Complementary algoRithms by Uncovered Sets), a heuristic method to construct portfolios based on uncovered sets, a concept derived from voting systems theory [41].As a comparison algorithm we use a portfolio composed of the BFGS, BIPOP-CMA-ES, LSfminbnd, and LSstep algorithms (BBLL) [2].In contrast to our automatically selected portfolios, the algorithms in this portfolio were manually selected to minimize T over all the benchmark functions within a target of 10 −3 .The descriptions of all algorithms employed are presented in Table 2.
Table 2. Summary of the algorithms employed in this paper, which were selected using ICARUS [41] and the publicly available results from the BBOB sessions at the 2009 and 2010 GECCO Conferences.Algorithm names are as used for the dataset descriptions available at http://coco.gforge.inria.fr/doku.php?id=algorithms.

Algorithm Description Reference
(1 + 2) m s CMA-ES A simple CMA-ES variant, with one parent and two offspring, sequential selection, mirroring and random restarts.All other parameters are set to the defaults. [42]

BFGS
The MATLAB implementation (fminunc) of this quasi-Newton method, which is randomly restarted whenever a numerical error occurs.The Hessian matrix is iteratively approximated using forward finite differences, with a step size equal to the square root of the machine precision.Other than the default parameters, the function and step tolerances were set to 10 −11 and 0, respectively. [43]

BIPOP-CMA-ES
A multistart CMA-ES variant with equal budgets for two interlaced restart strategies.After completing a first run with a population of size λ def = 4 + 3 + ln D , the first strategy doubles the population size; while the second one keeps a small population given by λ s = λ def 1 2 , where λ is the latest population size from the first strategy, λ, and U [0, 1] is an independent uniformly distributed random number.Therefore, All other parameters are at default values. [44]

LSfminbnd
The MATLAB implementation (fminbnd) of this axis parallel line search method, which is based on the golden section search and parabolic interpolation.It can identify the optimum of quadratic functions in a few steps.On the other hand, it is a local search technique; it can miss the global optimum (of the 1D function). [45] LSstep An axis parallel line search method effective only on separable functions.To find a new solution, it optimizes over each variable independently, keeping every other variable fixed.The STEP version of this method uses interval division, i.e., it starts from an interval corresponding to the upper and lower bounds of a variable, which is divided by half at each iteration.The next sampled interval is based on its "difficulty," i.e., by its belief of how hard it would be to improve the best-so-far solution by sampling from the respective interval.The measure of difficulty is the coefficient a from a quadratic function f (x) = ax 2 + bx + c, which must go through the both interval boundary points. [45] Nelder-Mead A version of the Nelder-Mead algorithm that uses random restarts, resizing and half-runs.In a resizing step, the current simplex is replaced by a "fat" simplex, which maintains the best vertex, but relocates the remaining ones such that they have the same average distance to the center of the simplex.Such steps are performed every 1000 algorithm iterations.In a half-run, the algorithm is stopped after 30 × D interations, with only the most promising half-runs being allowed to continue. [46] As our next step, we calculate the ELA features.Ideally, if the features are to be used for algorithm selection, their additional computational cost should be a fraction of the cost associated with a single search algorithm, which is usually bounded at 10 4 × D function evaluations [33].Belkhir et al. [47] and Kerschke et al. [15] use sample sizes of 30 × D and 50 × D respectively, which are close to the population size of an evolutionary algorithm.However, Belkhir et al. [47] establishes that 30 × D produces poor approximations of the feature values, although they can be improved by training and resampling a surrogate model.Nevertheless, their experiments also demonstrate that most features for the COCO benchmark set level-off between 10 2 × D and 10 3 × D. These results are supported by Škvorc et al. [13], who determined that for D = 2 a sample size of at least 200 × D was necessary to guarantee convergence.
Given this evidence, and to balance the cost of our computations, we set the lower bound for the sample size at 10 2 × D, corresponding to 1% of the budget, and the upper bound at 10 3 × D, corresponding to 10% of the budget, which we consider to be reasonable sample sizes.We divided the range between 10 2 × D and 10 3 × D into five equally sized intervals in base-10 logarithmic scale, aiming to produce a geometric progression analogous to the progression in D. As a result, we have five samples for each dimension.The samples have sizes n equal to {100, 178, 316, 562, 1000} × D points, where each one is roughly 80% larger than the previous.We generate the input samples, X, using MATLAB's Latin Hypercube Sampling function lhsdesign with default parameters.
We generate the output sample, Y f ,i , by evaluating X in one of the first 15 instances of the 24 functions from COCO, guaranteeing that the differences observed on the ELA features only depend on the function value and sample size, as illustrated in Figure 4.As D increases, the size of X becomes relatively smaller because the input space has increased geometrically, while the sample has increased linearly.This is an unavoidable limitation due to the curse of dimensionality.The combination of input/output sample are analyzed using the features.To sum up, the input patterns database for a sample size n has a total of 1440 observations, one per each instance, with ten predictors.All data used for training is made available in the LEarning and OPtimization Archive of Research Data (LEOPARD) v1.0 [19].Figure 4. Instance evaluation procedure.An input sample X ∈ X is evaluated in all the problem instances f i,j , i = 1, . . ., 24, j = 1, . . ., 15, where i is the function index, and j the instance index.The result is an output sample Y f ,i ∈ Y.The procedure guarantees that the differences among measures only depend on the output sample.

Selector Validation
Given the data-driven nature of the ELA features, the randomness and size of the input sample used to calculate may also affect accuracy.Therefore, through the validation process we aim to answer the following questions: Q1 Does the selector's accuracy increases with the size of the sample used to calculate the ELA features?Q2 What is the effect of the sample randomness on the selector performance?
The validation process is based on the Hold-One-Instance-Out (HOIO) and Hold-One-Problem-Out (HOPO) approaches proposed by Bischl et al. [2].For HOIO, the data from one instance of each problem at all dimensions is removed from the training set.For HOPO, all the instances for one problem at all dimensions are removed from the training set.HOIO measures the performance for instances of problems previously observed; whereas HOPO measures the performance for instances of unobserved problems.
To answer Q1, we train 30 different models for each validation approach, and average the performance over them.We use a cost-sensitive Random Forest (RF) classification model with 100 binary trees, which weighted each observation according to the difference in performance between algorithms [48] as BCs.The performance is measured for each value of n in terms of the success, best and worst-case selection, and total failure rates.The Success Rate (SR) is the percentage of solved problems by the selector; the Best Case Selection Rate (BCSR) is the percentage of problems for which the best algorithm was selected; the Worst-Case Selection Rate (WCSR) is the percentage of problems for which the worst algorithm was selected; and the Total Failure Rate (TFR) is the percentage of unsolved problems due to selecting the worst performing algorithm.To summarize the efficiency and accuracy results of the selector, we define a normalized performance score, ρ α T , as follows: where α T is a baseline algorithm for which ρ α T = 1.For our cases, the baseline algorithm will be the BBLL portfolio with random selection, ρ RB .
Once we have verified that the architecture of the selector produces satisfactory results, we address Q2.To do this, we estimate the empirical probability distribution of each feature, pr(c k ).There are multiple approaches to this problem.For example, to take multiple, independent trial sets per function [17], which guarantees the most accurate estimator of pr(c k ), although with the substantial computational of cost collecting the function responses and calculating the features, particularly as n and D increase.Another approach is to train a surrogate model with a small sample, and then resample from the model [47], which provides a low computational cost estimate of pr(c k ), although the resample also includes assumptions generated by the surrogate which may not correspond to the actual function.Moreover, parametric assumptions cannot be made, and there is no guarantee of fulfilling asymptotic convergence.Therefore, we use bootstrapping [49] to estimate pr(c k ), which allows us to use one, potentially small sample, hence it can be applied on expensive problems, without adding assumptions to the methodology.
Bootstrapping is a type of resampling method that uses the empirical distribution to learn about the population distribution as follows: Let Z = {z 1 , . . . ,z n }, z i = (x i , y i ), i = 1, . . ., n be a sample of n independent and identically distributed random variables drawn from the distribution pr(Z).We can consider c k = T(pr(Z)) to be a summary statistic of pr(Z), for example the mean, the standard deviation, or in our case an ELA feature.Let Z * j , j = 1, . . ., N be a bootstrap sample, which is a new set of n points independently sampled with replacement from Z. From each one of the N bootstrap samples, we calculate a bootstrap statistic, ĉ * k,j , also in our case a feature.In other words, from each sample of size n we create N resamples with replacement, also of size n, from which we estimate all the features.This feature vectors are fed into the meta-models during the HOIO and HOPO validation.With N = 2000, we estimate the 95% confidence intervals and bias of T, SR, BCSR, WCSR, TFR for all values of n.The confidence intervals are estimated using Efron's non-parametric corrected percentile method [50], i.e., we compute the 2.5th and 97.5th percentiles of the bootstrap distributions for each performance measure.

Performance of the Selector
We now provide an answer to the questions formulated above.The results show that a larger value of n does not translate to higher selection accuracy, while greatly increasing the cost for easier problems.The remaining budget to find a solution decreases due to an increase of f evals /D used to estimate the ELA features.Furthermore, changes in the input sample have minor effects on the performance.This implies that the selector is reliable even with noisy features, suggesting that their variance is small.Let us examine the results in detail.
The performance of the best algorithm for a given problem is illustrated in Figure 5 for the ICARUS/HOPO and BBLL portfolios.We observe that BIPOP-CMA-ES is the dominant algorithm for {10, 20} dimensions in both portfolios.The Nelder-Doerr algorithm replaces BFGS in the ICARUS/HOPO portfolio, resulting in deterioration of T for { f 1 , f 5 }.There are specialized algorithms for particular problems regardless of the number of dimensions, e.g., LSstep is the best for { f 3 , f 4 }.Table 3 shows the performance of the individual algorithms, the selectors during HOIO and HOPO.In addition, the results from an oracle, i.e., a method which always selects the best algorithm without incurring any cost, and a random selector are presented.The performance is measured as T, the 95% Confidence Interval (CI) of T, SR, SR, BCSR, WCSR, TFR and ρ RB .In boldface are the best values of each performance measure over each validation method and n/D.The table shows that only the oracles achieve SR = 100%, with the ICARUS portfolio having the best overall performance with ρ RB = 2.136 during HOPO.The table also shows that T is always the lowest for n/D = 100, given that less of the budget is used to calculate the ELA features.Highest SR is achieved with n/D = 562 for the ICARUS portfolios and n/D = 1000 for the BBLL portfolio.However, the differences with smaller n may not justify the additional cost.For example, a SR of {HOIO : 99.7%, HOPO : 93.6%} and a WCSR of {HOIO : 0.5%, HOPO : 4.2%} is achieved with n/D = 100 for the ICARUS sets, a difference of {HOIO : −0.1%, HOPO : −1.7%} in SR and {HOIO : 0.1%, HOPO : −1.6%} in WCSR for n/D = 562.Compared with total random selection, the selectors' WCSR and TFR are at least one order of magnitude smaller.Table 3 also shows on average a decrease on BCSR of ≈ 40%, an increase of WCSR and TFR of ≈7% and ≈3% respectively; although the results for the ICARUS sets are below average for WCSR and TFR, indicating better performance.Furthermore, SR is always above 90% for the ICARUS sets, while SR falls below 90% during HOPO for the BBLL set.Only BIPOP-CMA-ES with ρ RB = 1.823 can match the overall performance of a selector.Figure 6 illustrate the SR against T for the complete benchmark set. Figure 6a shows the results of individual algorithms, while Figure 6b shows the performance of such oracle as solid lines and the random selector as dashed lines for the ICARUS/HOIO, ICARUS/HOPO, and BBLL portfolios.As expected from Figure 5a, the ICARUS/HOPO oracle has a degradation in performance in the easier 10% of the problems.On the top 10% of the problems the performance of the three oracles is equivalent.The ICARUS/HOPO oracle has a small improvement on performance between 10% and 90% of the problems.Figure 6c,d     To understand where the selectors fail, we calculate the average SR, BCSR, WCSR and TFR for a given problem or a given dimension during HOIO and HOPO validations.Table 4 shows these results, where in boldface are the SR and BCSR less than 90%, and the WCSR and TFR higher than 10%.The table shows that an instance of either { f 23 , f 24 } have the highest possibility of remaining unsolved, despite the model having information about other instances of these functions.Some problems, such as { f 1 , f 2 , f 5 }, have high WCSR and low TFR values, which can be explained by their simplicity, as most algorithms can solve them.The performance appears to be evenly distributed in terms of dimension.Overall, the selector appears to struggle with those problems where only one algorithm may be the best.Nevertheless, these results indicate that the architecture of the selector is adequate for testing the effect of systemic errors.

Sampling Effects on Selection
Table 5 shows the 95% CI of T, SR, BCSR, WCSR, TFR .The sub-index represents estimation bias, defined as the difference between the bootstrap mean and the original mean, which was presented in Table 3.A large bias implies that the estimated CI is less reliable.In boldface are those intervals whose range is the smallest for a given validation.The lowest bias is also in boldface.This table shows the smallest range is achieved with n/D = 1000, implying more stable predictions are made the selector with a large sample size.This also implies more stable ELA features.However, the differences are relatively minor.For example, the difference between the upper and lower bounds of T represent 65 f evals /D for n/D = 100, 63 f evals /D for n/D = 316, and 37 f evals /D for n/D = 1000.For SR, the difference is 0.8% for n/D = 100, 0.6% for n/D = 316, and 0.4% for n/D = 1000.Hence, the ELA features are sufficiently stable with a small n, to achieve reliable selection by the selector.
Table 5. Bootstrapped 95% CI of T, SR, BCSR, WCSR, TFR .The sub-index represents the bias of the estimation.In boldface are those intervals whose range is the smallest for a given validation, and the lowest bias.

Discussion
The overarching aim of this paper was to explore the effect that uncertainty in the features has on the performance of automated algorithm selectors.To meet this aim, we have designed, implemented and validated an algorithm selector, and analyzed the effects that the features accuracy have on the selector performance.These methods were described in Section 3. We have carried out an experimental validation using the COCO noiseless benchmark set [33] as representative problems.The results described in Section 6 are encouraging, as they demonstrate that as long as the features are informative, and have low variability, the selection error will be small.This conclusion is the main contribution of our paper.However, there are several trade-offs in the selector performance associated with the methodology employed.We discuss these points in detail below.
The results have shown that an increase in the sample size n does not improve SR and BCSR.On the contrary, large sample sizes deteriorate the performance in terms of T for the easier problems.Although n = D × 10 2 is a reasonable cost for calculating the ELA features, this cost could be distributed if each candidate used to calculate features is also the starting point of a random restart.It is known that random restarts improve the heavy-tailed behavior of optimization algorithms [51].It may also be possible that an optimal n exists for every problem instance and ELA measure.Potentially, n could be decreased until the accuracy of the selector deteriorates.However, it should be noted that the learning model does not predict the variance of T, which is an indicator of the reliability of the algorithm.Therefore, the selector does not differentiate between reliable and unreliable algorithms.A model that predicts the variance may improve the selector accuracy.Further investigation of these issues is left for further research.
The results confirm that the 'no-free-lunch' theorem [27] still holds for the selector.The cost of ELA stage decreases the selector performance in the simplest problems compared to a method such as BFGS.Moreover, a parallel implementation of the portfolio would certainly achieve a SR = 100% with a trade-off on T. However, it should be noted that the component algorithms were selected using the complete information from the knowledge base.Hence, it is an estimate of the ideal performance.A key advantage of the selector is that after the allocated budget exceeds a threshold, performance gains are evident, particularly if the budget is a few orders of magnitude larger.Therefore, the selector is competitive when the precision of the solution is more important than the computational cost.Considering a BBO problem, where there is no certainty about the difficulty, the selector gives some assurances that a solution could be found with a small trade-off.Additionally, the results show that the selector is robust to noise in the ELA features.
Due to their algorithmic simplicity and scalability with D, the ELA features allowed us to 'sample once and measure many'.Therefore, they reduced the cost measured as f evals /D of the measuring phase.However, this does not mean that they are sufficient, complete or even necessary.Therefore, other features may be investigated that may potentially improve accuracy.
Finally, the reduced number of problems in the COCO benchmark set limits the generality of our results.The set is considered to be a space filling benchmark set [2].However, this conclusion was based on a graphical representation, which is dependent on the employed ELA features.Therefore, we consider it necessary to add functions to the training set.This can be accomplished by using an automatic generator [52] or examining other existing benchmark sets.In conjunction with a visualization technique, this approach may reveal new details about the nature of the problem set [31].This is a significant challenge in continuous BBO, for which the selector is a fundamental part of the solution.

Figure 1 .
Figure 1.Diagram of the algorithm selector based on Rice's framework [20].The selector has three main components: Performance evaluation, Feature extraction, and Algorithm selection.

Figure 2 .
Figure 2. Details of the feature extraction and algorithm selection processes.

Figure 3 .
Figure3.Structure of the multi-class learning model for A = {α 1 , α 2 , α 3 , α 4 }.The six BC models take the vector ELA features c as input and generate a probability score, R α i , α j , which are used to generate a recommended algorithm α b as output.

Figure 5 .
Figure 5. Performance of the best algorithm in terms of T for each problem.The lines represent the dimension of the problem.The Nelder-Doerr algorithm replaces BFGS in the ICARUS/HOPO portfolio in most of the problems.
Figure6illustrate the SR against T for the complete benchmark set.Figure6ashows the results of individual algorithms, while Figure6bshows the performance of such oracle as solid lines and the random selector as dashed lines for the ICARUS/HOIO, ICARUS/HOPO, and BBLL portfolios.As expected from Figure5a, the ICARUS/HOPO oracle has a degradation in performance in the easier 10% of the problems.On the top 10% of the problems the performance of the three oracles is equivalent.The ICARUS/HOPO oracle has a small improvement on performance between 10% and 90% of the problems.Figure6c,d illustrate the performance of the selectors for n/D = {100, 1000}.The shaded region indicates the theoretical area of performance, with the upper bound representing the ICARUS oracle and the lower bound random selection.Although it is unfeasible to match the performance of the oracle, due to the cost of extracting the ELA measures and a BCSR less than 100%, the figure shows the performance improvement against random selection.During HOIO, the ICARUS selector surpasses random selection at 106 f evals /D with n/D = 100, representing 5.9% of the problems; and at 1349 f evals /D with n/D = 1000, representing 31.2% of the problems.During HOPO, the ICARUS selector surpasses random selection at 168 f evals /D with n/D = 100, representing 9.9% of the problems; and at 1620 f evals /D with n/D = 1000, representing 40.3% of the problems. 0

Figure 6 .
Figure 6.Performance of the methods from Table 3 in terms of SR against T. Figure (a) illustrates the performance of the individual algorithms, Figure (b) of the oracles and random selectors, Figure (c) of the selectors during HOIO, and Figure (d) of the selectors during HOPO.For the latter two figures, solid lines represent oracles using n/D = 100, while dashed lines represent oracles using n/D = 1000.The shaded region indicates the theoretical area of performance, with the upper bound representing the ICARUS oracle and the lower bound random selection.

Table 1 .
Summary of the ELA features employed in this paper.Each one of them was scaled using the listed approach.

Table 3 .
Performance of each algorithm and the selector, in terms of T, 95% confidence interval of T, SR and ρ RB , which is the normalized performance score against random selection using the BBLL portfolio.In boldface are the lowest values for each performance measure over each validation method and n.

Table 4 .
Average SR across all for a given problem at a given dimension during HOIO and HOPO validations, and average BCSR and WCSR for a given problem at a given dimension during HOPO validations.In boldface are the SR and BCSR less than 90%, and the WCSR higher than 10%.