This section discusses the need for robust exploratory landscape analysis measures and presents an approach to determine the sample size that results in robust exploratory landscape analysis measures. This approach is then applied to a large collection of benchmark functions, and the results are presented and analyzed. Finally, the choice of a sample size to use for ELA measures for the remainder of the study is determined.
3.1. Determining Robustness
The choice of the sample size for calculating ELA measures presents a trade-off between accuracy and computational costs. For smaller sample sizes, the computational cost will be low, but the accuracy of the resulting ELA measures is poor. For larger sample sizes, the computational costs will be high, and the accuracy of the resulting ELA measure will increase.
As noted by Muñoz et al. [
35], an ELA measure
$c(f,n)$ which is calculated on an objective function
f from a sample size
n is a random variable. Therefore,
$c(f,n)$ has a probability distribution whose variance,
${\sigma}^{2}$, should converge to zero when
n tends to infinity, otherwise
${\sigma}^{2}$ is dependent on
f and
n. Several independent runs of a measure
$c(f,n)$ can be conducted to approximate the probability distribution. When the variance,
${\sigma}^{2}$, is small then
$c(f,n)$ is said to be robust. Defining when the variance is small results in extra hyperparameters when using ELA, as a threshold needs to be defined for each ELA measure. Rather than defining an absolute threshold, this study makes use of a procedure that determines when the variance becomes small enough, relative to increasing sample sizes. This, coupled with the fact that the variance tends to zero as the sample size increases, allows one to determine the sample size needed to provide robust ELA measures.
To determine the sample size needed to produce a robust measure, non-parametric statistical tests are needed, since the
$c(f,n)$ distributions are unknown and are unlikely to follow the normal distribution. In the literature, there are many hypothesis tests for equality of variance, also called homogeneity of variance, the most common being the
F-test, which tests the hypothesis that two normal distributions have the same variance. The Levene test [
36] is used to determine whether
k samples have equal variances, and can be used as an alternative to the
F-test when the population data does not follow the normal distribution. Consider
k random samples, where the
i-th sample has observations
${x}_{i1},{x}_{i2},\dots ,{x}_{i{n}_{i}}$. Levene considers the absolute differences between each observation and its corresponding group mean, i.e.,
${d}_{ij}=|{x}_{ij}-{\overline{x}}_{i.}|$,
$i=1,2,\dots ,k$,
$j=1,2,\dots ,{n}_{i}$, where
${n}_{i}$ is the number of observations in the
i-th group, and
${\overline{x}}_{i.}$ is the sample mean for the
i-th group. Then, the Levene test is defined as:
The Levene test statistic is then defined as:
where
Levene transforms the population data by considering the absolute differences between each observation and its corresponding group mean, and therefore
${d}_{ij}=|{x}_{ij}-{\overline{x}}_{i.}|$,
$i=1,2,\dots ,k$,
$j=1,2,\dots ,{n}_{i}$, and
${\overline{x}}_{i.}$ is the sample mean for the
i-th group in the above equations. Brown and Forsythe [
37] proposed a modification to the Levene test which provides more robust results, in which the absolute differences between each observation and its corresponding group median is calculated. That is,
${d}_{ij}=|{x}_{ij}-{\tilde{x}}_{i.}|$, where
${\tilde{x}}_{i.}$ is the median of the
i-th group.
However, it is only of interest whether the variance of the measures decreases as the sample size increases, and not if variances between sample sizes are equal or not. This is because a two-sided hypothesis test for the variance does not indicate if the variance of two samples is larger or smaller than the other. For this purpose, Levene trend tests [
38] can be used to determine if there is a monotonic increasing or decreasing trend in a group of variances. As described in [
39], such a hypothesis test can be set up as follows:
Then, all observations in a group
i are assigned a score
${w}_{i}$, for each group
$i=1,\dots ,k$. Now, regress the transformed data,
${d}_{ij}$, on
${w}_{i}$ and consider the regression slope
where
Under the null hypothesis, $\widehat{\beta}=0$ and the test statistic follows a t-distribution with $(N-1)$ degrees of freedom, where N is the total number of observations from all groups. Scores can be assigned as either linear or non-linear functions, which respectively allows testing for linear or non-linear trends in the variances. In this study, linear scores are investigated. That is, ${w}_{i}=1\forall i$.
The lawstat package [
39] in R is used to perform the Levene trend test.
Now, to determine at what sample size a measure $c(f,n)$, for a particular objective function f, becomes robust, the following procedure is performed:
Choose the sample sizes $s={s}_{1},...,{s}_{M}$ to be investigated.
For each sample size ${s}_{i}$, calculate the measure $c(f,{s}_{i})$ for r independent runs.
Perform the Levene trend test on the above samples, for each pair of sample sizes, ${s}_{i}$ and ${s}_{i+1}$. In this case, there $k=2$ groups. Obtain the test statistic and p-value.
For each pair of sample sizes, if the resulting p-value is less than or equal to the predefined significance level, $\alpha $, then the null hypothesis is rejected. This implies that it is likely that there is a monotonic decrease in the variance between the sample sizes. If the p-value is greater than $\alpha $, then the null hypothesis cannot be rejected. It is then said that there is strong evidence that the variance between tequivalencyhe different sample sizes is equal.
When using the procedure described above, for a particular ELA measure, there are several possibilities with regards to the number of occurrences of p-values $<\alpha $:
Zero occurrences: This implies that there is no evidence that the variance is lower for any sample size. The smallest sample size is chosen as the point of robustness since there is no decrease in variance from increasing sample size.
One occurrence: The first sample size after the occurrence is chosen to be the point of robustness.
Two or more consecutive occurrences: The first sample size after the chain of consecutive occurrences is chosen as the point of robustness.
Two or more non-consecutive occurrences: The first sample size after the first chain of consecutive occurrences is chosen as the point of robustness.
Please note that when the null hypothesis is rejected for a pair of sample sizes, it implies that the variance of the larger sample size is statistically likely to be lower than the variance of the smaller sample size. Therefore, the larger sample size is chosen as the point of robustness.
Based on the observation of Muñoz et al. [
35], the variance of a particular ELA measure tends to zero as the sample size increases. For the case of two or more non-consecutive occurrences of statistically significant pairs, Muñoz et al.’s observation implies that the first chain of statistically significant pairs is more likely to provide practically significant differences in variance than the second, or later, chain of statistically significant pairs. Therefore, the first sample size after the first chain of statistically significant pairs is chosen as the point of robustness.
3.2. Empirical Procedure
As noted in
Section 2, there are several benchmark functions defined in the literature. In this study, the following benchmark functions are investigated:
Then, the ELA measures described in
Section 2 are calculated for varying sample sizes for the 340 benchmark functions listed above. To calculate the ELA measures, the flacco library [
23] is used. In particular, the following sample sizes are investigated:
$50\times D$,
$100\times D$,
$200\times D$,
…,
$1000\times D$, where
D is the dimensionality of the decision variable space. The improved Latin hypercube sampling [
19] algorithm is used to sample the points for the ELA measures. This study focuses on the case when
$D=10$ since it is the only dimensionality for which both the CEC and BBOB benchmark suites have been defined. Each of the investigated feature sets is calculated from the same generated sample, and therefore all features are calculated from the same sample. The ELA measures that are investigated are described in
Section 2. The breakdown for the ELA measures is as follows: 16 dispersion measures, three y-distribution measures, 18 level-set measures, nine meta-model measures, five information content measures, five nearest better clustering measures, and eight principal component analysis measures, for a total of 64 measures. Each measure for all combinations of functions and sample sizes are calculated over 30 independent runs.
For this hypothesis test, the level of significance,
$\alpha $, is chosen as 5% a priori. Please note that the choice of the level of significance has a strong impact on the procedure for determining the point of robustness. If
$\alpha $ is large, then it is likely that the Levene trend test will find statistically significant differences in the variance between the pairs of smaller sample sizes, and therefore the point of robustness will be a relatively small sample size. If
$\alpha $ is small, then it is likely that the Levene trend test will either (i) find statistically significant differences in the variance between pairs of larger sample sizes, and the point of robustness will occur at larger sample sizes, or (ii) find no pairs of statistically significant differences in variances. To estimate the sampling distribution more accurately, bootstrapping is performed on the samples used as input to the Levene trend test, as described by Lim and Loh [
40]. In the experiments, bootstrapping is performed with replacement, and the number of bootstrap samples is set to 10,000.
3.3. Results and Discussion
Figure 1 provides the distribution of the point of robustness for each of the investigated ELA measures on all the investigated benchmark functions.
Figure 1 indicates that most of the ELA measures have two dominating points of robustness, with peaks at sample sizes
$50\times D$ and
$200\times D$. The features that provide the lowest point of robustness are
pca.expl_var.cov_x,
pca.expl_var.cov_init,
pca.expl_var.cov_x,
pca.expl_var.cor_init,
ela_meta.lin_simple.coef.max_by_min, and
ela_distr.number_of_peaks, with large peaks at sample size
$50\times D$. The measures which have platykurtic distributions, i.e., wide and flat distributions, are the dispersion and level-set feature sets. These platykurtic distributions indicate that the number of sample sizes needed to produce robust ELA measures often differs for different benchmark functions.
Figure 1 also shows that measures in a particular feature set tend to have the same distribution for the point of robustness. This observation is most prominent for the dispersion feature set. These similar distributions may indicate that measures within a feature set are highly correlated.
Figure 2 contains the plots of the distribution of the point of robustness for all investigated benchmark suites.
Figure 3 contains the plots of the distribution of the point of robustness for the combination of all investigated benchmark functions. These two figures combine the point of robustness across all investigated ELA measures for a particular benchmark suite. Thus, in
Figure 1 the cynosure is the different ELA measures, and in
Figure 2 and
Figure 3 the cynosure are the different benchmark suites.
Figure 2 indicates that the distribution of the point of robustness is roughly the same for the BBOB and CEC benchmark suites. These distributions appear to follow a negative binomial distribution, and this is validated with a goodness-of-fit test.
Figure 2f contrasts the robustness results of the BBOB and CEC benchmark suites, and indicates that the miscellaneous functions generally have a point of robustness at
$50\times D$. It is hypothesized that the oscillation functions used in the BBOB and CEC benchmark suites induces more phase transitions in the fitness landscapes, whereas the collection of miscellaneous functions are not oscillated. However, further research is required to determine the true cause.
As seen in
Figure 3, most benchmark functions provide robust ELA measures at sample sizes of
$50\times D$,
$100\times D$ and
$200\times D$. Since the different benchmark suites are defined for various search space dimensionalities, it is interesting to note that the distributions of the point of robustness do not change significantly between the benchmark suites. This implies that the improved Latin hypercube sampling algorithm is a good choice to generate samples for ELA, and is likely to provide good coverage of the function space, regardless of the size of the search space.
As noted above, an ELA measure $c(f,n)$ depends on both the function f and the sample size n. Since the procedure to determine the point of robustness for an ELA measure holds the function constant and varies the sample size, a summary statistic is needed to generalize the robustness for a particular ELA measure across a collection of functions. For this purpose, percentiles may be used. A percentile describes the percentage of observations that fall below a particular value. For example, the median is the 50th percentile. It implies that $50\%$ of the observations in a data set lie above the median.
Table 1 contains the percentiles for the point of robustness over all the investigated benchmark functions.
To select the appropriate percentile, the practitioner should consider the sensitivity and consequences that the choice of the sample size will have on the later stages of the application of landscape analysis. As noted earlier, there is a trade-off between accuracy and computational costs when calculating ELA measures. For example, if ELA measures are used in automated algorithm selection, a practitioner may be satisfied with lower accuracy to keep computational costs down. The selection of a benchmark suite which is used to compare algorithms is a task that has significant effects on the algorithm selection problem. Therefore, in this case large computational costs from landscape analysis is acceptable so that a comprehensive, representative benchmark suite may be found. The larger the chosen percentile, the larger the sample size and the higher the computational costs and accuracy will be. For this purpose, the 95th percentile is chosen to determine which sample size will be used for the remainder of the study.
All ELA measures that belong to a particular feature set are calculated from the same sample. Additionally, several feature sets can be calculated from the same sample, as is the case with the feature sets used in this study. Using such a group of measures is advantageous since multiple measures can be calculated from the same sample, which consequently allows for more accurate characterization of a benchmark problem. However, as shown in
Table 1, different measures provide robust results at different sample sizes. Since these measures are calculated from the same sample, the point of robustness of the group of measures should be defined as the largest sample size needed for any single measure within the collection of ELA measures.
When using the 95th percentile,
Table 1 indicates that a sample size of
$800\times D$ is the largest point of robustness for the whole collection of ELA measures. Therefore, for the remainder of the study, the ELA measures are calculated from a sample size of
$800\times D$.