Multi-Objective Optimization Benchmarking Using DSCTool

: By performing data analysis, statistical approaches are highly welcome to explore the data. Nowadays with the increases in computational power and the availability of big data in different domains, it is not enough to perform exploratory data analysis (descriptive statistics) to obtain some prior insights from the data, but it is a requirement to apply higher-level statistics that also require much greater knowledge from the user to properly apply them. One research area where proper usage of statistics is important is multi-objective optimization, where the performance of a newly developed algorithm should be compared with the performances of state-of-the-art algorithms. In multi-objective optimization, we are dealing with two or more usually conﬂicting objectives, which result in high dimensional data that needs to be analyzed. In this paper, we present a web-service-based e-Learning tool called DSCTool that can be used for performing a proper statistical analysis for multi-objective optimization. The tool does not require any special statistics knowledge from the user. Its usage and the inﬂuence of a proper statistical analysis is shown using data taken from a benchmarking study performed at the 2018 IEEE CEC (The IEEE Congress on Evolutionary Computation) is appropriate. Competition on Evolutionary Many-Objective Optimization.


Introduction
Nowadays, comparing the performance of a newly developed multi-objective optimization algorithm involves calculating descriptive statistics such as means, medians, and standard deviations of the algorithm's performance. Recently, we have published several studies that show that calculating descriptive statistics is not enough to compare algorithms' performances, but require some high-level statistics [1][2][3].
In this paper, we will show how different benchmarking practices can potentially have big influences on the outcome of a statistical analysis performed on multi-objective optimization algorithms. Additionally, we will present a web-service-based e-Learning tool called DSCTool [4], which guides the user through all necessary steps needed to perform a proper statistical analysis. The DSCTool also reduces the requirement of additional statistical knowledge from the user, which includes knowing which conditions must be fulfilled to select a relevant and proper statistical test (e.g., parametric or nonparametric) [5]. The conditions that should be checked before selecting a relevant statistical test includes checking for data independence, normality, and homoscedasticity of variances.
In multi-objective optimization, there is no single solution that simultaneously optimizes each objective. Since the objectives are said to be conflicting, there exists a set of alternative solutions. Each solution that belongs to the set of alternative solutions is optimal in a manner that no other solution from the search space is superior to it when all objectives are considered. The question very important to understand all its limitations and caveats because with its improper usage, wrong conclusions can be made. For example, improper usage comes from misunderstanding or not knowing the requirements when a certain statistical test can be applied. Recently, it was shown that outliers and small differences in the data can have negative impact on the results obtained from statistical tests [11].

Outliers
An outlier is an observation that lies outside of the distribution of the data [12]. If outliers are not properly handled, some deceptive statistics can be affected, which might not reflect the actual probability data distribution. This indicates that the statistical analysis shows that there is a statistical significance, while the probability distribution shows the opposite.
Outliers are data points that differ significantly from other data points in the data set and can cause serious problems in statistical analyses. For example, means are the most commonly used descriptive statistics that are involved in comparison studies because they are unbiased estimators. However, they are sensitive to outliers, so by using them for statistical analysis we inherently transfer this sensitivity to results of any analysis that originates from them. One option to reduce the influence of outliers is to use medians as a more robust statistic since they are less sensitive to outliers.

Small Differences in the Data
Though medians reduce the sensitivity to outliers, both means and medians can have negative impact on results obtained from applying statistical tests. Such impact can be observed when differences between means or medians are in some -neighbourhood. An -neighborhood is a range in function value space in which distance from a given number is less than some specified number . When the means or medians are not the same, but in some -neighborhood, they receive different rankings. However, if distributions of multiple runs are the same, indicating that there are no performance differences between the compared algorithms, this suggests that they should obtain the same rankings. On the contrary, it can also happen that the distributions of multiple runs are not the same, indicating that algorithms should obtain different rankings, but the means or medians are the same.

The DSCTool
The DSCTool is presented in [4] as an e-Learning tool with the goal to make all required scenarios for different performance evaluations of single-and multi-objective optimization algorithms. It guides the user through several steps of statistical analysis pipeline, starting from preparing and providing input data (optimisation algorithm results), then selecting the desired comparison scenario, and obtaining final result from it. In this way, the statistical knowledge required from the user is greatly reduced. Actually, the DSCTool allows users not to think about how the statistical analysis should be performed, but only to define the comparison scenario that is relevant for their experimental setup. Basically, the user must select a significance level used by statistical test and follow a pipeline for selecting the most commonly-used statistical tests in benchmarking evolutionary algorithms. For multi-objective optimization algorithm analysis, the user must provide data in form of one or more quality indicators, decide on significance level and which kind of ensemble (i.e., average, hierarchical majority vote, or data-driven) is desired for comparing data calculated by quality indicator. Following the pipeline is trivial as it will be shown in next sections.
The DSCTool is developed for the Deep Statistical Comparison (DSC) approach [2] and its variants that all have a benefit of providing robust statistical analysis with reduced influence of outliers and small differences in the data since a comparison is made on data distribution. In this paper, we will only show web services that implement DSC variants used with multi-objective optimization, i.e., the basic DSC ranking scheme used for comparing data for one quality indicator [13], and average, hierarchical majority vote, and data-driven ensembles that are used to fuse the information from more quality indicators [1,3].
The DSC ranking scheme is based on comparing distributions using some two-sample statistical test [14] with predefined significance level. The obtained ranking scheme is used to make a further statistical comparison. The DSC average ensemble is a ranking scheme that uses a set of user-selected quality indicators [1] to calculate a mean of DSC rankings for each quality indicator on specific problems [15]. The DSC hierarchical majority vote ensemble is a ranking scheme [1] that checks which algorithm wins in the most quality indicators or which algorithm is ranked the most number of times with the best DSC ranking on each benchmark problem separately. Finally, the DSC data-driven ensemble [3] is where the preference of each quality indicator is estimated using its entropy, which is calculated by the Shannon entropy weighted method [16]. The preference ranking organization method (PROMETHEE) [17] is then used to determine the rankings.
The DSCTool offers web services that are accessible from https://ws.ijs.si:8443/dsc-1.5/service/ and use the REST software architectural style. The web services are using JSON format to transfer input/output data. A more detailed information about using DSCTool can be accessed at https: //ws.ijs.si:8443/dsc-1.5/documentation.pdf.
For performance metrics, the organizers have selected inverted generational distance (IGD), where 10,000 uniformly distributed reference points were sampled on the Pareto front and hypervolume (HV), where the population was normalized by the nadir point of Pareto front and Monte Carlo estimation method with 1,000,000 points was adopted. Each algorithm was executed 20 times on each problem with 5, 10, and 15 objectives, resulting in 900 approximation sets. Using these approximation sets, both quality indicators were calculated and the mean of each quality indicator was taken as a comparison metric for each problem and number of objectives. The algorithms were then ranked according to the comparison metric and the final score of the algorithm was determined as the sum of the reciprocal values of the rankings. In Tables 1 and 2, official competition ranking results for 5 and 10 objectives are presented for both quality indicators, respectively.
Using the tables, it can be seen that CVEA3 is the best performing algorithm due to obtained total ranking of 1 for both quality indicators in 5 and 10 objectives.
Further, statistical analysis on the same data is showed using the DSCTool. First, quality indicators should be calculated using the approximation sets obtained by the algorithms. In addition to IGD and HV, also generational distance (GD) and epsilon indicator (EI) were calculated. After all quality indicators are calculated, the data for each quality indicator must be organized in an appropriate JSON input for the rank web service that performs the DSC ranking scheme. The JSON inputs for each quality indicator and each number of objectives are available at http://cs.ijs.si/dl/dsctool/rank-json.zip. For our analysis, the two-sample Anderson-Darling test was selected to compare data distributions (since it is more powerful than Kolmogorov-Smirnov test). To obtain the DSC rankings for each quality indicator, the rank web service was executed with an appropriate JSON input. The result is a JSON response where rankings are calculated based on the DSC ranking scheme. The rankings results for all quality indicators for 5 and 10 objectives are provided in Tables 3-6, respectively.   Table 3. DSCTool rankings for IGD.      Table 6. DSCTool rankings for GD. First, let us compare results obtained in the official competition (see Tables 1 and 2) and the results obtained by DSCTool, where comparisons are done using only one quality indicator (see Tables 3-6).
Here we would like to remind the reader that competition ranking is based on simple statistics in the form of the mean value obtained for a specific quality indicator, while DSCTool uses the DSC approach in which the comparison is based on the distribution of quality indicator values from several runs. Quickly one can see that for some algorithms the obtained ranking is the same (e.g., for algorithm CVEA3 on 5 objectives both approaches returned ranking 1 according to IGD and HV quality indicator), while for some others there is a large difference between the rankings (e.g., for algorithm AGE-II on 5 objectives and using HV quality indicator competition ranked it as 3rd, while DSCTool ranked it as 9th). Since the paper is not about comparing different statistical approaches, we will not go into details. Nevertheless, we would like to remind the reader that though mean values are unbiased estimators, they can be heavily influenced by outliers and some small differences in the data. For further details on this topic we refer the reader to [2]. Therefore, by only using different (in our case more powerful) statistics, conclusions obtained from the rankings can be drastically changed. In our experiment, two additional quality indicators (i.e., GD and EI) were calculated on purpose, so the influence of its selection can be even better seen. Looking at Tables 3-6, drastic changes in rankings can be observed. Again looking at algorithm AGE-II it can be seen that if only EI and GD would be used, the AGE-II would change from the average performing algorithm to the best one, also outperforming algorithm CVEA3 in all cases.
Though it is well known that the selection of the quality indicator can have a big influence on the statistical analysis outcome, this influence can be also clearly observed in Tables 3-6. In such situations, it is better to use an ensemble of several quality indicators and estimate performance according to all of them. The DSCTool provides the ensemble service, which calculates rankings according to inputs from several quality indicators. To do this, ranking results from running the rank web service on all quality indicators should be taken and used as inputs to the ensemble service. In addition to all quality indicator data that needs to be put into a proper JSON form, one needs to decide on which ensemble technique to be used. The DSCTool provides three ensemble options, namely: average, hierarchical, and data-driven. When using the average method, rankings are simply averaged, with the hierarchical method, the input rankings are looked at from a hierarchical viewpoint where the algorithm with the highest rankings obtains the best ensemble ranking, and the data-driven method, where the information gain provided by the quality indicators is also taken into account to determine the obtained ensemble ranking. The examples of the JSON input for the average ensemble method can be found at http://cs.ijs.si/dl/dsctool/M05-average.json and http://cs.ijs.si/dl/dsctool/M10-average.json for 5 and 10 objectives, respectively. To prepare the JSON input for hierarchical or data-driven ensemble, the name of the method should be modified/changed from average to hierarchical or data-driven, while everything else remains the same. Further, to obtain the ensemble rankings, the ensemble service was executed with an appropriate JSON input. The rankings obtained from ensemble services in 5 and 10 objectives are shown in Tables 7-9. Unsurprisingly, it can be again seen that the rankings have changed with respect to individual rankings shown in Tables 3-6. Now, the rankings no longer represent individual quality indicators, but a fusion of the information by all of them. All ensembles provide similar ranking results, but there are still some differences between them. So, the selection of an ensemble method can also have an influence and must be made with great care. If we are interested purely in average performance, average should be selected, if we are interested in most often high-performing algorithm, hierarchical should be selected (this is also recommended for dynamic optimization), while if we care about which algorithm seems the most relevant in general, the data-driven ensemble method should be selected.   Table 9. DSCTool data-driven ensemble rankings. The algorithms rankings obtained on each benchmark problem consist of statistical significance that can be used to compare the algorithms only on that specific problem (or performing single problem analysis). However, if we are interested in a more general conclusion, or to compare the algorithms using the set of all benchmark problems (or multiple-problem analysis), it is not enough only to look at the mean ranking of the algorithms obtained from all benchmark problems. In such situations, we should analyze the data by applying some paired statistical test. For this purpose an omnibus test must be performed, which is implemented in DSCTool as an omnibus web service. The omnibus test can be performed on the results from any of the above tables. When using the DSCTool rank or ensemble web service we receive a JSON result that consists of the rankings for all algorithms and all benchmark problems (which were shown in Tables 3-9). Additionally, it provides us the information if the parametric tests can be applied on our data and which statistical tests are relevant for our data. In our cases, the Friedman test was recommended as an appropriate one. To make a general conclusion, we continue by using it with a significance level of 0.05. Instead of showing the results of all obtained rankings (presented in the tables), only one scenario was selected to show how this can be done and the results were compared to the outcome of the competition. We decided to make omnibus test for the results obtained with EI quality indicator on five objective problems. The reason is in the fact that, here, algorithm AGE-II outperformed the CVEA3, which was shown to be the best performing algorithm at the competition. After preparing the JSON input (accessible at http://cs.ijs.si/dl/dsctool/M05-ei-omnibus.json) the omnibus web service was executed. The obtained results indicate that the null hypothesis is rejected, since the calculated p-value is 1.09 × 10 −10 and it is smaller than our previously set significance level (0.05). This means that there is a statistical significance in the data and post-hoc test should be performed to identify where these significant differences come from. In our case, the algorithm with the lowest mean value (AGE-II) was selected as the control algorithm and compared to the other algorithms. To apply the post-hoc test, the algorithm means obtained from the omnibus test were taken, the number of algorithms and the number of benchmark functions set (in our case 10 and 15, respectively), and the control algorithm specified (in our case AGE-II). Since the post-hoc statistic is dependent from the omnibus statistical test, the same statistical test must be applied in the post-hoc test (in our case, the Friedman test). After creating the JSON input (accessible at http://cs.ijs.si/dl/dsctool/M05-ei-posthoc.json), the posthoc web service was executed. The results show that algorithm AGE-II significantly outperforms every other algorithm. Therefore, what we have shown is that not only can the rankings change by applying a different quality indicator, but the differences between performances can be significant. If we would do a similar test on an HV quality indicator with five objective problems, the results would be reversed. Here, CVEA3 would be shown as an algorithm that significantly outperforms every other algorithm. However, if we would look at the results of any of the ensemble methods, we would see that there is no statistical significance between algorithms AGE-II and CVEA3, which in the end would be a general conclusion (assuming we would not search for the best algorithm according to some problem specific quality indicator).
With our experiments, we have shown how important it is to perform proper analysis since we were able to change outcomes of the study only by selecting different quality indicators and/or statistical approaches.

Conclusions
Performing a proper statistical analysis is an important task that must be taken with great care. This was clearly shown in Section 5, where we showed that the selection of quality indicators and the statistical approach that will be used to analyze this data has a big influence on the end results of the comparison. For these reasons, the DSCTool can help users to make proper statistical analysis quick and error free. It guides users through all analysis steps and provides them with all the required information that is needed to perform a proper statistical analysis and obtain final conclusions from their studies. There are still some decisions that should be made by the user such as a selection of significance level and choosing the relevant ensemble, but this requires much less knowledge than doing all of the statistics on their own. Author Contributions: Conceptualization, P.K. and T.E.; methodology, T.E. and P.K.; software, P.K., and T.E.; validation, P.K., and T.E.; formal analysis, T.E. and P.K.; investigation, P.K. and T.E.; writing-original draft preparation, P.K.; writing-review and editing, T.E. and P.K.; funding acquisition, T.E. and P.K. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by the financial support from the Slovenian Research Agency (research core funding No. P2-0098 and project No. Z2-1867) and from the European Union's Horizon 2020 research and innovation program under grant agreement No. 692286.