RHOASo: An Early Stop Hyper-Parameter Optimization Algorithm

This work proposes a new algorithm for optimizing hyper-parameters of a machine learning algorithm, RHOASo, based on conditional optimization of concave asymptotic functions. A comparative analysis of the algorithm is presented, giving particular emphasis to two important properties: the capability of the algorithm to work efficiently with a small part of a dataset and to finish the tuning process automatically, that is, without making explicit, by the user, the number of iterations that the algorithm must perform. Statistical analyses over 16 public benchmark datasets comparing the performance of seven hyper-parameter optimization algorithms with RHOASo were carried out. The efficiency of RHOASo presents the positive statistically significant differences concerning the other hyper-parameter optimization algorithms considered in the experiments. Furthermore, it is shown that, on average, the algorithm needs around 70% of the iterations needed by other algorithms to achieve competitive performance. The results show that the algorithm presents significant stability regarding the size of the used dataset partition.


Introduction
Tuning the hyper-parameter configuration of a machine learning (ML) algorithm is a recommended procedure to obtain a successful ML model for a given problem. Different ML algorithms have specific hyper-parameters whose configuration requires a deep understanding of both the model and the task. Since the hyper-parameter configuration greatly impacts the models' performance, the research in automatic hyper-parameter optimization (HPO) is focused on developing techniques that efficiently find optimal values for the hyper-parameters, maximizing accuracy while avoiding complex and expensive operations. However, this process remains a challenge because not all optimization methods are always suitable for a given problem.
Although there are several methods for tuning both continuous and discrete hyperparameters, they do not perform equally for all ML algorithms, displaying different consumption of computational resources and stability. The process becomes computationally expensive if too many function evaluations of hyper-parameter values must be carried out to obtain a suitable accuracy. Since the size of the dataset in the HPO phase influences the dynamic complexity of the classifier but not its accuracy [1], another possible limitation is that a HPO algorithm may require a large dataset to work efficiently. Finally, most HPO algorithms are iterative, which suggests that stopping the algorithm when the expected improvement of testing new configurations is low can be a good option [2]. Nevertheless, the ML user does not have information about the rate of convergence and the loss function values. Therefore, the user usually tends to leave the default parameters (more than 50 iterations) or set a high number of iterations to assure good performance ( [3]). This fact implies that the algorithm may perform more iterations than needed to obtain an adequate accuracy, with the consequent increased computational cost.
when it is run together with the three ML algorithms mentioned above, and it is compared with other HPO algorithms. This is done in two different ways. On one hand, we let the number of iterations of the HPO algorithms have their default values. On the other hand, we allow the HPO algorithms to be run for the same number of iterations that RHOASo needs until it stops.
Additionally, we have included an appendix with the results concerning the first experiment in which case the ML algorithms are decision tree (DT and K-nearest neighbor (KNN). It is shown that the performance of RHOASo compared with the other HPO algorithms is substantially better. Due to the superiority of the performance of RHOASo with respect to the rest of the HPO algorithms when they are run with DT and KNN, we have carried out the complete analysis only with the RF, GB and MLP algorithms.
In order to facilitate the reading of the article, we include below a scheme describing the experimentation and validation phases that were carried out.

Related Work
A hyper-parameter of a ML algorithm is a hidden component that directly influences the algorithm's behavior. Tuning it allows the user to control the performance of the algorithm.

Problem Statement
Definition 1. Let X be a tuple of random variables. Let Y be the space labels. Let D train ∈ X × Y be an i. i. d. sample whose distribution function is P.
A machine learning algorithm A is a functional as follows: The model h A,D train (x) predicts the label of (unseen) instance x minimizing the expected loss function, L(D train , h A,D train ).
This loss function measures the discrepancy between a hypothesis h ∈ H and an ideal predictor. The target space of functions of the algorithm, H, depends on specific parameters, λ = (λ 1 , . . . , λ n ) ∈ Λ , that might take discrete or continuous values and have to be fixed before applying the algorithm. We use the notation A λ to refer to the algorithm with the hyper-parameter configuration λ.
In this scenario, another independent data set, D := D train ∪ D test , serves to evaluate the loss function, L(D test , h A λ ,D train ), provided by the algorithm A λ (D train ). Let the hyperparameters λ = (λ 1 , . . . , λ n ) remain free in L(D test , h A λ ,D train ).
In the case of a classification ML problem, we can take the loss function L as the error rate, that is, one minus the cross-validation value. In this situation, one can define the following function: The hyper-parameter optimization (HPO) problem consists of trying to reach λ * := min λ (mean D test L(D test , h A λ ,D train )).

Overview of the State-of-the-Art Methods
Since the hyper-parameter configuration has a significant effect on the performance of a ML model, the main goal in the HPO research is to find optimal values for the hyper-parameters that maximize the accuracy of the model while minimizing the costs and avoiding manual tuning. In the case where hyper-parameters are continuous, HPO algorithms usually work using gradient descent-based methods ( [4][5][6]) in which the search direction in the hyper-parameter space is determined by the gradient of a model selection criterion at any step.
The discrete case has several approaches that perform differently depending on the ML algorithm and the dataset. Bayesian HPO is a type of surrogate-based optimization ( [7]) that tunes the hyper-parameters by keeping the assumed prior distribution of the loss function updated, taking into account the new observations that are selected by the acquisition function. The construction of this surrogate model and the hyper-parameter selection criteria result in several types of sequential model-based optimization (SMBO). The main methods model the error distribution with a Gaussian process ( [8]) or tree-based algorithms, such as sequential model-based algorithm configuration (SMAC) or the Tree Parzen Estimators (TPE) method ( [9,10]). Another perspective is the radial basis function optimization (RBFOpt) that proposes a deterministic surrogate model to approximate the error function of the hyper-parameters through dynamic coordinate search. These methods require fewer evaluations, improving the associated costs of Gaussian process methods ( [11]). Regarding the selection function to choose the next promising hyper-parameter configuration to test in the surrogate-based optimization, the typical approach is to use the expected improvement ( [8]). There are other alternatives, such as the predictive entropy search ( [12]. Other variants of SMBO can be found in [13,14], where different datasets and tasks are characterized by several measurements that allow to predict a ranking of several combinations of hyper-parameter values. Another important HPO approach is the decision-theoretic method, where the algorithm obtains the hyper-parameter setting by searching the hyper-parameter space directly following some particular strategy. As examples, we have grid search, which uses brute force, and the simple and effective random search (RS) that tests randomly sampled configurations from the hyper-parameter space [15,16]).
Other optimization algorithms are applied to the problem of discrete hyper-parameter values selection. This is the case, for instance, of the evolutionary algorithms, such as the covariance matrix adaptation evolutionary (CMA-ES) method [17], the simplex Nelder-Mead (NM) method ( [18,19]) or the application of continuous techniques over the discrete case such as the particle swarm (PS) ( [20,21]).
Although there are several options, these methods provide different results and consumption of computational resources, and they do not perform equally well with all ML algorithms. Then, we need to consider the costs to choose the HPO method, the size of the data required to run the optimization process effectively, and the human interaction needed. These issues arise in several open research challenges that we have summarized in Table 1. RHOASo is an HPO algorithm that is designed in order for the potential user not to have to configure the end of the process. Currently, the termination of a general HPO algorithm can be carried out in several ways ( [3]): (1) an amount of runtime fixed by the user based on intuition; (2) a lower bound of the generalization error specified by the user; and (3) considering the convergence of HPO if no progress is identified. All of these procedures can lead to over-optimistic bounds or excessive runtime that increases the computational cost. In this scenario, RHOASo is able to stop automatically, without losing accuracy and with minimal intervention of the user.
Additionally, many of the state-of-the-art HPO algorithms have, in turn, parameters that must be set up before running them. For instance, when a user wants to tune an (unbounded) integer-valued hyper-parameter of a given ML algorithm, the HPO algorithm requires the user to pre-configure a grid over which it is to be run. In many cases, the higher the size of the grid, the higher the execution cost of the HPO algorithm. A natural way to proceed in these cases is to accelerate the hyper-parameter running process, using earlystopping techniques ( [2,[22][23][24]). However, these algorithms still have other parameters that must be set up. Therefore, in some sense, HPO algorithms move the hyper-parameter tuning problem from ML algorithms to themselves, which increases the complexity and cost of the whole process. Table 2 below shows the parameters on which the HPO algorithms used in this work depend.  [28] Thus, the natural question is how to tune hyper-parameters of HPO algorithms without increasing the complexity. Since using HPO algorithms over themselves does not solve the problem, it is natural to ask for HPO algorithms depending on as few hyperparameters as possible and achieving good performance, compared with state-of-the-art HPO algorithms.
Our aim is to present a novel HPO algorithm with only one parameter to be tuned and to analyze its performance, compared with other state-of-the-art HPO algorithms.

The Proposed Algorithm: RHOASo
RHOASo is an approach to the HPO problem, whose underlying idea is the reversible gradient-based HPO method proposed for the continuous case ( [5]).
Open source code for RHOASo is hosted in GitHub (https://github.com/amunc/ RHOASo, accessed on: 25 July 2021), and it is available under the GPL license (version 3). Users do not need to install software separately, save for the Python language. Additionally, this is included in a ML intelligent system, RADSSo (RIASC automated decision support software), and it was used in several research works ( [29,30]).

The Setup
Recall from Section 2.1 that the space of functions in which a learning algorithm takes values is assumed to depend on certain parameters λ = (λ 1 , . . . , λ n ), and this space of functions is denoted by H λ . We make the following assumptions: 1.
The hyper-parameters λ i are discrete.

2.
If λ * = (λ 1 , . . . , Typical examples of hyper-parameters satisfying such assumptions are the maximum depth in any tree-based machine learning model, the number of trees if the output of the model is a weighted average of the outputs of all the trees, or the number of neurons in hidden layers of a multilayer perceptron (the weights of the inputs of a given neuron may be zero).
Let Φ A,D be the functional given in Equation (2) for a given machine learning model defined by a dataset D and a model H λ . If we plot the functional Φ λ,D considering, for example, the model random forest with hyper-parameters maximum depth (x-axis) and number of trees (y-axis), we can obtain a figure like those shown in Figure 1 (the plots are obtained from two different datasets).
From the expression Φ A,D (λ) = mean D test L(D test , f A λ ,D train ) it follows that if we let both the size of the dataset and the number of iterations in the cross-validation go to infinity, the surface given in Figure 1 becomes smoother, and takes the form shown in Figure 2, which is a concave surface with an asymptote in the plane z = z 0 ≤ 1.
At this point, one can ask for an algorithm to find a value of λ = (λ 1 , λ 2 ) at which Φ A,D attains a sufficiently high value while keeping λ 1 and λ 2 as small as possible.

Motivation of the Algorithm
In order to motivate the algorithm, consider the logistic function f (x) = 1/(1 + e −x ) restricted to R >0 . This is a concave function with an asymptote in y = 1, thus it has no maximum. Although the maximization problem of this function has no sense, we can ask for the point x 0 ∈ R with higher value f (x) subject to the condition that making x 0 smaller makes the decrease in f important. One way to formalize this question is by considering the maximization problem of the function Stb( 3 . In a certain sense, maximizing Stb(x) consists of choosing a point x 0 with a sufficiently large image f (x 0 ) but whose slope at that point is not too low. Note that Stb(x) = e −x f (x) 3 (−1 + 3e −x f (x)), and therefore Stb(x) = 0, has only one solution, x = ln(2), which is a maximum.
Consider now the function Stb(x, n) = x n f (x) f (x), n being a natural number. Then, taking n > ln(2), the equation Stb (x, n) = 0 has only one solution x 0 , which is greater than ln (2), and becomes larger as we make n increase. Thus, we can control how small the slope is at the solution x 0 by making n vary.   In order to explain how RHOASo works, and to link directly with the form it is presented below, let us denote the function f by Φ A,D and let us restrict its domain of definition to the set of natural numbers. The variable is now denoted by λ. Then, instead of the derivative, we may consider the following function: where h is a natural number. In order to simplify the notation, we set h = 1. Thus, we may consider the optimization problem defined by the following: Now, we can give a simple iterative algorithm to find the value λ close to that at which Φ A,D attains a sufficiently high value while keeping the magnitude of such coordinate as low as possible; at the iteration i, do: if Stb(λ i + 1) > Stb(λ i ), then λ i+1 := λ i + 1 and stop otherwise. This is just the most basic algorithm to solve the optimization problem max λ {Stb(λ, n)}. Observe that the convergence is always ensured because of the properties of the function Stb(λ).
See Figure 3 to show how the stabilizer function, Stb, behaves in two particular cases.

The Algorithm
There are two important features we want for the algorithm to have. On one hand, we want to avoid meta-parameters, that is, we want for the algorithm not to depend (strongly) on extra parameters. In state-of-the-art HPO algorithms, the user has to set as input the exact number of iterations that the algorithm must perform. On the other hand, we want the algorithm to give a good result when it is run, giving as input not the whole dataset but a small part of it. The proposed HPO algorithm exploits the consequences derived from assumptions 1 and 2 to reach the objective in a simple way.
Since the hyper-parameters we will work with are discrete, we may assume that the hyper-parameter space is Γ = N n . Suppose that we are at the point λ ∈ Γ = N n of the hyper-parameter space. The decision about to which next point the algorithm must jump is based on two basic rules:

1.
Fix a natural number h. Let Shifts = {0, h} n \ {(0, . . . , 0)}. We look at the points λ + Shifts. These are the next points at each possible direction in the space Γ. Since H λ ⊂ H λ+η , for each η ∈ Shifts, it is likely that for each η, λ + η will give a better result than λ.

3.
Once the algorithm stops at some point λ * ∈ Ω, there is a final step at which the point λ ∈ λ * + Shifts with maximum Φ A,D is found and given as the final output.
The pseudo-code of the algorithm is included in Algorithms 1-4.

Materials and Methods
Some experiments were carried out in order to answer the research questions formulated in the introduction:

1.
RQ1: Given a dataset, how good is the performance (accuracy, time complexity, sensibility, and specificity) of a ML algorithm when RHOASo is applied?
if stb > stb current then 8: λ current ← λ 9: else 10: break 11: end if 12: end while 13 The analyses aim to measure the quality of the proposed algorithm and decide whether there are statistically meaningful differences between the performance of the selected HPO methods and RHOASo.

ML and HPO Algorithms
We have evaluated the efficiency of three well-known ML algorithms:

1.
RF is an ensemble classifier consisting of a set of decision trees. Each tree is constructed by applying bootstrap re-sampling (bagging) to the training set, which extracts a subset of samples for training each tree. Therefore, the trees will have a weak correlation and give independent results. In the case of RF, we have two main hyper-parameters: the number of decision trees to be used and the maximum depth for each of them ( [31]).

2.
GB is another ensemble technique in which the predictors are made sequentially, learning from the previous predictor's mistakes to optimize the subsequent learner. It usually takes fewer iterations to reach close-to-actual predictions, but the stopping criteria have to be chosen carefully. This technique reduces bias and variance but can induce overfitting if too much importance is assigned to the previous errors. We tune two discrete hyper-parameters: the number of predictors and their maximum depth ( [32,33]).

3.
A MLP is a graph-type model that is organized in ordered layers (input layer, output layer, and hidden layers). Each layer consists of a set of nodes with no connections between them, so the connections occur between nodes belonging to different and contiguous layers. In this study, we set two hidden layers, and we tune the number of neurons in each of these hidden layers.
In Table 3, we give a summary of the ML algorithms we have used together with the hyper-parameters we have tuned. The search space for all hyper-parameters is in the interval [1,50]. All hyper-parameters not being tuned are set to their default values as per scikit-learn implementation ( [34]). We have used 10-fold cross-validation to assess the performance of all ML models combined with the HPO algorithms.
On the other hand, in Table 4 a summary of the HPO algorithms selected for this study is given.

Datasets
The datasets selected for the experiments are described in Table 5. The choice was motivated by different reasons: availability in public servers to verify the results, the number of instances and classes, and the type of features. There are 16 public benchmark datasets with a different number of variables (8-300), rows (4601-284,807), and classes (2)(3)(4)(5)(6)(7)(8)(9)(10). Since the size of the dataset in the HPO phase influences the performance of the classifier ([1]), the performances of the algorithms are analyzed with four different-sized partitions of each dataset (P = {P 1 = 8.3%, P 2 = 16.6%, P 3 = 50%, P4 = D i }). These proportions are chosen because, in this way, the number of instances is in different orders of magnitude. In addition, each dataset is divided into train data (80%) and validation data (20%) D train i ∪ D valid i = D i , and the partitioning scheme above is applied to obtain train and validation subsets with the corresponding proportions P j (D i ).
The transformation of features to obtain treatable datasets to input directly to each model is manually implemented with Python.

Construction of the Response Variables
In order to build response variables to measure the performance of the HPO algorithms, we apply them to each ML model H k over each partition, P j (D i ), obtaining a hyper-parameter configuration λ k i,j for H k . Then, the learning algorithm with the obtained hyper-parameter configuration is run over D train i to construct a classifier that is validated over D valid i . At this step, we collect the obtained accuracy. This scenario is repeated a number of times (trials), giving rise to two different experiments.

1.
Experiment 1: the experimentation is repeated 50 times (trials) for all HPOs, except for RHOASo, which automatically stops when it considers that it has obtained an optimal hyper-parameter configuration. 2.
Experiment 2: the experimentation is repeated for all HPOs as many times (trials) as RHOASo has carried out until stopping.
The time complexity of the whole process is stored as well. The time complexity, measured in seconds, is the sum of the time needed by the HPO algorithm to find the optimal hyper-parameter configuration, and the time that the ML algorithm uses for training. Then, we create two response variables. We denote by Acc k i,j the number of trials × 1 array where the m-th component is the accuracy of the predictive model tested on D valid i that was trained over D train i with the hyper-parameters (λ k i,j ) m at the m-th trial. TC k i,j is the notation for time complexity.
We have also collected the sensitivity and the specificity of each iteration for RHOASo to measure its performance more accurately.
Additionally, we have collected the MCC (Matthew correlation coefficient), which is defined as follows: where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively. This coefficient works as a substitute metric of the accuracy for unbalanced datasets [55,56]. Since some datasets contain a certain degree of imbalance, we present our results with both indicators, accuracy and MCC.

Statistical Analyses
Our main objective is to analyze the quality of the RHOASo algorithm and compare it with other HPO algorithms. In order to analyze whether there are meaningful statistical differences among the obtained results by the HPO algorithms and RHOASo, we perform the following statistical analysis:

1.
We have conducted descriptive and exploratory analyses of Acc RHOASo for all datasets and for the three selected ML algorithms. The choice of Wilcoxon's test is because the response variables that we compare are obtained by the application of the ML algorithms over the same dataset but with different settings (λ k i,j ). We obtain the results at a significance level of α = 0.05.

4.
Once we apply the inference described above, we obtain the p-values of 7 comparisons along 16 datasets with 4 partitions in each one, providing a total of 448 deci-sions on statistical difference for each ML algorithm and for each response variable, obtaining, thus, 2688 p-values. From these results, we have computed how many times we obtain positive difference (validity(RHOASo) > validity(H k )), negative difference (validity(RHOASo) < validity(H k )) or equality (validity(RHOASo) = validity(H k )), see Table 6. Table 6. Conditions of validity. The symbols =, >, < denote statistically meaningful equality and difference, and Me denotes the median.
Since the blue cells may be understood as being both a positive difference or negative difference, depending on the improvement that we obtain, we have reclassified the results, creating a new table, correcting these cases by the rule described in Equation (5). where and 6.
We have completed the analysis by computing the rate of each type of validity (red, yellow and green cells) as follows: where N denotes the number of total possible comparisons. Since we have performed the computations for each ML algorithm, we have N = 448. 7.
Finally, we compute R V • per partition and per dataset to analyze the consistency of the results.
Note that for studying the cases of unbalanced datasets, we have carried out the analyses described above by changing the accuracy for the MCC.

Technical Details
The analyses are carried out at high-performance computing over HP ProLiant SL270s Gen8 SE, with two processors, Intel Xeon CPU E5-2670 v2 @ 2.50GHz, with 10 cores each and 128 GB of RAM and one hard disk of 1 TB. The analysis script is implemented in Python language.

Results and Discussion
This section is organized according to the research questions we have formulated in the introduction. Since the behavior of RHOASo is similar in both Experiments 1 and 2, in this section, we detail the performance of RHOASo in Experiment 1, and we include a summary of the results for Experiment 2.

Performance in Experiment 1
We can see in Figure 4 the median of the accuracy when RHOASo is applied. The median for RF is 0.92, for GB it is 0.9058, and for MLP it is 0.8182. We can see that in all of these cases, it is greater than 0.80. We can observe that the achieved accuracy by RHOASo presents great stability in terms of the partitions of the dataset, except for dataset D 6 with model RF, and dataset D 16 with model GB. In the case of D 6 , the variation appears when we change from P 3 to P 4 . This dataset is the largest one in the study, with the highest rate of unbalanced data. Then, the most appropriate metric is the MCC. As is discussed later, the MCC remains stable for D 6 in all the partitions. The case of D 16 is more involved. The most frequent hyper-parameter configurations that RHOASo computes for each partition are max. depth: 5, number trees: 9 for P 1 (14 times out of 50), max. depth: 9, number trees: 3 for P 2 (50 times out of 50), max. depth : 3, number trees: 9 for P 3 (31 times out of 50), and max. depth: 9, number trees: 3 for P 4 (50 times out of 50). Since the number of features in D 16 is 12, there are few instances, and the dataset is unbalanced, the most probable explanation is that the model is overfitting the training data. This behavior appears also in the rest of the metrics with the combination D 16 and GB. The stability is not as evident when the dataset changes, although the general trend is maintained through the ML models. For instance, the obtained models with D 10 provide the worst accuracy for the three ML algorithms. Although it may seem that the accuracy is not very high, the rest of the HPO does not achieve better results, as is outlined below. For this reason, the fit achieved by RHOASo is considered to be sufficient.
We can see in Figure 5 the median of the MCC (∈ [−1, 1]) when RHOASo is applied. The median for RF is 0.51, for GB it is 0.40, and for MLP it is 0.32. The MCC value considers class imbalance, so the results worsen for accuracy, particularly in datasets 3 and 12, which are highly unbalanced: the minority classes contain less than 1% of total instances. Apart from unbalanced datasets, the trends are similar to those presented when evaluating the accuracy.
As far as the time complexity is concerned, the median results are included in Figure 6. The median value for RF is 1.3028 s, and for GB it is 4.2567 s. The MLP stands out for its high computational cost, with 259.8138 s. Regarding the stability, as expected, the larger the partition size that is used, the larger the time complexity, independently of the ML algorithm used together with RHOASo. Nevertheless, this behavior is different for each ML algorithm. It is worth noting that in the case of RF, the increase in time complexity as the size of the partitions increases is much smoother for most of the datasets, compared to GB and MLP.
Sensitivity and specificity are plotted in Figures 7 and 8. The results are stable across partitions and datasets, which achieve similar results, even when using different models, except for D 16 and GB. The median of sensitivity is 0.9 for GB, 0.91 for MLP, and 0.88 for RF. In contrast, specificity has a median of 0.7 for GB, 0.69 for MLP, and 0.733 for RF. The lower specificity could be caused by the imbalance between classes in specific datasets (see Table 5), which causes the models to be biased toward the majority class. However, D 11 has both low specificity and low sensitivity. Overall, the trends are similar to those found in the evaluation of accuracy.

Performance in Experiment 2
Since the behavior of RHOASo in both experiments is the same, we include in Table 7 a summary with the median of all metrics (without taking into account partitions) that RHOASo has obtained in Experiment 2. The number of iterations that RHOASo has needed until stopping for Experiment 1 is included in Figure 9. In the case of Experiment 2, this information can be observed in Table 7. In Experiment 1, we can see the median of the number of iterations needed by RHOASo for each ML algorithm, each dataset and each partition. As general result, the median of the number of iterations for each partition (computed over all datasets) is 35 for RF, 33.5 for GB, and 34 iterations for MLP. Taking into account that the number of iterations given as input (by default) in the rest of HPO algorithms is equal to 50, it implies that, on average, RHOASo needs approximately 70% of the iterations required by the other algorithms. Additionally, it stops the process by itself. As a consequence of this fact, less time is required by RHOASo to obtain a good enough accuracy and is able to be more competitive than other algorithms. This is most significant in the case of MLP, where each iteration is highly resource consuming. There is not a clear trend relating the partition size and number of iterations, especially in the case of GB, in which there is a greater variability in the number of iterations. It could be expected that a greater amount of data would contribute to a faster convergence, but this is not case. Therefore, it is likely that the functional Φ A,D may not be as concave, as it would be desirable to perform an effective early stopping.
We have not compared whether RHOASo is faster or slower, compared with the other HPO algorithms for Experiment 2 by the very nature of the design of the experiment.

Research Question 3: Are There Statistically Meaningful Differences between the Performance of RHOASo and the Other HPO Algorithms?
We remind that we have carried out two experiments: 1.
Experiment 1: the experimentation is repeated 50 times (trials) for all HPOs except for RHOASo, which automatically stops when it considers that it has obtained an optimal hyper-parameter configuration.

2.
Experiment 2: the experimentation is repeated for all HPOs as many times (trials) as RHOASo has carried out until stopping.

Experiment 1
In Figures 10 and 11, the performances (accuracies and time complexities) that are achieved by the HPO algorithms over each dataset are shown.
However, if we want to compare whether RHOASo obtains any gain against the other HPO algorithms, we need to carry out more detailed analyses. This is the study of the validity of RHOASo.
The rates of validity that are obtained by RHOASo, compared to the rest of the HPO algorithms are included in Table 8. Note that these computations are carried out with the accuracies and time complexities by the analyses that are explained in Section 4.4.  Figure 10. Accuracy complexity. (a) Accuracy in RF. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm. (b) Accuracy in GB. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm. (c) Accuracy in MLP. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm. In axis X, the dataset. In axis Y, the total time obtained measured in seconds. Each line represents a HPO algorithm. (c) Time complexity in MLP. In axis X, the dataset. In axis Y, the total time obtained measured in seconds. Each line represents a HPO algorithm.
On average, the class corresponding to positive statistically significative differences (green class) is higher than 50%. This can be considered a good result since RHOASo achieves better results than the rest of the algorithms in more than half of the cases analyzed. However, there is still a high rate in the blue class. Once we have transformed the blue class (see Section 4.4), we can analyze whether RHOASo is more effective than the rest of the HPO algorithms. The results are included in Table 9, which show that, on average, the class corresponding to positive statistically significative differences is higher than 70%. After confirming that RHOASo is 70% more efficient than the rest of the HPO algorithms, the question that arises is whether there is a pattern in the 30% of the cases in which it does not succeed. For example, it might be possible for RHOASo to fail for datasets with a certain dimensionality, or for a specific ML algorithm. Another possibility is that RHOASo always loses against the same HPO algorithm. For this reason, we are going to study in depth the consistency of the previous results.
Since we have dealt with unbalanced datasets, such as D 1 , D 3 D 4 , D 6 or D 12 , we have repeated the analyses, changing the metric of accuracy by MCC so as to avoid overoptimistic scores. In Figure 12, the MCCs that are achieved by the HPO algorithms over each dataset are shown.
The rates of validity that are obtained by RHOASo, compared to the rest of the HPO algorithms, are included in Table 10. Note that these computations are carried out with the MCCs and time complexities by the analyses that are explained in Section 4.4. We can observe that RHOASo maintains its rate of gain, overcoming 70% of the cases. In Section 5.4, we study the consistency of these results as well as those situations in which RHOASo does not obtain a gain.

Experiment 2
In Figures 13 and 14, the performances (accuracies and time complexities) that are achieved by the HPO algorithms over each dataset are shown. Figure 13. Experiment 2: Accuracy complexity. (a) Accuracy in RF. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm. (b) Accuracy in GB. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm. (c) Accuracy in MLP. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm. The rates of validity that have been obtained by RHOASo compared to the rest of HPO of algorithms are included in Table 11. Note that these computations are carried out with the accuracies and time complexities by the analyses that are explained in Section 4.4 for Experiment 2. In this experiment, RHOASo loses some of its advantage over other HPO algorithms, mainly because the improvement in execution time is lower since the number of iterations of the other algorithms is fixed to be equal to that of RHOASo. The average gain is of 53.86%, which is lower than the 71.96% obtained when evaluating the accuracy. Nevertheless, RHOASo performs better for GB, has a slight advantage for RF and is outperformed in MLP. This fact may be related to the search strategy of RHOASo. MLP has worse performance than other models across all datasets and configurations, which increases the number of neurons that tend to perform better, but the search strategy of RHOASo favors configurations with low magnitude of hyper-parameter values. Therefore, the performance when tuning MLP can be expected to be less satisfactory.
For the unbalanced datasets, we have included the analyses changing the metric of accuracy by MCC. In Figure 15, the MCCs that are achieved by the HPO algorithms over each dataset are shown.
The rates of validity that are obtained by RHOASo, compared to the rest of HPO of algorithms are included in Table 12. Note that these computations are carried out with the MCCs and time complexities by the analyses that are explained in Section 4.4. When using MCC as the reference metric, the results follow similar trends to that of accuracy. RHOASo gains a slight advantage for RF and incurs a slight loss for MLP, but there are no significant differences.

Research Question 4: Are the above Results Consistent?
In this section, we analyze whether RHOASo achieves a significant performance improvement consistently across datasets and partitions.

Experiment 1
The rates of validity for each partition computed with the accuracy and time complexity are included in Figure 16. The consistency of the green class is clear when we discriminate them by partitions. That is to say, if we only consider the partitions, RHOASo always outperforms the rest of the HPO algorithms. This may be due to the early stop of RHOASo, consuming less time but achieving good accuracy. As we can see in Figure 17, the above conclusion is not so general when we discriminate them by datasets, except for the case of GB. Nonetheless, the results concerning RF and MLP are not so far from being consistent (RF being closer than MLP), and a deeper analysis should be performed regarding this fact. In the case of MLP, there is a clear trend for the datasets in which RHOASo losses are D 1 , D 2 , D 3 , and D 4 . That is, datasets with a low number of instances (<10,000) but with a high number of features > 50. This failure in the large-dimension small-sized datasets can be due to the early-stop characteristic of the algorithm. Datasets with high dimensionality and low number of instances tend to increase the variance of the results, which contributes to create an irregular surface in Φ A,D , possibly trapping RHOASo in a local maximum. A possible solution could be to substitute the function that maps elements from the hyper-parameter space to the performance of the trained ML model on a validation dataset with an approximated probabilistic model of such function. This suggests that combining the underlying ideas of Bayesian hyper-parameter optimization algorithms with those presented in this paper could yield an early-stop algorithm that works efficiently in high dimensions.
In the case of RF, the gain of RHOASo is not enough for datasets D 5 , D 9 , and D 14 . Unlike in the case of MLP, these datasets have no clear commonalities, so we can only hypothesize. We believe that the problem is the same as in the case of MLP: high variance in the results diminishes the effectiveness of RHOASo. However, in this case, this variance could be ascribed to the inherent randomness in the training of RF or in the cross validation sampling.
As we can see in Figure 18, the behavior of the rates of validity for each partition remains consistent when these are computed with the MCC and time complexity. As we can see in Figure 19, the same conclusions as in the case of accuracy are obtained when we discriminate them by datasets. Nonetheless, the results concerning RF and MLP are not so far from being consistent (RF being closer than MLP), and a deeper analysis should be performed regarding this fact. We note that, in some cases, the green class is increased; this has not occurred for the red class in the unbalanced datasets.

Experiment 2
The rates of validity for each partition computed with the accuracy and time complexity are included in Figure 20. In this experiment, RHOASo is more inconsistent since it loses the advantage of execution time that it had over the other HPOs. RHOASo has a consistent advantage for all partition sizes in GB, but for RF it only improves other algorithms for partitions 3 and 4 and is inferior in MLP in all partitions, although the results improve when the size of the partition increases. This trend is also present for RF and GB. The reasons for this trend are probably the same as those we discussed in Section 5.4.1: the variance in the results trapping RHOASo in the local maxima. As we can see in Figure 21, the validity per dataset is negatively affected. In RF, there is a general decrease, with only datasets 10, 12, 12 and 15 showing a clear advantage for RHOASo. For GB, RHOASo maintains better results than other algorithms for all datasets, except 5 and 6. However, the consistency is lower than in experiment 1. Finally, in MLP, the validity of RHOASo is the lowest among all models, being consistently outperformed in five datasets: 1, 2, 3, 4 and 13.
The datasets with the worst results for each model have no clear commonalities, so it is possible that the neutralization of the execution time advantage of RHOASo is a significant factor in the deterioration of the results. In Figure 22, we show the results of the above analysis with the MCC and the time time complexity. The results are very similar to those achieved with accuracy. The main difference is that the validity for RF is improved for partitions 1 and 2, increasing the consistency of the results. In Figure 23, we show the rate of validity per dataset, using the MCC as the reference metric. As happened in Experiment 1, the trends are mostly the same as those present when evaluating the accuracy.

Conclusions
ML provides several powerful tools for data processing that find applications in different fields. Most of the existent models have several hyper-parameters that need to be tuned and have a noticeable impact on their performance. Therefore, HPO algorithms are essential to achieve the highest possible accuracy with minimal human intervention.
In this work, a new HPO algorithm is described as a generalization of the discrete analog of a basic iterative algorithm to obtain the solutions to certain conditional optimization problems for the logistic function. It is shown that its performance is weakly disturbed by changing the size of the data subset with which it is run. The algorithm shows positive statistically meaningful differences in efficiency, regarding the other HPO algorithms considered in this study. The algorithm can finish the tuning process by itself and only requires an upper bound on the number of iterations to perform. Furthermore, it is shown that, on average, it needs around 70% of the iterations needed by the other hyper-parameter optimization algorithms to achieve competitive results.
The results show that the algorithm achieves high accuracy, with similar results for all classifiers on each dataset. In addition, RHOASo can effectively use a small partition size to accelerate the HPO process without sacrificing the final accuracy of the model. Lastly, the automatic early stop ends the tuning process before reaching the fixed number of iterations (M e = 34), further increasing its efficiency.
Future work can be aimed at several lines: • Test the RHOASO's performance with more machine learning algorithms, such as decision trees or k-nearest neighbors. Data Availability Statement: The datasets supporting this work are from previously reported studies and datasets, which are cited. The processed data are available from the corresponding author upon request.

Appendix A. Additional Results
In this appendix, we show the results concerning Experiment 1 in the case that the ML algorithms are DT and KNN. Due to the great difference in the performance of RHOASo with respect to the rest of the HPO algorithms when they are run with these ML algorithms, we have excluded the analyses from the body of the article. However, we believe that the obtained results may be of interest.
We give a brief description of DT and KNN below.

1.
DT is a tree-like model, where the internal nodes and their edges encode possibilities and the ending nodes (leafs) encode decisions. Te maximum length of the paths joining the root node and a leaf is called the depth of the tree. There are a number of DT training algorithms, among which we can point out ID3, ID4, ID5 and CART. In this study, we have chosen CART.

2.
KNN is a non-parametric classification model the may be also used in regression problems.
The training examples are simply vectors in the feature space, carrying their class label. The training phase does not consist of constructing an internal mathematical model but simply allocating training data instances in the feature space. The classification phase is done by looking at the majority label of the k-nearest neighbors of each point. This implies the choice of a metric on the feature space, which by default is usually taken as the p = 2 Minkowski distance.
In Table A1, we include the hyper-parameters that we have tuned. We have chosen them because of their influence on the corresponding ML algorithms (see [3], Appendixes A1, A4).Concerning the hyper-parameters of DT to be tuned, we have chosen the minimum number of samples required to split an internal node (min_smaples_split) and the minimum number of samples required to be at a leaf node (min_samples_leaf). Regarding the hyperparameters of KNN, we have considered the number of neighbors to use for queries and p, the Minkowski's distance type.
The search space for all hyper-parameters is in the interval [1,50]. All hyperparameters not being tuned are set to their default values in scikit-learn implementation ( [34]). We have used 10-fold cross-validation to assess the performance of all ML models combined with the HPO algorithms. We can see in Figures A1-A6 that we have plotted the median of the accuracy, MCC, total time, sensitivity, specificity, and the number of iterations when RHOASo is applied together with DT and KNN. The median of accuracy for DT over all datasets is 0.85 and for KNN, it is 0.78. We can observe that the achieved accuracy by RHOASo presents great stability in terms of partitions of the dataset, except for partition P4 and some datasets. The median of MCC for DT is 0.58, and for KNN it is 0.41. Again, RHOASo presents a similar behavior to the accuracy. It can be due to P 4 does not contribute to improve the fit of the models, see [1]. Regarding the time complexity, the median value for DT is 1.11 s, and for KNN it is 22.53 seconds. It can be shown that the total time registered for D 4 in KNN is higher in P 3 than in P 4 . This can be caused because the stabilizer of RHOASo achieves an optimum value before in the total dataset compared to P 3 . The median of sensitivity is 0.88 for DT, and 0.87 for KNN. In contrast, specificity has a median of 0.87 for DT, and 0.63 for KNN. In addition, these plots inherit the same trends as the graphics of the accuracy and MCC. The median of the number of iterations for DT is 14, and for KNN it is 19. It is worth pointing out the number of iterations for P 3 is larger than for P 4 when we work with KNN in D 4 . This is probably caused by the same reason as in the plot of the total time. The results of validity of RHOASo faced on the rest of HPO algorithms are included in Table A2, which show that, on average, the class corresponding to positive statistically significative differences is higher than 90%. Note that these computations are carried out with the accuracies and time complexities by the analyses that are explained in Section 4.4. Since we have dealt with unbalanced datasets, we have repeated the analyses, changing the metric of accuracy by MCC so as to avoid over-optimistic scores. In Figure A9, the MCCs that are achieved by the HPO algorithms over each dataset are shown. The rates of validity that are obtained by RHOASo, compared to the rest of HPO of algorithms with the MCCs and time complexities, are included in Table A3. In this section, we analyze whether RHOASo achieves a significant performance improvement consistently across datasets and partitions.
The rates of validity for each partition computed with the accuracy and time complexity are included in Figure A10. As can be seen, RHOASo presents a much higher performance than any other HPO algorithm taken into account in this work, and this behavior is independent of the partition. As we can see in Figure A11, the above conclusion is the same when we discriminate rates of validity by datasets.
As we can see in Figure A12, the behavior of the rates of validity for each partition remains consistent when these are computed with the MCC and time complexity. As we can seen in Figure A13, the same conclusion as in the case of accuracy is obtained when we discriminate them by datasets.