High Per Parameter: A Large-Scale Study of Hyperparameter Tuning for Machine Learning Algorithms

Hyperparameters in machine learning (ML) have received a fair amount of attention, and hyperparameter tuning has come to be regarded as an important step in the ML pipeline. But just how useful is said tuning? While smaller-scale experiments have been previously conducted, herein we carry out a large-scale investigation, specifically, one involving 26 ML algorithms, 250 datasets (regression and both binary and multinomial classification), 6 score metrics, and 28,857,600 algorithm runs. Analyzing the results we conclude that for many ML algorithms we should not expect considerable gains from hyperparameter tuning on average, however, there may be some datasets for which default hyperparameters perform poorly, this latter being truer for some algorithms than others. By defining a single hp_score value, which combines an algorithm's accumulated statistics, we are able to rank the 26 ML algorithms from those expected to gain the most from hyperparameter tuning to those expected to gain the least. We believe such a study may serve ML practitioners at large.


Introduction
In machine learning (ML), a hyperparameter is a parameter whose value is given by the user and used to control the learning process.This is in contrast to other parameters, whose values are obtained algorithmically via training.
While software packages invariably provide hyperparameter defaults, practitioners will often tune these-either manually or through some automated process-to gain better performance.
Herein, we propose to examine the issue of hyperparameter tuning through an extensive empirical study, involving multitudinous algorithms, datasets, metrics, and hyperparameters.Our aim is to assess just how much of a performance gain can be had per algorithm by employing a performant tuning method.
The next section presents a brief account of previous work.Section 3 describes the experimental setup, followed by results and discussion in Section 4. Finally, we end with concluding remarks in Section 5.

Previous Work
There has been a fair amount of work on hyperparameters and it is beyond this paper's scope to provide a review.For that, we refer the reader to the recent, comprehensive review: "Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges" [3].They wrote that, "we would like to tune as few HPs [hyperparameters] as possible.If no prior knowledge from earlier experiments or expert knowledge exists, it is common practice to leave other HPs at their software default values..." [3] also noted that, "more sophisticated HPO [hyperparameter optimization] approaches in particular are not as widely used as they could (or should) be in practice."We shall use a sophisticated HPO approach herein.The paper does not include an empirical study.
We present below only the most recent papers that are directly relevant to the current study.
A major work by [8] formalized the problem of hyperparameter tuning from a statistical point of view, defined data-based defaults, and suggested general measures quantifying the tunability of hyperparameters.The overall tunability of an ML algorithm or that of a specific hyperparameter was essentially defined by comparing the gain attained through tuning with some baseline performance, usually that attained when using default hyperparameters.They also conducted an empirical study involving 38 binary classification datasets from OpenML, and six ML algorithms: elastic net, decision tree, k-nearest neighbors, support vector machine, random forest, and xgboost.Tuning was done through random search.They found that some algorithms benefited from tuning more than others, with elastic net and svm showing the most improvement and random forest showing the least.
[14] presented a methodology to determine the importance of tuning a hyperparameter based on a non-inferiority test and tuning risk: the perfor-mance loss that is incurred when a hyperparameter is not tuned, but set to a default value.They performed an empirical study involving 59 datasets from OpenML and two ML algorithms: support vector machine and random forest.Tuning was done through random search.Their results showed that leaving particular hyperparameters at their default value is noninferior to tuning these hyperparameters.In some cases, leaving the hyperparameter at its default value even outperformed tuning it.
Finally, [13] recently presented results and insights pertaining to the black-box optimization (BBO) challenge at NeurIPS 2020.Analyzing the performance of 65 submitted entries, they concluded that, "Bayesian optimization is superior to random search for machine learning hyperparameter tuning" (indeed this is the paper's title).(NB: random search is usually better than grid search, e.g., [2]).We shall use Bayesian optimization herein.
Examining these recent studies we made the following decisions regarding our setup, which is described in the next section: • Consider significantly more algorithms.
• Consider significantly more datasets.
• Consider Bayesian optimization, rather than lesser-performing random search or grid search.

Experimental Setup
Our setup involves numerous runs across a plethora of algorithms and datasets, comparing tuned and untuned performance over six distinct metrics.Below we detail the following setup components: datasets, algorithms, metrics, hyperparameter tuning, overall flow.
Datasets.We used the recently introduced PMLB repository [9], which includes 166 classification datasets and 122 regression datasets.As we were interested in performing numerous runs, we retained the 144 classification datasets with number of samples ≤ 10992 and number of features ≤ 100, and the 106 regression datasets with number of samples ≤ 8192 and number of features ≤ 100. Figure 1 presents a summary of dataset characteristics.Note that classification problems are both binary and multinomial.

Results and Discussion
A total of 96,192 replicates were performed, each comprising 300 algorithm runs (3 metrics × 50 Optuna trials, 3 metrics × 50 default trials), the final Further, our examining the respective implementations revealed possible randomness, e.g., for Decision Tree, when max features < n features, the algorithm will select max features at random; though the default is max features = n features we still took no chances of there being some hidden randomness deep within the code.
tally thus being 28,857,600 algorithm runs.2Table 2 presents our results.
Examining Table 2 brings to the fore several interesting points.First, regressors are somewhat more susceptible to hyperparameter tuning, i.e., there is more to be gained by tuning vis-a-vis the default hyperparameters.
For most classifiers and-to a lesser extent-regressors, the median value shows little to be gained from tuning; yet the mean value along with the standard deviation suggests that for some algorithms there is a wide range in terms of tuning effectiveness.Indeed, examining the collected raw experimental results, we noted that at times there was a "low-hanging fruit" case: the default hyperparameters yielded very poor performance on some datasets, leaving room for considerable improvement through tuning.
It would seem useful to define a "bottom-line" measure-a summary score, as it were, which essentially summarizes an entire table row, i.e., an ML algorithm's sensitivity to hyperparameter tuning.We believe any such measure would be inherently arbitrary to some extent; that said, we nonetheless put forward the following definition of hp score: • Consider (separately for classifiers and regressors) the 13 algorithms and 9 measures of Table 2 as a dataset with 13 samples and the following 9 features: metric1 median, metric2 median, metric3 median, metric1 mean, metric2 mean, metric3 mean, metric1 std, metric2 std, metric3 std • Apply scikit-learn's RobustScaler, which scales features using statistics that are robust to outliers: "This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range).The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).Centering and scaling happen independently on each feature..." [10] • The hp score of an algorithm is then simply the mean of its 9 scaled features.
This hp score is unbounded because improvements or impairments can be arbitrarily high or low.A higher value means that the algorithm is expected to gain more from hyperparameter tuning, while a lower value means that the algorithm is expected to gain less from hyperparameter tuning (on average).Table 3: The hp score of each ML algorithm, computed from the values in Table 2.A higher value means that the algorithm is expected to gain more from hyperparameter tuning, while a lower value means that the algorithm is expected to gain less from hyperparameter tuning.
Table 3 presents the hp scores of all 26 algorithms, sorted from highest to lowest per algorithm category (classifier or regressor).While simple and immanently imperfect, hp score nonetheless seems to summarize fairly well the trends observable in Table 2.
The main insight from Tables 2 and 3 is: For most ML algorithms, we should not expect huge gains from hyperparameter tuning on average, however, there may be some datasets for which default hyperparameters perform poorly, this latter being truer for some algorithms than others.In particular, those algorithms at the bottom of the lists in Table 3 would likely not benefit greatly from a significant investment in hyperparameter tuning.Some algorithms are robust to hyperparameter selection, others somewhat less.
Table 3 can be used in practice by an ML practitioner: 1) to decide how much to invest in hyperparameter tuning of a particular algorithm, and 2) to select algorithms that require less tuning.Hopefully, this will save time and energy [5].For algorithms at the top of the lists, we may inquire as to whether particular hyperparameters are the root cause of their hyperparameter sensitivity; further, we may seek out better defaults.For example, [12] recently focused on hyperparameter tuning for KernelRidge, which is at the top of the regressor list in Table 3. [8] discussed the tunability of a specific hyperparameter, though they noted the problem of hyperparameter dependency.

Concluding Remarks
We performed a large-scale experiment of hyperparameter-tuning effectiveness, across multiple ML algorithms and datasets.More algorithms can be added in the future, as well as more datasets and more hyperparameters.Further, one can consider tweaking specific components of the setup (e.g., the metrics and the scaler of Algorithm 1).The code is available at https://github.com/moshesipper).
Given the findings herein, it seems that more often than not hyperparameter tuning will not provide huge gains over the default hyperparameters of the respective software packages examined (in addition to our findings it also appears that most defaults were judiciously selected).A modicum of tuning would seem to be advisable, though other factors will likely play a stronger role in final model performance, including, to name a few: quality of raw data, solidity of data preprocessing, and choice of ML algorithm (curiously, this latter can be considered a tunable hyperparameter [11]).

Figure 1 :
Figure 1: Characteristics of the 144 classification datasets and 106 regression datasets used in our study.

Table 1 :
Value ranges or sets used by Optuna for hyperparameter tuning.For ease of reference we use the function names of the respective software packages: scikit-learn, xgboost, and lightgbm.Values sampled from a range in the log domain are marked as 'log', otherwise sampling is linear (uniform).
− r2) * (n − 1)/(n − p − 1), with r2 being the R2 score, n being the number of samples, and p being the number of features; ∈ [−∞, 1].3.Complement RMSE: Complement of root mean squared error (RMSE), defined as 1 − RMSE ; ∈ [−∞, 1].This has the same range as the previous two metrics.Overall flow.Algorithm 1 presents the top-level flow of the experimental setup.For each combination of algorithm and dataset we perform 30 replicate runs.Each replicate separately assesses model performance over the respective three classification or regression metrics.A replicate begins by splitting the dataset into training and test sets, and scaling them.Then, per each metric: 1. Run Optuna over the training set for 50 trials to tune the model's hyperparameters, and retain the best model; compute the best model's test-set metric score.Experimental setup (per algorithm and dataset).Fit MinMaxScaler to training set and apply fitted scaler to training set and test set Run Optuna with algorithm for n trials trials over training set and obtain best model # use eval score for single-trial evaluation [10]alanced accuracy: an accuracy score that takes into account class imbalances, essentially the accuracy score with class-balanced sample weights[10]; ∈ [0, 1].Hyperparameter tuning.For hyperparameter tuning we used Optuna, a state-of-the-art automatic hyperparameter optimization software framework[1].Optuna offers a define-by-run-style user API where one can dynamically construct the search space, and an efficient sampling algorithm and pruning algorithm.Moreover, our experience has shown it to be fairly easy to set up.Optuna formulates the hyperparameter optimization problem as a process of minimizing or maximizing an objective function that takes a set of hyperparameters as an input and returns its (validation) score.We used the default Tree-structured Parzen Estimator (TPE) Bayesian sampling algorithm.Optuna also provides pruning: automatic early stopping of unpromising trials [1].

Table 2 :
Compendium of final results over 26 ML algorithms, 250 datasets, 96,192 replicates, and 28,857,600 algorithm runs.A table row presents results of a single ML algorithm, showing a summary of all replicates and datasets.A table cell summarizes the results of an algorithm-metric pair.Cell values show median and mean(std), where median: median over all replicates and datasets of Optuna's percent improvement over default; mean(std): mean (with standard deviation) over all replicates and datasets of Optuna's percent improvement over default.Also shown is total number of replicates for which these statistics were computed.Acc: accuracy score; Bal: balanced accuracy score; F1: F1 score; R2: R2 score; Adj R2: adjusted R2 score; C-RMSE: complement RMSE; Reps: total number of replicates.NB: Number of replicates may be smaller than the maximal possible value: 144 × 30 = 4320 for classification datasets, and 106 × 30 = 3180 for regression datasets.This is due to edge cases that cause a single replicate to terminate with an error, the vicissitudes of life on the cluster, and in small part to long runtimes evoking the 72-hour timeout (this happened with GradientBoostingClassifier for 14 datasets and with XGBClassifier for 8 datasets).