Next Article in Journal
Sustainable Risk Identification Using Formal Ontologies
Next Article in Special Issue
An Evolutionary, Gradient-Free, Query-Efficient, Black-Box Algorithm for Generating Adversarial Instances in Deep Convolutional Neural Networks
Previous Article in Journal
CVE2ATT&CK: BERT-Based Mapping of CVEs to MITRE ATT&CK Techniques
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

High Per Parameter: A Large-Scale Study of Hyperparameter Tuning for Machine Learning Algorithms

Department of Computer Science, Ben-Gurion University, Beer-Sheva 8410501, Israel
Algorithms 2022, 15(9), 315; https://doi.org/10.3390/a15090315
Submission received: 27 July 2022 / Revised: 22 August 2022 / Accepted: 30 August 2022 / Published: 2 September 2022
(This article belongs to the Special Issue Algorithms for Natural Computing Models)

Abstract

:
Hyperparameters in machine learning (ML) have received a fair amount of attention, and hyperparameter tuning has come to be regarded as an important step in the ML pipeline. However, just how useful is said tuning? While smaller-scale experiments have been previously conducted, herein we carry out a large-scale investigation, specifically one involving 26 ML algorithms, 250 datasets (regression and both binary and multinomial classification), 6 score metrics, and 28,857,600 algorithm runs. Analyzing the results we conclude that for many ML algorithms, we should not expect considerable gains from hyperparameter tuning on average; however, there may be some datasets for which default hyperparameters perform poorly, especially for some algorithms. By defining a single hp_score value, which combines an algorithm’s accumulated statistics, we are able to rank the 26 ML algorithms from those expected to gain the most from hyperparameter tuning to those expected to gain the least. We believe such a study shall serve ML practitioners at large.

1. Introduction

In machine learning (ML), a hyperparameter is a parameter whose value is given by the user and used to control the learning process. This is in contrast to other parameters, whose values are obtained algorithmically via training.
Hyperparameter tuning, or optimization, is often costly and software packages invariably provide hyperparameter defaults. Practitioners will often tune these—either manually or through some automated process—to gain better performance. They may resort to previously reported “good” values or perform some hyperparameter-tuning experiments.
In recent years, there has been increased interest in software that performs automated hyperparameter tuning, such as Hyperopt [1] and Optuna [2]. The latter, for example, is a state-of-the-art hyperparameter tuner which formulates the hyperparameter optimization problem as a process of minimizing or maximizing an objective function that takes a set of hyperparameters as an input and returns its (validation) score. It also provides pruning, i.e., automatic early stopping of unpromising trials. Moreover, our experience has shown it to be fairly easy to set up, and indeed we used it successfully in our research [3,4].
A number of recent works, which we shall review, have tried to assess the importance of hyperparameters through experimentation. We propose herein to examine the issue of hyperparameter tuning through a significantly more extensive empirical study than has been performed to date, involving multitudinous algorithms, datasets, metrics, and hyperparameters. Our aim is to assess just how much of a performance gain can be had per algorithm by employing a performant tuning method.
The next section presents an account of relevant previous work. Section 3 describes the experimental setup, followed by results in Section 4. We discuss our findings in Section 5, and end with concluding remarks in Section 6.

2. Previous Work

There has been a fair amount of work on hyperparameters and it is beyond this paper’s scope to provide a detailed review. For that, we refer the reader to the recent comprehensive review: “Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges” [5].
Interestingly, Ref. [5] wrote that “we would like to tune as few HPs [hyperparameters] as possible. If no prior knowledge from earlier experiments or expert knowledge exists, it is common practice to leave other HPs at their software default values⋯”.
Ref. [5] also noted that “more sophisticated HPO [hyperparameter optimization] approaches in particular are not as widely used as they could (or should) be in practice” (the paper does not include an empirical study). We shall use a sophisticated HPO approach herein.
We present below only recent papers that are directly relevant to ours, “ancestors” of the current study, as it were.
A major work by [6] formalized the problem of hyperparameter tuning from a statistical point of view, defined data-based defaults, and suggested general measures quantifying the tunability of hyperparameters. The overall tunability of an ML algorithm or that of a specific hyperparameter was essentially defined by comparing the gain attained through tuning with some baseline performance, usually attained when using default hyperparameters. They also conducted an empirical study involving 38 binary classification datasets from OpenML, and six ML algorithms: elastic net, decision tree, k-nearest neighbors, support vector machine, random forest, and xgboost. Tuning was performed through a random search. They found that some algorithms benefited from tuning more than others, with elastic net and svm showing the highest improvement and random forest showing the lowest.
Ref. [7] presented a methodology to determine the importance of tuning a hyperparameter based on a non-inferiority test and tuning risk, i.e., the performance loss that is incurred when a hyperparameter is not tuned, but set to a default value. They performed an empirical study involving 59 datasets from OpenML and two ML algorithms: support vector machine and random forest. Tuning was performed through random search. Their results showed that leaving particular hyperparameters at their default value is noninferior to tuning these hyperparameters. In some cases, leaving the hyperparameter at its default value even outperformed tuning it.
Finally, Ref. [8] recently presented results and insights pertaining to the black-box optimization (BBO) challenge at NeurIPS 2020. Analyzing the performance of 65 submitted entries, they concluded that, “Bayesian optimization is superior to random search for machine learning hyperparameter tuning” (indeed this is the paper’s title) (NB: a random search is usually better than a grid search, e.g., [9]). We shall use Bayesian optimization herein.

The Current Study

After examining these recent studies, we made the following decisions regarding the experiments that we shall carry out herein:
  • Consider significantly more algorithms;
  • Consider significantly more datasets;
  • Consider Bayesian optimization, rather than weaker-performing random search or grid search.

3. Experimental Setup

Our setup involves numerous runs across a plethora of algorithms and datasets, comparing tuned and untuned performance over six distinct metrics. Below, we detail the following setup components:
  • Datasets;
  • Algorithms;
  • Metrics;
  • Hyperparameter tuning;
  • Overall flow.

3.1. Datasets

We used the recently introduced PMLB repository [10], which includes 166 classification datasets and 122 regression datasets. As we were interested in performing numerous runs, we retained the 144 classification datasets with number of samples ≤ 10,992 and number of features ≤ 100, and the 106 regression datasets with number of samples ≤ 8192 and number of features ≤ 100. Figure 1 presents a summary of dataset characteristics. Note that classification problems are both binary and multinomial.

3.2. Algorithms

We investigated 26 ML algorithms—13 classifiers and 13 regressors—using the following software packages: scikit-learn [11], xgboost [12], and lightgbm [13]. The algorithms are listed in Table 1, along with the hyperparameter ranges or sets used in the hyperparameter search (described below).

3.3. Metrics

We used three separate metrics for classification problems:
  • Accuracy: a fraction of correct predictions (∈ [ 0 , 1 ] ).
  • Balanced accuracy: an accuracy score that takes into account class imbalances, essentially the accuracy score with class-balanced sample weights [14] (∈ [ 0 , 1 ] ).
  • F1 score: a harmonic mean of precision and recall; in the multi-class case, this is the average of the F1 score per class with weighting (∈ [ 0 , 1 ] )
We used three separate metrics for regression problems:
  • R2 score: an R 2 (coefficient of determination) regression score function (∈ [ , 1 ] ).
  • Adjusted R2 score: a modified version of the R2 score that adjusts for the number of predictors in a regression model. It is defined as 1 ( 1 r 2 ) ( n 1 ) / ( n p 1 ) , with r 2 being the R2 score, n being the number of samples, and p being the number of features (∈ [ , 1 ] ).
  • Complement RMSE: a complement of root mean squared error (RMSE), defined as 1 RMSE (∈ [ , 1 ] ). This has the same range as the previous two metrics.

3.4. Hyperparameter Tuning

For hyperparameter tuning, we used Optuna, a state-of-the-art automatic hyperparameter optimization software framework [2]. Optuna offers a define-by-run-style user API where one can dynamically construct the search space, and an efficient sampling algorithm and pruning algorithm. Moreover, our experience has shown it to be fairly easy to set up. Optuna formulates the hyperparameter optimization problem as a process of minimizing or maximizing an objective function that takes a set of hyperparameters as an input and returns its (validation) score. We used the default tree-structured Parzen estimator (TPE) Bayesian sampling algorithm. Optuna also provides pruning, i.e., the automatic early stopping of unpromising trials [2].

3.5. Overall Flow

Algorithm 1 presents the top-level flow of the experimental setup. For each combination of algorithm and dataset, we perform 30 replicate runs. Each replicate separately assesses model performance over the respective three classification or regression metrics. A replicate begins by splitting the dataset into training and test sets, and scaling them. Then, for each metric:
  • Optuna is run over the training set for 50 trials to tune the model’s hyperparameters, the best model is retained, and the best model’s test-set metric score is computed.
  • Fifty models are evaluated over the training set with default parameters, the best model is retained, and the best model’s test-set metric score is computed. Strictly speaking, a few algorithms—decision tree, KNN, Bayesian—are essentially deterministic. For consistency, we still performed the 50 default hyperparameter trials. Further, our examination of the respective implementations revealed possible randomness, e.g., for decision tree, when max_features < n_features, the algorithm will select max_features at random; though the default is max_features = n_features we still took no chances of there being some hidden randomness deep within the code.
An evaluation of the model is carried out through five-fold cross-validation. At the end of each replicate, the test-set percent improvement in Optuna’s best model is computed over the default’s best model.
Algorithm 1 Experimental setup (per algorithm and dataset)
  • Input:
  •   algorithm← algorithm to run
  •   dataset← dataset to be used
  •   n_replicates← 30 (number of replicates)
  •   n_trials← 50 (number of Optuna trials, also number of runs with default values)
  •   time_limit← 72 h (for all replicates)
  • Output:
  •   Final scores (over test sets)
#  'metric1', 'metric2', 'metric3' are, respectively:
#     · For classification: accuracy, balanced accuracy, F1
#     · For regression: R2, adjusted R2, complement RMSE
#eval_score is 5-fold cross-validation score
 1:
Load dataset
 2:
forrep← 1 to n_replicates do
 3:
    Randomly split dataset into 70% training_set and 30% test_set
 4:
    Fit MinMaxScaler to training_set and apply fitted scaler to training_set and test_set
 5:
    for metric in 'metric1', 'metric2', 'metric3'’ do
 6:
        Run Optuna with algorithm for n_trials trials over training_set and obtain best_model  px# use eval_score for single-trial evaluation
 7:
        Train best_model over training_set
 8:
        Compute metric for best_model over test_set
 9:
        for i in n_trials do
10:
           Initialize a model with default hyperparameters
11:
           Evaluate model over training_set using eval_score
12:
           if eval_score is best obtained so far then
13:
               Save model as best_model
14:
        Train best_model over training_set
15:
        Compute metric for best_model over test_set
16:
    imp1, imp2, imp3 = percent improvement Optuna over default for 'metric1', 'metric2', 'metric3'  # Compute and record replicate scores
17:
    if runtime > time_limit then
18:
        break

4. Results

A total of 96,192 replicates were performed, each comprising 300 algorithm runs (3 metrics × 50 Optuna trials, 3 metrics × 50 default trials), with the final tally thus being 28,857,600 algorithm runs. Note that for each run, we used the fit method of the respective algorithm five times during five-fold cross validation, i.e., the learning algorithm was executed five times. Table 2 presents our results.
Table 2 shows several interesting points. First, regressors are somewhat more susceptible to hyperparameter tuning, i.e., there is more to be gained by tuning vis-a-vis the default hyperparameters.
For most classifiers and—to a lesser extent—regressors, the median value shows little to be gained from tuning, yet the mean value along with the standard deviation suggests that for some algorithms there is a wide range in terms of tuning effectiveness. Indeed, by examining the collected raw experimental results, we noted that there was a “low-hanging fruit” case at times. The default hyperparameters yielded very poor performance on some datasets, leaving room for considerable improvement through tuning.
It would seem useful to define a “bottom-line” measure—a summary score, as it were, which essentially summarizes an entire table row, i.e., an ML algorithm’s sensitivity to hyperparameter tuning. We believe any such measure would be inherently arbitrary to some extent; that said, we nonetheless put forward the following definition of hp_score:
  • The 13 algorithms and 9 measures of Table 2 are considered (separately for classifiers and regressors) as a dataset with 13 samples and the following 9 features: metric1_median, metric2_median, metric3_median, metric1_mean, metric2_mean, metric3_mean, metric1_std, metric2_std, metric3_std.
  • Scikit-learn’s RobustScaler is applied, which scales features using statistics that are robust to outliers: “This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). Centering and scaling happen independently on each feature…” [14].
  • The hp_score of an algorithm is then simply the mean of its nine scaled features.
This hp_score is unbounded because improvements or impairments can be arbitrarily high or low. A higher value means that the algorithm is expected to gain more from hyperparameter tuning, while a lower value means that the algorithm is expected to gain less from hyperparameter tuning (on average).
Table 3 presents the hp_scores of all 26 algorithms, sorted from highest to lowest per algorithm category (classifier or regressor). While simple and immanently imperfect, hp_score nonetheless seems to summarize the trends observable in Table 2 fairly well.

5. Discussion

The main takeaway from Table 2 and Table 3 is as follows. For most ML algorithms, we should not expect huge gains from hyperparameter tuning on average; however, there may be some datasets for which default hyperparameters perform poorly, especially for some algorithms. In particular, those algorithms at the bottom of the lists in Table 3 would likely not benefit greatly from a significant investment in hyperparameter tuning. Some algorithms are robust to hyperparameter selection, while others are somewhat less robust.
Perhaps the main limitation of this work (as in others involving hyperparameter experimentation) pertains to the somewhat subjective choice of value ranges (Table 1). This is, ipso facto, unavoidable in empirical research such as this. While this limitation cannot be completely overcome, it can be offset given that the code is publicly available at https://github.com/moshesipper (accessed on 1 September 2022), and we and others may enhance our experiment and add additional findings. Indeed, we hope this to be the case.
Table 3 can be used in practice by an ML practitioner to:
  • Decide how much to invest in hyperparameter tuning of a particular algorithm;
  • Select algorithms that require less tuning to hopefully save time—as well as energy [15].

6. Concluding Remarks

We performed a large-scale experiment of hyperparameter-tuning effectiveness, across multiple ML algorithms and datasets. We found that for many ML algorithms, we should not expect considerable gains from hyperparameter tuning on average; however, there may be some datasets for which default hyperparameters perform poorly, especially for some algorithms. By defining a single hp_score value, which combines an algorithm’s accumulated statistics, we were able to rank the 26 ML algorithms from those expected to gain the most from hyperparameter tuning to those expected to gain the least. We believe such a study may serve ML practitioners at large, in several ways, as noted above.
There are many avenues for future work:
  • Algorithms may be added to the study.
  • Datasets may be added to the study.
  • Hyperparameters that have not been considered herein may be added.
  • Specific components of the setup may be managed (e.g., the metrics and the scaler of Algorithm 1).
  • Additional summary scores, like the hp_score, may be devised.
  • For algorithms at the top of the lists in Table 3, we may inquire as to whether particular hyperparameters are the root cause of their hyperparameter sensitivity; further, we may seek out better defaults. For example, [16] recently focused on hyperparameter tuning for KernelRidge, which is at the top of the regressor list in Table 3. Ref [6] discussed the tunability of a specific hyperparameter, though they noted the problem of hyperparameter dependency.
Given the findings herein, it seems that, more often than not, hyperparameter tuning will not provide huge gains over the default hyperparameters of the respective software packages examined. A modicum of tuning would seem to be advisable, though other factors will likely play a stronger role in final model performance, including, to name a few, the quality of raw data, the solidity of data preprocessing, and the choice of ML algorithm (curiously, the latter can be considered a tunable hyperparameter [3]).

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Acknowledgments

I thank Raz Lapid for helpful comments.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Bergstra, J.; Yamins, D.; Cox, D.D. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in Science Conference, Austin, TX, USA, 11–17 July 2013; Volume 13, p. 20. [Google Scholar]
  2. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar]
  3. Sipper, M.; Moore, J.H. AddGBoost: A gradient boosting-style algorithm based on strong learners. Mach. Learn. Appl. 2022, 7, 100243. [Google Scholar] [CrossRef]
  4. Sipper, M. Neural networks with à la carte selection of activation functions. SN Comput. Sci. 2021, 2, 1–9. [Google Scholar] [CrossRef]
  5. Bischl, B.; Binder, M.; Lang, M.; Pielok, T.; Richter, J.; Coors, S.; Thomas, J.; Ullmann, T.; Becker, M.; Boulesteix, A.L.; et al. Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges. arXiv 2021, arXiv:2107.05847. [Google Scholar]
  6. Probst, P.; Boulesteix, A.L.; Bischl, B. Tunability: Importance of Hyperparameters of Machine Learning Algorithms. J. Mach. Learn. Res. 2019, 20, 1–32. [Google Scholar]
  7. Weerts, H.J.P.; Mueller, A.C.; Vanschoren, J. Importance of Tuning Hyperparameters of Machine Learning Algorithms. arXiv 2020, arXiv:2007.07588. [Google Scholar]
  8. Turner, R.; Eriksson, D.; McCourt, M.; Kiili, J.; Laaksonen, E.; Xu, Z.; Guyon, I. Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the Black-Box Optimization Challenge 2020. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track, Virtual Event/Vancouver, BC, Canada, 6–12 December 2020; Volume 133, pp. 3–26. [Google Scholar]
  9. Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
  10. Romano, J.D.; Le, T.T.; La Cava, W.; Gregg, J.T.; Goldberg, D.J.; Chakraborty, P.; Ray, N.L.; Himmelstein, D.; Fu, W.; Moore, J.H. PMLB v1.0: An open source dataset collection for benchmarking machine learning methods. arXiv 2021, arXiv:2012.00058v2. [Google Scholar]
  11. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  12. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
  13. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
  14. Scikit-Learn: Machine Learning in Python. 2022. Available online: https://scikit-learn.org/ (accessed on 22 June 2022).
  15. García-Martín, E.; Rodrigues, C.F.; Riley, G.; Grahn, H. Estimation of energy consumption in machine learning. J. Parallel Distrib. Comput. 2019, 134, 75–88. [Google Scholar] [CrossRef]
  16. Stuke, A.; Rinke, P.; Todorović, M. Efficient hyperparameter tuning for kernel ridge regression with Bayesian optimization. Mach. Learn. Sci. Technol. 2021, 2, 035022. [Google Scholar] [CrossRef]
Figure 1. Characteristics of the 144 classification datasets and 106 regression datasets used in our study.
Figure 1. Characteristics of the 144 classification datasets and 106 regression datasets used in our study.
Algorithms 15 00315 g001
Table 1. Value ranges or sets used by Optuna for hyperparameter tuning. For ease of reference, we use the function names of the respective software packages: scikit-learn, xgboost, and lightgbm. Values sampled from a range in the log domain are marked as ‘log’, otherwise sampling is linear (uniform).
Table 1. Value ranges or sets used by Optuna for hyperparameter tuning. For ease of reference, we use the function names of the respective software packages: scikit-learn, xgboost, and lightgbm. Values sampled from a range in the log domain are marked as ‘log’, otherwise sampling is linear (uniform).
Classification
AlgorithmHyperparameterValues
AdaBoostClassifiern_estimators[10, 1000] (log)
learning_rate[0.1, 10] (log)
DecisionTreeClassifiermax_depth[2, 10]
min_impurity_decrease[0.0, 0.5]
criterion{gini, entropy}
GradientBoostingClassifiern_estimators[10, 1000] (log)
learning_rate[0.01, 0.3]
subsample[0.1, 1]
KNeighborsClassifierweights{uniform, distance}
algorithm{auto, ball_tree, kd_tree, brute}
n_neighbors[2, 20]
LGBMClassifiern_estimators[10, 1000] (log)
learning_rate[0.01, 0.2]
bagging_fraction[0.5, 0.95]
LinearSVCmax_iter[10, 10,000] (log)
tol[1 × 10 5 , 0.1] (log)
C[0.01, 10] (log)
LogisticRegressionpenalty{l1, l2}
solver{liblinear, saga}
MultinomialNBalpha[0.01, 10] (log)
fit_prior{True, False}
PassiveAggressiveClassifierC[0.01, 10] (log)
fit_intercept{True, False}
max_iter[10, 1000] (log)
RandomForestClassifiern_estimators[10, 1000] (log)
min_weight_fraction_leaf[0.0, 0.5]
max_features{auto, sqrt, log2}
RidgeClassifiersolver{auto, svd, cholesky, lsqr, sparse_cg, sag, saga}
alpha[0.001, 10] (log)
SGDClassifierpenalty{l2, l1, elasticnet}
alpha[1 × 10 5 , 1] (log)
XGBClassifiern_estimators[10, 1000] (log)
learning_rate[0.01, 0.2]
gamma[0.0, 0.4]
Regression
AlgorithmHyperparameterValues
AdaBoostRegressorn_estimators[10, 1000] (log)
learning_rate[0.1, 10] (log)
BayesianRidgen_liter[10, 1000] (log)
alpha_1[1 × 10 7 , 1 × 10 5 ] (log)
lambda_1[1 × 10 7 , 1 × 10 5 ] (log)
tol[1 × 10 5 , 0.1] (log)
DecisionTreeRegressormax_depth[2, 10]
min_impurity_decrease[0.0, 0.5]
criterion{squared_error, friedman_mse, absolute_error}
GradientBoostingRegressorn_estimators[10, 1000] (log)
learning_rate[0.01, 0.3]
subsample[0.1, 1]
KNeighborsRegressorweights{uniform, distance}
algorithm{auto, ball_tree, kd_tree, brute}
n_neighbors[2, 20]
KernelRidgekernel{linear, poly, rbf, sigmoid}
alpha[0.1, 10] (log)
gamma[0.1, 10] (log)
LGBMRegressorlambda_l1[1 × 10 8 , 10.0] (log)
lambda_l2[1 × 10 8 , 10.0] (log)
num_leaves[2, 256]
LinearRegressionfit_intercept{True, False}
normalize{True, False}
LinearSVRloss{epsilon_insensitive, squared_epsilon_insensitive}
tol[1 × 10 5 , 0.1] (log)
C[0.01, 10] (log)
PassiveAggressiveRegressorC[0.01, 10] (log)
fit_intercept{True, False}
max_iter[10, 1000] (log)
RandomForestRegressorn_estimators[10, 1000] (log)
min_weight_fraction_leaf[0.0, 0.5]
max_features{auto, sqrt, log2}
SGDRegressoralpha[1 × 10 5 , 1] (log)
penalty{l2, l1, elasticnet}
XGBRegressorn_estimators[10, 1000] (log)
learning_rate[0.01, 0.2]
gamma[0.0, 0.4]
Table 2. Compendium of final results over 26 ML algorithms, 250 datasets, 96,192 replicates, and 28,857,600 algorithm runs. A table row presents results of a single ML algorithm, showing a summary of all replicates and datasets. A table cell summarizes the results of an algorithm–metric pair. Cell values show median and mean(std), where median: median over all replicates and datasets of Optuna’s percent improvement over default; mean(std): mean (with standard deviation) over all replicates and datasets of Optuna’s percent improvement over default. The total number of replicates for which these statistics were computed is also shown. Acc: accuracy score; Bal: balanced accuracy score; F1: F1 score; R2: R2 score; Adj R2: adjusted R2 score; C-RMSE: complement RMSE; Reps: total number of replicates .
Table 2. Compendium of final results over 26 ML algorithms, 250 datasets, 96,192 replicates, and 28,857,600 algorithm runs. A table row presents results of a single ML algorithm, showing a summary of all replicates and datasets. A table cell summarizes the results of an algorithm–metric pair. Cell values show median and mean(std), where median: median over all replicates and datasets of Optuna’s percent improvement over default; mean(std): mean (with standard deviation) over all replicates and datasets of Optuna’s percent improvement over default. The total number of replicates for which these statistics were computed is also shown. Acc: accuracy score; Bal: balanced accuracy score; F1: F1 score; R2: R2 score; Adj R2: adjusted R2 score; C-RMSE: complement RMSE; Reps: total number of replicates .
Classification
AlgorithmAccBalF1Reps
MedianMean(std)MedianMean(std)MedianMean(std)
AdaBoostClassifier1.920.9 (65.3)2.221.5 (57.4)1.939.3 (150.1)4320
DecisionTreeClassifier0.0115.6 (2.3 × 103)0.096.0 (2.2 × 103)0.055.7 (1.3 × 103)4220
GradientBoostingClassifier0.545.1 (1.4 × 103)0.648.9 (1.4 × 103)0.642.0 (1.1 × 103)4096
KNeighborsClassifier0.83.8 (13.8)1.85.9 (16.7)1.55.1 (17.7)4254
LGBMClassifier0.01.2 (11.5)0.01.0 (11.9)0.00.9 (12.1)4287
LinearSVC0.01.0 (8.3)0.01.9 (8.5)0.01.7 (8.2)4299
LogisticRegression0.01.5 (8.2)0.03.4 (12.5)0.03.4 (11.8)4307
MultinomialNB0.09.8 (58.9)8.527.5 (48.9)10.540.5 (128.6)4149
PassiveAggressiveClassifier1.97.8 (24.0)1.85.9 (18.6)3.010.8 (28.8)4301
RandomForestClassifier0.0153.6 (2.3 × 103)0.0218.8 (3.0 × 103)0.0134.1 (2.0 × 103)4320
RidgeClassifier0.01.0 (6.8)0.01.4 (7.3)0.01.9 (7.9)4273
SGDClassifier1.25.0 (20.6)1.65.2 (16.7)2.08.6 (26.7)4212
XGBClassifier0.013.5 (643.1)0.011.4 (431.2)0.010.1 (467.6)4111
Regression
AlgorithmR2AdjC-RMSEReps
medianmean(std)medianmean(std)medianmean(std)
AdaBoostRegressor2.03.6 (33.5)2.1−741.3 (9.5 × 103)3.85.1 (20.9)3179
BayesianRidge0.06.8 × 103 (3.7 × 105)−0.0−3.3 (55.3)0.01.0 (9.8)3117
DecisionTreeRegressor3.861.5 (788.0)4.049.1 (841.5)7.063.4 (1.3 × 103)3150
GradientBoostingRegressor1.617.3 (430.9)1.7−6.6 × 105 (2.4 × 107)4.12.3 (126.4)3180
KNeighborsRegressor3.577.8 (627.4)3.518.8 (471.6)4.5203.5 (5.3 × 103)3160
KernelRidge69.5−9.3 × 105 (5.0 × 107)65.93.6 × 103 (1.7 × 105)49.51.7 × 103 (8.1 × 104)3053
LGBMRegressor0.00.0 (25.6)0.0−1.2 (34.4)0.00.4 (2.1)3179
LinearRegression0.02.3 (70.4)0.0−35.1 (469.4)0.0−1.7 (62.8)3170
LinearSVR25.186.4 (2.7 × 103)24.3173.5 (2.8 × 103)23.9159.7 (2.2 × 103)3161
PassiveAggressiveRegressor71.6180.7 (1.7 × 103)58.5−304.3 (4.1 × 103)62.0331.9 (5.5 × 103)3167
RandomForestRegressor−0.11.5 (44.2)−0.2−1.2 × 103 (4.6 × 104)−0.5−1.5 (13.2)3180
SGDRegressor0.02.6 (68.6)0.0−41.4 (2.0 × 103)0.02.2 (39.8)3167
XGBRegressor0.920.0 (717.1)0.8−675.6 (7.4 × 103)2.36.8 (164.6)3180
The number of replicates may be smaller than the maximal possible value (144 × 30 = 4320 for classification datasets, and 106 × 30 = 3180 for regression datasets). This is due to edge cases that cause a single replicate to terminate with an error, the vicissitudes of life on the cluster, and (in small part) long runtimes evoking the 72 h timeout (this happened with GradientBoostingClassifier for 14 datasets and with XGBClassifier for 8 datasets).
Table 3. The hp_score of each ML algorithm, computed from the values in Table 2. A higher value means that the algorithm is expected to gain more from hyperparameter tuning, while a lower value means that the algorithm is expected to gain less from hyperparameter tuning.
Table 3. The hp_score of each ML algorithm, computed from the values in Table 2. A higher value means that the algorithm is expected to gain more from hyperparameter tuning, while a lower value means that the algorithm is expected to gain less from hyperparameter tuning.
ClassificationRegression
RandomForestClassifier3.89KernelRidge2110.75
DecisionTreeClassifier2.43GradientBoostingRegressor183.97
GradientBoostingClassifier1.52BayesianRidge35.19
MultinomialNB1.36PassiveAggressiveRegressor5.34
AdaBoostClassifier0.79LinearSVR2.13
PassiveAggressiveClassifier0.56KNeighborsRegressor0.59
XGBClassifier0.38DecisionTreeRegressor0.35
SGDClassifier0.35RandomForestRegressor0.09
KNeighborsClassifier0.27AdaBoostRegressor−0.07
LogisticRegression−0.08XGBRegressor−0.10
LinearSVC−0.09SGDRegressor−0.23
RidgeClassifier−0.10LinearRegression−0.25
LGBMClassifier−0.10LGBMRegressor−0.26
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Sipper, M. High Per Parameter: A Large-Scale Study of Hyperparameter Tuning for Machine Learning Algorithms. Algorithms 2022, 15, 315. https://doi.org/10.3390/a15090315

AMA Style

Sipper M. High Per Parameter: A Large-Scale Study of Hyperparameter Tuning for Machine Learning Algorithms. Algorithms. 2022; 15(9):315. https://doi.org/10.3390/a15090315

Chicago/Turabian Style

Sipper, Moshe. 2022. "High Per Parameter: A Large-Scale Study of Hyperparameter Tuning for Machine Learning Algorithms" Algorithms 15, no. 9: 315. https://doi.org/10.3390/a15090315

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop