Next Article in Journal
Risk Prediction of International Stock Markets with Complex Spatio-Temporal Correlations: A Spatio-Temporal Graph Convolutional Regression Model Integrating Uncertainty Quantification
Previous Article in Journal
Decarbonization Commitment, Political Connections, and Firm Value: Evidence from China
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Empirical Calibration of XGBoost Model Hyperparameters Using the Bayesian Optimisation Method: The Case of Bitcoin Volatility

by
Saralees Nadarajah
1,*,
Jules Clement Mba
2,
Ndaohialy Manda Vy Ravonimanantsoa
3,
Patrick Rakotomarolahy
4 and
Henri T. J. E. Ratolojanahary
4
1
Department of Mathematics, University of Manchester, Manchester M13 9PL, UK
2
School of Economics, College of Business and Economics, University of Johannesburg, Johannesburg 2092, South Africa
3
Ecole Supérieur Polytechnique, Université d’Antananarivo, Antananarivo 101, Madagascar
4
Laboratory of Mathematics and Their Applications, University of Fianarantsoa, Fianarantsoa 301, Madagascar
*
Author to whom correspondence should be addressed.
J. Risk Financial Manag. 2025, 18(9), 487; https://doi.org/10.3390/jrfm18090487
Submission received: 8 July 2025 / Revised: 27 August 2025 / Accepted: 28 August 2025 / Published: 2 September 2025
(This article belongs to the Section Mathematics and Finance)

Abstract

Ensemble learning techniques continue to show greater interest in forecasting the volatility of cryptocurrency assets. In particular, XGBoost, an ensemble learning technique, has been shown in recent studies to provide the most accurate forecast of Bitcoin volatility. However, the performance of XGBoost largely depends on the tuning of its hyperparameters. In this study, we examine the effectiveness of the Bayesian optimization method for tuning the XGBoost hyperparameters for Bitcoin volatility forecasting. We chose to explore this method rather than the most commonly used manual, grid, and random hyperparameter choices due to its ability to predict the most promising areas of hyperparameter spaces through exploitation and exploration using acquisition functions, as well as its ability to minimize error with a reduced amount of time and resources required to find an optimal configuration. The obtained XGBoost configuration improves the forecast accuracy of Bitcoin volatility. Our empirical results, based on letting the data speak for itself, could be used for a comparative study on Bitcoin volatility forecasting. This would also be important for volatility trading, option pricing, and managing portfolios related to Bitcoin.

1. Introduction

Volatility is a key indicator of uncertainty and risk in financial markets. For the case of Bitcoin, an innovative and highly speculative digital asset, this volatility exceeds that observed in traditional markets by far. While this instability creates opportunities for substantial returns, it also complicates forecasting and risk management. In this context, reliable predictive models that can estimate volatility are essential for investors and quantitative finance researchers alike.
Among modern predictive modeling approaches, the XGBoost (Extreme Gradient Boosting) algorithm is one of the most effective at processing complex, nonlinear financial data. However, the quality of the predictions obtained depends heavily on calibrating its hyperparameters correctly, such as tree depth, learning rate, and regularization coefficients. Inadequate calibration can lead to overfitting or a loss of performance, thereby compromising the robustness of the model.
To optimize this calibration, Bayesian optimization (BO) (Brochu et al., 2010; Garnett, 2023) offers an effective, mathematically rigorous solution. Unlike traditional methods such as grid or random searches, BO uses a probabilistic model of the objective function and an acquisition function to intelligently guide exploration of the hyperparameter space. This approach reduces computational cost while increasing the likelihood of identifying optimal configurations.
This paper proposes performing empirical calibration of the XGBoost model using BO for Bitcoin volatility forecasting. The objective is twofold: firstly, to evaluate the effectiveness of this tuning strategy compared to conventional calibration methods; and secondly, to demonstrate how tools from probability, statistics, and optimization can be integrated into a practical approach that links mathematical rigour with real-world financial market issues.

2. Literature Review

In recent years, modeling and forecasting the volatility of financial assets, particularly cryptocurrencies such as Bitcoin, has attracted increasing attention in the literature. Several traditional approaches, such as GARCH models and stochastic models (Nadarajah et al., 2025), have been studied extensively to capture the complex dynamics and high variability inherent in these markets. However, machine learning methods such as gradient boosting algorithms, notably XGBoost, have emerged as powerful tools capable of modeling complex nonlinear relationships in large, noisy datasets.
Nevertheless, the performance of these models hinges on the precise calibration of their hyperparameters. In this regard, Bayesian optimization has become the preferred method for efficiently exploring the hyperparameter space, often outperforming classical methods such as exhaustive or random search. While several recent studies have demonstrated the advantages of this approach in various fields, few have specifically examined the combined use of XGBoost (Friedman, 2001) and Bayesian optimization for predicting cryptocurrency volatility.
The authors in Cowen-Rivers et al. (2022) worked on a black-box optimization about hyperparameter settings with HEBO (Heteroscedastic Evolutionary Bayesian Optimization) enriching Bayesian Optimization (BO) algorithms (Brochu et al., 2010). The authors in Cowen-Rivers et al. (2022) created a wide range of hyperparameter tasks (108) in a variety of classification and regression problems, using nine models (such as multi-layer perceptrons, support vector machines, and others), six datasets (two regression and four classification) from the University of California Irvine (UCI) repository, and two measures per dataset (such as negative log-likelihood or mean-squared error). Each model had adjustable hyperparameters, such as the number of hidden units in a neural network. The objective was to adjust these hyperparameters so as to maximize/minimize one of the specified measures. The black-box objective values were stochastic, with noise contributions from the training–test separations used to calculate losses. The paper found results with a wide range of solvers that rely on BO strategies or follow zero-order techniques such as differential evolution or particle swarms.
The authors in Joy et al. (2016) proposed a framework for fitting hyperparameters to big data using BO. The framework divides the big data set into chunks and generates the hyperparameter fit on these chunks in parallel. The framework then uses this information to tune the hyperparameters of the big data set using transfer learning, and evaluates its performance on the hyperparameter tuning tasks of two machine learning algorithms: a deep neural network and an SVM with a radial basis function kernel. These tasks were performed on two real-world data sets.
The authors in Bentéjac et al. (2021) observed that XGBoost and gradient boosting, when trained using the package defaults, were the least efficient methods. Consequently, the paper concluded that a meticulous search for parameters was necessary to create accurate gradient boosting-based models. This was not the case for random forest, whose performance was slightly better on average when the default parameter values were used. In fact, adjusting the randomization parameters, the subsampling rate and the number of features selected at each division in XGBoost proved useless as long as a certain randomization was used. In experiments, the paper set the subsampling rate to 0.75 without replacement and the number of features to the square root of the total number of features. This reduced the size of the parameter search grid by a factor of 16 and improved the average performance of XGBoost.
The authors in Putatunda and Rama (2019) proposed a new method for optimizing the hyperparameters of an extreme gradient boosting algorithm: Randomized-Hyperopt. Its performance was compared with other existing techniques, such as grid search, random search, and hyperopt, taking into account both prediction accuracy and time required. The paper found that Randomized-Hyperopt outperforms Grid Search and Hyperopt, and is close to or better than the average Gini of Hyperopt for all datasets. Furthermore, the Randomized-Hyperopt method has the shortest execution time for all datasets.
The authors in van Hoof and Vanschoren (2021) discussed Hyperboost, an AutoML (Automated Machine Learning) tool based on SMAC (Sequential Model-Based Algorithm Configuration). They described the challenges of adapting the Gradient Boosting model to BO and then explained how they adapted this model using quantile regression. However, they acknowledged that quantile regression alone is insufficient as it does not account for epistemic uncertainty. When estimating the 90th quantile, they provided optimistic estimates for the candidate configurations and combined these with the quantile regression estimates. They then combined these with a bonus for Manhattan distance from the nearest configuration. This bonus encourages users to avoid configurations that have already been observed. They demonstrated that Hyperboost outperformed SMAC on a reasonable set of classification problems when used with a small set of configuration spaces.
The authors in Lévesque (2018) discussed hyperparameter tuning using BO. This paper evaluated the impact of overlearning on optimization and proposed strategies to reduce it, as well as applying it to generate heterogeneous sets of classifiers and methods for optimizing hyperparameters with conditional structures. The hyperparameters of the following models were optimized: LDA, QDA, SVM, KNN, and Adaboost. The paper also used the Laplace kernel and Modrian with several types of Gaussian processes (GPs) (Rasmusse & Williams, 2005).
The authors in Eggensperger et al. (2013) worked on a reference library for hyperparameter optimization and provided the first in-depth comparison of three substitution models: Spearmint, SMAC, and TPE (van Hoof & Vanschoren, 2021), which are substitute models in the GP family. The paper provided a common interface for three optimization software packages and made it easy to integrate new ones. The authors in (Anghel et al., 2018) does a job similar to (van Hoof & Vanschoren, 2021), working with the hyperparameters of XGBoost, LightGBM, and Catboost.
The authors in Joy et al. (2020) developed a rapid hyperparameter tuning framework called Hypertune that draws on insights from the learning theory of CAPs. The paper also identified a new way of exploiting trends in generalization performance from a subset of the data to the whole dataset. The paper used directional derivatives to report monotonic trends in model generalization performance for hyperparameters that directly dictate model complexity. The paper then evaluated the effectiveness of the algorithm at fitting hyperparameters from several machine learning algorithms to real-world benchmark datasets. The results show that Hypertune outperforms Fabolas (Klein et al., 2017) and generic BO. The paper also showed that HyperTune is compatible with the state-of-the-art hyperparameter tuning algorithm, Hyperband (Li et al., 2018), and that it is widely applicable to optimization of hyperparameters.
Our objective is to optimize the hyperparameters of the XGBoost model using BO. Our study will begin with an introduction to the theory of Bayesian optimization and its application to Bitcoin volatility. This will be followed by a discussion of the results and a conclusion. All computations for the application were performed using R, version 4.1.3. The R codes used are given in Appendix A.

3. Bayesian Optimization

3.1. Overview

BO is a powerful strategy for locating the extrema of objective functions when evaluations are costly and an explicit analytical form of the function is unavailable. Instead of requiring derivatives or convexity assumptions, BO relies on observations obtained from sampled evaluations, making it particularly well suited for problems where the following apply:
  • The objective function has no closed-form expression.
  • Evaluations are expensive or time-consuming.
  • Derivatives are unavailable.
  • The optimization problem is non-convex.
A key strength of BO lies in its efficiency—it typically requires fewer evaluations than alternative optimization methods. This efficiency comes from incorporating prior knowledge about the problem, which guides the sampling process and reduces unnecessary exploration of the search space.
The term “Bayesian” reflects the use of Bayes’ theorem to update beliefs about the objective function as new data is observed. The prior distribution encodes assumptions about plausible functions, such as smoothness. These assumptions make certain function shapes more likely than others.
Formally, let x i denote a sample point and f x i denote the value of the objective function at that point. After t observations,
D 1 : t = x 1 : t , f x 1 : t ,
and the prior distribution over functions P ( f ) is updated using the likelihood P D 1 : t f to yield the posterior
P ( f D 1 : t ) = P D 1 : t f P ( f ) P D 1 : t .
This posterior distribution then guides the selection of the next evaluation point, balancing the exploration of uncertain regions with the exploitation of promising areas.

3.2. Bayesian Optimization Approach

We begin by formulating the problem as x * = argmin x f ( x ) , where the objective function f is assumed to be continuous and Lipschitz continuous. That is, for all x 1 , x 2 , there is a constant C such that
f x 2 f x 1 C x 2 x 1 .
Our focus is on global rather than local optimization. In a minimization setting, we seek x * satisfying
f x * f ( x ) , x , x * x < ε .
If f is concave, every local minimum is also a global minimum. In practice, however, concavity cannot be confirmed in advance, as the explicit form of f is unknown and its gradient is unavailable. We do, however, assume that the boundaries of the search space are known.

3.3. A Priori Function

Every Bayesian method begins with an initial distribution. A BO method can be considered optimal if the following conditions hold:
  • The acquisition function is continuous and approximately minimizes the risk with respect to the global minimum at a fixed point.
  • The variance converges to zero (or to a positive minimum in the presence of noise) if and only if the distance to the nearest observation is zero.
  • The objective function itself is continuous.
  • The a priori function is homogeneous.
  • The method is independent of specific variants and applicable to a wide range of optimization tasks.
A GP on a space χ is a random process mapping χ R . In non-parametric Bayesian methods, GPs are often used as priors over functions. A GP is fully specified by its mean function μ ( x ) and its covariance function k : χ 2 R . For all x χ , the distribution of f ( x ) is given by
f ( x ) N μ ( x ) , k x , x .
Formally,
μ : χ R , f G P ( μ , k ) .
Intuitively, a GP can be thought of as a distribution over functions. Instead of returning a single value f ( x ) at a point x, it returns a mean and variance describing the uncertainty over possible values of f at that point. This is why stochastic processes are sometimes referred to as “random functions”, by analogy to random variables.

3.3.1. Covariance Function

We choose the squared exponential (radial basis function) kernel
k x , x = exp γ x x 2 ,
which takes values close to 1 for nearby points and approaches 0 for distant points. This reflects the intuitive idea that close points are strongly correlated, while far-apart points influence each other minimally.
Given a set of inputs x 1 : t and their function values f 1 : t = f x 1 : t , the joint distribution is
f 1 : t N ( 0 , K ) ,
where the kernel matrix K is
K = k x 1 , x 1 k x 1 , x t k x t , x 1 k x t , x t .
The diagonal entries are 1, reflecting perfect correlation of each point with itself—valid in the noise-free case. For simplicity, we assume a zero mean function μ ( x ) = 0 .

3.3.2. Posterior Prediction

Suppose we have previous observations x 1 : t , f 1 : t . We wish to use BO to select the next evaluation point x t + 1 . Let f t + 1 = f x t + 1 be the unknown function value at this point. By the properties of GPs, f 1 : t , f t + 1 is jointly Gaussian:
f 1 : t f t + 1 N 0 , K k k T K x t + 1 , x t ,
where
k = k x t + 1 , x 1 , , k x t + 1 , x t .
Using the Sherman–Morrison–Woodbury formula (Shermen & Morrison, 1949), the predictive distribution is
P f t + 1 D 1 : t , x t + 1 = N μ t x t + 1 , σ t 2 x t + 1
with μ t x t + 1 = k T K 1 f 1 : t and σ t 2 x t + 1 = k x t + 1 , x t + 1 k T K 1 k .

3.4. Choice of Covariance Function

Selecting an appropriate GP covariance function is critical, as it governs the smoothness and regularity of functions drawn from the process. The commonly used squared exponential kernel makes a simplifying assumption: it treats all input features as contributing equally to the covariance, ignoring potential differences in their influence.
To address this limitation, isotropic kernels can be extended by introducing hyperparameters. In the isotropic case, a single length-scale parameter θ controls the kernel width
k x i , x j = exp 1 2 θ 2 x i x j 2 .
Reducing θ narrows the kernel, allowing the GP to model functions that vary more rapidly.
For anisotropic (direction-dependent) models, the most common choice is the squared exponential kernel with a vector of length-scale parameters θ for automatic relevance determination (ARD)
k x i , x j = exp 1 2 x i x j T diag ( θ ) 1 x i x j ,
where diag ( θ ) is a diagonal matrix with d entries, one per dimension. Intuitively, a small θ l implies that the lth input dimension has little influence on the covariance, effectively rendering it irrelevant and removing it from the model.
Hyperparameters are typically learned by seeding the GP with a small set of random samples and maximizing the log marginal likelihood (Jones et al., 1998; Santner, 2003; Sasena, 2002; Williams & Rasmussen, 2006). This process can be aided by informative priors over the hyperparameters (Frean & Boyle, 2008; Lizotte, 2008). Efficient and robust methods for hyperparameter learning remain an active research area (Brochu et al., 2010; Osborne & Osborne, 2010).

3.5. Acquisition Function in BO

After covering the implementation of the prior on smooth functions and its updates with new observations, we now turn to the acquisition function in BO.
The acquisition function directs the search toward the optimum by assigning high values to locations that are promising—due to high predicted objective values, high uncertainty, or both. By maximizing this function, we determine the next point to evaluate. Formally, the next sample location is chosen as
x next = argmax x u x D ,
where u ( · ) denotes the acquisition function, and D is the set of current observations.

3.5.1. Expected Improvement (EI) and Probability of Improvement (PI)

The earliest acquisition function, proposed is aimed to maximize the Probability of Improvement (PI) over the current best observation. Let
x + = argmax x i x 1 : t f x i
denote the best location found so far. The PI function is defined as
P I ( x ) = P f ( x ) f x + = Φ μ ( x ) f x + σ ( x ) ,
where Φ ( · ) is the standard normal cumulative distribution function. While intuitive, PI is purely exploitative, favoring points with a high chance of beating f x + , even if the potential improvement is small. To encourage exploration, an offset parameter ξ 0 is often introduced
P I ( x ) = P f ( x ) f x + + ξ = Φ μ ( x ) f x + ξ σ ( x ) .
The choice of ξ is left to the user. The authors in Kusner and Yin (1997) recommend starting with a large ξ (to promote exploration) and decreasing it over time. Empirical studies (Jones, 2001; Lizotte, 2008; Törn & Žilinskas, 1989) show that the choice of ξ can have a substantial effect depending on the problem domain.
PI can be useful in perceptual or preference-based models, where thresholds of noticeable change exist, since it naturally targets points likely to achieve a specified level of improvement.
A more balanced approach is the Expected Improvement (EI), which accounts for both the probability and magnitude of improvement. Let the improvement function be
I ( x ) = max 0 , f t + 1 ( x ) f x + .
Then EI is defined as
E I ( x ) = 0 I · L ( I ) d I ,
where L ( I ) is the likelihood of achieving improvement I given the posterior normal distribution:
L ( I ) = 1 2 π σ ( x ) exp μ ( x ) f x + I 2 2 σ 2 ( x ) .
Evaluating the integral yields the closed-form (Jones, 2001; Mockus et al., 1978)
E I ( x ) = μ ( x ) f x + Φ ( Z ) + σ ( x ) ϕ ( Z ) , σ ( x ) > 0 , 0 , σ ( x ) = 0 ,
with Z = μ ( x ) f x + σ ( x ) , and ϕ ( · ) is the standard normal probability density function.
While EI is typically applied myopically (one step ahead), multi-step extensions exist, including closed-form two-step EI (Ginsbourger et al., 2008) and multi-step Bayesian optimization strategies (Garnett et al., 2010).

3.5.2. Upper Confidence Bound (UCB) Acquisition Functions

Unlike PI and EI, which focus on local improvement near the current maximum, UCB-based strategies (Srinivas et al., 2009) aim to reduce uncertainty across the entire search space. A naïve rule is to select
x t = argmax x D μ t 1 ( x ) ,
which is purely exploitative and risks premature convergence.
The Gaussian Process Upper Confidence Bound (GP-UCB) balances exploration and exploitation via
x t = argmax x D μ t 1 ( x ) + β t σ t 1 ( x ) ,
where β t controls the exploration–exploitation trade-off. The first term favors high predicted values, while the second favors uncertain regions. The quantity inside the maximization can be interpreted as an upper quantile of the posterior marginal P f ( x ) y t 1 .
From a theoretical perspective, GP-UCB is related to functions in a Reproducing Kernel Hilbert Space (RKHS) H k ( D ) (Wahba, 1990) associated with kernel k ( · , · ) . In RKHS, the norm
f k = f , f k
quantifies smoothness: kernels producing smoother functions induce larger penalties. This RKHS framework underpins theoretical regret bounds for GP-UCB under both Bayesian and non-Bayesian settings.

3.5.3. Extreme Gradient Boosting (XGBoost)

Consider the objective function deduced by the XGBoost model, as follows:
L ( y ) = i = 1 n l y ^ i , y i + m = 1 M Ω δ m
with
Ω ( δ ) = α | δ | + 1 2 β ω 2 ,
where l is the loss function, | δ | is the number of leaves in the regression tree δ , ω is the vector of values assigned to each of its leaves, α is the Lasso penalty coefficient, and β is the ridge regularization coefficient. Our objective is to find θ * such that θ * = min θ L ( y ) with θ denoting the set of hyperparameters.
The hyperparameters include eta, max-depth, subsample, colsample-by-tree, and min-child-weight. eta is a hyperparameter that controls the learning rate, which determines how much each tree’s predictions contribute to the overall model. eta scales the contribution of each new tree by multiplying its leaf weights before adding them to the model’s running prediction. It prevents the model from overfitting by making the boosting process more conservative.
The max-depth is a parameter that controls the maximum number of levels (depth) a decision tree can grow during training. A higher max-depth allows the model to capture more complex patterns and interactions between features, but it also increases the risk of overfitting-especially when combined with a small dataset or insufficient regularization. A shallower tree (low max-depth) tends to be more generalizable but may underfit by missing important feature relationships. In practice, max-depth works in tandem with other parameters like min-child-weight and gamma to balance model complexity and performance, and it is typically tuned through cross-validation.
The subsample parameter controls the fraction of the training data randomly sampled (without replacement) for growing each individual tree. It is a form of stochastic gradient boosting, similar to bagging, that helps reduce overfitting and improve model generalization. For example, setting subsample = 0.8 means each tree is built using only 80% of the training samples, chosen at random, which introduces diversity among trees and makes the ensemble less prone to fitting noise Lower values can improve robustness but may also increase bias if set too low, while subsample = 1.0 means all data is used for every tree.
colsample-by-tree is a hyperparameter that controls the fraction of features (columns) randomly selected to train each individual tree in the boosting process. Its value ranges from 0 to 1, where 1 means all features are used, and smaller values make the model use only a subset of features per tree, introducing randomness that can help reduce overfitting and improve generalization—similar to the feature bagging technique in Random Forests. For example, setting colsample-by-tree = 0.8 means each tree will be built using 80% of the available features, chosen randomly before the tree grows. This is especially useful when working with high-dimensional datasets or when some features are highly correlated.
min-child-weight is a regularization parameter that controls the minimum sum of instance weights (or Hessians) needed in a child node before further splitting is allowed. It essentially sets a threshold on how much “training data mass” must exist in a branch, helping to prevent overfitting by stopping overly specific splits on small, noisy subsets of data. Higher values make the model more conservative in that it requires larger, more statistically significant splits, while lower values allow the model to create deeper, more fine-grained splits. In regression, this value relates to the sum of Hessians; in classification, it is influenced by class probabilities.

4. Application with the Volatility of Bitcoin

The volatility of Bitcoin has been the subject of extensive academic research, indicating that it exhibits significantly higher fluctuations compared to traditional asset classes like stocks and commodities. Empirical studies have consistently shown that Bitcoin’s price volatility is notably distinct due to factors including speculative trading, market sentiment, and external economic events. Some studies emphasize the asymmetric nature of Bitcoin’s volatility, whereby positive shocks exert a more substantial influence on price fluctuations than negative shocks of equal magnitude (Bouri et al., 2023; Tunahan et al., 2020). Additionally, Bitcoin options display similar implied volatility patterns to traditional financial markets, highlighting its speculative nature and susceptibility to shifts in investor sentiment (Zulfiqar & Gulzar, 2021). Various econometric models, particularly GARCH-type models, have been utilized to capture the inherent volatility dynamics of Bitcoin, revealing that trading volume and market sentiment are crucial determinants of Bitcoin’s pricing behavior (Jalal et al., 2020; Sapuric et al., 2022). The increase in Bitcoin’s popularity and its integration into financial markets has led to heightened volatility spillovers to other assets, reflecting its evolving role as both an investment vehicle and a speculative bubble (Bouri et al., 2018; Melawat & Gunarsih, 2023). Furthermore, the onset of the COVID-19 pandemic intensified volatility across cryptocurrency markets, demonstrating how macroeconomic uncertainties can impact Bitcoin’s trading patterns (Karim et al., 2014; Maghyereh & Abdoh, 2022). Overall, the literature indicates that Bitcoin’s volatility arises from a complex interplay of market factors, including its relatively young market status, regulatory developments, and global economic contexts, making it a focal point for research within financial economics.

4.1. Data

The daily Bitcoin data used is available at https://fr.investing.com/crypto/bitcoin/historical-data (accessed on 28 August 2025). The price series is between 1 January 2017 and 31 January 2024. The Bitcoin data series p t are transformed into returns using r t = log p t log p t 1 .
The series of Bitcoin returns r t is shown in Figure 1. Between 2017 and 2018, the line shows high-frequency oscillations with occasional sharp spikes, suggesting periods of heightened volatility. In 2019, similar dense oscillations continue, though volatility appears slightly reduced compared to 2017. In early 2020, a sharp, pronounced downward spike occurs, likely corresponding to a significant market event (possibly the COVID-19 crash if this is financial data like Bitcoin or stock returns). Between 2021 and 2023, the returns continue fluctuating with relatively consistent volatility, though the frequency of extreme spikes appears somewhat diminished.
Some basic statistics are summarized in Table 1.
The largest single-day loss in the dataset is about −21.60%, showing that Bitcoin can have very sharp drops. On average, Bitcoin’s daily return is a small positive value of roughly 0.063%, indicating slight upward drift over the period. The largest single-day gain is about 9.88%, which is much smaller in magnitude than the worst loss. The variability of returns is 0.0003, a measure of dispersion in squared units. The typical daily fluctuation is about 1.7%, reflecting high volatility. The skewness is −0.8287, meaning the distribution of returns is moderately left-skewed. Large negative returns are more frequent or severe than large positive ones. At 12.855, the kurtosis is far above the normal distribution’s value of 3, indicating very heavy tails. Extreme returns (both gains and losses) occur much more often than in a normal distribution. Overall, the numbers suggest Bitcoin returns are volatile, prone to large swings, and have a distribution with a bias toward severe losses and frequent extreme movements.

4.2. Tuning Hyperparameters with Bayesian Optimization

We used the GP with exponential kernel during the optimization. The ranges of hyperparameters used respectively with their initializations are in Table 2. The choices of the space for the hyperparameters were taken from (Verma, 2024).
In Table 2, r u n i f ( b , c , a ) denotes the uniform distribution between a and b in steps of c. c ( a , b ) denotes minimum and maximum margins of a hyperparameter. s a m p l e ( a : b , c , r e p l a c e = T R U E ) creates a random sequence of numbers between a and b in steps of c. Replace being equal to TRUE means that the values are replaced if several sequences are drawn.
The range defines the search space that the Bayesian optimization will explore, while the initializations are the random evaluation of a number of initial points before the Bayesian optimization begins using the GP to guide the search. At each iteration, cross-validation based on time series splitting was performed.
After the optimization, Table 3 gives the optimal hyperparameter values with the acquisition functions used and the best root mean squared error (RMSE). RMSE is a common metric for evaluating models that measures the average magnitude of prediction errors, giving higher weight to larger errors due to squaring. It is calculated by taking the square root of the mean of the squared differences between predicted values and actual values. Because the errors are squared before averaging, RMSE penalizes large deviations more severely, making it useful when large mistakes are especially undesirable. The result is expressed in the same units as the target variable, which makes it more interpretable than some other metrics, though it can be sensitive to outliers.

4.3. Graphical Representation of Various Results

The three figures (Figure 2—EI; Figure 3—UCB; Figure 4—POI) all visualize the behavior of different Bayesian Optimization acquisition functions when tuning hyperparameters for XGBoost. Each figure includes the hyperparameter search space, optimization convergence, and the resulting learning curve. Although the structure of the plots is consistent, the outcomes and optimization behaviors differ noticeably.
Starting with the explored hyperparameter spaces, the EI and UCB acquisition functions (Figure 2 and Figure 3) explore a much wider and more uniform grid across the eta and max-depth dimensions. Their plots show a broader scatter of evaluated points, reflecting a tendency to explore more. In contrast, POI (Figure 4) focuses its search more tightly in a specific region, particularly at lower eta values and mid-to-lower max-depth levels. This tighter clustering implies that POI (Figure 4) tends to exploit early and refine in a focused region, while the UCB and EI (Figure 3 and Figure 2) prioritize broader exploration.
Looking at the color scale of RMSE scores, the POI (Figure 4) plot uses a much narrower and higher-performing score range (with RMSE values from around −0.02 to −0.05). On the other hand, both the EI and UCB (Figure 2 and Figure 3) show RMSE values reaching as low as −0.4, but this is due to using a different scaling, likely to represent greater error magnitudes. The key difference is that POI’s plot (Figure 4) shows consistently better scores (smaller errors), suggesting a more effective optimization in that specific range.
In the 3D hyperparameter space plots, the same trends persist. POI (Figure 4) shows its best results clustered in a narrow zone, especially at lower eta and higher subsample values. The UCB and EI (Figure 3 and Figure 2) have a more scattered spread, with the UCB (Figure 3) displaying several outlier points—evidence of aggressive exploration, which sometimes leads to poor-performing configurations.
The convergence plots offer some of the clearest distinctions. POI (Figure 4) converges the fastest and most smoothly, reaching a low RMSE within just a few iterations and staying stable afterward. The EI (Figure 2) shows one major dip (possibly an early good point) and then flattens, indicating slightly slower convergence. The UCB (Figure 3), however, is the most erratic. Its convergence curve fluctuates sharply several times, which reflects instability and inconsistent improvement. This instability, while potentially leading to better global optima, makes the UCB (Figure 3) less reliable for fast or resource-constrained tuning.
Finally, examining the XGBoost learning curves, all three acquisition functions eventually achieve low RMSE values for both training and test sets, suggesting successful model training. However, POI (Figure 4) achieves this with the smoothest and steepest descent, stabilizing around iteration 25 and maintaining close performance between training and test curves. The EI (Figure 2) also converges well, though over a slightly longer horizon. The UCB curve (Figure 3) takes the longest to settle, needing almost 70 iterations, but still manages to reach comparable final performance.
In summary, POI (Figure 4) demonstrates the most efficient and stable convergence, exploiting early good results and focusing its search effectively. The EI (Figure 2) represents a middle ground, balancing exploration and exploitation. The UCB (Figure 3) is the most exploratory and erratic, showing more variation in optimization quality, which could be beneficial in avoiding local minima but comes at the cost of stability and speed.

5. Conclusions

This study focused on empirically calibrating the XGBoost model for predicting Bitcoin volatility using Bayesian optimization to adjust the hyperparameters. The explanatory variables included technical indicators, historical volatility measures, and exogenous factors related to the cryptocurrency market. This allowed us to capture the complexity and specific dynamics of this volatile financial asset.
The main innovation lies in applying a Bayesian optimization method based on a Gaussian process, which intelligently and efficiently explores the hyperparameter space. This reduces computational cost while improving the model’s predictive accuracy. This approach optimized performance in terms of root mean square error (RMSE) and preserved the robustness of the model by avoiding overfitting thanks to the dynamic management of the exploration–exploitation balance during the optimization process.
The results highlight the importance of adopting advanced optimization techniques in financial time series modeling, especially for highly volatile assets like Bitcoin. By strengthening the model’s predictive capacity and generalization, this approach opens up promising prospects for risk management and decision-making in digital financial markets.
The lack of interpretability of machine learning techniques regarding features and stylized facts, as seen in Bitcoin volatility, is not a barrier to its use when forecasting accuracy is the focus. Work on the interpretability of machine learning techniques continues in various fields.

Author Contributions

Conceptualization, S.N., J.C.M., N.M.V.R., P.R., and H.T.J.E.R.; methodology, S.N., J.C.M., N.M.V.R., P.R., and H.T.J.E.R.; investigation, S.N., J.C.M., N.M.V.R., P.R., and H.T.J.E.R. All authors have read and agreed to the published version of the manuscript.

Funding

No funds were received for this research.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data can be obtained from the corresponding author.

Acknowledgments

The authors would like to thank the editor and the three referees for their careful reading and comments, which improved the paper.

Conflicts of Interest

All of the authors have no conflicts of interest.

Appendix A. R Codes

library(xgboost)
library(ParBayesianOptimization)
library(forecast)
library(ggplot2)
library(scatterplot3d)
 
# 1. Load Data
setwd("C:/R/package/3.5/code")
data <- read.csv2("exg.csv", sep = ";", dec = ",", header = TRUE)
 
# Residuals of ARIMA
x <- residuals(auto.arima(as.numeric(data[1:2587, 5])))
 
# Lags
ac<- as.data.frame(embed(x, 4))
colnames(ac) <- c("r4", "r3", "r2", "r1")
 
#2. Scoring Function
 
scoring_function<- function
(eta, max_depth, subsample, colsample_bytree, min_child_weight) {
 
params<- list(
booster = "gbtree",
eta = eta,
max_depth = as.integer(max_depth),
subsample = subsample,
colsample_bytree = colsample_bytree,
min_child_weight = as.integer(min_child_weight),
objective = "reg:squarederror",
eval_metric = "rmse"
  )
 
  k <- 5  # Number of time splits
  n <- nrow(ac)
fold_size<- floor(n / (k + 1))
 
rmses<- c()
 
for (i in 1:k) {
train_index<- 1:(fold_size * i)
test_index<- (fold_size * i + 1):(fold_size * (i + 1))
 
dtrain<- xgb.DMatrix(data = as.matrix(ac[train_index, 2:4]),
              label = ac[train_index, 1])
 
dtest<- xgb.DMatrix(data = as.matrix(ac[test_index, 2:4]),
             label = ac[test_index, 1])
 
    model <- xgb.train(
params = params,
data = dtrain,
nrounds = 100,
verbose = 0,
watchlist = list(train = dtrain, test = dtest),
early_stopping_rounds = 10
    )
 
pred<- predict(model, dtest)
rmse<- sqrt(mean((ac[test_index, 1] - pred)^2))
rmses<- c(rmses, rmse)
  }
 
mean_rmse<- mean(rmses)
return(list(Score = -mean_rmse, Pred = NA))
# Maximize the negative of Score
}
 
#3. BO with Time Series Splited
bounds<- list(
eta = c(0.001, 0.3),
max_depth = c(3L, 10L),
subsample = c(0.5, 1.0),
colsample_bytree = c(0.5, 1.0),
min_child_weight = c(1L, 10L)
)
 
set.seed(123)
set.seed(123)
opt_results <- bayesOpt(
  FUN = scoring_function,
  bounds = bounds,
  initPoints = 10,
  iters.n = 20,
  acq = "ucb",
  kappa = 2.576
)
 
 
print(opt_results$scoreSummary)

References

  1. Anghel, A. S., Papandreou, N., Parnell, T., De Palma, A., & Pozidis, H. (2018, December 2–8). Benchmarking and optimization of gradient boosting decision tree algorithms. Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada. [Google Scholar]
  2. Bentéjac, C., Csörgo, A., & Martínez-Muñoz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54, 1937–1967. [Google Scholar] [CrossRef]
  3. Bouri, E., Das, M., Gupta, R., & Roubaud, D. (2018). Spillovers between bitcoin and other assets during bear and bull markets. Applied Economics, 50, 5935–5949. [Google Scholar] [CrossRef]
  4. Bouri, E., Salisu, A. A., & Gupta, R. (2023). The predictive power of bitcoin prices for the realized volatility of us stock sector returns. Financial Innovation, 9, 62. [Google Scholar] [CrossRef]
  5. Brochu, E., Brochu, T., & De Freitas, N. (2010, July 2–4). A Bayesian interactive optimization approach to procedural animation design. 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (pp. 103–112), Madrid, Spain. [Google Scholar]
  6. Brochu, E., Cora, V. M., & De Freitas, N. (2010). A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv, arXiv:1012.2599. [Google Scholar] [CrossRef]
  7. Cowen-Rivers, A. I., Lyu, W., Tutunov, R., Wang, Z., Grosnit, A., Griffiths, R. R., Maraval, A. M., Jianye, H., Wang, J., Peters, J., & Ammar, H. B. (2022). Hebo: Pushing the limits of sample-efficient hyper-parameter optimisation. Journal of Artificial Intelligence Research, 74, 1269–1349. [Google Scholar] [CrossRef]
  8. Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., & Leyton-Brown, K. (2013). Towards an empirical foundation for assessing Bayesian optimization of hyperparameters. In NIPS workshop on Bayesian optimization in theory and practice (Vol. 10, pp. 1–5). NeurIPS. [Google Scholar]
  9. Frean, M., & Boyle, P. (2008, December 3–5). Using Gaussian processes to optimize expensive functions. Australasian Joint Conference on Artificial Intelligence (pp. 258–267), Auckland, New Zealand. [Google Scholar]
  10. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232. [Google Scholar] [CrossRef]
  11. Garnett, R. (2023). Bayesian optimization. Cambridge University Press. [Google Scholar]
  12. Garnett, R., Osborne, M. A., & Roberts, S. J. (2010, April 12–16). Bayesian optimization for sensor set selection. 9th ACM/IEEE International Conference on Information Processing in Sensor Networks (pp. 209–219), Stockholm, Sweden. [Google Scholar]
  13. Ginsbourger, D., Le Riche, R., & Carraro, L. (2008). A multi-points criterion for deterministic parallel global optimization based on gaussian processes. HAL Science. [Google Scholar]
  14. Jalal, R. N. U. D., Sargiacomo, M., & Sahar, N. U. (2020). Commodity prices, tax purpose recognition and bitcoin volatility: Using arch/garch modeling. Journal of Asian Finance Economics and Business, 7, 251–257. [Google Scholar] [CrossRef]
  15. Jones, D. R. (2001). A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21, 345–383. [Google Scholar] [CrossRef]
  16. Jones, D. R., Schonlau, M., & Welch, W. J. (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13, 455–492. [Google Scholar] [CrossRef]
  17. Joy, T. T., Rana, S., Gupta, S., & Venkatesh, S. (2016, December 4–8). Hyperparameter tuning for big data using Bayesian optimisation. 2016 23rd International Conference on Pattern Recognition (ICPR) (pp. 2574–2579), Cancun, Mexico. [Google Scholar]
  18. Joy, T. T., Rana, S., Gupta, S., & Venkatesh, S. (2020). Fast hyperparameter tuning using Bayesian optimization with directional derivatives. Knowledge-Based Systems, 205, 106247. [Google Scholar] [CrossRef]
  19. Karim, M. M., Noman, A. H. M., Kabir Hassan, M., Khan, A. M., & Kawsar, N. H. (2024). Volatility spillover and dynamic correlation between islamic, conventional, cryptocurrency and precious metal markets during the immediate outbreak of COVID-19 pandemic. International Journal of Islamic and Middle Eastern Finance and Management, 17, 662–692. [Google Scholar] [CrossRef]
  20. Klein, A., Falkner, S., Bartels, S., Hennig, P., & Hutter, F. (2017). Fast Bayesian optimization of machine learning hyperparameters on large datasets. In Artificial intelligence and statistics (pp. 528–536). PMLR. [Google Scholar]
  21. Kusner, H., & Yin, G. (1997). Stochastic approximation algorithms and applications. Springer. [Google Scholar]
  22. Lévesque, J.-C. (2018). Bayesian hyperparameter optimization: Overfitting, ensembles and conditional spaces [Ph.D. thesis, Université Laval]. [Google Scholar]
  23. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2018). Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185), 1–52. [Google Scholar]
  24. Lizotte, D. J. (2008). Practical Bayesian optimization. University of Alberty. [Google Scholar]
  25. Maghyereh, A., & Abdoh, H. (2022). COVID-19 and the volatility interlinkage between bitcoin and financial assets. Empirical Economics, 63, 2875–2901. [Google Scholar] [CrossRef]
  26. Melawati, & Gunarsih, T. (2023). Egarch model: Volatility spillover analysis of bitcoin price on altcoin and S&P 500 index. International Journal of Business Humanities Education and Social Sciences (Ijbhes), 5, 101–110. [Google Scholar]
  27. Mockus, J., Tiesis, V., & Zilinskas, A. (1978). Toward global optimization, volume 2, chapter Bayesian methods for seeking the extremum. Elsevier. [Google Scholar]
  28. Nadarajah, S., Mba, J. C., Rakotomarolah, P., & Ratolojanahary, H. T. J. E. (2025). Ensemble learning and adaptive neuro-fuzzy inference system for cryptocurrency volatility forecasting. Journal of Risk and Financial Management, 18, 52. [Google Scholar] [CrossRef]
  29. Osborne, M., & Osborne, M. A. (2010). Bayesian Gaussian processes for sequential prediction, optimisation and quadrature [Ph.D. thesis, Oxford University]. [Google Scholar]
  30. Putatunda, S., & Rama, K. (2019, December 20–22). A modified Bayesian optimization based hyper-parameter tuning approach for extreme gradient boosting. 2019 Fifteenth International Conference on Information Processing (ICINPRO) (pp. 1–6), Bengaluru, India. [Google Scholar]
  31. Rasmussen, C. E., & Williams, C. K. I. (2005). Gaussian processes for machine learning. The MIT Press. [Google Scholar]
  32. Santner, T. J. (2003). The design and analysis of computer experiments. Springer. [Google Scholar]
  33. Sapuric, S., Kokkinaki, A., & Georgiou, I. (2022). The relationship between bitcoin returns, volatility and volume: Asymmetric GARCH modeling. Journal of Enterprise Information Management, 35, 1506–1521. [Google Scholar] [CrossRef]
  34. Sasena, M. J. (2002). Flexibility and efficiency enhancements for constrained global design optimization with kriging approximations. University of Michigan. [Google Scholar]
  35. Shermen, J., & Morrison, W. J. (1949). Adjustment of an inverse matrix corresponding to changes in the elements of a given column or a given row of the original matrix. Annual Mathmatical Statistics, 20, 621–625. [Google Scholar]
  36. Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv, arXiv:0912.3995. [Google Scholar]
  37. Törn, A., & Žilinskas, A. (1989). Global optimization (Vol. 350). Springer. [Google Scholar]
  38. Tunahan, H., Akkuş, H. T., & Çelİk, I. (2020). Modeling, forecasting the cryptocurrency market volatility and value at risk dynamics of bitcoin. Muhasebe Bilim Dünyası Dergisi, 22(2), 296–312. [Google Scholar] [CrossRef]
  39. van Hoof, J., & Vanschoren, J. (2021). Hyperboost: Hyperparameter optimization by gradient boosting surrogate models. arXiv, arXiv:2101.02289. [Google Scholar] [CrossRef]
  40. Verma, V. (2024). Exploring key xgboost hyperparameters: A study on optimal search spaces and practical recommendations for regression and classification. International Journal of All Research Education and Scientific Methods, 12, 3259–3266. [Google Scholar] [CrossRef]
  41. Wahba, G. (1990). Spline models for observational data. SIAM. [Google Scholar]
  42. Williams, C. K. I., & Rasmussen, C. E. (2006). Gaussian processes for machine learning (volume 2). MIT Press Cambridge. [Google Scholar]
  43. Zulfiqar, N., & Gulzar, S. (2021). Implied volatility estimation of bitcoin options and the stylized facts of option pricing. Financial Innovation, 7, 67. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Returns series.
Figure 1. Returns series.
Jrfm 18 00487 g001
Figure 2. Explored hyperparameter space (2D), exploring hyperparameter space (3D), convergence of Bayesian optimization, and XGBoost learning curve for the EI acquisition function.
Figure 2. Explored hyperparameter space (2D), exploring hyperparameter space (3D), convergence of Bayesian optimization, and XGBoost learning curve for the EI acquisition function.
Jrfm 18 00487 g002
Figure 3. Explored hyperparameter space (2D), exploring hyperparameter space (3D), convergence of Bayesian optimization, and the XGBoost learning curve for the UCB acquisition function.
Figure 3. Explored hyperparameter space (2D), exploring hyperparameter space (3D), convergence of Bayesian optimization, and the XGBoost learning curve for the UCB acquisition function.
Jrfm 18 00487 g003
Figure 4. Explored hyperparameter space (2D), exploring hyperparameter space (3D), convergence of Bayesian optimization, and the XGBoost learning curve for the POI acquisition function.
Figure 4. Explored hyperparameter space (2D), exploring hyperparameter space (3D), convergence of Bayesian optimization, and the XGBoost learning curve for the POI acquisition function.
Jrfm 18 00487 g004
Table 1. Basic summary statistics of the returns.
Table 1. Basic summary statistics of the returns.
StatisticsBitcoin Returns
Minimum 0.2160
Average 0.0006
Maximum 0.0988
Variance 0.0003
Standard deviation 0.0170
Skewness 0.8287
Kurtosis 12.8550
Table 2. Initializations and hyperparameter ranges.
Table 2. Initializations and hyperparameter ranges.
HyperparametersInitializationRanges During Optimization
eta r u n i f ( 5 , 0.001 , 0.3 ) c ( 0.001 , 0.3 )
max-depth s a m p l e ( 3 : 15 , 5 , r e p l a c e = T R U E ) c ( 3 , 15 )
subsample r u n i f ( 5 , 0.5 , 1 ) c ( 0.5 , 1 )
colsample-bytree r u n i f ( 5 , 0.1 , 1 ) c ( 0.1 , 1 )
min-child-weight s a m p l e ( 1 : 20 , 5 , r e p l a c e = T R U E ) c ( 1 , 20 )
Table 3. The values of hyperparameters are related to their respective acquisition functions and the best RMSE.
Table 3. The values of hyperparameters are related to their respective acquisition functions and the best RMSE.
HyperparameterEIPOIUCB
eta 0.2020 0.0697 0.1026
max-depth344
subsample 0.7003 0.7278 0.8393
colsample-bytree1 0.7046 0.6788
min-child-weight1064
RMSE 0.0170 0.0170 0.0170
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nadarajah, S.; Mba, J.C.; Ravonimanantsoa, N.M.V.; Rakotomarolahy, P.; Ratolojanahary, H.T.J.E. Empirical Calibration of XGBoost Model Hyperparameters Using the Bayesian Optimisation Method: The Case of Bitcoin Volatility. J. Risk Financial Manag. 2025, 18, 487. https://doi.org/10.3390/jrfm18090487

AMA Style

Nadarajah S, Mba JC, Ravonimanantsoa NMV, Rakotomarolahy P, Ratolojanahary HTJE. Empirical Calibration of XGBoost Model Hyperparameters Using the Bayesian Optimisation Method: The Case of Bitcoin Volatility. Journal of Risk and Financial Management. 2025; 18(9):487. https://doi.org/10.3390/jrfm18090487

Chicago/Turabian Style

Nadarajah, Saralees, Jules Clement Mba, Ndaohialy Manda Vy Ravonimanantsoa, Patrick Rakotomarolahy, and Henri T. J. E. Ratolojanahary. 2025. "Empirical Calibration of XGBoost Model Hyperparameters Using the Bayesian Optimisation Method: The Case of Bitcoin Volatility" Journal of Risk and Financial Management 18, no. 9: 487. https://doi.org/10.3390/jrfm18090487

APA Style

Nadarajah, S., Mba, J. C., Ravonimanantsoa, N. M. V., Rakotomarolahy, P., & Ratolojanahary, H. T. J. E. (2025). Empirical Calibration of XGBoost Model Hyperparameters Using the Bayesian Optimisation Method: The Case of Bitcoin Volatility. Journal of Risk and Financial Management, 18(9), 487. https://doi.org/10.3390/jrfm18090487

Article Metrics

Back to TopTop