Enhancing Short-Term Wind Energy Forecasting with XGBoost and Conformal Prediction for Robust Uncertainty Quantification

Nthangeni, Rabelani Innocent; Sigauke, Caston; Ravele, Thakhani; Tshisikhawe, Thinawanga Hangwani

doi:10.3390/computation14030056

Open AccessArticle

Enhancing Short-Term Wind Energy Forecasting with XGBoost and Conformal Prediction for Robust Uncertainty Quantification

by

Rabelani Innocent Nthangeni

,

Caston Sigauke

^*

,

Thakhani Ravele

and

Thinawanga Hangwani Tshisikhawe

Department of Mathematical and Computational Sciences, University of Venda, Thohoyandou 0950, South Africa

^*

Author to whom correspondence should be addressed.

Computation 2026, 14(3), 56; https://doi.org/10.3390/computation14030056

Submission received: 22 January 2026 / Revised: 19 February 2026 / Accepted: 27 February 2026 / Published: 1 March 2026

(This article belongs to the Section Computational Engineering)

Download

Browse Figures

Versions Notes

Abstract

This paper presents probabilistic wind energy forecasting using quantile regression averaging combined with a conformal prediction modelling framework. The study uses data from Eskom, South Africa’s power utility company. The data is from April 2019 to November 2023. A partial linear additive quantile regression (PLAQR) averaging method is used to combine forecasts from two competing forecasting models: eXtreme Gradient Boosting (XGBoost) and Principal Component Regression (PCR). To compare the predictive abilities of the models, two data splits are used: 80%, 10% and 10% for the first set, and 85%, 10% and 5% for the second set, for training, validation and testing, respectively. Empirical results suggest that the combined predictions from PLAQR perform better than the individual models, significantly improving calibration and accuracy. The proposed combination has the smallest root mean square error (RMSE) and the highest probability of change in direction (POCID). The combination captures nonlinearities and produces well-calibrated probabilistic results. Probability integral transform histograms validate this. This performance gain reflected the importance of data volume. This is reinforced by the fact that the PLAQR model, which combines the benefits of tree-based approaches and linear models, is a robust modelling approach for reliable renewable energy forecasting. Future research directions should consider more varied ensembles.

Keywords:

conformal prediction; model calibration; probabilistic forecasting; quantile regression averaging; wind energy

1. Introduction

1.1. Research Motivation

The need for accurate probabilistic wind energy forecasting is not an abstract concept but an imperative for South Africa’s economy. As the country moves towards a diversified energy mix, the inherent variability of wind-based electricity poses substantial uncertainty for the system. Deterministic forecasting methods, which only predict the point at which wind turbines will generate electricity, are not effective in the long run. Not only do they fail to consider the range of possible electricity generation, but they also result in substantial economic losses, as grid management must keep costly fossil fuel reserves on standby or risk curtailing wind-based electricity during unexpected generation.

1.2. Literature Review

Wind energy is one of the cornerstones of renewable energy production worldwide [1]. Nevertheless, the unpredictable nature of wind speed, influenced by spatial and temporal variability, has been a major challenge for power grid management [2]. As a result, short-term wind energy forecasting, which aims to predict wind energy production from minutes to days in advance [3], has become a crucial task for ensuring the stability of power grids, minimising costs, and optimising the use of renewable energy resources [4,5]. Despite the development of sophisticated forecasting models, wind energy forecasts have been error-prone owing to the chaotic nature of wind [2]. This paper aims to overcome the challenges by developing a short-term wind energy prediction model using the XGBoost algorithm and comparing it with the Principal Component Regression method. Furthermore, it aims to improve the model by incorporating conformal prediction. To understand the existing body of research this literature review highlights the development of wind energy forecasting techniques both chronologically and topically.

1.2.1. Evolution of Forecasting Methods: From Statistics to Machine Learning

In the early stages of wind forecasting, statistical and time series models were used. Methods such as the Autoregressive Integrated Moving Average (ARIMA) model and the Persistence Model (PM) were popular owing to their simplicity [6]. Nevertheless, these classical models are based on linearity and stationarity, which makes it difficult for them to represent the nonlinear relationships between wind power and meteorological variables [7]. This inherent drawback led to the use of more sophisticated models.

The emergence of machine learning (ML) algorithms brought a major shift in both theme and timeline. The development of Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs) demonstrated greater ability to model nonlinear processes [8]. Recent studies have shown that ML algorithms, including Random Forest (RF), XGBoost, K-Nearest Neighbour (KNN), and Multi-Layer Perceptron (MLP), have been able to produce more accurate results than their statistical counterparts [4]. For example, gradient boosting machines (GBMs) achieved a normalised mean absolute error (NMAE) of 5.15% in short-term forecasting, outperforming traditional methods [9]. RF and GBM algorithms have further pushed the boundaries of forecasting by allowing multiple learners to work together to produce more accurate predictions [10].

1.2.2. XGBoost and Hyperparameter Optimisation in Wind Forecasting

Under the ML framework, XGBoost has proven to be a highly effective algorithm because of its efficiency, scalability, and regularisation properties [11]. Its use in wind energy forecasting has been well established. For instance, ref. [12] developed a Bayesian-optimised XGBoost (BO-XGBoost) model that performed better than SVMs, Kernel Extreme Learning Machine (KELM), and Long Short-Term Memory (LSTM) models under different testing scenarios, including adverse weather conditions [13].

The performance of XGBoost relies heavily on hyperparameter optimisation. Bayesian optimisation is highly effective, beating grid search methods comprehensively by effectively exploring the complex hyperparameter space [12]. Optimisation techniques have also been coupled with feature engineering; ref. [5] coupled XGBoost with LSTM and technical analysis tools such as MACD, resulting in a highly accurate normalised mean absolute error of 0.0396. Although deep learning architectures such as CNN-GRU may occasionally outperform XGBoost, they remain highly competitive and reliable options for day-ahead forecasting problems [14]. Hybrid models continue to break new ground in performance, with techniques such as Boost-LR (combining XGBoost, CatBoost, and RF) resulting in substantial reductions in error [15].

1.2.3. Uncertainty Quantification and the Role of PCR

The literature indicates a critical gap in quantifying prediction uncertainty. Most of the literature, including the ones mentioned above, focuses on point forecast accuracy (e.g., MAE, RMSE) and lacks information about forecast uncertainty. This is a major drawback for grid managers who require risk assessment and informed decision-making under uncertainty. Although [2] investigated probabilistic forecasting using ensemble approaches, and statistical tests such as the Diebold–Mariano test exist for model comparison, a comprehensive framework for building prediction intervals with statistical guarantees has yet to be explored [14].

Conformal prediction, proposed by [16], provides a remedy by offering valid prediction intervals with a prescribed confidence level [17]. Although its integration with ML models has demonstrated potential for quantifying uncertainty measures in other applications [18], its extension to wind energy forecasting has yet to be explored. A detailed discussion of conformal prediction is given in [19].

Moreover, there is a significant gap in relation to Principal Component Regression (PCR). Although PCR combines PCA and linear regression to address multicollinearity, a problem often encountered in meteorological data, it is rarely reported in the literature for wind energy estimation. There are no comparisons of its performance with the latest ML approaches.

1.2.4. Summary of the Literature and Research Gap

While the literature clearly shows an evolution from statistical models to sophisticated machine learning methods like XGBoost, which has been shown to improve point forecast accuracy considerably, there has been a lack of research on uncertainty quantification. Moreover, there has been a lack of research on the potential of simple yet effective techniques such as PCR. This research aims to bridge this gap by developing a framework that combines the power of XGBoost with the accuracy of conformal prediction to provide not only point forecasts but also prediction intervals, serving as a benchmark for techniques like PCR. This will address the first and foremost requirement of modern power management systems. Table 1 summarises the key studies discussed, highlighting their methodologies, focus, and limitations.

1.3. Contributions and Research Highlights

The key innovation in this work lies the combination of tree-based and linear modelling methods, using Partial Linear Additive Quantile Regression (PLAQR), which is theoretically grounded in the unique properties of wind energy time-series data. This is justified as follows:

The time series of wind energy data has two essential characteristics that underpin our hybrid model: (i) nonlinear transitions between regimes based on atmospheric stability constraints, which are addressed by the tree-based model, and (ii) linear trend components during stable atmospheric regimes, which are addressed by PLAQR. Our hybrid model is not a result of heuristics, but rather a consequence of the physical insight that wind energy production occurs on multiple scales: nonlinear atmospheric processes control transitions between regimes. However, linear relationships dominate during stable operation. PLAQR captures this hierarchical structure by permitting tree-based models to divide the feature space into regions where linear relationships hold.
The stochastic process of wind generation, which is heteroscedastic and non-stationary, makes it difficult to apply conventional parametric uncertainty analysis. Conformal prediction is especially useful in this case because it is a distribution-free method for uncertainty analysis that holds under the actual data-generating process, which is unknown. In the context of wind energy prediction, where the distribution varies with weather and seasonal patterns, this is especially useful because it ensures that the prediction intervals have the correct coverage regardless of the true error distribution.
Instead of offering a heuristic combination, PLAQR provides calibrated probabilistic predictions via the theoretical guarantee of finite-sample coverage provided by the conformal framework for prediction sets. The incorporation of tree-based nonlinearities improves point prediction accuracy (lower RMSE) and directional correctness (higher POCID), while preserving valid estimates of uncertainty, as verified by Probability Integral Transform (PIT) histograms. This tackles the inherent trade-off between sharpness and calibration in probabilistic forecasting.
The improvement in performance with an increase in the amount of training data from 80% to 85% is not only empirical but also reflects the consistency properties of both the ensemble technique and conformal prediction. As data volumes increase, tree-based techniques will be able to distinguish between smaller regimes. On the other hand, the non-asymptotic properties of the conformal framework will be more refined.
The proposed modelling framework offers a template for forecasting renewable energy, addressing the key challenge of producing point forecasts and uncertainty measures for a non-standard data-generating process. The underlying principles of the approach, including regime-based hybrid modelling and distribution-free uncertainty quantification, can be applied to other areas of renewable energies.

The remainder of this paper is structured as follows. Section 2 introduces the models, Section 3 presents the empirical results, and Section 4 provides a detailed discussion of these findings. Finally, Section 5 offers concluding remarks.

2. Models

The modelling framework proposed in this study is given in Figure 1. The wind energy prediction system uses the eXtreme Gradient Boosting (XGBoost) model with conformal prediction. A comparative analysis is done with Principal Component Regression (PCR). After preparing the data with wind speed as the target variable, the model selection step is where the flowchart branches out. In the XGBoost part of the flowchart, the model uses 80–10–10 and 85–10–5 data splits for training, validation and testing sets, respectively. The training is done with parameters such as

m a x . d e p t h = 6

and

η = 0.1

, uses early stopping, and computes feature importance before performing conformal prediction with

α = 0.05

. In the PCR part of the flowchart, the model applies the same techniques, uses leave-one-out cross-validation, and finally selects the optimal components. The models then generate forecasts, compute MAE, RMSE, PICP, and MPIW, and produce visualisations.

2.1. eXtreme Gradient Boosting

Tianqi Chen and Carlos Guestrin introduced XGBoost in 2016 [22]. It builds on gradient boosting [23]. XGBoost is a highly scalable and efficient gradient-boosting algorithm [24]. Gradient boosting is an ensemble learning technique that sequentially constructs multiple decision trees. Each new tree is trained to predict the errors (residuals) made by the previous trees, enabling iterative improvements in overall prediction accuracy. This process results in a strong predictive model.

The key principles of gradient boosting are as follows:

2.1.1. Additive Learning

Additive learning in XGBoost is a boosting ensemble technique where the predictive model is built iteratively. This process involves sequentially adding new decision trees to an already trained ensemble. In XGBoost ensemble methods, additive learning builds the final prediction model gradually. It starts with an initial model, and in each iteration, a new weak learner

f_{m} (x)

is trained to address the errors of the ensemble

F_{m - 1} (x)

. This weak learner is then added to the ensemble, thereby improving the overall model. A mathematical representation of this is given in Equation (1).

F_{m} (x) = F_{m - 1} (x) + f_{m} (x)

(1)

2.1.2. Loss Function

The loss function is the difference between the predicted values and the actual values. XGBoost has been shown to handle a wide range of loss functions, depending on the problem being tackled. The loss function in regression problems is the Mean Squared Error, given in Equation (2).

L (y_{i}, {\hat{y}}_{i}) = {(y_{i} - {\hat{y}}_{i})}^{2},

(2)

where

y_{i}

is the actual values and

{\hat{y}}_{i}

is the predicted values.

2.1.3. Regularisation

Regularisation is a crucial set of techniques to prevent the model from overfitting the training data. XGBoost incorporates a regularisation term,

Ω (f_{j})

, into its overall objective function

Θ

. The objective function aims to minimise both the prediction error (measured by the loss function,

L (y_{i}, {\hat{y}}_{i})

and the model complexity measured by the regularisation term. The equation is as follows:

Θ = \sum_{i = 1}^{M} L (y_{i}, {\hat{y}}_{i}) + \sum_{j = 1}^{J} Ω (f_{j})

(3)

The specific form of the regularisation term used here is

Ω (f_{j}) = γ T + \frac{1}{2} λ {∥ w ∥}_{2}^{2},

(4)

where T is the number of leaf nodes, and w is the weight of the tree, which specifies the complexity of the tree, that is,

Ω (f_{j}) = γ T + \frac{1}{2} λ {∥ w ∥}_{2}^{2}

,

γ

and

λ

are the parameters controlling complexity. The larger the value, the more complex the tree’s structure.

By adding this regularisation term to the objective function, XGBoost does not just minimise error on the training data; it also builds an inherently simpler, more generalisable model. Rather than fitting the entire model at once, it is optimised iteratively. We begin with an initial prediction

{\hat{y}}_{i}^{(0)} = 0

. At each step, we add a new tree to enhance the model. The updated prediction after adding the

t^{t h}

tree can be expressed as follows:

{\hat{y}}_{i}^{t} = {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})

(5)

In decision trees, boosting is used during the model’s training to minimise the objective function. This technique involves iteratively adding a new function f to the existing model. Therefore, in the

t^{th}

iteration, a new function is added as follows:

Θ = \sum_{i = 1}^{M} L (y_{i}, y_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{j})

(6)

The algorithm can also handle missing data and make precise decisions on where to split data based on gains. Further, XGBoost relies on post-pruning for improving efficiency. It is well-known for its scalability and flexibility, making it one of the most favourable algorithms for handling big data. Though XGBoost has many advantages over other learning algorithms, its parameters must be carefully tuned for better performance [20].

2.2. Principal Component Regression

Principal Component Regression (PCR) is used as a benchmark model in this study. It is a dimension-reduction technique useful when multicollinearity exists among explanatory variables in the multiple regression framework. A standard multiple linear regression model is defined as:

Y = X β + ε,

(7)

where Y represents the vector of observed values, X denotes a matrix of explanatory variables,

β

represents the parameter vector, and

ε

is the vector of error terms. The least squares solution for estimating the parameter vector

\hat{β}

is expressed as follows:

\hat{β} = {(X^{T} X)}^{- 1} X^{T} Y + ε

(8)

The challenge is that at times

X^{T} X

may be singular due to either multicollinearities or as a result of the number of variables exceeding the sample observations. PCR changes the original matrix

X

into a lower-dimensional orthogonal space using Singular Value Decomposition (SVD) to solve this issue.

To get our first m principal components, we use SVD to approximate the X matrix:

X = {\tilde{X}}_{(m)} + ε_{X} = (U_{(m)} D_{(m)}) V_{(m)}^{T} + ε_{X} = T_{(m)} P_{(m)}^{T} + ε_{X},

(9)

where T represents orthogonal scores, while P denotes loadings. Both U and V are orthonormal and the matrix D is diagonal with positive real entries. Consequently, regressing Y on the scores results in:

\hat{β} = P {(T^{T} T)}^{- 1} T^{T} Y

(10)

2.3. Quantile Regression

Quantile regression (QR) is a modelling framework for estimating conditional quantiles of the response variable and was developed by Koenker and Bassett [25]. If we let Y to represent a random variable with corresponding covariates X, then the conditional quantile

q_{Y ∣ X} (τ)

, where

τ \in (0, 1)

, is defined as

q_{Y ∣ X} (τ) = inf {y \in R : F_{Y ∣ X} (y) \geq τ}

, where

F_{Y ∣ X}

represents the conditional distribution of Y given X. The conditional quantile

q_{Y ∣ X} (τ)

is a solution to

q_{Y ∣ X} (τ) = arg min_{g} E [ρ_{τ} (Y - g (X)) ∣ X],

(11)

where

ρ_{τ} (\cdot)

is the pinball loss function defined as

ρ_{τ} (u) = u (τ - I (u < 0))

and

I (\cdot)

is an indicator function. Now, let

Y_{t} = X_{t}^{⊤} β + ε_{t}

be a linear quantile regression model where

Y_{t}

denotes wind energy,

X_{t}

is the design matrix,

β

is a vector of parameters and

ε_{t}

is the error term; then, the estimates of

β

are given as

{\hat{β}}_{τ} = arg min_{β \in R^{p}} \sum_{t = 1}^{n} ρ_{τ} (Y_{t} - X_{t}^{⊤} β) .

(12)

2.4. Partial Linear Additive Quantile Regression Framework for Forecast Combination

To go beyond the linear combination of forecasts, we propose a partial linear additive quantile regression model. The proposed model enables us to combine predictions from XGBoost and PCR models. The proposed model assumes that the true conditional quantile can be expressed as a linear combination of base forecasts plus a nonlinear adjustment term [26].

Let:

$f_{t}^{XGB}$ be the point forecast from an XGBoost model at time t.
$f_{t}^{PCR}$ be the point forecast from a PCR model at time t.
$Y_{t}$ be the actual realized value.

We define our combined forecast as the output of a partial linear additive quantile regression model. For a given quantile

τ

, the conditional quantile function is:

\begin{matrix} q_{Y | F} (τ | f_{t}^{XGB}, f_{t}^{PCR}) = β_{0} (τ) + β_{1} (τ) f_{t}^{XGB} + β_{2} (τ) f_{t}^{PCR} + g (f_{t}^{XGB}, f_{t}^{PCR}), \end{matrix}

(13)

where

β_{0} (τ)

is the intercept;

β_{1} (τ)

and

β_{2} (τ)

are the linear weights for the two base forecasts and this forms the linear component, and

g (\cdot, \cdot)

is a smooth, unknown function that captures the nonlinear interactions and residual patterns not accounted for by the linear combination. The parameters

β_{0}, β_{1}, β_{2}

and the function g are estimated simultaneously for a given

τ

by minimising a regularised version of the quantile loss function:

\begin{matrix} arg min_{β_{0}, β_{1}, β_{2}, g} \sum_{t = 1}^{T} ρ_{τ} (Y_{t} - [β_{0} + β_{1} f_{t}^{XGB} + β_{2} f_{t}^{PCR} + g (f_{t}^{XGB}, f_{t}^{PCR})]) + λ \cdot J (g), \end{matrix}

(14)

where T is the number of time points in the training set;

ρ_{τ} (\cdot)

is the pinball loss as defined in Equation (11);

J (g)

is a penalty term that enforces smoothness on the function g (e.g., the integral of the squared second derivatives);

λ

is a smoothing parameter that controls the trade-off between fitting the data and the smoothness of g. This is an additive model estimated with a quantile loss objective.

2.5. Evaluation Metrics

Evaluation metrics are essential for determining the ability of forecasting methods to predict future time series data. In the subsequent sections, we discuss some performance evaluation metrics.

2.5.1. Root Mean Square Error

The root mean square error (RMSE) is the square root of the average squared difference between the predicted and actual values, indicating how well the model’s predictions fit the actual data. The RMSE is calculated using Equation (15).

RMSE = \sqrt{\frac{1}{m} \sum_{t = 1}^{m} {(y_{t} - {\hat{y}}_{t})}^{2}},

(15)

where

y_{t}

is the actual value,

{\hat{y}}_{t}

denotes the predicted value of the

t

th observation and m represents the total number of predictions. A lower RMSE value indicates that the predictions are closer to the actual data points, indicating better model performance.

2.5.2. Mean Absolute Error

The mean absolute error (MAE) is the average absolute difference between the predicted and actual values. The formula for MAE is given in Equation (16).

MAE = \frac{1}{m} \sum_{t = 1}^{m} |y_{t} - {\hat{y}}_{t}|,

(16)

where

y_{t}

,

{\hat{y}}_{t}

and m are as defined in Equation (15). A lower MAE indicates that the model’s predictions are closer to the actual values.

2.5.3. Mean Absolute Scaled Error

The mean absolute scaled error (MASE) is a forecast evaluation metric which compares the forecast errors to those of a naive benchmark model. For a seasonal time series, the MASE is defined by Equation (17).

MASE = \frac{\frac{1}{m} \sum_{t = 1}^{n} | y_{t} - {\hat{y}}_{t} |}{\frac{1}{n - 1} \sum_{t = 2}^{m} | y_{t} - y_{t - 1} |},

(17)

where

y_{t}

,

{\hat{y}}_{t}

and m are as defined in Equation (15) and

{\hat{y}}_{t - 1}

denotes the predicted value of the

(t - 1) th

observation. A low MASE value is desirable as it indicates better predictive performance.

2.5.4. Mean Bias Error

It measures the average bias in the predictions. Mean Bias Error can be positive or negative, indicating whether the model overestimates or underestimates the actual values. MBE is given by Equation (18).

MBE = \frac{1}{m} \sum_{t = 1}^{m} (y_{t} - {\hat{y}}_{t}),

(18)

where

y_{t}

,

{\hat{y}}_{t}

and m are as defined in Equation (15).

2.5.5. Prediction of Change in Direction

A major disadvantage of using evaluation metrics such as RMSE, MAE, MASE, and MBE is that they do not account for changes in forecast direction. This drawback is overcome by using the prediction of change in direction (POCID), which counts the number of correct direction changes [27]. The POCID is calculated using Equation (19).

POCID = \frac{100}{m} \sum_{t = 1}^{m} D_{t},

(19)

where m is the total number of forecasted periods,

D_{t}

is a directional indicator function for time period t, defined as:

\begin{matrix} D_{t} = \{\begin{matrix} 1 & if (y_{t} - y_{t - 1}) ({\hat{y}}_{t} - {\hat{y}}_{t - 1}) > 0 \\ 0 & otherwise \end{matrix} \end{matrix}

The condition

(y_{t} - y_{t - 1}) ({\hat{y}}_{t} - {\hat{y}}_{t - 1}) > 0

checks if the predicted and actual movements have the same sign. The actual change is denoted by

y_{t} - y_{t - 1}

, while

{\hat{y}}_{t} - {\hat{y}}_{t - 1}

is the predicted change. If the product is positive, both changes are in the same direction (both positive or both negative), and the prediction is counted as correct (

D_{t} = 1

). However, if the product is zero or negative, the directions did not match, and the prediction is counted as incorrect (

D_{t} = 0

).

A major drawback of POCID is that it does not consider the closeness of the predictions to the actual values. In their study, ref. [27] developed a fitness metric. This metric combines POCID and MSE [27]. In this study, we extend the fitness metric by using a data-driven weight, w and RMSE. The modified fitness metric is given in Equation (20).

\begin{matrix} Fitness = \frac{POCID}{1 + w * RMSE}, \end{matrix}

(20)

where w is calculated as follows

\begin{matrix} w & = & \frac{1}{\frac{1}{k} \sum_{i = 1}^{k} R M S E_{i}} \\ = & \frac{k}{\sum_{i = 1}^{k} R M S E_{i}}, \end{matrix}

with k representing the number of models. Therefore

\begin{matrix} Fitness = \frac{POCID}{1 + (\frac{k}{\sum_{i = 1}^{k} R M S E_{i}}) R M S E_{i}} \end{matrix}

(21)

Higher values of the fitness metric indicate better model performance in predicting fluctuations and greater prediction accuracy.

2.6. Conformal Prediction

Conformal Prediction (CP) is a modelling approach which produces prediction sets or prediction intervals with a certain coverage probability [19]. Given a user-specified error probability

α

, CP guarantees that the true target output

Y_{new}

for a new input

X_{new}

is included in the predicted set

C (X_{new})

with probability at least

1 - α

[28]. CP is model-agnostic and makes finite-sample predictions with only the assumption of exchangeability.

2.6.1. Mathematical Framework

The most computationally efficient variant of CP is Split Conformal Prediction (SCP) [19]. This variant requires a pre-fitted model

\hat{f}

, trained on a proper training set and a calibration set.

Nonconformity Measure

A function

V (x, y)

measures how different a new data point is to a previously observed data point. For regression, the absolute residual is a popular choice:

V (x, y) = | y - \hat{f} (x) |

Calibration

Given a calibration set

J = {(x_{1}, y_{1}), \dots, (x_{n}, y_{n})}

of size

n = n_{cal}

:

Compute nonconformity scores for all points in $J$ :

$s_{i} = V (x_{i}, y_{i}) = | y_{i} - \hat{f} (x_{i}) |, i = 1, \dots, n$
For a desired miscoverage rate $α$ , calculate the critical quantile q from the empirical distribution of the scores:

$q = Quantile ({s_{1}, \dots, s_{n}}, \frac{⌈ (n + 1) (1 - α) ⌉}{n})$

Prediction Interval

For a new test point

x_{new}

, the prediction interval is [19]:

C (x_{new}) = [\hat{f} (x_{new}) - q, \hat{f} (x_{new}) + q]

2.6.2. Coverage Guarantee

The fundamental guarantee of CP is marginal coverage. For a new exchangeable test point

(X_{new}, Y_{new})

[19]:

\begin{matrix} P (Y_{new} \in C (X_{new})) \geq 1 - α \end{matrix}

This probability holds over the randomness in both the calibration set and the new test point

Proof.

Consider the combined set of calibration scores

{s_{1}, \dots, s_{n}}

and the test score

s_{new} = V (x_{new}, y_{new})

. By exchangeability,

s_{new}

is equally likely to rank anywhere among these

n + 1

scores. The probability that

s_{new}

is less than or equal to the

(1 - α)

-th quantile of the calibration scores is therefore at least

1 - α

. Since

Y_{new} \in C (X_{new})

if and only if

s_{new} \leq q

, the coverage guarantee holds for any finite n. □

2.7. Prediction Interval Evaluation Metrics

2.7.1. Prediction Interval Coverage Probability

The Prediction Interval Coverage Probability (PICP) evaluation metric calculates the percentage of actual observations that lie within their corresponding predicted intervals. It is the key measure of the validity or reliability of an interval [29]. The formula for PICP is given in Equation (22).

PICP = \frac{1}{m} \sum_{t = 1}^{m} c_{t}, where c_{t} = \{\begin{matrix} 1 & if y_{t} \in [L_{t}, U_{t}] \\ 0 & if y_{t} \notin [L_{t}, U_{t}] \end{matrix},

(22)

where

y_{t}

is the actual value of the

t

th observation,

[L_{t}, U_{t}]

is the prediction interval with lower bound

L_{t}

and upper bound

U_{t}

for that observation and m is the total number of predicted observations.

2.7.2. Mean Prediction Interval Width

The Mean Prediction Interval Width (MPIW) measures the average width of prediction intervals and quantifies their sharpnessor precision. A narrower interval is more informative, provided it maintains valid coverage [29]. The formula for MPIW is given by:

MPIW = \frac{1}{m} \sum_{t = 1}^{m} (U_{t} - L_{t}),

(23)

where

U_{t}

,

L_{t}

and m are as defined in Equation (22). A lower MPIW is desirable, but it should only be used to compare models that have already achieved a valid PICP.

2.7.3. Coverage Width-Based Criterion

The Coverage Width-based Criterion (CWC) is a score that penalises both low coverage and large interval widths. It provides a single value to be minimised [29]. One common formulation is:

CWC = MPIW \cdot (1 + γ (PICP) \cdot e^{- η (PICP - μ)}), γ = \{\begin{matrix} 1 & if P I C P < μ \\ 0 & if P I C P \geq μ \end{matrix},

(24)

where

μ

is the nominal coverage rate,

γ (PICP)

is an indicator function and

η

is a scaling parameter that controls how heavily under-coverage is penalised.

2.7.4. Probability Integral Transform Histogram

The Probability Integral Transform (PIT) is a graphical method for evaluating the probabilistic calibration of an entire predictive distribution, not just a single interval. It is calculated by applying the forecasted Cumulative Distribution Function (CDF) to its corresponding actual observation [30]. The CDF is given in Equation (25).

z_{t} = F_{t} (y_{t}),

(25)

where

y_{t}

is the actual value of the

t

th observation and

F_{t}

is the forecasted CDF for that observation. The resulting set of values

{z_{1}, \dots, z_{m}}

is then plotted as a histogram. For a perfectly calibrated forecast, the

z_{t}

values should be uniformly distributed on

[0, 1]

, resulting in a flat histogram.

2.7.5. Diebold–Mariano Test

In addition to computing forecast accuracy measures, it may be necessary to test whether the differences in forecast accuracy are statistically significant. For the purpose of checking the predictive accuracy of the rival models, the Diebold–Mariano test is applied in this research, as proposed by [31] and discussed in [32].

Let

y_{t, τ}

for

t = 1, \dots, m

be the observed wind energy values, and let

{\hat{y}}_{i, t, τ}

and

{\hat{y}}_{j, t, τ}

be two different forecasts obtained from models i and j, respectively, for

\forall i \neq j

and

i, j = 1, 2, \dots, K

. The forecast errors are defined as

ε_{i, t, τ} = {\hat{y}}_{i, t, τ} - y_{t, τ}

for each model. If

g (ε_{i, t, τ})

is a loss function related to the forecast errors.

g (ε_{i, t, τ}) = e^{λ ε_{i, t, τ}} - 1 - λ ε_{i, t, τ} .

(26)

The loss differential series between the two forecasts is then constructed as [32]:

d_{t} = g (ε_{1, t, τ}) - g (ε_{2, t, τ})

(27)

The null hypothesis tested by the Diebold–Mariano test is that both forecasts have equal forecasting accuracy, which is expressed as

H_{0} : E (d_{t}) = 0

. In contrast, the alternative hypothesis is

H_{1} : E (d_{t})

, which suggests a statistically significant difference in forecast performance.

3. Results

3.1. Exploratory Data Analysis

3.1.1. Data Source

This study uses wind energy data sourced from Eskom, South Africa’s power utility company. This data is freely available at https://www.eskom.co.za/dataportal/, accessed on 5 February 2025. The data is from 2 April 2019 to 28 November 2023.

3.1.2. Data Characteristics

The data will be split into two sets: (80% training, 10% validation, and 10% test) and (85% training, 10% validation, and 5% test). The response variable in this study is wind energy, for which we will make predictions. The explanatory variables in the dataset are as follows:

difLag1is the first-order hourly difference in the Wind energy produced time series: $Y_{t} - Y_{t - 1}$
difLag2 is the second-order hourly difference in the Wind energy produced time series: $Y_{t} - Y_{t - 2}$
difLag12 is the difference from half a day prior: $Y_{t} - Y_{t - 12}$
difLag24 is the daily seasonal difference: $Y_{t} - Y_{t - 24}$
Hour represents the hour of the day (0 to 23).
Day represents the day of the week.
noltrend is the estimated nonlinear trend component of the wind energy produced time series. This component was extracted through seasonal and trend decomposition using Loess.

Figure 2 displays eight histograms that show different patterns across various variables. Day and Hour have uniform distributions, while difLag1, difLag2, difLag12 and difLag24 show bell-shaped distributions centred around zero, showing small lagged differences. In contrast, noltrend and wind exhibit right-skewed distributions, with more low values and fewer high values.

3.1.3. Summary Statistics

Table 2 shows the summary statistics for wind energy from 2 April 2019 until 27 November 2023. The data have a minimum wind energy of 19.8 MWh and a maximum of 3102.2 MWh for the specified period. The central tendency shows a median of 903.0 MWh and a mean of 982.5 MWh, which are close to each other, suggesting a moderately right-skewed distribution with the mean slightly higher than the median. This observation is supported by the skewness of 0.7557.

Furthermore, the kurtosis value of 3.3057 suggests a leptokurtic distribution, indicating heavier tails and a sharper peak than a normal distribution. This may reflect occasional high-production spikes or outliers in wind energy data. The interquartile range (IQR), calculated as the difference between the third quartile (Q3) of 1306.8 MWh and the first quartile (Q1) of 568.4 MWh, is 738.4 MWh, indicating a moderately wide middle 50% of the data.

3.2. Data Processing

3.2.1. Dataset Description

Figure 3 shows the time series plot of wind energy. There has been steady increase in wind energy production over the yeas 2019 to 2023.

3.2.2. Missing Values

Figure 4 shows that the data has no missing values in all variables.

3.2.3. Relationship Between Variables

Figure 5 displays the Pearson correlation coefficients among various variables. Red indicates strong positive correlations, while white shows no linear relationship. Each feature is perfectly correlated with itself along the diagonal.

The analysis reveals strong multicollinearity among difLag variables, with difLag1 and difLag2 showing particularly high correlation of 0.87. Wind exhibits moderate positive relationships with difLag24 (0.52), difLag12 (0.53), and a weaker link with difLag2 (0.19). Additionally, noltrend correlates strongly with Wind (0.66) and moderately with difLag24 (0.41). In contrast, the Day and Hour features show weak correlations with most other variables, suggesting they may not be strongly linked to the rest of the dataset.

3.2.4. Variable Importance

A comparison of variable importance is presented in Table 3 for two different sizes of the training dataset (80% vs. 85%). The variable noltrend is always the dominant predictor with the highest importance value (0.66 in both models). Feature difLag12 is the second most important predictor, but its importance drops slightly as the training dataset size increases from 0.2309 to 0.2223. The relative order of most variables remains unchanged, indicating that the structure of variable importance is stable. There are some small changes in the less important variables, like Hour (from 0.0199 to 0.0302) and difLag24 (from 0.0605 to 0.0508). Day is still insignificant in both models.

The analysis of model performance across two data splitting strategies reveals the optimal number of boosting rounds (nrounds) to minimise generalisation error and avoid overfitting is given in Table 4. For the 80% training, 10% validation, and 10% test split, the best results were at nround = 143, achieving the lowest RMSE and MAE. In the 85% training, 10% validation, and 5% test split, the optimal nround was 152. In both cases, exceeding these nrounds led to overfitting. emphasising the need for early stopping.

Figure 6, which is the zoomed sample of Figure A1 given in Appendix A shows that the XGBoost model shows strong and consistent performance in both point predictions and uncertainty quantification throughout the entire dataset. The Conformal Prediction Intervals effectively bound the actual wind energy across all segments, confirming that they are well-calibrated and reliable.

Figure 7 shows the zoomed samples from Figure A2, given in Appendix A which indicate that the predicted values closely align with the actual production values.

3.3. Choosing Number of Components

When selecting the optimal number of components for Principal Component Regression (PCR), the one-sigma heuristic was used. This approach is often recommended in the literature as a way to balance model simplicity and predictive accuracy. According to [33], the one-sigma heuristic involves choosing the model with the fewest components that still has a prediction error within one standard error of the minimum error observed across all models. In essence, rather than selecting the model with the absolute lowest prediction error, this method prioritises a simpler model that achieves nearly the same level of performance, thereby minimising the risk of overfitting.

3.4. Selecting Number of Components

Figure 8 illustrates the procedure for selecting the optimal number of components using the RMSEP criterion for both the 80% and 85% training sets. The plots show a sharp decline in RMSEP with the addition of the first few components, followed by a stabilisation in RMSEP as more components are incorporated. While the absolute minimum RMSEP occurs at six components, the results indicate that most of the relevant predictive information is already captured by the first three to four components. Adding further components provides only marginal improvements to model performance.

The Table 5 highlights significant trade-offs between the two data splits. The 85% split indicates improved point forecast accuracy with reduced MASE (1.3224 vs. 1.3822) and MAE (144.61 vs. 149.98), although the RMSE marginally increases. Moving to prediction intervals, the 85% split indicates improved coverage (PICP: 0.9364 vs. 0.9275) with broader intervals (MPIW: 693.34 vs. 654.76), thus indicating improved uncertainty estimation. However, it is important to note that the CWC for the 85% split significantly improves (2064.1 vs. 2669.333), indicating improved overall interval forecasting performance despite broader prediction intervals.

Time series plots comparing actual versus predicted values in Figure 9 shows both models closely tracking observed data patterns. The 80% training split predictions align more precisely with actual values, particularly at peak points, explaining its lower MASE and MAE. Both models exhibit slight systematic underestimation, visible as predicted values frequently plotting below actual observations. The similar RMSE values reflect nearly identical prediction error magnitudes, though the 85% split demonstrates marginally better overall forecast accuracy as shown in Table 5.

Figure 10 and Figure 11 present Probability Integral Transform (PIT) histograms, which are derived from conformal p-values, which serve as a standard diagnostic tool for assessing the probabilistic calibration of a forecasting model. Both histograms provide strong visual evidence that the conformal prediction model is well-calibrated. In both plots, the distribution of p-values is approximately uniform, as desired. This consistency across both graphs confirms that the model’s estimates of uncertainty are reliable.

3.5. Results of the Diebold–Mariano Tests

We present the results from the DM tests based on the 80% training test, 10% validation and 10% test. The two models considered are M1 (XGBoost) and M2 (PCR). The null hypothesis is:

H₀:

Forecasts from M1 and M2 are equally accurate.

H₁:

Forecasts from M1 and M2 have different accuracy.

The XGBoost model significantly outperforms the PCR model in forecasting accuracy. As shown in Table 6, XGBoost achieves lower error across all evaluated metrics. This is confirmed by the Diebold–Mariano test (see Table 7), which indicates that the result is statistically significant (

p

-value = 0.0015) and that the mean loss differential is −1651.345. The model also results in 4.62% and 3.18% improvements in MSE and MAE, respectively, for the XGBoost model.

3.6. Probability of Change in Direction and Fitness Tests

3.6.1. XGBoost and PCR (80% Training Test, 10% Validation and 10% Test)

Based on 80% training, 10% validation, and 10% test sets for both the XGBoost and PCR models, we combined test-set predictions using the partially linear additive quantile regression (PLAQR) averaging method. We refer to the combined predictions as fplaqrTest10. A detailed discussion of the PLAQR method is presented in [26].

The evaluation compared three forecasting models, fplaqrTest10, f1XG and f3PCR, using a combination of traditional error metrics, directional accuracy, and a composite fitness score. A summary of the model comparison (80% training test, 10% validation and 10% test) is given in Table 8, while Figure 12 shows the probability of change in direction. In terms of directional accuracy, measured by the Prediction of Change in Direction (POCID), all models performed well, with scores ranging from 70.71% to 71.71%. Model f3PCR achieved the highest directional accuracy at 71.71%. However, when considering prediction error, fplaqrTest10 recorded the lowest Root Mean Squared Error (RMSE) of approximately 179, while f3PCR had the highest RMSE at about 189.

To provide an overall assessment, a fitness metric was used that balances directional accuracy with error magnitude, applying a weight factor optimised for the observed RMSE range. According to this combined measure, model fplaqrTest10 achieved the highest fitness score of 34.28, making it the best-performing model overall. This outcome reflects its superior balance between maintaining a relatively high POCID and keeping prediction errors low. Models f1XG and f3PCR followed with fitness scores of 33.76 and 33.60, respectively.

While each model demonstrated solid performance, fplaqrTest10 is identified as the most effective, offering the optimal trade-off between accurately predicting movement direction and minimising forecast error. The fitness metric, calibrated with a weight of 0.006, effectively highlighted these differences, confirming fplaqrTest10 as the recommended choice for practical application.

3.6.2. XGBoost and PCR (85% Training Test, 10% Validation and 5% Test)

Similarly, for the 85% training, 10% validation, and 5% test sets, the combined predictions using PLAQR are denoted fplaqrTest5. The results in Table 9 compare three models: fplaqrTest5, f2XG, and f4PCR using error metrics, directional accuracy, and a combined fitness score. In contrast, Figure 13 shows the probability of change in direction. In traditional error measures, fplaqr performs best, with the lowest MAE (139.12), MSE (32,000.81), RMSE (178.89), and MASE (1.22), indicating the smallest average prediction errors. The models f2XG and f4PCR follow with progressively higher error values. All models show a low mean bias error (MBE), suggesting minimal systematic over- or under-prediction.

For directional accuracy, measured by POCID, fplaqr again leads by correctly predicting price movement direction 74.24% of the time, compared to 72.97% for f2XG and 73.31% for f4PCR. A combined fitness metric, which balances POCID and RMSE with a weighting factor of 0.006, ranks the models in the same order as fplaqrTest5. achieves the highest fitness score (35.81), followed by f2XG (34.80) and f4PCR (34.24). All are interpreted as having “good” overall performance.

The analysis confirms that fplaqrTest. is the best-performing model across both accuracy and directional metrics. The chosen weight (

w = 0.006

) is justified as approximately the inverse of the average RMSE across models, providing an optimal penalty that allows for meaningful comparison without overly diminishing the fitness score. The results suggest this weighting is appropriate for models with RMSE values around 180–190.

4. Discussion

This research shows that the hybrid model produced by combining XGBoost and PCR via the PLAQR averaging method is better calibrated and more accurate. This result, that the PLAQR ensemble model is better for composite performance (fplaqrTest10, fplaqrTest5), is consistent with the established notion within the research community that model averaging is preferable, as it helps reduce variance and become more generalisable [34,35]. However, this has been made possible within the conformal prediction paradigm, as indicated through its well-calibrated predictions as well as its PIT Histograms, which is a major development as it proves that advanced combination methods are capable of improving both accuracy as well as probability values within machine learning paradigms despite primarily focusing on error minimisation within their model development approaches.

The superior directional accuracy (POCID) and reduced error (RMSE) of the PLAQR model confirm our hypothesis: it can indeed capture complex nonlinearities by leveraging the complementary strengths of tree-based and linear methods, while maintaining structural stability. The competitive POCID of the PCR model, along with its higher error, flags its sensitivity to directional trends but with limitations in magnitude precision. XGBoost, on the other hand, provided a compromise solution. More importantly, the performance gain with the larger training dataset (85% vs. 80%) indicates that these data-intensive models continue to benefit from more data, underscoring the significance of scale in wind energy forecasting models.

The implications of these findings extend beyond this specific forecasting problem. They provide one possible solution to the problem of constructing reliable forecasting systems in the renewable energy sector. The success of the conformal prediction method is important for understanding how to construct more trustworthy AI systems in problem domains where reliable estimates of uncertainty are important.

The effectiveness of the PLAQR approach needs to be tested with a more diverse ensemble, including other model types, such as neural networks and scoring functions. The robustness of this approach can also be tested using high-frequency time-series data. Another area could be exploring adaptive weights for the PLAQR model.

5. Conclusions

This research shows that the PLAQR ensemble of XGBoost and PCR yields a better, well-calibrated model for forecasting, not only in terms of accuracy (RMSE, POCID), but also in terms of probability, as ensured by conformal prediction. The purpose of this research was to reaffirm the significance of employing advanced methods of averaging in machine learning. As shown, combining model classes and increasing the amount of data used are vital for renewable energy forecasting. As a result, it offers clear benefits to grid management, reducing the need for expensive balancing power reserves. The high level of accuracy enables more efficient reserves, thereby generating cost savings for the system. Additionally, the boosted accuracy helps improve dispatch decisions by grid operators, which, with the rise of renewable energy, often lead to errors in energy curtailment.

Author Contributions

Conceptualisation, R.I.N., T.H.T., T.R. and C.S.; methodology, R.I.N.; software, R.I.N.; validation, R.I.N., T.H.T., T.R. and C.S.; formal analysis, R.I.N. and C.S.; investigation, R.I.N., T.H.T., T.R. and C.S.; data curation, R.I.N.; writing—original draft preparation, R.I.N.; writing—review and editing, R.I.N., T.H.T., T.R. and C.S.; visualisation, R.I.N.; supervision, T.H.T., T.R. and C.S.; project administration, T.H.T., T.R. and C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding, and the University of Venda funded the APC.

Data Availability Statement

The data that support the findings of this study are available at https://github.com/csigauke/Enhancing-Short-Term-Wind-Energy-Forecasting-with-XGBoost-and-Conformal-Prediction, accessed on 2 January 2026. The data is analytic data which was used for developing the models used in this study.

Acknowledgments

The authors are grateful to the numerous people for helpful comments on this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ANN	Artificial Neural Networks
ARIMA	AutoRegressive Integrated Moving Average
BH-XGBoost	Bayesian Hyperparameter-optimised XGBoost
Boost-LR	Boosting with Linear Regression
CNN GRU	Convolutional Neural Network and Gated Recurrent Unit
CWC	Coverage Width-based Criterion
GBM	Gradient Boosting Machines
GPR	Gaussian Process Regression
KDJ	Stochastic Oscillator
KNN	K-Nearest Neighbors
LSTM	Long Short-Term Memory
MACD	Moving Average Convergence and Divergence
MAE	Mean Absolute Error
MASE	Mean Absolute Scaled Error
MBE	Mean Bias Error
MLP ANN	Multi-Layer Perceptron Artificial Neural Network
MPIW	Mean Prediction Interval Width
NMAE	Normalised Mean Absolute Error
NN	Neural Networks
PCA	Principal Component Analysis
PCR	Principal Component Regression
PICP	Prediction Interval Coverage Probability
PIT	Probability Integral Transform
RF	Random Forest
RMSE	Root Mean Square Error
SVM	Support Vector Machines
XGBoost	eXtreme Gradient Boosting

Appendix A. Supplementary Plots

The series of plots in Figure A1 and Figure A2 demonstrate that the predictive models are highly effective at forecasting wind energy production.

Figure A1. Actual vs. predicted value at 80% training test, 10% validation and 10% test using the XGBoost model.

Figure A2. Actual vs. predicted value at 80% training test, 10% validation and 5% test using the XGBoost model.

References

Behabtu, H.A.; Vafaeipour, M.; Kebede, A.A.; Berecibar, M.; Van Mierlo, J.; Fante, K.A.; Messagie, M.; Coosemans, T. Smoothing Intermittent Output Power in Grid-Connected Doubly Fed Induction Generator Wind Turbines with Li-Ion Batteries. Energies 2023, 16, 7637. [Google Scholar] [CrossRef]
Kim, D.; Hur, J. Short-term probabilistic forecasting of wind energy resources using the enhanced ensemble method. Energy 2018, 157, 211–226. [Google Scholar] [CrossRef]
Foley, A.M.; Leahy, P.G.; Marvuglia, A.; McKeogh, E.J. Current methods and advances in forecasting of wind power generation. Renew. Energy 2012, 37, 1–8. [Google Scholar] [CrossRef]
Ekinci, G.; Ozturk, H.K. Forecasting Wind Farm Production in the Short, Medium, and Long Terms Using Various Machine Learning Algorithms. Energies 2025, 18, 1125. [Google Scholar] [CrossRef]
Zheng, Y.; Guan, S.; Guo, K.; Zhao, Y.; Ye, L. Technical indicator enhanced ultra-short-term wind power forecasting based on long short-term memory network combined XGBoost algorithm. IET Renew. Power Gener. 2025, 19, e12952. [Google Scholar] [CrossRef]
Giebel, G.; Brownsword, R.; Kariniotakis, G.; Denhard, M.; Draxl, C. The State-of-the-Art in Short-Term Prediction of Wind Power: A Literature Overview, 2nd ed.; ANEMOS.plus: Crete, Greece, 2011. [Google Scholar] [CrossRef]
Liu, Z.; Guo, H.; Zhang, Y.; Zuo, Z. A Comprehensive Review of Wind Power Prediction Based on Machine Learning: Models, Applications, and Challenges. Energies 2025, 18, 350. [Google Scholar] [CrossRef]
Lei, M.; Shiyan, L.; Chuanwen, J.; Hongling, L.; Yan, Z. A review on the forecasting of wind speed and generated power. Renew. Sustain. Energy Rev. 2009, 13, 915–920. [Google Scholar] [CrossRef]
Park, S.; Jung, S.; Lee, J.; Hur, J. A short-term forecasting of wind power outputs based on gradient boosting regression tree algorithms. Energies 2023, 16, 1132. [Google Scholar] [CrossRef]
Lahouar, A.; Slama, J.B.H. Hour-ahead wind power forecast based on random forests. Renew. Energy 2017, 109, 529–541. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Xiong, X.; Guo, X.; Zeng, P.; Zou, R.; Wang, X. A short-term wind power forecast method via XGBoost hyper-parameters optimization. Front. Energy Res. 2022, 10, 905155. [Google Scholar] [CrossRef]
García-Puente, B.; Rodríguez-Hurtado, A.; Santos, M.; Sierra-García, J. Evaluation of XGBoost vs. other Machine Learning models for wind parameters identification. Renew. Energy Power Qual. J. 2023, 21, 388–393. [Google Scholar] [CrossRef]
Sunku, V.S.R.P.; Namboodiri, V.; Mukkamala, R. The Short-Term Wind Power Forecasting by Utilizing Machine Learning and Hybrid Deep Learning Frameworks. Probl. Reg. Energetics 2025, 1, 1–11. [Google Scholar] [CrossRef]
Ahmed, U.; Muhammad, R.; Abbas, S.S.; Aziz, I.; Mahmood, A. Short-term wind power forecasting using integrated boosting approach. Front. Energy Res. 2024, 12, 1401978. [Google Scholar] [CrossRef]
Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
Angelopoulos, A.N.; Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv 2021, arXiv:2107.07511. [Google Scholar]
Dheur, V. Distribution-Free and Calibrated Predictive Uncertainty in Probabilistic Machine Learning. Ph.D. Thesis, UMONS—University of Mons [Faculté des Sciences], Mons, Belgium, 2025. [Google Scholar]
Angelopoulos, A.N.; Barber, R.F.; Bates, S. Theoretical Foundations of Conformal Prediction. arXiv 2025, arXiv:2411.11824. [Google Scholar] [CrossRef]
Kavzoglu, T.; Teke, A. Advanced hyperparameter optimization for improved spatial prediction of shallow landslides using extreme gradient boosting (XGBoost). Bull. Eng. Geol. Environ. 2022, 81, 201. [Google Scholar] [CrossRef]
Ponkumar, G.; Jayaprakash, S.; Kanagarathinam, K. Advanced machine learning techniques for accurate very-short-term wind power forecasting in wind energy systems using historical data analysis. Energies 2023, 16, 5459. [Google Scholar] [CrossRef]
Zhao, X.; Li, Q.; Xue, W.; Zhao, Y.; Zhao, H.; Guo, S. Research on Ultra-Short-Term Load Forecasting Based on Real-Time Electricity Price and Window-Based XGBoost Model. Energies 2022, 15, 7367. [Google Scholar] [CrossRef]
Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. Xgboost: Extreme gradient boosting. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef]
Koenker, R.W.; Bassett, G. Regression Quantiles. Econometrica 1978, 46, 33–50. [Google Scholar] [CrossRef]
Hoshino, T. Quantile regression estimation of partially linear additive models. J. Nonparametr. Stat. 2014, 26, 509–536. [Google Scholar] [CrossRef]
Fallahtafti, A.; Aghaaminiha, M.; Akbarghanadian, S.; Weckman, G.R. Forecasting ATM Cash Demand Before and During the COVID-19 Pandemic Using an Extensive Evaluation of Statistical and Machine Learning Models. SN Comput. Sci. 2022, 3, 164. [Google Scholar] [CrossRef] [PubMed]
Stocker, M.; Małgorzewicz, W.; Fontana, M.; Taieb, S.B. A Gentle Introduction to Conformal Time Series Forecasting. arXiv 2025, arXiv:2511.13608. [Google Scholar] [CrossRef]
Khosravi, A.; Nahavandi, S.; Creighton, D.; Atiya, A.F. Comprehensive review of neural network-based prediction intervals and new advances. IEEE Trans. Neural Netw. 2011, 22, 1341–1356. [Google Scholar] [CrossRef]
Gneiting, T.; Balabdaoui, F.; Raftery, A.E. Probabilistic forecasts, calibration and sharpness. J. R. Stat. Soc. Ser. B Stat. Methodol. 2007, 69, 243–268. [Google Scholar] [CrossRef]
Diebold, F.; Mariano, R. Comparing predictive accuracy. J. Bus. Econ. Stat. 1995, 13, 253–265. [Google Scholar] [CrossRef]
Triacca, U. Comparing Predictive Accuracy of Two Forecasts. 2018. Available online: https://www.lem.sssup.it/phd/documents/Lesson19.pdf (accessed on 17 September 2025).
Mevik, B.H.; Wehrens, R.; Liland, K.H. Introduction to the pls Package. 2015. Available online: https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.html (accessed on 23 August 2025).
Nowotarski, J.; Weron, R. Computing electricity spot price prediction intervals using quantile regression and forecast averaging. Comput. Stat. 2015, 30, 791–803. [Google Scholar] [CrossRef]
Mpfumali, P.; Sigauke, C.; Bere, A.; Mulaudzi, S. Day Ahead Hourly Global Horizontal Irradiance Forecasting—Application to South African Data. Energies 2019, 12, 3569. [Google Scholar] [CrossRef]

Figure 1. Flowchartof the modelling framework: XGBoost with Conformal Prediction vs. PCR.

Figure 2. Distribution of variables.

Figure 3. Time series wind plot.

Figure 4. Missing values.

Figure 5. Pearson correlation coefficient matrix.

Figure 6. Conformal Prediction Interval on actual vs. predicted value at 80% training test, 10% validation and 10% test based on first 500 observations.

Figure 7. Conformal Prediction Interval of actual vs. predicted value at 80% training test, 10% validation and 5% test.

Figure 8. Selecting number of optimal components for 80% and 85% training test.

Figure 9. Actual vs. predicted values comparison for benchmark model across different data splits.

Figure 10. Probability Integral Transform on (80% training test, 10% validation and 10% test).

Figure 11. Probability Integral Transform on (85% training test, 10% validation and 5% test).

Figure 12. Probability of change in direction (80% training test, 10% validation and 10% test).

Figure 13. Probability of change in direction (85% training test, 10% validation and 5% test).

Table 1. Summaryof wind energy forecasting methods.

Author/Year	Methodology Type	Forecasting Horizon	Uncertainty Quantification	Key Performance Metrics	Main Limitations
[6]	Statistical (ARIMA, Persistence)	Short-term	Not addressed	Qualitative review	Struggle with nonlinear relationships between wind power and weather variables
[8]	Machine Learning (ANNs, SVMs)	Short-term	Not addressed	Comparative review	Early-stage ML applications, limited uncertainty quantification
[10]	Random Forest	Hour-ahead	Not addressed	Accuracy improvements demonstrated	Point predictions only, no uncertainty estimates
[2]	Ensemble methods (temporal and geographical ensembles)	Short-term	Probabilistic forecasting with analogue ensemble methods	Improved uncertainty estimation	Complex implementation, computational intensity
[12]	XGBoost with Bayesian hyperparameter optimisation (BH-XGBoost)	Short-term	Not addressed	Superior performance vs. SVM, KELM, LSTM in all test conditions	Point predictions only, no uncertainty quantification
[20]	Advanced optimisation algorithms	Not specified	Not addressed	Improved model performance	Focus on optimisation rather than uncertainty
[9]	Gradient Boosting Machine (GBM)	Short-term (15-min intervals)	Not addressed	NMAE: 5.15%	Point predictions only, limited to specific temporal resolution
[13]	XGBoost vs. SVR, GPR, NN	Short-term	Not addressed	XGBoost most effective for short-term predictions	No uncertainty quantification
[21]	LightGBM, RF, CatBoost, XGBoost	Very short-term	Not addressed	MAE, MSE, RMSE, R-squared comparisons	Point predictions only
[15]	Boost-LR (XGBoost, CatBoost, RF + Linear Regression)	Short-term	Not addressed	MAE improvements: 31.42%, 32.14%, 27.55%	Ensemble improves accuracy but lacks uncertainty intervals
[5]	XGBoost + LSTM + Technical Indicators (KDJ, SO, MACD)	Ultra-short-term	Not addressed	NMAE: 0.0396; Processing time: 550 s	Computational complexity, no uncertainty quantification
[4]	XGBoost, RF, ANNs, KNN, MLP	Medium to long-term	Not addressed	Superior stability and accuracy vs. statistical methods	Focus on accuracy, not forecast reliability
[14]	CNN-GRU vs. XGBoost, RF	Day-ahead	Statistical validation (Diebold–Mariano test)	Deep learning marginally better; XGBoost competitive	Hypothesis testing rather than operational uncertainty quantification
[7]	XGBoost, RF, LSTM vs. traditional	Comprehensive review	Not addressed	ML superiority demonstrated	Review format, no empirical uncertainty analysis
[18]	Machine Learning + Conformal Prediction	Various	Conformal prediction	Quantifiable uncertainty intervals	Not specifically applied to wind energy forecasting

Table 2. Summary statistics for wind energy produced.

Summary	Value
Minimum	19.8
First Quartile (Q1)	568.4
Median (Q2)	903.0
Third Quartile (Q3)	1306.8
Maximum	3102.2
Mean	982.5
Skewness	0.7557
Kurtosis	3.3057

Table 3. Variable importance comparison for 80% and 85% training sets.

Variables	Importance (80% Train)	Importance (85% Train)
noltrend	0.6637	0.6670
difLag12	0.2309	0.2223
difLag24	0.0605	0.0508
Hour	0.0199	0.0302
difLag2	0.0161	0.0198
difLag1	0.0075	0.0082
Day	0.0015	0.0016

Table 4. Model performance comparison using (80% training test, 10% validation and 10% test) and (85% training test, 10% validation and 5% test).

Model/Set	Evaluation Metrics	Optimal Rounds = 143	nrounds = 500	nrounds = 1000
80% train	MASE	1.3284	1.4037	1.4188
10% validation	RMSE	182.4441	193.5536	195.9357
10% test	MAE	144.1475	152.3166	153.9614
	MBE	0.0677	−7.1082	−7.9733
Model/Set	Evaluation Metrics	Optimal Rounds = 152	nrounds = 500	nrounds = 1000
85% train	MASE	1.2551	1.3146	1.3421
10% validation	RMSE	182.781	192.4822	197.4568
5% test	MAE	143.5088	150.3099	153.4492
	MBE	−3.7311	−17.5414	−12.7386

Table 5. Model performance comparison and prediction interval evaluation using (80% training, 10% validation, 10% test) and (85% training, 10% validation, 5% test).

Model/Set	Evaluation Metrics	80%/10%/10%	85%/10%/5%
Training/Validation/Test Split	MASE	1.3822	1.3224
	RMSE	189.0324	190.2073
	MAE	149.9833	144.6136
	MBE	−3.4655	−4.1088
Prediction Intervals	PICP	0.9275	0.9364
	MPIW	654.7578	693.342
	CWC	2669.333	2064.1

Table 6. Performance metrics comparison (80% training test, 10% validation and 10% test).

Metric	XGBoost	PCR	Winner
MSE	34,081.89	35,733.24	XGBoost
RMSE	184.61	189.03	XGBoost
MAE	145.22	149.98	XGBoost
MAPE (%)	13.58	14.27	XGBoost

Table 7. Model comparisons.

Diebold–Mariano Test
Null Hypothesis	Test Statistic	p-Value	Mean Loss Differential	Result
M1 = M2	−3.182	0.0015	−1651.345	Not equally accurate

Table 8. Comprehensive model comparison (80% training test, 10% validation and 10% test).

Model	MASE	RMSE	MSE	MAE	MBE	POCID (%)	Fitness
fplaqrTest10	1.2993	179.2473	32,129.60	140.9943	−0.7070	71.1487	34.2805
f1XG	1.3284	182.4441	33,285.86	144.1475	0.0677	70.7078	33.7562
f3PCR	1.3822	189.0324	35,733.24	149.9833	−3.4655	71.7120	33.6014

Table 9. Performance metrics for the three forecasting models (85% training test, 10% validation and 5% test).

Model	MASE	RMSE	MSE	MAE	MBE	POCID (%)	Fitness
fplaqrTest5.	1.2168	178.8877	32,000.81	139.1227	0.8072	74.2409	35.8077
f2XG	1.2551	182.7810	33,408.91	143.5088	−3.7311	72.9677	34.8014
f4PCR	1.3224	190.2073	36,178.81	151.2035	−4.1088	73.3105	34.2373

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nthangeni, R.I.; Sigauke, C.; Ravele, T.; Tshisikhawe, T.H. Enhancing Short-Term Wind Energy Forecasting with XGBoost and Conformal Prediction for Robust Uncertainty Quantification. Computation 2026, 14, 56. https://doi.org/10.3390/computation14030056

AMA Style

Nthangeni RI, Sigauke C, Ravele T, Tshisikhawe TH. Enhancing Short-Term Wind Energy Forecasting with XGBoost and Conformal Prediction for Robust Uncertainty Quantification. Computation. 2026; 14(3):56. https://doi.org/10.3390/computation14030056

Chicago/Turabian Style

Nthangeni, Rabelani Innocent, Caston Sigauke, Thakhani Ravele, and Thinawanga Hangwani Tshisikhawe. 2026. "Enhancing Short-Term Wind Energy Forecasting with XGBoost and Conformal Prediction for Robust Uncertainty Quantification" Computation 14, no. 3: 56. https://doi.org/10.3390/computation14030056

APA Style

Nthangeni, R. I., Sigauke, C., Ravele, T., & Tshisikhawe, T. H. (2026). Enhancing Short-Term Wind Energy Forecasting with XGBoost and Conformal Prediction for Robust Uncertainty Quantification. Computation, 14(3), 56. https://doi.org/10.3390/computation14030056

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Short-Term Wind Energy Forecasting with XGBoost and Conformal Prediction for Robust Uncertainty Quantification

Abstract

1. Introduction

1.1. Research Motivation

1.2. Literature Review

1.2.1. Evolution of Forecasting Methods: From Statistics to Machine Learning

1.2.2. XGBoost and Hyperparameter Optimisation in Wind Forecasting

1.2.3. Uncertainty Quantification and the Role of PCR

1.2.4. Summary of the Literature and Research Gap

1.3. Contributions and Research Highlights

2. Models

2.1. eXtreme Gradient Boosting

2.1.1. Additive Learning

2.1.2. Loss Function

2.1.3. Regularisation

2.2. Principal Component Regression

2.3. Quantile Regression

2.4. Partial Linear Additive Quantile Regression Framework for Forecast Combination

2.5. Evaluation Metrics

2.5.1. Root Mean Square Error

2.5.2. Mean Absolute Error

2.5.3. Mean Absolute Scaled Error

2.5.4. Mean Bias Error

2.5.5. Prediction of Change in Direction

2.6. Conformal Prediction

2.6.1. Mathematical Framework

Nonconformity Measure

Calibration

Prediction Interval

2.6.2. Coverage Guarantee

2.7. Prediction Interval Evaluation Metrics

2.7.1. Prediction Interval Coverage Probability

2.7.2. Mean Prediction Interval Width

2.7.3. Coverage Width-Based Criterion

2.7.4. Probability Integral Transform Histogram

2.7.5. Diebold–Mariano Test

3. Results

3.1. Exploratory Data Analysis

3.1.1. Data Source

3.1.2. Data Characteristics

3.1.3. Summary Statistics

3.2. Data Processing

3.2.1. Dataset Description

3.2.2. Missing Values

3.2.3. Relationship Between Variables

3.2.4. Variable Importance

3.3. Choosing Number of Components

3.4. Selecting Number of Components

3.5. Results of the Diebold–Mariano Tests

3.6. Probability of Change in Direction and Fitness Tests

3.6.1. XGBoost and PCR (80% Training Test, 10% Validation and 10% Test)

3.6.2. XGBoost and PCR (85% Training Test, 10% Validation and 5% Test)

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Supplementary Plots

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI