Short-Term Electric Load Probability Forecasting Based on the BiGRU-GAM-GPR Model

Shao, Qizhuan; Bao, Rungang; Liu, Shuangquan; Fu, Kaixiang; Mo, Li; Xiao, Wenjing

doi:10.3390/su17125267

Open AccessArticle

Short-Term Electric Load Probability Forecasting Based on the BiGRU-GAM-GPR Model

by

Qizhuan Shao

¹,

Rungang Bao

^2,*,

Shuangquan Liu

^3,*

,

Kaixiang Fu

¹,

Li Mo

²

and

Wenjing Xiao

²

¹

Yunnan Power Grid Co., Ltd., 73# Tuodong Road, Kunming 650011, China

²

School of Civil and Hydraulic Engineering, Huazhong University of Science and Technology, 1037 Luoyu Road, Wuhan 430074, China

³

China Southern Power Grid Lancang-Mekong International Co., Ltd., 15 Guangfu Road, Kunming 650228, China

^*

Authors to whom correspondence should be addressed.

Sustainability 2025, 17(12), 5267; https://doi.org/10.3390/su17125267

Submission received: 25 April 2025 / Revised: 28 May 2025 / Accepted: 4 June 2025 / Published: 6 June 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate and reliable short-term electricity load forecasting plays an important role in ensuring the healthy operation of the power grid and promoting sustainable socio-economic development. This research proposes a novel hybrid load probability prediction model, BiGRU-GAM-GPR, which combines a bidirectional gated recurrent unit (BiGRU), global attention mechanism (GAM), and Gaussian process regression (GPR). Firstly, BiGRU-GAM is used to predict the sequence to obtain preliminary prediction results, and then these results are input into GPR to obtain more accurate deterministic and probabilistic prediction results. To verify the effectiveness of the proposed model, a series of experiments are conducted on three real-world power load datasets. The experimental results show the following: (1) BiGRU has the optimal forecasting ability compared with the other basic models. (2) The global attention mechanism improves the model’s perception ability of the spatial features of multi-feature sequences and plays a positive role in enhancing the model’s forecasting performance. (3) The GPR model further explores the internal relationships of the data by expanding the deterministic prediction results into probabilistic results, thus improving the forecasting effect. (4) The proposed model BiGRU-GAM-GPR exhibits the best performance in both deterministic and probabilistic forecasting and has good robustness.

Keywords:

short-term electricity load forecasting; probability prediction; bidirectional gated recurrent unit; global attention mechanism; Gaussian process regression

1. Introduction

A safe and reliable power supply is an important guarantee for promoting the healthy and stable development of the urban economy and the sustainable development of human society [1,2,3]. With the gradual advancement of China’s power market reform, the requirements for accurate and reliable load forecasting have been further increased [4,5,6]. According to the time scale of load forecasting, current power load forecasting is mainly divided into four categories: ultra-short-term load forecasting, short-term load forecasting, medium-term load forecasting, and long-term load forecasting [7]. Short-term load forecasting generally predicts the load demand for the next one hour to one week [8], and it is an important part of the field of power load forecasting [9]. Enhancing the accuracy and robustness of short-term load forecasting plays a positive role in optimizing the power generation plan, ensuring the safety and stability of the power grid, formulating bidding strategies, and integrating renewable energy sources [10,11,12].

Traditional short-term power load forecasting models mainly include the autoregressive moving average model (ARMA) [13], autoregressive integrated moving average model (ARIMA) [14], multiple linear regression (MLR) [15], Kalman filter [16], and gray relational degree models [17], etc. These traditional models are mainly based on statistical analysis and linear regression analysis, and they feature high computational efficiency and strong interpretability [18]. However, with the integration of renewable energy sources into the grid and the introduction of market mechanisms, load forecasting has gradually evolved from a traditional linear problem into a high-dimensional nonlinear problem [19,20]. As a result, traditional models have difficulties in capturing more complex dynamic characteristics when dealing with high-dimensional nonlinear problems, and they also struggle to meet the requirements of handling large-scale data and computing power [21,22,23].

With the development of computer science, intelligent algorithms based on machine learning and deep learning have been developing rapidly, providing more ideas for short-term load forecasting models [24,25,26]. An increasing number of hybrid models and ensemble models based on intelligent computing methods have been applied in the field of power load forecasting. Machine-learning algorithms and deep learning algorithms have significant advantages in handling highly complex nonlinear relationships between inputs and outputs [27]. Commonly used machine-learning methods include support vector regression (SVR) [28], Gaussian process regression (GPR) [29], and artificial neural network (ANN) [30], etc. Commonly used deep learning methods include recurrent neural network (RNN), long short-term memory network (LSTM) [31], gated recurrent unit (GRU) [32], and bidirectional gated recurrent unit (BiGRU) [33], etc. Niu et al. [34] proposed a multi-energy load forecasting model for an integrated energy system (IES) based on CNN-BiGRU, which was optimized through an attention mechanism. This model adopted a multi-task learning approach to accurately predict short-term cooling, heating, and power loads. Li et al. [35] proposed an LSTM-based power load forecasting model combined with a simplex optimizer, and took into account the influence of social factors in the processing of input data. Lin et al. [36] established a novel ensemble model based on an extreme learning machine (ELM) optimized by variational mode decomposition (VMD) and differential evolution (DE) algorithms, achieving good results in multi-step power load forecasting. Wang et al. [37] proposed a power load forecasting method based on a novel combined interval forecasting system (CElif), aiming to improve the accuracy and reliability of power load forecasting through the synergistic effect of decomposition and denoising strategies, individual prediction modules, optimization modules, and evaluation modules, providing a decision-making basis for the scientific dispatching of smart grids. Lin et al. [38] proposed a short-term load forecasting framework based on graph neural network (GNN), aiming to predict both individual loads and aggregated loads simultaneously. This framework can capture different hidden spatial dependencies without any prior knowledge of geographical information, thus improving the prediction accuracy. Machine-learning and deep learning methods, in addition to their extensive applications in load forecasting, have also demonstrated a remarkable performance in other fields. Zhu et al. [39] constructed a miniature transducer suitable for high-frequency intravascular ultrasound (IVUS) imaging using machine-learning approaches and successfully fabricated a lead-free (100)-textured KNLN thick film with superior piezoelectric properties, thereby paving a new path for the application of lead-free piezoelectric materials in the field of high-frequency ultrasound imaging. Feng et al. [40] established a coupled prediction and analysis model that takes into account the sliding wear process, real machined surface, and mixed lubrication. For the first time, sliding wear and its influence on surface micro-topography were integrated with material hardness and wear time, forming an effective predictive model framework and providing a valuable tool for evaluating the lubrication performance of gas turbine bearings.

In the actual operation process, the load is prone to fluctuations due to the influence of external factors [41], which leads to prediction errors. Although traditional statistical models and existing intelligent algorithms have achieved significant progress in load forecasting accuracy, most existing methods focus on obtaining high-precision single-point load forecasting results through deterministic forecasting, that is, depicting load trends through single-point forecast values. This approach ignores the uncertainty caused by random load fluctuations in actual operation. Therefore, when forecasting errors are unavoidable, the reliability of single-point load forecasting results is inferior to probabilistic forecasting. Probabilistic load forecasting extends the point prediction results to range results. By giving the upper and lower limits of the predicted load, it determines the possible fluctuation range of the predicted value, making the prediction closer to the actual operation and production, and providing more reliable prediction results for power generation enterprises to formulate power generation plans and participate in market competition [42]. Xiao et al. [43] innovatively incorporated the influencing factors of consecutive multi-day meteorological conditions into the daily peak load forecasting method of the power grid, and proposed a hybrid power load forecasting model based on decomposition and Fisher Information. They also used the Gaussian process regression (GPR) method to expand the deterministic load prediction into range results, providing an extended idea for peak load forecasting. Huang et al. [44] proposed a probabilistic load forecasting method based on CNN and load range discretization (LRD), which can directly generate the probability distribution of the load without presetting the type of probability distribution or using non-differentiable training functions. Lin et al. [45] proposed a long short-term memory (LSTM) model based on a dual attention mechanism for the probabilistic forecasting of short-term regional loads. Through the feature attention mechanism and the temporal attention mechanism, this model, respectively, evaluates the correlation between the input features and the load data and discovers their temporal dependence, thus improving its prediction accuracy. The above-mentioned studies have made outstanding contributions to short-term power load forecasting in terms of input factors, feature screening, and model integration.

In this study, a hybrid probabilistic forecasting model for short-term power load considering uncertainties, BiGRU-GAM-GPR, which couples a bidirectional gated recurrent unit (BiGRU), global attention mechanism (GAM), and Gaussian process regression (GPR), is proposed. The prediction of this model is divided into two stages. In the first stage, the data is input into BiGRU-GAM to obtain the preliminary prediction results; in the second stage, the preliminary prediction results are input into GPR to obtain more accurate deterministic and probabilistic prediction results. The innovations of this study are summarized as follows: (1) A hybrid probabilistic forecasting model composed of BiGRU, GAM, and GPR is proposed, aiming to obtain more accurate point prediction results and reliable probabilistic results of power loads. (2) The BiGRU model is compared with the basic models to verify the excellent performance of BiGRU in short-term load forecasting. (3) Through model comparison, the enhancing effect of the global attention mechanism on the forecasting performance is verified. (4) Gaussian process regression is used to expand the deterministic results into probabilistic results, verifying the feasibility of the proposed model in probabilistic forecasting. (5) BiGRU-GAM-GPR is comprehensively compared with six comparative models on three datasets divided by the sliding window method. The results show that the proposed BiGRU-GAM-GPR achieves accurate point prediction results and reliable probabilistic prediction results and has excellent forecasting ability and robustness.

The chapter arrangement of this paper is as follows: Section 1 introduces the research background of this paper; Section 2 presents the basic models and methods used in this paper, including the bidirectional gated recurrent unit, global attention mechanism, and Gaussian process regression, as well as the entire process of the proposed probabilistic forecasting framework; Section 3 introduces the research objects, describes the division of research data, data preprocessing, and the design of comparative experiments; Section 4 shows the experimental results of the proposed model and the comparative models, as well as the analysis and discussion of the results; Section 5 presents the conclusions of this paper.

2. Methods

2.1. Bidirectional Gated Recurrent Unit (BiGRU)

The gated recurrent unit (GRU) is an improved structure of the long short-term memory (LSTM) network [46]. Compared with LSTM, GRU integrates the forget gate and input gate into an update gate and merges the hidden state with the cell state. It not only retains the LSTM’s ability to capture long-term and short-term dependencies but also simplifies the computational process, thereby improving computational efficiency.

GRU contains two gate structures: the reset gate and the update gate. The reset gate determines how new inputs are combined with the previous hidden state, while the update gate decides the extent to which the previous hidden state is retained and the degree to which the current candidate hidden information is incorporated. The structure of GRU is illustrated in Figure 1, where

z_{t}

and

r_{t}

represent the reset gate and update gate, respectively. The detailed computation process is as follows:

z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z})

(1)

r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r})

(2)

{\tilde{h}}_{t} = \tanh (W x_{t} + U (r_{t} ⊙ h_{t - 1}) + b)

(3)

h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ {\tilde{h}}_{t}

(4)

where

{\tilde{h}}_{t}

and

h_{t}

represent the candidate hidden state and the current output, respectively;

W_{z}

,

W_{r}

, and

W

are the forward weights;

U_{z}

,

U_{r}

, and

U

are the recurrent weights for the update gate, reset gate, and candidate hidden state, respectively; and

b_{z}

,

b_{r}

, and

b

denote the biases for the update gate, reset gate, and candidate hidden state, respectively.

The bidirectional gated recurrent unit (BiGRU) consists of two independent GRU layers, one processing the sequence in the forward direction and the other in the backward direction. Compared with the standard GRU model, this bidirectional structure enables BiGRU to capture long-range dependencies within the sequence and reflect the global information of the entire time series, thereby improving the model’s performance. The structure of BiGRU is illustrated in Figure 2, and the detailed computation process is as follows:

{\bar{h}}_{t} = G R U (x_{t}, {\bar{h}}_{t - 1})

(5)

{\bar{h}}_{t} = G R U (x_{t}, {\bar{h}}_{t - 1})

(6)

where

{\bar{h}}_{t}

and

{\underline{h}}_{t}

represent the forward and backward units, respectively.

2.2. Global Attention Mechanism (GAM)

The global attention mechanism (GAM) is a crucial technique in deep learning [47] that aims to enable a model to focus on the most relevant parts of an input sequence or features by dynamically computing a context vector. The global attention mechanism was initially proposed by Bahdanau et al. [48] in the context of neural machine translation to address the limitation of traditional encoder–decoder architectures, where fixed-length vectors often fail to effectively capture key information from long sequences. Subsequently, this mechanism has been extended to various domains, including computer vision, natural language processing (NLP), and other fields, to enhance the model’s ability to capture global information.

The core of the global attention mechanism lies in dynamically computing attention weights to perform the weighted aggregation of input features. By assigning weights to features at all positions, the model is able to focus on the most important parts of the sequence, thereby enhancing its ability to perceive spatial features in multi-feature sequences. The key components of this mechanism include an encoder–decoder framework, a context vector, and the computation of attention weights. Within the encoder–decoder framework, the encoder encodes the input sequence into a set of hidden states, while the decoder progressively generates the output sequence, dynamically attending to the encoder’s hidden states. At each step of generating the output, the decoder computes a context vector, which represents the weighted sum of the most relevant encoder states during the current decoding step. The attention weights are calculated using an alignment model, typically employing scoring functions such as dot-product, concatenation, or general forms. The formula for computing the global attention mechanism is as follows:

c_{t} = \sum_{i = 1}^{n} α_{t i} h_{i}

(7)

α_{t i} = softmax (score (s_{t}, h_{i}))

(8)

where

c_{t}

denotes the context vector of the decoder at the tth time step;

h_{i}

represents the hidden state of the encoder at the ith position of the input sequence;

α_{t i}

is the attention weight of the ith hidden state

h_{i}

of the encoder during the tth decoding step; and

s_{t}

indicates the current hidden state of the decoder.

2.3. Gaussian Process Regression (GPR)

Gaussian process regression (GPR) is a nonparametric regression method based on Bayesian statistics [49] which exhibits significant advantages in addressing nonlinear, high-dimensional, and complex regression problems. It has been widely applied in fields such as machine learning, signal processing, and optimization control. The GPR model describes the relationships between sample points through the covariance matrix of a Gaussian process, enabling predictions for unknown data points. Unlike traditional linear regression models, GPR can capture complex nonlinear relationships by employing a specified kernel function while also providing estimates of uncertainty.

This is a training dataset

D = \{(x_{i}, y_{i})| i = 1, 2, \dots, n\} = (X, y)

, where

X \in ℝ^{d \times n}

is the matrix composed of input vectors

x_{i}

,

y \in ℝ^{n}

is the vector composed of output data

y_{i}

,

n

is the number of training samples, and

d

is the dimensionality of the input vector

x_{i}

. We define the function space

g (x)

, and then

g (x^{(1)})

,

g (x^{(2)})

, …,

g (x^{(n)})

form a set of random variables that follow a joint Gaussian distribution. The Gaussian process is expressed as follows:

g (x) \sim G P (m (x), k (x, x^{'}))

(9)

where all of the statistical characteristics of the Gaussian process

g (x)

are determined by the mean function

m (x)

and the covariance function

k (x, x^{'})

.

By incorporating Gaussian white noise

ε \sim N (0, σ_{n}^{2})

into the observed target values

y

, a general model for Gaussian process regression can be established, expressed as

y = g (x) + ε

. Since the noise

ε

is independent of

g (x)

, and when

g (x)

follows a Gaussian distribution,

y

also follows a Gaussian distribution. Therefore, the prior distribution of

y

can be expressed as

y \sim N (m (x), K + σ_{n}^{2} I)

(10)

where

K

is the covariance matrix composed of elements

K_{i j} = k (x_{i}, x_{j})

, and

k (\cdot)

represents the covariance function.

Given an

n^{*}

-dimensional test sample set

D^{*} = \{(x_{i}, y_{i})| i = n + 1, \dots, n + n^{*}\}

, the training sample observations

y

and the output vector

y^{*}

of the test data form a joint Gaussian distribution as follows:

[\begin{matrix} y \\ y^{*} \end{matrix}] \sim N (0, [\begin{matrix} K (X, X) + σ_{n}^{2} I & K (X, X^{*}) \\ K (X^{*}, X) & k (X^{*}, X^{*}) \end{matrix}])

(11)

where

K (X, X)

is the symmetric positive definite covariance matrix.

K (X^{*}, X) = K {(X, X^{*})}^{T}

is the covariance matrix between the test data and the training data.

k (X^{*}, X^{*})

is the covariance matrix of the test input variable.

Under the constraint of the joint prior distribution given the training dataset and the test points, the posterior probability distribution of the output

y^{*}

can be derived according to the Bayesian principle as follows:

y^{*} |X, y, X^{*} \sim N ({\hat{y}}^{*}, cov (y^{*}))

(12)

{\hat{y}}^{*} = K (X^{*}, X) {[K (X, X) + σ_{n}^{2} I]}^{- 1} y

(13)

cov (y^{*}) = k (X^{*}, X^{*}) - K (X^{*}, X) {[K (X, X) + σ_{n}^{2} I]}^{- 1} K (X, X^{*})

(14)

where

{\hat{y}}^{*}

and

cov (y^{*})

represent the predicted value and the predictive variance of the GPR model during the testing period, respectively.

2.4. Load Forecasting Framework

This paper innovatively proposes a hybrid probabilistic prediction model for short-term power load considering uncertainty, which combines a bidirectional gated recurrent unit, global attention mechanism, and Gaussian process regression, named BiGRU-GAM-GPR. The model aims to achieve more accurate point prediction results as well as reliable probabilistic prediction results for power load. The framework of the model is shown in Figure 3 below. A single-step prediction method with multi-factor inputs is adopted in this paper, taking into account the influence of historical meteorological factors and load. The point prediction results are extended to probabilistic prediction results by using Gaussian process regression. The detailed prediction process of the model is as follows.

Step 1: Data partitioning and preprocessing. Firstly, the data are tested for missing values and outliers to ensure the usability of the dataset. To verify the generalization ability of the proposed model, the sliding window method is employed to partition the data, generating three distinct datasets. These datasets are further divided into training and testing sets. The data in both the training and test sets are normalized to accelerate the gradient descent and eliminate the influence of dimensionality.

Step 2: The calculation of the prediction results of the BiGRU-GAM model. The divided training and testing datasets are fed into the bidirectional gated recurrent unit (BiGRU) model optimized by the global attention mechanism (GAM). At each time step, the sequence is first input into the BiGRU model for training to obtain the output. Simultaneously, the global attention mechanism dynamically learns the various parts of the input sequence and computes a weight vector. These weights are then applied to the output of BiGRU for weighted summation, resulting in the load prediction outcome of the BiGRU-GAM model.

Step 3: Repartitioning of the dataset. The load forecasting results from the BiGRU-GAM model and the observed load values are recombined to form new training and testing datasets. First, the load forecasting results from the BiGRU-GAM model are partitioned into a new training set and a new testing set. The training set for the GPR model is then composed of the training set of the BiGRU-GAM model’s prediction results and the observed load values from the training set partitioned in Step 1. The testing set for the GPR model is the testing set of the BiGRU-GAM model’s prediction results.

Step 4: The calculation of the point prediction and probabilistic prediction results. The training and testing sets obtained in Step 3 are input into the GPR model to obtain the final point prediction and probabilistic prediction results.

2.5. Evaluation Metrics

2.5.1. Evaluation Metric of Point Prediction

To comprehensively evaluate the deterministic prediction performance of the experimental design model, this study chooses the determination coefficient (R²), root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) as evaluation metrics for model performance. R² represents the proportion of the variance in the target variable that is explained by the independent variables (features) in the model, with a range between 0 and 1. Typically, a higher R² indicates a better model fitting performance. RMSE evaluates the accuracy of model predictions by calculating the square root of the mean of the squared deviations between the predicted and observed values. MAE computes the average absolute deviation between the predicted and actual values, while MAPE calculates the percentage of relative error between the predicted and actual values. The smaller these three values are, the more accurate the model predictions are. The formulas for these evaluation metrics are presented as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}}

(15)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}}

(16)

M A E = \frac{1}{N} \sum_{i = 1}^{N} |{\hat{y}}_{i} - y_{i}|

(17)

M A P E = \frac{1}{N} \sum_{i = 1}^{N} \frac{|{\hat{y}}_{i} - y_{i}|}{y_{i}} \times 100 %

(18)

where

N

is the total number of samples;

y_{i}

is the observed value of the ith sample;

{\hat{y}}_{i}

is the predicted value of the ith sample; and

\bar{y}

is the mean of the samples.

2.5.2. Evaluation Metric of Probability Prediction

Unlike the evaluation of deterministic forecasts, the assessment of probabilistic forecasting results requires not only consideration of prediction accuracy but also an analysis of the reasonableness of the predicted intervals. In this study, we employ the prediction interval coverage probability (PICP), continuous ranked probability score (CRPS), and width of prediction intervals (MPIW) to measure the predictive performance.

The core function of PICP is to quantify the probability that the model’s prediction interval covers the true value. A higher PICP indicates a better prediction performance. The calculation method is as follows:

P I C P = \frac{1}{N} \sum_{i = 1}^{N} δ_{i}

(19)

where

N

is the sample size, and

δ_{i}

is a Boolean function that takes the value 1 when the true value

x_{i}

lies within the interval, and 0 otherwise.

CRPS is a metric used to assess the accuracy of probabilistic forecasts. It is an extension of the mean absolute error to continuous probability distributions and can be used to measure the inconsistency between the forecasted probability distribution and the observed distribution based on the differences in the cumulative distribution function (CDF). This metric characterizes the evaluation results through specific scores, where a lower score indicates a better overall model performance. The calculation method is as follows:

C R P S = \frac{1}{N} \sum_{i = 1}^{N} \int_{- \infty}^{+ \infty} {[F ({\hat{y}}_{i}) - δ ({\hat{y}}_{i} - y_{i})]}^{2} d {\hat{y}}_{i}

(20)

\begin{matrix} F ({\hat{y}}_{i}) = \int_{- \infty}^{{\hat{y}}_{i}} p (x) d x \\ δ ({\hat{y}}_{i} - y_{i}) = \{\begin{array}{l} 0, & {\hat{y}}_{i} < y_{i} \\ 1, & {\hat{y}}_{i} \geq y_{i} \end{array} \end{matrix}

(21)

where

F ({\hat{y}}_{i})

represents the cumulative distribution function obtained from the probabilistic model, and

δ ({\hat{y}}_{i} - y_{i})

denotes the Heaviside step function, which represents the true value of the sample in terms of the CDF.

MPIW is used to measure the average width of the prediction intervals. A smaller MPIW indicates that the uncertainty of the model’s predicted values is lower, meaning the model can capture data features and patterns more accurately, and the prediction results are more concentrated and precise. Its calculation process involves summing up the widths of the prediction intervals for all prediction points and then dividing by the total number of prediction points to obtain the average width of the prediction intervals. The calculation method is as follows:

M P I W = \frac{1}{N} \sum_{i = 1}^{N} (U p p e r_{i} - L o w e r_{i})

(22)

where

N

is the sample size,

U p p e r_{i}

is the upper limit of the prediction interval for the ith prediction point, and

L o w e r_{i}

is the lower limit of the prediction interval for the ith prediction point.

3. Case Study

3.1. Study Data

This study employs hourly-scale historical electricity load data from a province in southern China, covering the period from 1 July 2020, to 1 July 2021, as a case study. Meteorological factors such as the average temperature, dew point, and relative humidity were selected as input variables, with the meteorological data obtained from the local meteorological bureau. To validate the generalization capability of the proposed model, a sliding window approach [50] was adopted based on the existing dataset A. This method ensures that adjacent datasets include both overlapping periods to preserve temporal continuity and new data to validate the model’s generalization ability, playing an effective role in evaluating the model’s generalization capability. The sliding window length and single-step sliding distance were set to 80% and 20% of the original dataset A, respectively. Consequently, the original dataset was sequentially divided into new datasets B and C, as illustrated in Figure 4. For each dataset, the first 80% of the data (highlighted in yellow) and the last 20% of the data (highlighted in pink) were used as the training and test sets, respectively.

3.2. Data Preprocessing

In this study, three meteorological factors, namely average temperature, dew point, and relative humidity, along with historical load data, are employed as inputs for predicting future load. To account for the cumulative periodic effects in load time series, based on the autocorrelation analysis of hourly load data and the cross-correlation analysis of meteorological factors, load data and meteorological factors with a lag of 1 week (168 h) are used as model inputs to predict the load for the next hour.

Prior to feeding the data into the model, data normalization is conducted to accelerate the gradient descent speed, eliminate the influence of different units, avoid the gradient descent bias towards high numerical features due to dimensional differences, and enhance the computational efficiency of the model. The normalization formula is as follows:

R_{n o r m} = \frac{R - R_{\min}}{R_{\max} - R_{\min}}

(23)

where

R_{n o r m}

is the normalized value,

R

is the original value, and

R_{\max}

and

R_{\min}

represent the maximum and minimum values in the dataset, respectively.

3.3. Comparative Experiment Design

To comprehensively evaluate the effectiveness of the proposed hybrid prediction model BiGRU-GAM-GPR in this study, six comparison models, namely BiGRU-GPR, BiGRU-GAM, BiGRU, GRU, LSTM, and BiLSTM, were employed. Two sets of comparative experiments were conducted. The first set focused on deterministic prediction results, comparing the point prediction outcomes of the proposed model with those of the aforementioned six comparison models. The second set focused on probabilistic prediction results, comparing the probabilistic prediction outcomes of BiGRU-GPR with those of the proposed model under different confidence intervals. These two sets of comparative experiments were designed to verify the effectiveness of the proposed model and the role of the global attention mechanism in enhancing prediction performance.

To ensure the fairness of the experiments, the number and size of the neural layers, the training batch size and number of epochs, and the learning rate were kept consistent across all models. For deterministic prediction, the best-performing results were selected, while for probabilistic prediction, the average of the results from five runs was used as the final outcome. This study uses the “gpy1.9.9” framework in Python 3.9 to implement the GPR model, with the “rbf” function as the kernel function. The hyperparameter settings for all models are shown in Table 1.

4. Result

4.1. Deterministic Prediction Results

The deterministic prediction results of the proposed model and the comparison models in this study are shown in Table 2, Table 3 and Table 4. Overall, the BiGRU-GAM-GPR model proposed in this study outperformed all other experimental models, achieving the best results in all four evaluation metrics. This preliminary validates the significant advantage of the model in load forecasting.

Specifically, among the basic deep learning algorithms, BiGRU achieved the best prediction performance compared with GRU, LSTM, and BiLSTM. Compared with the worst results, in dataset A, the RMSE, MAE, and MAPE of BiGRU were reduced by 17.46%, 21.05%, and 21.28%, respectively, and the R² was increased by 1.28%. Similarly, in dataset B, the RMSE, MAE, and MAPE were reduced by 18.29%, 20.79%, and 23.66%, respectively, and the R² was increased by 1.73%. In dataset C, the RMSE, MAE, and MAPE were reduced by 5.41%, 7.76%, and 8.90%, respectively, and the R² was increased by 0.68%. This is attributed to the bidirectional recurrent mechanism employed by BiGRU, which significantly enhances the model’s ability to capture the temporal dependencies in load time series. Additionally, the structural adjustments of the GRU units reduce the number of parameters and improve computational efficiency, making BiGRU more suitable for load time series prediction.

Compared with BiGRU, in dataset A, the RMSE, MAE, and MAPE of BiGRU-GAM were reduced by 16.12%, 15.50%, and 15.68%, respectively, and the R² was increased by 0.81%. In dataset B, the RMSE was reduced by 4.11%, and the R² was increased by 1.73%. In dataset C, the RMSE, MAE, and MAPE were reduced by 2.41%, 2.53%, and 1.17%, respectively, and the R² was increased by 0.27%. Among the hybrid models, compared with BiGRU-GPR, the proposed BiGRU-GAM-GPR with a global attention mechanism achieved reductions in RMSE, MAE, and MAPE by 11.24%, 10.57%, and 11.03%, respectively, and an increase in R² by 0.35% in dataset A; reductions in RMSE, MAE, and MAPE by 2.89%, 2.80%, and 2.90%, respectively, and an increase in R² by 0.19% in dataset B; and reductions in RMSE, MAE, and MAPE by 8.79%, 9.79%, and 10.40%, respectively, and an increase in R² by 0.90% in dataset C. This result indicates that the introduction of the global attention mechanism (GAM) can enhance prediction performance. This is because GAM enables BiGRU to dynamically focus on the most critical features or time intervals for load forecasting by calculating attention weights for features at each time step in the input sequence, thereby improving the model’s perception of spatial features in multi-feature sequences. When processing mixed inputs of load data and meteorological factors, GAM quantifies the contribution of different features to load forecasting. Although BiGRU captures bidirectional temporal dependencies, long sequences may contain redundant information, such as stable load data during non-peak periods. GAM suppresses interference from secondary information through weight allocation, mitigates the gradient vanishing problem, and directs the model to focus on fluctuation characteristics during peak and valley periods. The weighted aggregation process of GAM serves as an implicit feature fusion mechanism, nonlinearly combining hidden states from different layers of BiGRU to generate more discriminative feature representations, thereby enhancing the prediction accuracy of the hybrid model.

Compared with BiGRU-GAM, in dataset A, the RMSE, MAE, and MAPE of BiGRU-GAM-GPR were reduced by 17.90%, 21.59%, and 22.44%, respectively, and the R² was increased by 0.61%. In dataset B, the RMSE, MAE, and MAPE were reduced by 3.28%, 10.56%, and 14.10%, respectively, and the R² was increased by 0.20%. In dataset C, the RMSE, MAE, and MAPE were reduced by 6.26%, 7.90%, and 11.46%, respectively, and the R² was increased by 1.03%. This result demonstrates that the hybrid model first captures temporal dependencies through the bidirectional architecture of BiGRU, then forms a collaborative mechanism of “temporal modeling–feature weighting” via GAM to perform feature dimension reduction and temporal feature extraction on raw data. The output features have captured complex load patterns through nonlinear transformations, enabling GPR to fit nonlinear relationships without relying on strongly assumed kernel functions. By reducing the input dimension of GPR, the model allows GPR to focus on modeling residual uncertainties rather than repeatedly learning temporal patterns. GPR further excavates the intrinsic connections in the data, avoids overfitting and underfitting, and reduces the impact of noise, thereby enhancing prediction performance.

Table 5, Table 6 and Table 7 and Figure 5, Figure 6 and Figure 7 illustrate the improvements of the BiGRU-GAM-GPR model over the comparison models in hourly-scale electricity load forecasting across various metrics. These tables and figures clearly demonstrate that the proposed model in this study exhibits significant enhancements in electricity load forecasting, outperforming the other models.

To further validate the performance of the proposed model in extreme value prediction, representative load processes with a time span of 72 h were selected from each of the three datasets, as shown in Figure 8, Figure 9 and Figure 10. Figure 8 explores the performance in peak value prediction, and Figure 9 and Figure 10 explore the performance in valley value prediction. Among them, Figure 9 represents the lowest valley value, and Figure 10 represents the secondary valley value. The closer the model’s predicted value is to the actual observed value, the better the prediction effect of the model. As can be seen from the figures, most models are capable of reflecting the general trends in load variations. This is particularly evident in dataset A, while the performance in datasets B and C is slightly inferior, with some models failing to accurately predict certain extreme values. This discrepancy may be attributed to the fact that dataset A encompasses an entire year of data, enabling the model to more effectively capture the overall annual trends and peak-to-valley variations in load. In dataset C, the standard deviation of the load is significantly higher than that in dataset A, indicating a notable increase in load volatility. The initial training period encompasses extremely high-temperature intervals and a greater number of holidays, leading to irregular abrupt changes. Consequently, the model’s ability to fit outliers is inferior to that achieved with dataset A.

From the enlarged views on the right side of the figures, it is evident that the proposed model in this study achieves the best performance in predicting both peaks and valleys. The predicted results exhibit the highest degree of consistency with the actual observed values. The proposed model also demonstrates a significantly better performance at extreme points and inflection points compared to the comparison models. Therefore, the proposed model exhibits a superior performance and distinct advantages in the field of load forecasting.

4.2. Probabilistic Prediction Results

Probabilistic predictions were conducted using BiGRU-GPR and BiGRU-GAM-GPR, and the results are presented in the Table 8. As shown in the table, for dataset A, the average CRPS of BiGRU-GAM-GPR was reduced by 9.88%, the MPIW decreased by 3.90% under the 95% and 80% confidence intervals, and the average PICP increased by 1.46%. Similarly, for dataset B, the average CRPS was reduced by 2.65%, the MPIW decreased by 1.83% under the 95% and 80% confidence intervals, and the average PICP increased by 1.37%. For dataset C, the average PICP increased by 0.87%.

In dataset C, BiGRU-GPR outperformed BiGRU-GAM-GPR in terms of the CRPS and MPIW metrics. This may be attributed to the fact that CRPS evaluates the overall accuracy of the probability distribution and is sensitive to both the shape and location of the distribution. The global attention mechanism, when applied to dataset C, may have introduced biases, such as misfitting nonlinear relationships. Although the prediction intervals covered the true values, the overall distribution was not precise enough. Although the PICP was improved by increasing the interval width, that is, increasing the corresponding MPIW metric value, the overall accuracy of the probability distribution was sacrificed, resulting in the CRPS performance of the proposed model being inferior to that of the comparison model.

As demonstrated in the aforementioned study, the BiGRU-GAM-GPR model provides more accurate probabilistic load forecasting results compared with the BiGRU-GPR model. Figure 11, Figure 12 and Figure 13 illustrate the interval prediction results of the proposed model for the 72 h representative load profiles discussed in the previous section. These results encompass the majority of the load observations, thereby further validating the role of the global attention mechanism in enhancing prediction performance. Moreover, the incorporation of Gaussian process regression extends deterministic load forecasting results to probabilistic interval predictions, which better reflects the uncertainty associated with load forecasting. This approach also provides model prediction results that are applicable for decision-makers in practical engineering contexts.

5. Discussion

From the results, it can be seen that the model has high-precision power load forecasting capabilities, which can help power system operators to grasp the changing trends of electricity demand in different periods in advance. By obtaining accurate load information, operators can plan power generation schedules more scientifically, reasonably dispatch the power generation of traditional energy sources such as thermal, hydropower, and nuclear power, and reduce energy waste caused by excessive or insufficient power generation, thereby significantly reducing energy losses on both the power generation and transmission sides. In the power resource allocation process, precise load forecasting enables more reasonable delivery of power resources to different regions and user groups, avoiding resource idling or shortages caused by supply–demand mismatches, and effectively improving the overall utilization efficiency of power resources.

Meanwhile, as the proportion of intermittent renewable energy sources such as wind power and photovoltaic power in the power supply structure continues to rise, the instability and volatility of their power generation pose significant challenges to the stable operation of the power grid and power consumption. The model can help the grid to better balance power supply and demand through precise interval forecasting of power loads combined with renewable energy generation data. On the one hand, when renewable energy generation is excessive, the model’s forecasting results can guide the grid to adjust the operation strategy of energy storage equipment in a timely manner, storing excess electrical energy and avoiding wind and photovoltaic curtailment; on the other hand, when renewable energy generation is insufficient, advance knowledge of load demand can help the grid quickly deploy other energy sources for supplementation, ensuring the stability and reliability of the power supply. This efficient power dispatching and management model greatly promotes the effective consumption of renewable energy, reduces dependence on traditional fossil energy, and further drives the low-carbon and clean transformation of the entire power grid, providing strong technical support for achieving sustainable development.

6. Conclusions

This paper proposes a hybrid probabilistic forecasting model for short-term electricity load considering uncertainty, named BiGRU-GAM-GPR, which integrates a bidirectional gated recurrent unit, global attention mechanism, and Gaussian process regression. The model aims to achieve more accurate point predictions of electricity load and reliable probabilistic results. Initially, historical load data and meteorological data are fed into the BiGRU optimized by the global attention mechanism to obtain preliminary prediction results. Subsequently, the preliminary prediction results are divided into a new training set and a test set. The new training set is combined with the original training set of observed values to serve as the input training set for GPR, while the new test set derived from the preliminary prediction results is used as the test set for GPR. The final prediction results are obtained through the GPR prediction.

To validate the effectiveness of the proposed model in short-term load forecasting, this study first compared BiGRU with commonly used basic models (GRU, LSTM, and BiLSTM). BiGRU achieved the best prediction performance across three real-world electricity load datasets, with maximum improvements of 1.73% in R², 18.29% in RMSE, 21.05% in MAE, and 23.66% in MAPE. In the deterministic prediction experiments, the proposed model BiGRU-GAM-GPR was compared with six comparison models (BiGRU-GPR, BiGRU-GAM, BiGRU, GRU, LSTM, and BiLSTM). The results indicated that the proposed model achieved maximum improvements of 2.73% in R², 43.16% in RMSE, 47.69% in MAE, and 48.51% in MAPE. In the probabilistic prediction experiments, the proposed model was compared with BiGRU-GPR, and the results demonstrated that the proposed model had superior interval prediction capabilities, covering the vast majority of observed values and showing a better performance in terms of PICP, CRPS, and MPIW. The above results lead to the following conclusions:

(1): BiGRU demonstrates a strong capability of capturing the temporal dependencies within load time series, making it more suitable for addressing short-term load forecasting problems compared with other commonly used deep learning models.
(2): By incorporating the global attention mechanism, the model is able to focus on the most important features within the sequence, thereby enhancing its ability to perceive spatial features in multi-feature sequences. This indicates that the global attention mechanism plays a positive role in improving the model’s prediction performance.
(3): The GPR model further explores the intrinsic relationships within the data by extending deterministic prediction results to probabilistic outcomes. It adaptively fits the nonlinear relationships in the data, thereby avoiding overfitting and underfitting and reducing the impact of noise, which ultimately enhances the prediction performance.
(4): The proposed BiGRU-GAM-GPR model demonstrates a superior performance in both deterministic and probabilistic predictions, thereby validating its practical value and robustness in short-term electricity load forecasting. This model provides guidance for the integration and grid connection of new energy sources as well as participation in market competition.

However, this paper only uses meteorological factors as external factors influencing short-term load changes. In short-term load forecasting, social factors such as electricity price fluctuations and economic indicators, as well as date-related factors, can also be introduced, and the contribution of each factor can be quantified through interpretability tools. Currently, the model handles uncertainty based on Gaussian process assumptions. Future research could explore nonparametric methods to characterize the nonlinear dependence between different variables, thereby enhancing the adaptability of probabilistic forecasting to complex distributions. When evaluating the effectiveness of probability forecasting, future research will consider how to select more diverse evaluation metrics to comprehensively assess the predictive performance of the model. Moreover, further improving the accuracy of the overall probability distribution of electricity load and enhancing the ability to capture trends are issues that need to be addressed. Therefore, in future research, a more comprehensive set of influencing factors will be explored and integrated into the model to enhance its short-term load forecasting capabilities.

Author Contributions

Conceptualization, R.B. and L.M.; methodology, R.B.; software, K.F.; validation, Q.S. and K.F.; formal analysis, S.L.; investigation, S.L.; resources, S.L.; data curation, Q.S.; writing—original draft preparation, R.B.; writing—review and editing, R.B. and Q.S.; visualization, W.X.; supervision, L.M.; project administration, Q.S. and S.L.; funding acquisition, L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 52379011) and the Fundamental Research Funds for the Central Universities (YCJJ20242210).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data unavailable due to privacy restriction.

Conflicts of Interest

Author Qizhuan Shao and Kaixiang Fu were employed by the company Yunnan Power Grid Co., Ltd. Author Shuangquan Liu was employed by the company China Southern Power Grid Lancang-Mekong International Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BiGRU	Bidirectional gated recurrent unit
GAM	Global attention mechanism
GPR	Gaussian process regression
GRU	Gated recurrent unit
LSTM	Long short-term memory
BiLSTM	Bidirectional long short-term memory

References

Abosedra, S.; Dah, A.; Ghosh, S. Electricity consumption and economic growth, the case of Lebanon. Appl. Energy 2009, 86, 429–432. [Google Scholar] [CrossRef]
Adam, N.R.B.; Elahee, M.K.; Dauhoo, M.Z. Forecasting of peak electricity demand in Mauritius using the non-homogeneous Gompertz diffusion process. Energy 2011, 36, 6763–6769. [Google Scholar] [CrossRef]
Ji, L.; Zhang, B.; Huang, G.; Xie, Y.; Niu, D. GHG-mitigation oriented and coal-consumption constrained inexact robust model for regional energy structure adjustment A case study for Jiangsu Province, China. Renew. Energy 2018, 123, 549–562. [Google Scholar] [CrossRef]
Ruggles, T.H.; Dowling, J.A.; Lewis, N.S.; Caldeira, K. Opportunities for flexible electricity loads such as hydrogen production from curtailed generation. Adv. Appl. Energy 2021, 3, 100051. [Google Scholar] [CrossRef]
He, W.; King, M.; Luo, X.; Dooner, M.; Li, D.; Wang, J. Technologies and economics of electric energy storages in power systems: Review and perspective. Adv. Appl. Energy 2021, 4, 100060. [Google Scholar] [CrossRef]
Sabadini, F.; Madlener, R. The economic potential of grid defection of energy prosumer households in Germany. Adv. Appl. Energy 2021, 4, 100075. [Google Scholar] [CrossRef]
He, F.; Zhou, J.; Feng, Z.; Liu, G.; Yang, Y. A hybrid short-term load forecasting model based on variational mode decomposition and long short-term memory networks considering relevant factors with Bayesian optimization algorithm. Appl. Energy 2019, 237, 103–116. [Google Scholar] [CrossRef]
Zhang, X.; Wang, J.; Zhang, K. Short-term electric load forecasting based on singular spectrum analysis and support vector machine optimized by Cuckoo search algorithm. Electr. Power. Syst. Res. 2017, 146, 270–285. [Google Scholar] [CrossRef]
Feng, C.; Wang, Y.; Chen, Q.; Ding, Y.; Strbac, G.; Kang, C. Smart grid encounters edge computing: Opportunities and applications. Adv. Appl. Energy 2021, 1, 100006. [Google Scholar] [CrossRef]
Aslam, S.; Herodotou, H.; Mohsin, S.M.; Javaid, N.; Ashraf, N.; Aslam, S. A survey on deep learning methods for power load and renewable energy forecasting in smart microgrids. Renew. Sust. Energy Rev. 2021, 144, 110992. [Google Scholar] [CrossRef]
Pramanik, A.S.; Sepasi, S.; Nguyen, T.; Roose, L. An ensemble-based approach for short-term load forecasting for buildings with high proportion of renewable energy sources. Energy Build. 2024, 308, 113996. [Google Scholar] [CrossRef]
Waheed, W.; Xu, Q. Data-driven short term load forecasting with deep neural networks: Unlocking insights for sustainable energy management. Electr. Power Syst. Res. 2024, 232, 110376. [Google Scholar] [CrossRef]
Yang, Z.; Ce, L.; Lian, L. Electricity price forecasting by a hybrid model, combining wavelet transform, ARMA and kernel-based extreme learning machine methods. Appl. Energy 2017, 190, 291–305. [Google Scholar] [CrossRef]
de Oliveira, E.M.; Oliveira, F.L.C. Forecasting mid-long term electric energy consumption through bagging ARIMA and exponential smoothing methods. Energy 2018, 144, 776–788. [Google Scholar] [CrossRef]
Li, J.; Deng, D.; Zhao, J.; Cai, D.; Hu, W.; Zhang, M.; Huang, Q. A Novel Hybrid Short-Term Load Forecasting Method of Smart Grid Using MLR and LSTM Neural Network. IEEE Trans. Ind. Inform. 2021, 17, 2443–2452. [Google Scholar] [CrossRef]
Yang, D. On post-processing day-ahead NWP forecasts using Kalman filtering. Sol. Energy 2019, 182, 179–181. [Google Scholar] [CrossRef]
Ye, J.; Dang, Y.; Yang, Y. Forecasting the multifactorial interval grey number sequences using grey relational model and GM (1, N) model based on effective information transformation. Soft Comput. 2020, 24, 5255–5269. [Google Scholar] [CrossRef]
Qiu, X.; Suganthan, P.N.; Amaratunga, G.A.J. Ensemble incremental learning Random Vector Functional Link network for short-term electric load forecasting. Knowl-Based Syst. 2018, 145, 182–196. [Google Scholar] [CrossRef]
Jain, R.; Mahajan, V. Load forecasting and risk assessment for energy market with renewable based distributed generation. Renew. Energy Focus 2022, 42, 190–205. [Google Scholar] [CrossRef]
Chitsaz, H.; Shaker, H.; Zareipour, H.; Wood, D.; Amjady, N. Short-term electricity load forecasting of buildings in microgrids. Energy Build. 2015, 99, 50–60. [Google Scholar] [CrossRef]
Hafeez, G.; Khan, I.; Jan, S.; Shah, I.A.; Khan, F.A.; Derhab, A. A novel hybrid load forecasting framework with intelligent feature engineering and optimization algorithm in smart grid. Appl. Energy 2021, 299, 117178. [Google Scholar] [CrossRef]
Mughees, N.; Mohsin, S.A.; Mughees, A.; Mughees, A. Deep sequence to sequence Bi-LSTM neural networks for day-ahead peak load forecasting. Expert. Syst. Appl. 2021, 175, 114844. [Google Scholar] [CrossRef]
Zhang, Z.; Hong, W.; Li, J. Electric Load Forecasting by Hybrid Self-Recurrent Support Vector Regression Model with Variational Mode Decomposition and Improved Cuckoo Search Algorithm. IEEE Access 2020, 8, 14642–14658. [Google Scholar] [CrossRef]
Fan, G.; Han, Y.; Li, J.; Peng, L.; Yeh, Y.; Hong, W. A hybrid model for deep learning short-term power load forecasting based on feature extraction statistics techniques. Expert Syst. Appl. 2024, 238, 122012. [Google Scholar] [CrossRef]
Wang, H.; Lei, Z.; Zhang, X.; Zhou, B.; Peng, J. A review of deep learning for renewable energy forecasting. Energy Convers. Manag. 2019, 198, 111799. [Google Scholar] [CrossRef]
Kuster, C.; Rezgui, Y.; Mourshed, M. Electrical load forecasting models: A critical systematic review. Sustain. Cities Soc. 2017, 35, 257–270. [Google Scholar] [CrossRef]
Behmiri, N.B.; Fezzi, C.; Ravazzolo, F. Incorporating air temperature into mid-term electricity load forecasting models using time-series regressions and neural networks. Energy 2023, 278, 127831. [Google Scholar] [CrossRef]
Li, S.; Kong, X.; Yue, L.; Liu, C.; Khan, M.A.; Yang, Z.; Zhang, H. Short-term electrical load forecasting using hybrid model of manta ray foraging optimization and support vector regression. J. Clean. Prod. 2023, 388, 135856. [Google Scholar] [CrossRef]
Aflaki, A.; Gitizadeh, M.; Kantarci, B. Accuracy improvement of electrical load forecasting against new cyber-attack architectures. Sustain. Cities Soc. 2022, 77, 103523. [Google Scholar] [CrossRef]
Tarmanini, C.; Sarma, N.; Gezegin, C.; Ozgonenel, O. Short term load forecasting based on ARIMA and ANN approaches. Energy Rep. 2023, 9, 550–557. [Google Scholar] [CrossRef]
Islam, B.U.; Ahmed, S.F. Short-Term Electrical Load Demand Forecasting Based on LSTM and RNN Deep Neural Networks. Math. Probl. Eng. 2022, 2022, 2316474. [Google Scholar] [CrossRef]
Li, D.; Sun, G.; Miao, S.; Gu, Y.; Zhang, Y.; He, S. A short-term electric load forecast method based on improved sequence-to-sequence GRU with adaptive temporal dependence. Int. J. Electr. Power 2022, 137, 107627. [Google Scholar] [CrossRef]
Xu, Y.; Jiang, X. Short-term power load forecasting based on BiGRU-Attention-SENet model. Energy Source Part A 2022, 44, 973–985. [Google Scholar] [CrossRef]
Niu, D.; Yu, M.; Sun, L.; Gao, T.; Wang, K. Short-term multi-energy load forecasting for integrated energy systems based on CNN-BiGRU optimized by attention mechanism. Appl. Energy 2022, 313, 118801. [Google Scholar] [CrossRef]
Li, X.L.; Wang, Y.Q.; Ma, G.B.; Chen, X.; Shen, Q.X.; Yang, B. Electric load forecasting based on Long-Short-Term-Memory network via simplex optimizer during COVID-19. Energy Rep. 2022, 8, 1–12. [Google Scholar] [CrossRef]
Lin, Y.; Luo, H.; Wang, D.; Guo, H.; Zhu, K. An Ensemble Model Based on Machine Learning Methods and Data Preprocessing for Short-Term Electric Load Forecasting. Energies 2017, 10, 1186. [Google Scholar] [CrossRef]
Wang, J.; Gao, J.; Wei, D. Electric load prediction based on a novel combined interval forecasting system. Appl. Energy 2022, 322, 119420. [Google Scholar] [CrossRef]
Lin, W.; Wu, D.; Boulet, B. Spatial-Temporal Residential Short-Term Load Forecasting via Graph Neural Networks. IEEE Trans. Smart Grid 2021, 12, 5373–5384. [Google Scholar] [CrossRef]
Zhu, B.; Zhang, Z.; Ma, T.; Yang, X.; Li, Y.; Shung, K.K.; Zhou, Q. (100)-Textured KNN-based thick film with enhanced piezoelectric property for intravascular ultrasound imaging. Appl. Phys. Lett. 2015, 106, 173504. [Google Scholar] [CrossRef]
Feng, Y.; Shi, X.J.; Lu, X.Q.; Sun, W.; Liu, K.P.; Fei, Y.F. Predictions of friction and wear in ball bearings based on a 3D point contact mixed EHL model. Surf. Coat. Technol. 2025, 502, 131939. [Google Scholar] [CrossRef]
Tan, M.; Liao, C.; Chen, J.; Cao, Y.; Wang, R.; Su, Y. A multi-task learning method for multi-energy load forecasting based on synthesis correlation analysis and load participation factor. Appl. Energy 2023, 343, 121177. [Google Scholar] [CrossRef]
Zhu, H.; Lin, Q.; Li, X.; Xiao, H.; Shao, T. Short-term electrical load forecasting based on pattern label vector generation. Energ. Build. 2025, 331, 115383. [Google Scholar] [CrossRef]
Xiao, W.; Mo, L.; Xu, Z.; Liu, C.; Zhang, Y. A hybrid electric load forecasting model based on decomposition considering fisher information. Appl. Energ. 2024, 364, 123149. [Google Scholar] [CrossRef]
Huang, Q.; Li, J.; Zhu, M. An improved convolutional neural network with load range discretization for probabilistic load forecasting. Energy 2020, 203, 117902. [Google Scholar] [CrossRef]
Lin, J.; Ma, J.; Zhu, J.; Cui, Y. Short-term load forecasting based on LSTM networks considering attention mechanism. Int. J. Electr. Power 2022, 137, 107818. [Google Scholar] [CrossRef]
Bai, Y.; Xie, J.; Liu, C.; Tao, Y.; Zeng, B.; Li, C. Regression modeling for enterprise electricity consumption: A comparison of recurrent neural network and its variants. Int. J. Electr. Power 2021, 126, 106612. [Google Scholar] [CrossRef]
Im, S.; Chan, K. Neural Machine Translation with CARU-Embedding Layer and CARU-Gated Attention Layer. Mathematics 2024, 12, 997. [Google Scholar] [CrossRef]
Chorowski, J.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-Based Models for Speech Recognition. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; p. 28. [Google Scholar]
Zhang, Z.; Ye, L.; Qin, H.; Liu, Y.; Wang, C.; Yu, X.; Yin, X.; Li, J. Wind speed prediction method using Shared Weight Long Short-Term Memory Network and Gaussian Process Regression. Appl. Energy 2019, 247, 270–284. [Google Scholar] [CrossRef]
Son, J.; Cha, J.; Kim, H.; Wi, Y. Day-Ahead Short-Term Load Forecasting for Holidays Based on Modification of Similar Days’ Load Profiles. IEEE Access 2022, 10, 17864–17880. [Google Scholar] [CrossRef]

Figure 1. Structure of the GRU unit.

Figure 2. Structure of the BiGRU model.

Figure 3. Flowchart of the proposed load forecasting model.

Figure 4. Research dataset. (a) Dataset A. (b) Dataset B. (c) Dataset C.

Figure 5. Improvements of the proposed model over the comparison models in dataset A.

Figure 6. Improvements of the proposed model over the comparison models in dataset B.

Figure 7. Improvements of the proposed model over the comparison models in dataset C.

Figure 8. Comparison of 72 h representative load profiles for different models in dataset A.

Figure 9. Comparison of 72 h representative load profiles for different models in dataset B.

Figure 10. Comparison of 72 h representative load profiles for different models in dataset C.

Figure 11. Interval coverage plots of the proposed model at 80% and 95% confidence levels in dataset A.

Figure 12. Interval coverage plots of the proposed model at 80% and 95% confidence levels in dataset B.

Figure 13. Interval coverage plots of the proposed model at 80% and 95% confidence levels in dataset C.

Table 1. Hyperparameters of each model.

Study Case	Models	Hyperparameters
Case A	GRU	num layers = 2; hidden size = 64,128; learning rate = 0.001; batch size = 64; epoch = 100
	LSTM	num layers = 2; hidden size = 128,64; learning rate = 0.001; batch size = 64; epoch = 100
	BiLSTM	Same as LSTM
	BiGRU	Same as GRU
	BiGRU-GAM	Same as GRU
	BiGRU-GPR	num layers = 2; hidden size = 64,128; learning rate = 0.001; batch size = 64; epoch = 100
	BiGRU-GAM-GPR	num layers = 2; hidden size = 64,128; learning rate = 0.001; batch size = 64; epoch = 100
Case B	GRU	num layers = 2; hidden size = 64,128; learning rate = 0.001; batch size = 64; epoch = 100
	LSTM	num layers = 2; hidden size = 128,64; learning rate = 0.001; batch size = 64; epoch = 100
	BiLSTM	Same as LSTM
	BiGRU	Same as GRU
	BiGRU-GAM	Same as GRU
	BiGRU-GPR	num layers = 2; hidden size = 64,128; learning rate = 0.001; batch size = 64; epoch = 100
	BiGRU-GAM-GPR	num layers = 2; hidden size = 64,128; learning rate = 0.001; batch size = 64; epoch = 100
Case C	GRU	num layers = 2; hidden size = 64,128; learning rate = 0.001; batch size = 64; epoch = 100
	LSTM	num layers = 2; hidden size = 128,64; learning rate = 0.003; batch size = 64; epoch = 100
	BiLSTM	Same as LSTM
	BiGRU	Same as GRU
	BiGRU-GAM	Same as GRU
	BiGRU-GPR	num layers = 2; hidden size = 64,128; learning rate = 0.001; batch size = 64; epoch = 100
	BiGRU-GAM-GPR	num layers = 2; hidden size = 64,128; learning rate = 0.001; batch size = 64; epoch = 100

Table 2. Performance evaluation results of all models on the dataset A.

Models	RMSE	MAE	MAPE	R²
GRU	458.86	388.36	2.00%	0.9700
LSTM	521.30	457.76	2.35%	0.9613
BiLSTM	491.12	409.38	2.09%	0.9657
BiGRU	430.28	361.39	1.85%	0.9736
BiGRU-GAM	360.90	305.36	1.56%	0.9815
BiGRU-GPR	333.81	267.74	1.36%	0.9841
BiGRU-GAM-GPR	296.29	239.44	1.21%	0.9875

Table 3. Performance evaluation results of all models on the dataset B.

Models	RMSE	MAE	MAPE	R²
GRU	520.27	403.84	1.75%	0.9506
LSTM	460.72	357.14	1.59%	0.9612
BiLSTM	510.02	408.32	1.86%	0.9525
BiGRU	425.11	323.45	1.42%	0.9670
BiGRU-GAM	407.62	341.90	1.56%	0.9697
BiGRU-GPR	406.02	314.60	1.38%	0.9698
BiGRU-GAM-GPR	394.27	305.79	1.34%	0.9716

Table 4. Performance evaluation results of all models on the dataset C.

Models	RMSE	MAE	MAPE	R²
GRU	605.74	519.50	2.81%	0.9396
LSTM	602.01	486.94	2.66%	0.9403
BiLSTM	606.78	496.49	2.72%	0.9394
BiGRU	573.94	479.17	2.56%	0.9458
BiGRU-GAM	560.13	467.04	2.53%	0.9484
BiGRU-GPR	575.68	476.81	2.50%	0.9497
BiGRU-GAM-GPR	525.06	430.15	2.24%	0.9582

Table 5. The improvement of BiGRU-GAM-GPR compared to other models of dataset A.

Models	RMSE	MAE	MAPE	R²
GRU	35.43%	38.35%	39.51%	1.80%
LSTM	43.16%	47.69%	48.51%	2.73%
BiLSTM	39.67%	41.51%	42.07%	2.26%
BiGRU	31.14%	33.74%	34.34%	1.43%
BiGRU-GAM	17.90%	21.59%	22.38%	0.61%
BiGRU-GPR	11.24%	10.57%	10.63%	0.35%

Table 6. The improvement of BiGRU-GAM-GPR compared to other models of dataset B.

Models	RMSE	MAE	MAPE	R²
GRU	24.22%	24.28%	23.31%	2.21%
LSTM	14.42%	14.38%	15.58%	1.08%
BiLSTM	22.69%	25.11%	27.93%	2.01%
BiGRU	7.25%	5.46%	5.22%	0.48%
BiGRU-GAM	3.27%	10.56%	13.69%	0.20%
BiGRU-GPR	2.89%	2.80%	3.03%	0.19%

Table 7. The improvement of BiGRU-GAM-GPR compared to other models of dataset C.

Models	RMSE	MAE	MAPE	R²
GRU	13.32%	17.20%	20.14%	1.98%
LSTM	12.78%	11.66%	15.67%	1.90%
BiLSTM	13.47%	13.36%	17.45%	2.00%
BiGRU	8.52%	10.23%	12.29%	1.31%
BiGRU-GAM	6.26%	7.90%	11.32%	1.03%
BiGRU-GPR	8.79%	9.79%	10.32%	0.90%

Table 8. Probabilistic prediction results.

Index		Case A		Case B		Case C
model		BiGRU-GPR	BiGRU-GAM-GPR	BiGRU-GPR	BiGRU-GAM-GPR	BiGRU-GPR	BiGRU-GAM-GPR
CRPS	min	151.128	135.791	219.621	213.776	236.609	253.609
	mean	151.132	136.204	219.716	213.899	236.944	254.064
	max	151.138	136.337	219.760	213.959	237.043	254.694
PICP	min	0.891	0.904	0.872	0.885	0.916	0.923
	mean	0.891	0.904	0.873	0.885	0.917	0.925
	max	0.891	0.904	0.874	0.885	0.917	0.928
MPIW	95%CL	732.865	704.306	1023.779	1005.092	1197.047	1350.239
MPIW	80%CL	374.985	360.372	523.836	514.275	612.493	690.876

In the table, 95%CL and 80%CL represent the 95% and 80% confidence levels, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shao, Q.; Bao, R.; Liu, S.; Fu, K.; Mo, L.; Xiao, W. Short-Term Electric Load Probability Forecasting Based on the BiGRU-GAM-GPR Model. Sustainability 2025, 17, 5267. https://doi.org/10.3390/su17125267

AMA Style

Shao Q, Bao R, Liu S, Fu K, Mo L, Xiao W. Short-Term Electric Load Probability Forecasting Based on the BiGRU-GAM-GPR Model. Sustainability. 2025; 17(12):5267. https://doi.org/10.3390/su17125267

Chicago/Turabian Style

Shao, Qizhuan, Rungang Bao, Shuangquan Liu, Kaixiang Fu, Li Mo, and Wenjing Xiao. 2025. "Short-Term Electric Load Probability Forecasting Based on the BiGRU-GAM-GPR Model" Sustainability 17, no. 12: 5267. https://doi.org/10.3390/su17125267

APA Style

Shao, Q., Bao, R., Liu, S., Fu, K., Mo, L., & Xiao, W. (2025). Short-Term Electric Load Probability Forecasting Based on the BiGRU-GAM-GPR Model. Sustainability, 17(12), 5267. https://doi.org/10.3390/su17125267

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Short-Term Electric Load Probability Forecasting Based on the BiGRU-GAM-GPR Model

Abstract

1. Introduction

2. Methods

2.1. Bidirectional Gated Recurrent Unit (BiGRU)

2.2. Global Attention Mechanism (GAM)

2.3. Gaussian Process Regression (GPR)

2.4. Load Forecasting Framework

2.5. Evaluation Metrics

2.5.1. Evaluation Metric of Point Prediction

2.5.2. Evaluation Metric of Probability Prediction

3. Case Study

3.1. Study Data

3.2. Data Preprocessing

3.3. Comparative Experiment Design

4. Result

4.1. Deterministic Prediction Results

4.2. Probabilistic Prediction Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI