Short- and Medium-Term Power Demand Forecasting with Multiple Factors Based on Multi-Model Fusion

: With the continuous development of economy and society, power demand forecasting has become an important task of the power industry. Accurate power demand forecasting can promote the operation and development of the power supply industry. However, since power consumption is affected by a number of factors, it is difﬁcult to accurately predict the power demand data. With the accumulation of data in the power industry, machine learning technology has shown great potential in power demand forecasting. In this study, gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost) and light gradient boosting machine (LightGBM) are integrated by stacking to build an XLG-LR fusion model to predict power demand. Firstly, preprocessing was carried out on 13 months of electricity and meteorological data. Next, the hyperparameters of each model were adjusted and optimized. Secondly, based on the optimal hyperparameter conﬁguration, a prediction model was built using the training set (70% of the data). Finally, the test set (30% of the data) was used to evaluate the performance of each model. Mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and goodness-of-ﬁt coefﬁcient (Rˆ2) were utilized to analyze each model at different lengths of time, including their seasonal, weekly, and monthly forecast effect. Furthermore, the proposed fusion model was compared with other neural network models such as the GRU, LSTM and TCN models. The results showed that the XLG-LR model achieved the best prediction results at different time lengths, and at the same time consumed the least time compared to the neural network model. This method can provide a more reliable reference for the operation and dispatch of power enterprises and future power construction and planning.


Introduction
Electricity is one of the most important basic energy sources in the world. It can provide basic support for industrial production and processing, and sustain people's daily life. Since there is no high-quality storage carrier for electric energy at this stage, low storage efficiency occurs when battery packs or pumped energy storage power stations are solely data in a short period of time, and significantly improve the prediction accuracy, which has obvious advantages.
At present, the methods adopted by domestic researchers more often refer to neural networks [12][13][14], support vector machines [15], and the joint model [16]. Reference [17] proposes a power demand forecasting model based on a second-order gray neural network. First, the wavelet sequence is used to perform stationarity processing on the original data set, and then the power demand is predicted using a second-order gray neural network. Reference [18] adopts the grayscale model of a neural network to predict the power demand, and obtains a relatively good prediction effect. Reference [19] proposes a LSSVM_PSO model for power demand forecasting. The model utilizes a particle swarm optimization algorithm to adjust the learning rate to reduce the prediction error of the support vector machine and improve its reliability. Compared with the least squares support vector machine, this method achieves higher convergence rates and prediction performance. Reference [20] combines the feedback of the neural network and ARMA models to predict the power generation of wind power plants, and this model achieves high accuracy and interpretability. Reference [21] proposes a power load forecasting model based on extreme gradient enhancement to solve the problem whereby traditional forecasting models have difficulty in dealing with massive data when power data grows exponentially in some cases. Through the analysis of meteorological factors and the long-term regularity of the daily power load, the model achieves higher prediction accuracy and smoother prediction error compared with traditional machine algorithms. Reference [22] combines the two models of Xgboost and ARMA, and uses the power consumption data of enterprise users to make predictions. Through a series of comparative experiments, it is found that this method achieves more accurate prediction results than traditional methods.
Through the above analysis, and in view of the problems that short-and mediumterm power data is less informative and difficult to predict, after considering the impact of meteorological factors on power consumption, this paper integrates LGB, XGB and GBDT, and fully explores the correlation between electricity demand and weather data through the integrated model. The model is trained by using the time series relationship existing in the data so as to obtain a more accurate prediction effect.

Data Source
The data in this paper came from the 13-month electricity consumption data of a city in China published on the Internet. The original data set contains five attribute items, including historical electricity consumption, temperature, humidity, wind speed and rainfall. All data was collected every 15 min, that is, the data of five attribute items was recorded once every 15 min. The specific data set is shown in Table 1, where time represents the time of data recording. In order to verify the effectiveness of the model, the data set was divided into a training set and a test set before model training according to the power demand forecasting tasks of different durations. The training set accounted for 70% of the original data, and the test set accounted for 30%.

Data Cleaning
The data item of electricity consumption in the data used in this paper was analyzed, and a data trend diagram was drawn, as shown in Figure 1. It can be seen that the data fluctuated significantly since the 41st day of Year 1. Since the original data came from a certain city in China, it was speculated that this period should be during the Chinese Lunar New Year, when a large number of urban migrant workers returned to their hometowns to celebrate the New Year, and a large number of enterprises and institutions stopped work and production during this period, resulting in large fluctuations in electricity consumption. In order to reduce the impact of the abnormal fluctuation on prediction results, this paper classified the data of 15 days after the 41st day of Year 1 as abnormal data and deleted it from the data set. In order to verify the effectiveness of the model, the data set was divided into a training set and a test set before model training according to the power demand forecasting tasks of different durations. The training set accounted for 70% of the original data, and the test set accounted for 30%.

Data Cleaning
The data item of electricity consumption in the data used in this paper was analyzed, and a data trend diagram was drawn, as shown in Figure 1. It can be seen that the data fluctuated significantly since the 41st day of Year 1. Since the original data came from a certain city in China, it was speculated that this period should be during the Chinese Lunar New Year, when a large number of urban migrant workers returned to their hometowns to celebrate the New Year, and a large number of enterprises and institutions stopped work and production during this period, resulting in large fluctuations in electricity consumption. In order to reduce the impact of the abnormal fluctuation on prediction results, this paper classified the data of 15 days after the 41st day of Year 1 as abnormal data and deleted it from the data set.

Data Normalization
The main goal of data normalization was to scale the original data within a fixed interval according to certain rules to eliminate the influence of different data dimensions in the original data so as to ensure that the model training results were not affected by the original data dimensions. In this paper, according to Equation (1), five attribute items including electricity consumption, temperature, humidity, wind speed and rainfall in the original data were normalized to the [0,1] interval [23].
In the equation,x i is the normalized value of the ith value of the sample, x i is the ith value of the sample, X min is the minimum value of the sample, and X max is the maximum value of the sample.

Boosting and Decision Tree
Ensemble learning completes the learning task by constructing and combining multiple learners. By combining multiple learners, it is often possible to obtain significantly better generalization performance compared to a single learner. There are three common ensemble learning ideas, including bagging, boosting, and stacking.
Boosting is a kind of algorithm that can upgrade a weak learner to a strong learner. The working mechanism is as follows: firstly, train a base learner from the initial training set, and then adjust the distribution of training samples according to the performance of the base learner, so that the training samples made by the previous base learner will receive more attention in the follow-up. The next base learner is then trained based on the adjusted sample distribution. This is repeated until the number of base learners reaches the specified value N, and finally the N base learners are weighted together. The flow chart of the algorithm is shown in Figure 2: in the original data so as to ensure that the model training results were not affected by the original data dimensions. In this paper, according to equation (1), five attribute items including electricity consumption, temperature, humidity, wind speed and rainfall in the original data were normalized to the [0,1] interval [23].

=
(1) In the equation, is the normalized value of the th value of the sample, is the th value of the sample, is the minimum value of the sample, and is the maximum value of the sample.

Boosting and Decision Tree
Ensemble learning completes the learning task by constructing and combining multiple learners. By combining multiple learners, it is often possible to obtain significantly better generalization performance compared to a single learner. There are three common ensemble learning ideas, including bagging, boosting, and stacking.
Boosting is a kind of algorithm that can upgrade a weak learner to a strong learner. The working mechanism is as follows: firstly, train a base learner from the initial training set, and then adjust the distribution of training samples according to the performance of the base learner, so that the training samples made by the previous base learner will receive more attention in the follow-up. The next base learner is then trained based on the adjusted sample distribution. This is repeated until the number of base learners reaches the specified value , and finally the base learners are weighted together. The flow chart of the algorithm is shown in Figure 2: A decision tree is an important model in ensemble learning, and its core is a tree structure, as shown in Figure 3. The figure represents the mapping relationship between object attributes and object values. The root node and inner node represent the segmentation of features, and each branch denotes the output of the feature corresponding to the parent node in the regional space here. A decision tree is an important model in ensemble learning, and its core is a tree structure, as shown in Figure 3. The figure represents the mapping relationship between object attributes and object values. The root node and inner node represent the segmentation of features, and each branch denotes the output of the feature corresponding to the parent node in the regional space here.
Decision trees are generally divided into classification trees and regression trees. Classification trees are often used in class division, while regression trees are often used in numerical prediction [24]. During the growth of the regression tree, each leaf node can get a predicted value, and the threshold of each feature value is exhausted during segmentation. The optimal segmentation variable and optimal segmentation point are found by minimizing the squared error, and then the minimized square error is utilized to find the most credible segmentation basis so as to ensure that the predicted value of the current branch node is unique, or at a certain artificial threshold. If the data of each leaf node is not unique, the average value of the node data is used as the predicted value. Decision trees are generally divided into classification trees and regression trees. Classification trees are often used in class division, while regression trees are often used in numerical prediction [24]. During the growth of the regression tree, each leaf node can get a predicted value, and the threshold of each feature value is exhausted during segmentation. The optimal segmentation variable and optimal segmentation point are found by minimizing the squared error, and then the minimized square error is utilized to find the most credible segmentation basis so as to ensure that the predicted value of the current branch node is unique, or at a certain artificial threshold. If the data of each leaf node is not unique, the average value of the node data is used as the predicted value.
At this time, the optimal segmentation variable and the segmentation point with the smallest overall square error loss are obtained.
Step 3: After the segmentation scheme at the value of the first attribute is obtained, calculate the output of the two sub-regions: Step 4: Continue to call steps 2 and 3 for the two sub-regions to find the optimal variable characteristics of each branch node. The growth of the regression tree ends when all regions meet the threshold or exhaust all attributes for its growth.
Step 5: The input space is divided into M regions, , , ⋯ , , and there is a fixed output value in each divided unit region. The final decision tree is generated as follows:

Gradient Boosting Decision Algorithm
The gradient boosting decision algorithm is a representative algorithm in the boosting series of algorithms, which consists of multiple decision trees, and the The growth of the above regression tree generally has the following five steps: Step 1: Enter the training data set, as follows: Step 2: Traverse all feature variables j. When the fixed segmentation variable j is encountered, segmentation point s is scanned.
At this time, the optimal segmentation variable j and the segmentation point s with the smallest overall square error loss are obtained.
Step 3: After the segmentation scheme at the value s of the first attribute j is obtained, calculate the output of the two sub-regions: Step 4: Continue to call steps 2 and 3 for the two sub-regions to find the optimal variable characteristics of each branch node. The growth of the regression tree ends when all regions meet the threshold or exhaust all attributes for its growth.
Step 5: The input space is divided into M regions, R 1 , R 2 , · · · , R M , and there is a fixed output value c m in each divided unit region. The final decision tree is generated as follows:

Gradient Boosting Decision Algorithm
The gradient boosting decision algorithm is a representative algorithm in the boosting series of algorithms, which consists of multiple decision trees, and the conclusions of all trees are accumulated as the final answer [25]. The main idea of the gradient boosting decision tree is to take advantage of the squared error to denote the loss function, in which each regression tree learns the conclusions and residuals of all previous trees, and fits a current residual regression tree. The residual is the difference between the true value and the predicted value. The boosting tree is the accumulation of the regression trees generated by the entire iterative process. However, the gradient boosting decision tree requires that the weak learner must be a CART regression tree model, and GBDT requires that the sample loss predicted by the model be as small as possible during model training. The process of using GBDT as a regression algorithm to predict the power demand is as follows: Assume that the training set samples are T = (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x m , y m ), the maximum number of iterations is T, the loss function commonly uses mean square error function L(y, f (x)) = (y − f (x)) 2 , and the output is the strong learner f (x). The regression algorithm process is as follows: Step 1: Initialize the weak learner. The mean of C can be set to the mean of the sample y.
Step 3: Use (x i , r ti )i = 1, 2, 3, . . . , m to fit a CART regression tree to get the tth regression tree. Its corresponding leaf node area is R tj , j = 1, 2, 3, . . . , J, where J is the number of leaf nodes of the regression tree t.
Step 4: With regard to the leaf region j = 1, 2, 3, . . . , J, there is the best fitting value at this time.
Step 5: Update the strong learner.
Finally, the expression of the strong learner f (x) is obtained: GBDT can be applied to most regression problems [26,27]. For dense data such as electricity demand, a variety of distinguishing features and feature combinations can be found through this model, which has strong generalization and expression ability to achieve a better fitting effect.

LightGBM Model
In order to improve model training efficiency and reduce memory consumption, based on the traditional GBDT algorithm, the Light Gradient Boosting Machine (LightBGM) algorithm is proposed [28]. The pre-sorting algorithm commonly used in the boosting algorithm performs feature selection and splitting. This method can accurately find the splitting point, but the memory usage and computational cost are high. Therefore, the LightBGM algorithm uses Histogram to improve the speed of processing training samples. The Histogram algorithm constructs a piecewise function in advance before training, converts continuous eigenvalues into K discrete bin values, and then establishes a histogram containing K items. The constructed histogram is utilized to traverse the training samples. During this process, the LightBGM algorithm accumulates statistics in the histogram according to K discrete values and finally finds the best split point from the discrete values. This method can significantly reduce the computational memory and computational cost, and significantly improve the computational speed.
In addition, the leaves of the GBDT algorithm use a level-wise growth method, which does not distinguish the leaves of the same layer. However, in fact, the split of many leaves brings a low gain, which brings the waste of computing resources and memory resources [29]. In response to this problem, the LightBGM algorithm adopts a more efficient Leaf-wise algorithm that grows according to leaves. It splits by finding the largest splitting gain from a certain layer of leaves and repeats it continuously, which enables the algorithm to achieve higher accuracy under the same number of splits. Meanwhile, overfitting can be avoided by limiting the depth of the tree when the sample size is small.
It can be seen from the above that the LightBGM algorithm, based on the core idea of the GBDT algorithm, improves the feature splitting process and tree growth method by introducing a new method, which makes the model simpler, requires less computational cost, and achieves more accurate predictions.

XGBoost Algorithm
Based on the decision tree boosting optimization model, the XGBoost algorithm converts weak learners into strong learners through iteration [29]. In the XGBoost algorithm, the CART regression tree is used as a weak learner to first determine the optimal structure of the tree, such as the number of leaf nodes and the depth of the tree. Next, the distributed forward additive model is adopted. Each time a single tree is generated, the weight of the last misclassified data is increased and used for the current tree, and the overall error of the model is gradually reduced by continuously adding trees until the end of training [30].
When the XGBoost algorithm is adopted to train samples, the model for each tree is as follows: In the equation, w is the leaf node score value.
x represents the input sample data, q(x) denotes the leaf node corresponding to the sample x, and M is the number of leaf nodes of the tree. The equation for adding the mth tree to the model is as follows: To train a single CART tree [31], the objective function needs to be determined first: The objective function is divided into two parts, including loss function L and regularization Ω. For regression, the loss of the square of the residual between the predicted value and the true value, that is, the L2 loss, is generally used to evaluate the degree of model fitting, and the regularization term acts as a penalty term for the model to prevent overfitting. The regularization term is defined as: In the equation, M refers to the number of leaf nodes and w j refers to the L2 regularity of leaf node scores. r and λ are used to control the complexity of the tree. From this, the regularization term can be calculated. Equations (12), (13), and (15) are brought into the objective function, and the second-order Taylor formula is used to obtain the form of the leaf node of the mth tree, which is as follows: Let G j = ∑ i∈I j g i , H j = ∑ i∈I j h i . Bring them into Equation (16) and obtain the partial derivative of the objective function with respect to w j . Set the value of the derivative function to 0, and obtain: Bring it into the objective function and obtain: This paper used Obj * to evaluate the quality of a single CART regression tree structure. XGBoost enumerated the splitting schemes of all features from the tree with a depth of 0 and calculated its objective function value to determine the optimal structure of the tree. When the tree reached the maximum depth and the sum of the sample weights was less than the set threshold, the establishment of the decision tree was stopped. The sampling ratio of each tree was controlled by the set parameters, and the structure training process of a tree was finally optimized through parameter adjustment.
XGBoost applied boosting to carry out the next round of training after training one tree, obtaining the optimized training model structure through continuous iteration. After one iteration, XGBoost multiplied the weight of the leaf node and the learning rate, thereby weakening the influence of each tree and providing a larger learning space for subsequent trees. Finally, the optimal number of iterations of the model was determined, and the training of the model was completed.

LR Model
The LR model is mainly represented by a conditional probability distribution P(Y|X) in the form of a parameterized logistic distribution. Among them, the value range of X as a random variable is a real number, and the value range of X as a random variable is 1 or 0. The conditional distribution of the LR model is as follows: In the equation, x ∈ R n refers to the input, Y ∈ {0, 1} refers to the output, w ∈ R n and b ∈ R are the parameters, w is the weight vector, b is the bias, and w·x is w and the inner product of x. For a given input x, P(Y = 1|x) and P(Y = 0|x) can be solved according to Equations (19) and (20). Logistic regression compares two conditional probability values and finds a class with a larger probability value, thereby assigning input x to that class.
The weight vector w and the input vector x are extended to get At the moment, the LR model is as follows: The probability of an event occurring divided by the probability of an event not occurring is the probability of the event. At this time, assume that the probability of an event occurring is p, the probability of it not occurring is 1 − p, thus the probability of the event is p 1−p . The logarithmic probability of the event is as follows, which can also be called the logit function.
For logistic regression, the following equation can be obtained from Equations (21) and (22).
It can be seen from the above equation that in the LR model, the logit function with the output Y = 1 has a linear relationship with the input x. The value domain of the linear function w·x is the real number domain, and the input x can be split by a linear function.
Since x ∈ R n+1 , w ∈ R n+1 , the linear function w·x can be converted into a probability by taking advantage of Equation (19): When the linear function w·x infinitely approaches positive infinity, the value of the conditional probability approaches 1; when the linear function w·x infinitely approaches negative infinity, the value of the conditional probability approaches 0.
, is given. The maximum likelihood estimation method is used here to estimate the LR model parameters. ( At this moment, the likelihood function is The estimated value of w can be obtained by solving the local maximum of Equation (28). Next, we optimize the objective function, which is the log-likelihood function. In logistic regression, gradient descent and quasi-Newton methods are often used. Assume thatŵ is the maximum likelihood estimate of w, and the resulting LR model is Due to the limited learning ability of the LR model, it is often necessary to combine it with other models [32]. Corresponding feature combinations are obtained by other models through training, and then the LR model gives the corresponding predicted values.

Power Demand Forecasting Model Based on Stacking
In view of the fact that no single model can meet the requirements of training performance and stability well, this paper attempts to use the Stacking to synthesize the advantages of various boosting models [33]. Moreover, combining it with the LR regression model enables the fusion model to have strong discrimination and stability, and does not require too frequent iterations on the basis of achieving good results.
The overall design of model training and testing in this study is shown in Figure 4. First, the original data is cleaned and normalized, and then the power demand forecasting model based on stacking is trained to obtain the corresponding forecasting model. Next, the test data is used for prediction. values.

Power Demand Forecasting Model Based on Stacking
In view of the fact that no single model can meet the requirements of trainin performance and stability well, this paper attempts to use the Stacking to synthesize th advantages of various boosting models [33]. Moreover, combining it with the LR regression model enables the fusion model to have strong discrimination and stability and does not require too frequent iterations on the basis of achieving good results.
The overall design of model training and testing in this study is shown in Figure 4 First, the original data is cleaned and normalized, and then the power demand forecastin model based on stacking is trained to obtain the corresponding forecasting model. Nex the test data is used for prediction. The process of model training is then described in detail. Through the previou analysis, it can be found that the power demand data involved in this study has stron regularity in the time series when they are divided by day, month and season after th data on special holidays is removed. Meanwhile, the amount of data is limited, so th model based on decision tree is more suitable for solving this kind of problem. Thre models, including GBDT, XGBoost and LightGBM, have their own advantages and disadvantages in predicting different scenarios. The fusion of the three models ca achieve a joint gain effect. Stacking is an ensemble framework for hierarchical models [34 The first layer is composed of a number of different base learners. This paper selected three models, including GBDT, XGBoost and LightGBM. When each model was adjusted to achieve good results, they were integrated to predict, thereby reducing the deviation o the model and achieving better results. The LR regression model was selected for th second layer, which further avoided the occurrence of overfitting, effectively reduced th variance of the model, and made the model more stable. The specific steps of the powe demand forecasting model based on stacking are as follows: The process of model training is then described in detail. Through the previous analysis, it can be found that the power demand data involved in this study has strong regularity in the time series when they are divided by day, month and season after the data on special holidays is removed. Meanwhile, the amount of data is limited, so the model based on decision tree is more suitable for solving this kind of problem. Three models, including GBDT, XGBoost and LightGBM, have their own advantages and disadvantages in predicting different scenarios. The fusion of the three models can achieve a joint gain effect. Stacking is an ensemble framework for hierarchical models [34]. The first layer is composed of a number of different base learners. This paper selected three models, including GBDT, XGBoost and LightGBM. When each model was adjusted to achieve good results, they were integrated to predict, thereby reducing the deviation of the model and achieving better results. The LR regression model was selected for the second layer, which further avoided the occurrence of overfitting, effectively reduced the variance of the model, and made the model more stable. The specific steps of the power demand forecasting model based on stacking are as follows: Step 1: First, the overall data set consisting of meteorological factor and power demand was divided into training data (training set) and prediction data (testing set). Then the training samples were divided into k groups of data with the same amount.
Step 2: The training data set was trained multiple times with each base learner. Each training utilized k − 1 pieces of data as training samples, and the remaining one was used as a validation set. The data of meteorological factor in the validation set was utilized to predict power demand, so as to obtain k copies of the prediction data through the validation set. In addition, the prediction samples would be predicted during each training process to obtain k copies of prediction data. It should be noted that only the training set needs to do this step. The validation set and test set do not need it.
Step 3: Combine the k pieces of prediction data obtained through the validation set to get new training sample data. The obtained k pieces of prediction data were averaged to obtain new prediction data. The specific process is shown in Figure 5.
Step 2: The training data set was trained multiple times with each base learner. Each training utilized − 1 pieces of data as training samples, and the remaining one was used as a validation set. The data of meteorological factor in the validation set was utilized to predict power demand, so as to obtain copies of the prediction data through the validation set. In addition, the prediction samples would be predicted during each training process to obtain copies of prediction data. It should be noted that only the training set needs to do this step. The validation set and test set do not need it.
Step 3: Combine the pieces of prediction data obtained through the validation set to get new training sample data. The obtained pieces of prediction data were averaged to obtain new prediction data. The specific process is shown in Figure 5. Step 4: Input the data obtained in Step 3 into the second layer, and finally get the final prediction result. The process is shown in Figure 6.
The power demand model constructed in this paper used GBDT, XGBoost and LightGBM, the three boosting models in the first layer of the stacking framework. The second layer of the stacking framework adopted the LR model to directly output the prediction results. The overall framework of the model is shown in Figure 6.  Step 4: Input the data obtained in Step 3 into the second layer, and finally get the final prediction result. The process is shown in Figure 6.
The power demand model constructed in this paper used GBDT, XGBoost and Light-GBM, the three boosting models in the first layer of the stacking framework. The second layer of the stacking framework adopted the LR model to directly output the prediction results. The overall framework of the model is shown in Figure 6.
Step 2: The training data set was trained multiple times with each base learner. Each training utilized − 1 pieces of data as training samples, and the remaining one was used as a validation set. The data of meteorological factor in the validation set was utilized to predict power demand, so as to obtain copies of the prediction data through th validation set. In addition, the prediction samples would be predicted during each training process to obtain copies of prediction data. It should be noted that only th training set needs to do this step. The validation set and test set do not need it.
Step 3: Combine the pieces of prediction data obtained through the validation se to get new training sample data. The obtained pieces of prediction data were averaged to obtain new prediction data. The specific process is shown in Figure 5. Step 4: Input the data obtained in Step 3 into the second layer, and finally get the fina prediction result. The process is shown in Figure 6.
The power demand model constructed in this paper used GBDT, XGBoost and LightGBM, the three boosting models in the first layer of the stacking framework. Th second layer of the stacking framework adopted the LR model to directly output th prediction results. The overall framework of the model is shown in Figure 6.  The optimal parameters of each basic model are summarized in Table 2. In this study, some key hyperparameters in GBDT, XGBoost, and LightGBM algorithms were adjusted, as shown in Table 2. Table 2 also explains the specific meaning of these hyperparameters. According to the maximum average precision, the best value of each set of hyperparameters is obtained, as shown in Table 2.

Evaluation Indicators
Power demand forecasting calculates the power consumption demand for a period of time in the future based on the internal relationship between the historically recorded power consumption data and the corresponding meteorological information. The estimated power consumption demand often has some errors compared with the actual power demand. The smaller the error, the higher the accuracy of the model, and the closer the fit between estimated and actual power consumption demand curve, which means the better performance of the model. Therefore, the objective evaluation of the model is of great significance for analyzing the quality of the model.
In this paper, four commonly used model evaluation indicators, including mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and goodness-of-fit coefficient (R 2 ) were used to evaluate a single deterministic model. The formulas are as follows: In the formula, Q sim,n represents the predicted value, Q obs,n represents the true value, Q sim represents the mean of the predicted value, and Q obs represents the mean of the actual value. The smaller the index value in Formulas (31)- (33), the smaller the error of the forecast model. The closer the index value in Formula (34) is to 1, the higher the accuracy of the forecast model [35], the better the fit between the measured value and the predicted value.

Forecast Results of Short-Term Power Demand in Different Seasons
With regard to verification of the effect of forecasting seasonal power demand, one day of data was randomly selected from the test set for testing, and the accuracy of power demand forecasting at different times of the day in a certain season was examined. The results of the multi-model fusion model XLG-LR constructed in this paper for forecasting power demand in different seasons are shown in Figure 7. It can be seen that the prediction results of the XLG-LR model were not as good as other models in some periods, but in most periods the predicted results were the closest to the true value.
The results of this method and other methods are shown in Table 3. It can be seen that compared with the three models of XGB, LGB and GBDT, the XLG-LR model in this paper achieved the best results in the prediction of power demand in different seasons under the four evaluation indicators. In terms of different seasons, the XLG-LR model achieved the best prediction results in summer, and the prediction in winter was worse than the other three seasons. The result of R 2 = 0.9901 was obtained in the prediction of electricity demand in summer, which showed that the predicted electricity demand curve and the real electricity demand curve were close to complete fitting. In the formula, , represents the predicted value, , represents the true value, represents the mean of the predicted value, and represents the mean of the actual value.
The smaller the index value in Formulas (31)-(33), the smaller the error of the forecast model. The closer the index value in Formula (34) is to 1, the higher the accuracy of the forecast model [35], the better the fit between the measured value and the predicted value.

Forecast Results of Short-Term Power Demand in Different Seasons
With regard to verification of the effect of forecasting seasonal power demand, one day of data was randomly selected from the test set for testing, and the accuracy of power demand forecasting at different times of the day in a certain season was examined. The results of the multi-model fusion model XLG-LR constructed in this paper for forecasting power demand in different seasons are shown in Figure 7. It can be seen that the prediction results of the XLG-LR model were not as good as other models in some periods, but in most periods the predicted results were the closest to the true value.
The results of this method and other methods are shown in Table 3. It can be seen that compared with the three models of XGB, LGB and GBDT, the XLG-LR model in this paper achieved the best results in the prediction of power demand in different seasons under the four evaluation indicators. In terms of different seasons, the XLG-LR model achieved the best prediction results in summer, and the prediction in winter was worse than the other three seasons. The result of = 0.9901 was obtained in the prediction of electricity demand in summer, which showed that the predicted electricity demand curve and the real electricity demand curve were close to complete fitting.

Power Demand Forecasting Results on a Weekly Basis
In actual production, the power sector usually needs to plan the production and scheduling of the next week at the end of one week. Therefore, forecasting power demand in units of weeks has practical significance in guiding the power sector to arrange production scheduling. In order to evaluate the power demand forecasting on a weekly basis, the data of seven consecutive days was randomly selected from the test set for testing, and the accuracy of the power demand forecasting at different periods in the seven days was examined. Figure 8 shows the results of power demand on a weekly basis of forecast by the model XLG-LR. It can be seen that in addition to the XGB model, LGB, GBDT and XLG-LR models all performed better on forecasting the trend of electricity demand in a week, in which the forecast of XLG-LR model was the closest to the true value.
, 2148 17 of 31 demand in a week, in which the forecast of XLG-LR model was the closest to the true value.
The results of the XLG-LR model in this paper and other methods are shown in Table  4. It can be seen that the four models were all close to 1 in terms of indicator, suggesting that the prediction results of each model can better approach the true value. However, the XLG-LR model improved by 35.42%, 2.97% and 4.03% respectively in terms of MAE compared with the three models of XGB, LGB and GBDT. It showed that the XLG-LR model proposed in this paper could achieve more accurate prediction results in power demand forecasting on a weekly basis.

Power Demand Forecasting Results on a Monthly Basis
The power demand forecasting on a monthly basis can support power companies making monthly planning and reasonably arranging production scheduling. In order to evaluate the power demand forecasting in monthly units, the data of 30 consecutive days was randomly selected from the test set, and the accuracy of the power demand forecasting at different periods during the 30 days was examined. Figure 9 shows the power demand forecasting results on a monthly basis forecast by the model XLG-LR. It can be seen from the figure that, similar to the power demand forecasting on a weekly basis, except for the XGB model, the LGB, GBDT and XLG-LR models could better predict  Table 4. It can be seen that the four models were all close to 1 in terms of R 2 indicator, suggesting that the prediction results of each model can better approach the true value. However, the XLG-LR model improved by 35.42%, 2.97% and 4.03% respectively in terms of MAE compared with the three models of XGB, LGB and GBDT. It showed that the XLG-LR model proposed in this paper could achieve more accurate prediction results in power demand forecasting on a weekly basis.

Power Demand Forecasting Results on a Monthly Basis
The power demand forecasting on a monthly basis can support power companies making monthly planning and reasonably arranging production scheduling. In order to evaluate the power demand forecasting in monthly units, the data of 30 consecutive days was randomly selected from the test set, and the accuracy of the power demand forecasting at different periods during the 30 days was examined. Figure 9 shows the power demand forecasting results on a monthly basis forecast by the model XLG-LR. It can be seen from the figure that, similar to the power demand forecasting on a weekly basis, except for the XGB model, the LGB, GBDT and XLG-LR models could better predict the trend of electricity demand in one month, and the prediction of the XLG-LR model was the closest to the true value. The results of the XLG-LR model and other methods are shown in Table 5. It can be seen that the four models are all close to 1 in terms of indicator, suggesting that the prediction results of each model could better approach the true value. However, in terms of MAE, RMSE and MAPE, the XLG-LR model achieved the minimum value compared with the three models of XGB, LGB and GBDT, indicating that the XLG-LR model proposed in this paper could obtain more accurate forecast results when forecasting electricity demand on a monthly basis.

Discussion
It can be seen from the above experiments that although the power demand could be well predicted using GBDT, XGBoost and LightGBM models, the prediction results made by different algorithms under different scenarios were not stable. Two reasons may account for this. One is that the data characteristics in the different scenarios were not the same, which would affect the model training and learning process. The other reason is related to the data set used in this paper having a limited amount of data, which would affect the quality of the data to a certain extent. As data-driven methods, the prediction performance of GBDT, XGBoost, and LightGBM models was greatly affected by the quantity and quality of training data. Therefore, in order to effectively solve these problems, this paper proposes an XLG-LR model for power demand forecasting based on stacking, which effectively solves various problems existing in the single use of GBDT, XGBoost and LightGBM models. Experiments suggest that the XLG-LR model in this paper has achieved high accuracy in different forecasting scenarios, effectively improving the power demand forecasting accuracy.
In recent years, with the continuous development of neural networks, a growing number of scholars have begun to apply neural networks into power demand forecasting [36], and frequently used models include the gated recurrent unit (GRU) [37], long shortterm memory networks (LSTM) [38], and the temporal convolutional network (TCN) [39], etc. In order to verify the advancement and effectiveness of the XLG-LR model, the power demand data of this paper was used to train the above GRU, LSTM, TCN models and the XLG-LR model, and utilized the test set to test the training results.
As a long-term memory neural network, LSTM is widely used for correlation learning and prediction in sequence data. Since the vanishing gradient of recurrent neural network (RNN) hinders the network from learning long-term dependencies, LSTM reduces the occurrence of the problem by introducing the forget gate, input gate and output gate, which can achieve better results. On the basis of this method, Wang et al. [40] Table 5. It can be seen that the four models are all close to 1 in terms of R 2 indicator, suggesting that the prediction results of each model could better approach the true value. However, in terms of MAE, RMSE and MAPE, the XLG-LR model achieved the minimum value compared with the three models of XGB, LGB and GBDT, indicating that the XLG-LR model proposed in this paper could obtain more accurate forecast results when forecasting electricity demand on a monthly basis.

Discussion
It can be seen from the above experiments that although the power demand could be well predicted using GBDT, XGBoost and LightGBM models, the prediction results made by different algorithms under different scenarios were not stable. Two reasons may account for this. One is that the data characteristics in the different scenarios were not the same, which would affect the model training and learning process. The other reason is related to the data set used in this paper having a limited amount of data, which would affect the quality of the data to a certain extent. As data-driven methods, the prediction performance of GBDT, XGBoost, and LightGBM models was greatly affected by the quantity and quality of training data. Therefore, in order to effectively solve these problems, this paper proposes an XLG-LR model for power demand forecasting based on stacking, which effectively solves various problems existing in the single use of GBDT, XGBoost and LightGBM models. Experiments suggest that the XLG-LR model in this paper has achieved high accuracy in different forecasting scenarios, effectively improving the power demand forecasting accuracy.
In recent years, with the continuous development of neural networks, a growing number of scholars have begun to apply neural networks into power demand forecasting [36], and frequently used models include the gated recurrent unit (GRU) [37], long short-term memory networks (LSTM) [38], and the temporal convolutional network (TCN) [39], etc. In order to verify the advancement and effectiveness of the XLG-LR model, the power demand data of this paper was used to train the above GRU, LSTM, TCN models and the XLG-LR model, and utilized the test set to test the training results.
As a long-term memory neural network, LSTM is widely used for correlation learning and prediction in sequence data. Since the vanishing gradient of recurrent neural network (RNN) hinders the network from learning long-term dependencies, LSTM reduces the occurrence of the problem by introducing the forget gate, input gate and output gate, which can achieve better results. On the basis of this method, Wang et al. [40] forecast short-term photovoltaic power and this study conducts comparative experiments. Temporal CNN (TCN) is a simple one-dimensional convolutional network that can be applied to time series data. The layers in the network have temporal properties and are used to learn global and local features of the data. Convolutional layers also help improve model latency, allowing prediction to conduct parallel processing. Based on this method, Wang et al. [41] predicts the short-term electricity consumption of industrial users, and this study carries out comparative experiments. As for the GRU model, more attention is paid to the role of gate control, especially the feature weight introduced into its formula to enhance the ability to extract data features. Based on the method, Gao et al. [42] carries out short-term power load forecasting. A power load in the next 48 h with one hour as a unit is predicted. In this study, a comparative experiment is conducted on the basis of this method.
During the comparison, the relevant parameters in the GRU, LSTM and TCN models need to be set. The parameter settings of each model are shown in Table 6 during the comparative experiment stage. In the training and testing of the power demand forecasting model based on stacking, the input form of data refers to data usage × data feature number. In contrast with this model, when GRU, LSTM and TCN are trained and tested, the form of data input refers to data usage × data feature number × time window length. The size of the time window needs to be adjusted according to the forecast demand of different durations.
First, four methods were used to compare the seasonal power demand forecasting, and the same training set and test set as Section 5.2 were utilized to carry out experiments to investigate the accuracy of power demand forecasting in different periods of a day in a certain season. The prediction results of the four models for different seasons of electricity demand are shown in Figure 10. It can be seen that the XLG-LR model was the closest to the true value in most time periods. The comparison between the XLG-LR model in this paper and the other three neural network methods is shown in Table 7. It can be seen that compared with the three models of GRU, LSTM and TCN, the XLG-LR model has significant advantages in forecasting power demand in different seasons under the four evaluation indicators. The comparison between the XLG-LR model in this paper and the other three neural network methods is shown in Table 7. It can be seen that compared with the three models of GRU, LSTM and TCN, the XLG-LR model has significant advantages in forecasting power demand in different seasons under the four evaluation indicators.
Secondly, four methods were used to compare power demand forecasting in weeks, and the same training set and test set as in Section 5.3 were utilized to conduct experiments to examine the accuracy of power demand forecasting at different time periods in a week. The prediction results of the four models for the trend of electricity demand in one week are shown in Figure 11. It can be seen that the three models GRU, LSTM and TCN had obvious prediction deviations in the periods of high and low electricity demand, while the XLG-LR model could accurately predict the change trend of power demand in most time periods. Secondly, four methods were used to compare power demand forecasting in weeks, and the same training set and test set as in Section 5.3 were utilized to conduct experiments to examine the accuracy of power demand forecasting at different time periods in a week. The prediction results of the four models for the trend of electricity demand in one week are shown in Figure 11. It can be seen that the three models GRU, LSTM and TCN had obvious prediction deviations in the periods of high and low electricity demand, while the XLG-LR model could accurately predict the change trend of power demand in most time periods.
(a)  The comparison between the XLG-LR model and the other three neural network methods are shown in Table 8. It can be seen that compared with the three models of GRU, LSTM and TCN, the XLG-LR model had obvious advantages in the power demand forecasting on a weekly basis under the four evaluation indicators, and all indicators were ahead of other models. Then four methods were used to compare the power demand forecasting on a The comparison between the XLG-LR model and the other three neural network methods are shown in Table 8. It can be seen that compared with the three models of GRU, LSTM and TCN, the XLG-LR model had obvious advantages in the power demand forecasting on a weekly basis under the four evaluation indicators, and all indicators were ahead of other models. Then four methods were used to compare the power demand forecasting on a monthly basis, and the same training set and test set as Section 5.3 were utilized to conduct experiments to examine the accuracy of power demand forecasting at different time periods during the 30 days. Figure 12 shows the forecast results of the four models for the trend of electricity demand in one month. It can be seen from the figure that the three models of GRU, LSTM and TCN had obvious forecast deviations in the period of low electricity demand, while the XLG-LR model could basically match the real demand in most time periods.  The comparison between the XLG-LR model and the other three neural network methods are shown in Table 9. It can be seen that compared with the three models of GRU, LSTM and TCN, the XLG-LR model had significant advantages in forecasting electricity demand on a monthly basis under the four evaluation indicators, and the curve fitting effect was the best and the power demand forecast error was the smallest. The prediction time of the model was related to the convenience of the model in reality. This paper adopted the same training data to compare the time consumption of the XLG-LR model and the other three neural network methods in the prediction stage. The specific structure is shown in Table 10. It can be seen from the table that the XLG-LR model could complete the prediction in the shortest time in each forecasting scenario. And the time required was at least one order of magnitude different than the other three neural network methods, which fully showed that the XLG-LR model had an absolute advantage in prediction time.
Through the above comparative experiments, it could be considered that the XLG-LR model had obvious advantages in terms of prediction accuracy and prediction time consumption compared with the classical neural network algorithms. The construction of the XLG-LR model mainly relies on the principle of a decision tree, and the global optimal solution is finally obtained by continuously optimizing the local optimal solution in the solving process. The neural network needs to compare the data features extracted from the test data with the trained model to give the optimal solution. However, the data in the training model has numerous features as well as a certain similarity, so it performs not as well as the XLG-LR model in terms of accuracy and time consumption. Therefore, it can be considered that the XLG-LR model in this study could achieve ideal prediction results for the power demand forecasting in different scenarios.
Although the method proposed in this paper has achieved relatively ideal power demand forecasting results, there are still some problems that need to be solved in the future.
(1) The dataset is relatively small and contains a limited amount of information. At present, the dataset used in this paper has only 13 months of data, which reduces the generalization and reliability of the model to a certain extent. The GBDT, XGBoost and LightGBM algorithms used in this paper can achieve better prediction results on small data sets, but if the data is more abundant, it should be able to achieve better prediction results. Therefore, in future research and exploration, the current dataset can be supplemented by collecting more months of data to build a larger and more informative dataset for electricity demand forecasting.
(2) More indicators other than meteorological factors may also be able to influence the forecast results. The electricity demand can be affected by many factors, including the level of local economic development and industrial structure. Although the indicators used in this study can exert the necessary influences on electricity demand to a certain extent, some other indicators may also affect it. Therefore, in the future, researchers can learn of other factors affecting electricity demand indicators from experts in related fields, and collect more index data that can have an impact on electricity demand to supplement the current data set.

Conclusions
Regarded as an important task in the power industry, power demand forecasting guarantees normal operation of economic development, sustains people's daily life, and directs electric power production. This study utilized 13 months of electricity and meteorological data and adopted three models: GBDT, XGBoost and LightGBM, in order to build an XLG-LR power demand forecasting model based on stacking fusion. After the data was divided into a training set and a test set, the above four models were trained, and the test set was used to verify the feasibility of the model. The experiments in this study were carried out under the following software and hardware conditions. Software conditions required python3.7, tensorflow2.8.0, with the sklearn, seaborn, numpy, matplotlib, and the pandas development kits installed. The hardware environment required that the graphics card model was AMD Radeon(TM) Vega 8 Graphics and that the memory was 8 GB.
Verification started with different time lengths such as seasonal forecasting, weekly forecasting and monthly forecasting. It was found that under different time lengths, except for the XGBoost model, the GBDT, LightGBM and XLG-LR models all achieved relatively satisfactory results, among which the XLG-LR model proposed in this paper works best. From the perspective of prediction accuracy, the overall prediction accuracy ranked as XLG-LR > GBDT > LightGBM > XGBoost. In addition, this paper also compared the power demand prediction results of the XLG-LR model with that of the three mainstream neural network models of TCN, GRU and LSTM. The results showed that the XLG-LR model in this paper can also achieve the best experimental results in this dataset compared to the neural network model. Through the above discussion, the reliability and validity of the XLG-LR model in this paper for power demand forecasting was verified. When the power demand data or the meteorological data changes, only a new data set is needed to train the model to form a new prediction model, which can cope with the data changes and carry out the corresponding prediction. The method in this study can also be applied to power demand forecasting in other regions, and a new data set is needed to train a new forecasting model. In addition, the method has been encapsulated into corresponding software with good interoperability. It will be able to be used in a wider range of practical applications in the days to come.
In the future, more power demand data can be collected to build a larger power demand database so as to verify the accuracy and advancement of the algorithm in this paper in power demand forecasting. At the same time, under the premise that the amount of data is sufficient enough, this method could be adopted to carry out long-term electricity demand forecasting, such as forecasting the electricity demand in the next year. In addition, electricity demand is also closely related to other factors besides meteorological ones, such as the level of economic development and the regional industrial layout. In the days to come, these data can be supplemented to improve the prediction accuracy of this method. Furthermore, the method can also be applied to other fields, including the prediction of water demand and coal resource demand.