Short-Term Load Forecasting of Distributed Energy System Based on Kernel Principal Component Analysis and KELM Optimized by Fireworks Algorithm

Accurate and stable load forecasting has great significance to ensure the safe operation of distributed energy system. For the purpose of improving the accuracy and stability of distributed energy system load forecasting, a forecasting model in view of kernel principal component analysis (KPCA), kernel extreme learning machine (KELM) and fireworks algorithm (FWA) is proposed. First, KPCA modal is used to reduce the dimension of the feature, thus redundant input samples are merged. Next, FWA is employed to optimize the parameters C and σ of KELM. Lastly, the load forecasting modal of KPCA-FWA-KELM is established. The relevant data of a distributed energy system in Beijing, China, is selected for training test to verify the effectiveness of the proposed method. The results show that the new hybrid KPCA-FWA-KELM method has superior performance, robustness and versatility in load prediction of distributed energy systems.


Introduction
The concept of energy internet will effectively promote world energy production, consumption and system reform, and drive energy transformation in all countries so as to achieve energy cleanliness, efficiency, safety, convenience, and sustainable use [1]. As an important component of energy internet, distributed energy system (DES) has attracted widespread attention due to its outstanding features for instance safety and re-liability, high energy efficiency, environmental friendliness, and sustainability [2]. However, the use of distributed energy (DE) has malpractices such as random output. Therefore, precise prediction of space-time load of DES has great significance owing to the assistance of improving the overall stability of the system operation and promoting the effective use of distributed energy.
For years, with the advancement of mathematical theory and the rapid development of modern computer technology, the technology and methods of load forecasting are developing all the time, which mainly includes classical prediction method and modern intelligent prediction method. Traditional prediction methods, which include time series method [3], regression analysis method [4], grey theory method [5] and so on, is simple and mature. However, due to certain defects, the prediction accuracy is unsatisfactory. Analysis of the time series method ignores the influence of other external factors on account of taking time factors as the only variable. Consequently, there will be a serious error in the prediction results when the external environment changes greatly [6]. The regression reduce the dimensionality of the input variables while preserving nonlinear information between the input variables.
In summary, this study preliminarily established the feature set of candidate influence factors of distributed energy system load and builds a distributed energy system load prediction model based on KPCA and FWA algorithm to optimize KELM. The other parts of this article are arranged as follows. The next part introduces the algorithm of this article, including kernel principal component analysis, FWA algorithm and KELM model, and constructs a complete prediction framework. The third part selects the practical cases to verify the effectiveness and stability of the proposed method. The fourth part is the conclusion.

KPCA
As a nonlinear principal component model, KPCA which can get more reasonable result of index reduction compared with PCA, can effectively solve the nonlinear relationship among variables, and is effectively used in multi-index comprehensive analysis [25]. This method can compress the information contained in a great quantity of index variables into a few comprehensive variables that can reflect the original information features. In addition, by analyzing the index of the inclusive variables, it can process the nonlinear relationship between the variables and minimize the loss of the original data information [26]. The basic steps are as follows [27,28].
We set a combination of random vectors containing N random variables, thereinto, x k ∈ R N (k = 1, 2, . . . , m), and m stands for the input sample size. That is: the initial input sample data set is M = [a 1 , a 2 , . . . , a n ] T = [b 1 , b 2 , . . . , b m ]. By projecting to the space through nonlinear mapping, the dataset is projected to: Get the covariance matrix according to the definition of covariance: Solve: Get it's the eigenvalue and the eigenvector, wherein λ F is an eigenvalue; Get K by matrix centralization: wherein, I n is n × n the matrix, satisfy simultaneously I i , j = 1/n, the Formula (3) is simplified to: After the above calculation, we can use the method of principal component analysis in traditional PCA to calculate the projection of a data point on the eigenvector, and finally the kernel principal component of the point can be obtained.

Improved Fireworks Model
FWA is the simulation of the whole fireworks explosion process [29]. Fireworks exploded to make sparks, the sparks make more new sparks at the same time, so as to constitute rich patterns. Converting the process of fireworks explosion into the calculation process of FWA, fireworks to be seen as feasible solutions of the problem, and the process of spark generation is understood as the process of finding the optimal result. In the process of solving the optimal solution, the influencing factors of FWA include the number of sparks, the blast radius, and a best group of fireworks and sparks which the next explosion selects.
FWA has a self-regulation mechanism with good local search and global search capabilities. In FWA, the explosion radius and explosion spark quantity of each firework is dissimilar. Fireworks with a larger explosion radius and a poor fitness value give fireworks more "ability to explore"-exploration ability. While the firework with a good fitness value owns a smaller explosion radius, enabling it to have a greater "ability of excavation" -exploitability around the location. Furthermore, the introduction of Gaussian mutation sparks can further enrich the diversity of a population.
Therefore, it can be seen that the three main components of FWA are explosion operator, mutation operator and selection strategy.
(1) Explosion operator. According to fitness value of fireworks, the amount of spark and explosion radius produced by each fireworks explosion can be obtained. The calculation formulas of spark number S i and explosion radius A i are as follows towards the fireworks x i (I = 1, 2, . . . , N).
In Formulas (7) and (8), y max , y min denote the maximum and minimum fitness value of the current population respectively; the fitness value of the fireworks x i denotes expressed in f (x i ); and M adjusts the quantity of explosive sparks as a constant; in addition,R indicates that the size of the fireworks explosion radius is set to a constant; moreover, in order to avoid zero operation, ε is used as the minimum machine value.
(2) Mutation operator. Mutation operator can add the variety of the spark population. The variation spark in FWA is the Gaussian mutation sparks created by the explosion sparks through Gaussian mutation. When selecting fireworks x i , the k-dimensional Gaussian mutation exercise is used as:x ik = x ik × e; thereinto, x ik represents k-dimensional variation spark, and e indicates that it conforms to Gaussian distribution.
In FWA, when the explosion spark and mutation spark generated by the operator are separated from the search space, it must to map them to a new location, the calculation method is as follows:x wherein, x UB,k , x LB,k denote the upper and lower search spaces on the k-dimension.
(3) Selection strategy. A certain number of individuals need to be selected for the next generation of fireworks in explosion fireworks and mutation sparks so as to information for future generations of fireworks.
Candidates with the best fitness value will become the next generation of fireworks when K individuals are selected and N is population's number. For the remaining N − 1 fireworks, the choice is made in a probabilistic way. For fireworks x i , the calculation method is as follows: wherein, R(x) represents the sum of the distances between all samples in the current individual candidate set. In set, the probability that a sample is selected will decrease when the individual has a higher density, which it also means there are other candidates around this individual. Based on the previous description, the specific process of FWA method is obtained [30]: Step 1. Initialize parameters, randomly select N fireworks in the solution space, and initialize its coordinates.
Step 2. Calculate the fitness value f (x i ) for each firework and calculated their explosion radius R i and spark number S i . Randomly select the zD coordinate in kD to update the coordinates.
Step 3. Firstly,M Gaussian abrupt spark is generated; Then selecting the spark x i , calculating the resultx ik ofM Gaussian mutation sparks based on Gaussian mutation formula, and save them to the population of Gaussian mutation sparks.
Step 4. The probability formula is aimed to select the best individual N from the fireworks, explosion sparks and Gaussian mutation sparks as the next iteration fireworks.
Step 5. Determine the termination condition. If the termination condition is not met, continue the loop until the best result is output; if it meets, the cycle ends.

KELM
Huang et al. put forward the theory of extreme learning machine in 2006. On account of this theory, many scholars have innovated in addition to a variety of models, such as online sequential extreme learning machine and KELM [31,32]. KELM is a single-layer feedforward neural network algorithm, which is more accurate than the extreme learning machine (ELM) algorithm. Compared with BPNN and SVM, the kernel function extreme learning machine has shorter time to calculate results, faster calculation speed, and greatly improves the adaptability of the model to the samples [33]. KELM algorithm has been effectively applied in many fields, which proves its effectiveness in prediction.
The following introduces the construction principle of the general ELM model, and the specific neural network function is as follows: In the formula: g (x) denotes the output value of the network, h i (x) represents the output of the i hidden layer neurons corresponding to the input x; β i represents the connection weights between the i hidden layer neurons and the output neurons.
The regression accuracy of ELM is measured by the error. The smaller the error is, the greater the accuracy is. Therefore, the minimum output error is calculated to obtain the optimal result. The formula is as follows: In the formula, L represents the quantity of hidden layer neurons; g 0 (x) represents the function to be forecasted composed of the target value.
The generalization ability of neural networks is measured by minimizing the output weight β. Usually β takes its least squares solution, the calculation method is as follows: In the formula: H represents the hidden layer matrix of neural network; H + represents the generalized inverse matrix of H matrix; O represents the prediction target value vector. On the basis of ridge regression theory, by improving the normal number 1/C, the results will be more stable with better generalization.
The introduction of kernel function to KELM algorithm can obtain better regression prediction accuracy. This paper uses Mercer' s condition to define the kernel matrix, and the calculation formula is as follows: Kernel matrix Ω replaces random matrix HHT in ELM and uses kernel function to map all input samples from n-dimensional space to high-dimensional hidden feature space. After setting the kernel parameters, the mapping value of the kernel matrix Ω is fixed [33]. The kernel function includes RBF kernel function, linear kernel function and polynomial kernel function, which is usually set as RBF kernel function: The parameter 1/C is added to the main diagonal of the unit diagonal matrix HHT, so that the characteristic root will not be equal to zero, and then the weight vector β * is obtained. A more stable and better generalization ELM model is obtained. At this time the output weights of ELM method become [34]: In the formula, I denotes diagonal matrix; C denotes penalty coefficient for weighing the proportion between structural risk and empirical risk; HHT denotes generated by mapping input samples from kernel functions.
From the above formulas, the output of the KELM model is described as follows: In the kernel KELM algorithm, it is not necessary to give the specific form of the feature mapping function h(x) of the hidden layer node, and the output value can be gotten only by knowing the specific form of the kernel function [35]. Furthermore, so the kernel function is directly in the form of inner product, it is not necessary to set the number of hidden layer nodes, nor to set the initial weight and bias of hidden layer.

Model Construction
This paper first determines the feature set of the candidate influence factors of the distributed energy system and uses KPCA method to deal with the feature dimensionality reduction, and then uses the fireworks algorithm to optimize the KELM, thus the optimal value of the penalty coefficient C and the kernel parameter σ is obtained. Finally, the characteristic data after KPCA reduction are input to get the prediction result. The proposed composite prediction framework is shown in Figure 1.  Figure 1. The proposed composite prediction framework.
The following are the specific steps of the prediction model. (1) Initial input variable selection and data processing. The influence factors of the distributed energy system load are determined by the literature data analysis, and the candidate input variables C = {Ci, i = 1, 2, ..., n} are formed, and quantify and normalize the input data (Ci). and then the fitness of each generation will be compared to select the best parameters. Judge whether each iteration satisfies the stop condition of the algorithm. If yes, the parameter is the global optimal parameter. If not, start a new cycle until the global optimal parameter is found. (5) Simulation prediction. According to the prediction model above, the short-term load of distributed energy system is forecasted, and the results of load forecasting are analyzed and evaluated. The following are the specific steps of the prediction model.
(1) Initial input variable selection and data processing. The influence factors of the distributed energy system load are determined by the literature data analysis, and the candidate input variables C = {Ci, i = 1, 2, . . . , n} are formed, and quantify and normalize the input data (Ci). (2) KPCA feature reduction. After step (1), a matrix is formed based on the input data, the nonlinear mapping function selects the Gauss kernel function k(x, y) = exp(− x−y 2 2σ 2 ). After the KPCA nonlinear transformation in the Section 2.1, the kernel principal component is retained when the cumulative variance contribution rate τ is greater than 90 %, and finally a new input variable matrix is formed.
(3) Initialize the FWA parameter. After many tests, the maximum quantity of iterations is Maxgen = 500, the quantity of population is PopNum = 30, the quantity of spark determines the constant M = 100, and the radius of explosion determines the constant R = 150. (4) Get the best values of C and σ in KELM. Firstly, C and σ will be randomly assigned, and then the fitness of each generation will be compared to select the best parameters. Judge whether each iteration satisfies the stop condition of the algorithm. If yes, the parameter is the global optimal parameter. If not, start a new cycle until the global optimal parameter is found.
(5) Simulation prediction. According to the prediction model above, the short-term load of distributed energy system is forecasted, and the results of load forecasting are analyzed and evaluated.

Error Measures
It is very important to find the model with the best prediction effect among many models, and the indexes to evaluate the advantages and disadvantages of prediction models usually include: relative error (RE), root mean square error (RMSE), average absolute error (MAPE) and average absolute error (AAE). The smaller the error value, the better the accuracy of prediction, and the more effective the method. The calculation formula of these four indicators are as follows: In the formula, y t represents the actual charge at time t, y * t denotes the predicted charge at time t, and N represents the data group.

Data Selection and Pretreatment
For the sake of verify the forecasting accuracy of the method, this paper selects the load data and meteorological data of the distributed energy system in China from 0:00 on 18 June 2018 to 24:00 on 18 June 2019. The charge data from 0:00 on 14 June 2018 to 24:00 on 14 June 2019 are selected as training samples to establish univariate time series. The charge data from 0:00 on 18 June 2019 to 24:00 on 18 June 2019 are utilized as testing sample, with the information collection frequency being 15 min. Simultaneously, the maximum temperature, average temperature, minimum temperature, season type, month, precipitation regime, day type, wind speed, humidity, and the load value at the same time in the previous five days are considered as the candidate set for feature selection, totaling 30 candidate features, as shown in Table 1. Sea t represents the season in which day t is located, 1 is spring, 2 is summer, 3 is autumn, 4 is winter. F 25 M t represents the month in which day t is located F 26 P t represents the tth day's precipitation F 27 Hol t represent whether day t is holiday, 0 is holiday, 1 is not holiday. F 28 Wk t represent whether day t is weekend, 0 is weekend, 1 is not weekend. F 29 W t represents the wind speed on day t F 30 H t represents the humidity on day t Since the collected data is not public, the main statistical indicators are listed in Table 2. For the better training and learning of the proposed model, all input data should be normalized: In the formula, x i represents the actual value, x min denotes the minimum value of the sample, x max denotes the maximum value, and y i denotes the standardized load.
Furthermore, KPCA is adopted to make principal component analysis of 30 reference vectors and extract 4 sets of vectors as the input vector of the proposed model.

KELM for Load Forecasting
After the dimensionality reduction of input features, input parameters are brought into the KELM model for learning. In this article, the self-programming program is utilized to the operation in Matlab software. It is It is noteworthy that RBF kernel function is selected in the article as the kernel function of KELM method. To ensure its precision and accuracy, the model's important parameters are obtained and optimized by FWA algorithm. The parameters of FWA algorithm are set in Section 2.4, hence they are not repeated here. Through calculation, the KELM model parameters are C = 8.325 and σ = 0.0031.
For the sake of proving the forecasting accuracy of the proposed DES load forecasting method, this paper also selects the KELM, ELM and BPNN which are not optimized by FWA to forecast the load data of this sample, and then evaluates and analyzes the prediction results of the four methods. The structure of the BPNN model is 6-3-1, whose hidden layer transfer function is expressed as "tansig" function, while the output layer transfer function is expressed as purelin function, with the maximum time of training being 300, the minimum error of training objectives being 0.0001, the rate of training being 0.1, and the original weight and threshold values derived from the model's own learning. In the ELM method, the penalty parameter obtained from training C is 10.276 and the kernel parameter σ is 0.0013. In the KELM model parameters, C is 10.108 and σ is 0.0026. Table 3 shows the DES load prediction results of BPNN, ELM, KELM, and the method proposed in the article on the test set.
In order to intuitively analyze, the predicted results in Table 3 are drawn into diagrams, as shown in Figure 2. It can be found from Table 3 and Figure 2 that the predicted results of the four methods are not very different from the real data, and the overall trend is consistent. Among them, the prediction results of KPCA-FWA-DIR method are most similar to the actual situation, and the prediction results of other models have relatively large errors. Furthermore, the results show that the prediction curve of KELM method is more accurate than that of single prediction curve, indicating that the introduction of kernel function increases the accuracy of the model to a certain extent. In order to intuitively analyze, the predicted results in Table 3 are drawn into diagrams, as shown in Figure 2. It can be found from Table 3 and Figure 2 that the predicted results of the four methods are not very different from the real data, and the overall trend is consistent. Among them, the prediction results of KPCA-FWA-DIR method are most similar to the actual situation, and the prediction results of other models have relatively large errors. Furthermore, the results show that the prediction curve of KELM method is more accurate than that of single prediction curve, indicating that the introduction of kernel function increases the accuracy of the model to a certain extent.        respectively. The relative errors are 2.25%, −2.29%, 2.23%, 2.06%, 2.24%, 2.31%, 2.19% and −2.28%, respectively. In the KELM prediction method, the relative error of 11 prediction results is controlled in within [−3%, 3%]. The error of the three prediction results is within [−1.5%, 1.5%], in which the serial numbers are 13, 31 and 45 respectively. Their relative errors are −1.22%, 1.15% and 1.01% respectively. The minimum absolute value of relative error is 1.01% and the maximum is 5.89%. In the ELM model, the error of 4 sample points is controlled in within [−3%, 3%], which are the samples of the sequence number of 7, 68 and 84 respectively. The relative errors are 2.35%, −2.26% and 1.14%, respectively, but they are all outside the [−1%, 1%] range. Its minimum of the absolute value of the relative error is 1.14% and the maximum value is 7.52%. The minimum value of absolute relative error of BPNN method is 1.16%, and the maximum is 10.12%. The error of most prediction results is in [−9%, −7%] and [7%, 9%], and the fluctuation range is large. From this point of view, the KPCA-FWA-KELM method has the best prediction effect, followed by KELM method and ELM method, and BPNN method has the worst effect. It can be found that KPCA-FWA-KELM modal has the best prediction accuracy and stability, which shows that FWA algorithm improves the ability of model learning, effectively avoids the problem of falling into local optimization, and improves the global search ability of KELM. It is also shown that the prediction results obtained from the KPCA model can achieve satisfactory prediction results and effectively eliminate the interference of redundant data. Furthermore, ILSSVM has better performance than LSSVM, SVM and BPNN. This result shows that LSSVM can achieve better prediction effect after improvement.
The RMSE, MAPE and AAE of BPNN, ELM, KELM and KPCA-FWA-KELM are shown in Figure 4. We can find that the RMSE, MAPE and AAE of the proposed methods are 1.6158%, 1.6079% and 1.6017% respectively, which are the least error of the above five methods. Furthermore, errors of RMSE, MAPE and AAE of KELM methods are 4.0873%, 4.0713% and 4.0649% respectively. The RMSE, MAPE and AAE of ELM method were 5.4899%, 5.2693% and 5.0553% respectively. RMSE, MAPE and AAE of the BPNN method are 7.9917%, 7.8731% and 7.8878% respectively. These indexes reflect the overall error of model forecasting and the degree of error discretization. Therefore, it can be further seen that the overall forecasting effect of the KELM method is superior to the ELM method and the BPNN method, while the overall forecasting effect of the ELM method is superior to the BPNN method, indicating that the overall forecasting performance of the ELM method is significantly improved after the introduction of the kernel function. The forecasting effect of the KPCA-FWA-KELM method is better than that of the KELM method. Practice has proved that using FWA algorithm to select KELM method C and σ can obtain better optimization effect, while KPCA method can reduce redundant data while ensuring the integrity of input information, so as to achieve ideal forecasting effect.
In summary, the method proposed in this article optimizes the KELM method by the FWA algorithm and obtains the appropriate parameters C and σ in the KELM method, which can effectively decrease the load forecasting error.
On the one hand, KPCA method can ensure the integrity of input information, on the other hand, it can decrease the noise in the input variables to enhance the effectiveness of input variables, so as to enhance the accuracy and stability of distributed energy system load prediction. The data calculation results prove the effectiveness and stability of the load prediction method proposed in this article.

Conclusions
This paper puts forward a hybrid load prediction method that combines KPCA with KELM optimized by FWA. First, for the sake of forecasting the power load of DES, the KPCA method is used to select input features. In addition, the FWA method is employed to optimize the parameters of KELM. Lastly, after the optimized input subset and the best values of C and σ are obtained, the method is used for load prediction of DES. The following results are obtained through this study.
(1) KPCA can effectively decrease the influence of non-correlation noise and improve the prediction performance. (2) The introduction of FWA optimization algorithm can enhance the global search ability, and the KELM method optimized by FWA shows good results. (3) On the basis of the error index, in comparison with ELM, KELM has achieved better prediction results, indicating that the method of improving ELM by introducing kernel function is effective (RMSE, MAPE and AAE are respectively 4.0873%, 4.0713% and 4.0649%). Therefore, the KPCA-FWA-KELM load prediction modal proposed in this paper is effective and feasible and is expected to become an effective alternative method for load prediction in power industry.