1. Introduction
According to the latest UN report “World Population Prospects 2022”, from 15 November, the Earth’s population has reached 8 billion people. According to the forecasts of the UN Department of Economic and Social Affairs, the world population should increase to 8.5 billion by 2030 and to 9.7 billion by 2050 [
1]. Population growth, continued industrialization, and urbanization will be key factors in the growth of energy demand in the coming decades. The development of the economy of any country highly relies on its energy resources, as well as its management [
2]. This occurs because energy is necessary for every type of industry and within every stage of development.
Given the current state of facts, it becomes crucial to create accurate models for forecasting the power load since it is subsequently used as the basis for making adequate decisions within the power management domain [
3]. Predicting electricity demand is a critical element for power sector planning and development as it helps to match the future power demands of a variety of sectors consuming electricity. Power load forecasting is critical in the capacity planning, scheduling, and maintenance of power systems, along with end-consumer awareness of observing their consumption pattern and bills in real time [
4]. Because of the growing deregulation of the energy market, it is more important than ever for utility providers to produce stronger load forecasts. Electric energy storage is either expensive, inefficient, or impracticable. Furthermore, the demand and supply of power must always be balanced [
3]. Consequently, good forecasting approaches for energy demand are indispensable for power system governance which further involves efficient resource management [
2,
5].
Various factors such as weather (e.g., temperature, wind, rain), consumer behavior, plugged load, social and geographical factors, etc., more or less directly influence the quantity of energy that is consumed in different areas [
2]. Accordingly, since there are several factors affecting the amount of used energy and due to the nonlinearity and uncertainty of these different predictors, forecasting energy consumption represents a rather complex task [
6,
7]. While the economical factors affect the consumption trend directly, the modifications concerning the weather induce a cyclical behavior in the time series. The identification of sufficient and adequate information for a good time-series dataset for power load or consumption prediction represents a challenge in the development of methods that would achieve credible forecasts. Forecasting will be poor if there is insufficient information; similarly, modeling will be difficult or even misleading if the information in the dataset is irrelevant or redundant [
8].
Most of the research in the literature is made for developing prediction approaches for short-term (up to 1 day or at most 1 week) forecasting and fewer models are dedicated to medium (several weeks and up to a few months) and long term (from a year to 10–20 years ahead) [
9]. On the other hand, many methods which do not perform suitably for mid-to-long-term forecasting can work well for short-term forecasting by providing better results. Before using predictive analysis, it is important to understand the limitations of each method [
10].
The purpose of this study is manifold. It proposes an overview of the most recent employments of DL for energy forecasting in various ways (e.g., individual household power consumption, building consumption, consumption of the economic sector, electricity load for an energy market operator, etc.) which leads then to an effective oversight of the produced and consumed energy, subsequently. The techniques are varied, ranging from convolutional to recurrent architectures; however, they do not frequently use hyperparameter tuning. Then, the focus is set on the latest entries for power load forecasting, which is a task that is later used in the current work as well. In this context, the second part of the study is devoted to an exemplification of a real-world scenario of power load forecasting using an optimally parameterized LSTM network. We do not aim to propose a state-of-the-art DL model for this task, but, rather, to point that the quality of results clearly improves when efforts are invested in fine-tuning the hyperparameters of the LSTM via metaheuristics.
The article is structured as further described. The next section starts with an overview of recent studies in energy forecasting, where all the presented methods are DL-based and are examined under similar circumstances. The section continues with a subsection dedicated only to recent models that studied the estimation of the power load, which is a task further treated in the current work. Finally, the section closes with a short description of the dataset that is used within it.
Section 3 describes the methods that are later exemplified in the experiments section. It presents the used LSTM as well as the various metaheuristics that are utilized for hyperparameter tuning.
Section 4 presents the results obtained from the tried implementation and discusses their meaning. The final section concludes the study and suggests new ideas for future work.
3. Baseline Methods Used within Experiments
This section briefly introduces the baseline methods that were implemented and utilized in the experiments. First, the description of the LSTM model is provided, followed by the explanations of the six swarm intelligence metaheuristics that were employed to tune the LSTM architecture.
3.1. Long Short-Term Memory
The main focus of the recent AI applications is represented by the ANNs, which provide the firm ground for DL. These networks try to mimic the biological structure of the human brain, with neurons and connections between them. During the training process, the neural cells can obtain the correlations to their neighbors, allowing them to solve different prediction problems. With respect to the type of problem that needs to be solved, there are various sorts of ANNs, including shallow, deep, convolutional, and recurrent neural networks [
33].
The traditional neural networks can determine the output based on the current input, without consideration of the previous inputs, which renders them useless for time-series predictions. RNNs, on the other hand, can remember the previous input data, and LSTM is additionally capable of retaining the long-term input data. The LSTM consists of a repeating memory cell with three interacting layers, known as the forget gate, the input gate, and the output gate. Gates represent the mechanism utilized to select which processing data will be kept and which will be forgotten.
Data entering the LSTM model will first pass through the forget gate, which will decide if this data should be released from the current state. The function of the forget gate
is obtained by Equation (
1).
where
denotes the forget gate, which is in the range
, as the function is defined by the sigmoid expression
.
and
are variable weight matrices and
denotes the bias.
denotes the input data and
is the previous output.
During the next phase, data are passed to the input gate, which is described by Equations (
2) and (
3). In Equation (
2),
represents the output of the sigmoid function, which specifies which data are to be stored within the memory cell.
,
, and
are parameters to be optimized.
To obtain the entire result from the input gate, it is required to establish the potential update vectors
, which can be defined with Equation (
3). This vector is within the boundaries
, as it represents the result of the
function.
To determine the final state at the output gate, it is required to perform calculations by utilizing the potential values that need to be updated, according to Equation (
4).
represent the values that need to be discarded from the memory, while the novel data that will be stored in the cell are given by .
The final output gate can be expressed as in Equation (
5). The gate
denotes the sigma function, whose output is utilized in a product with the
on the cell state
, as described by Equation (
6).
LSTM models are famous for their superior performance level when it comes to time-series forecasting, as can be seen from recent applications including stock prices [
34,
35], petroleum production [
36], medical diagnosis [
37,
38], and COVID-19 cases prediction [
39,
40], to name only a few.
3.2. Metaheuristics
This subsection introduces the six renowned metaheuristics algorithms that are utilized throughout the experiments to optimize the LSTM model. The selected methods are famous optimizers, frequently used for solving a wide range of NP-hard assignments in the last two decades with a significant success. All observed metaheuristics are implemented in their original variants, with the default control parameter values, as proposed by their respective authors in the initial papers.
Among recent successful applications of the population-based methods, the most prominent include COVID-19 cases forecasting [
41,
42], cloud computing challenges [
43,
44,
45], cloud-edge computing [
46], wireless sensor network optimization [
47,
48,
49,
50], feature selection [
51,
52,
53], classification of MRI images and other medical solutions [
54,
55,
56], optimization problems [
57,
58], credit card fraud detection [
59,
60], pollution estimation [
61], network security [
62,
63], and also the general optimization of the different machine learning models, including LSTM [
64,
65,
66,
67].
3.2.1. Genetic Algorithm
The genetic algorithm (GA) is a type of evolutionary metaheuristic that found inspiration in the natural selection. The algorithm simulates the processes of selection, inheritance, crossover, and mutation at the cellular level. A recent overview of the GA can be found in [
68].
The individuals in the initial population have a set of properties which are alterable and mutative. Based on the single fitness, parents are designated for producing the individuals in the next generation.
Additionally, they can be crossed-over in pairs and exploit their advantages to create a better one. Finally, the process of mutation can be applied to a single individual to alter its previous properties for better fitness in the next generation.
3.2.2. Particle Swarm Optimization
Kennedy et al. suggested a metaheuristic optimization method and named it particle swarm optimization (PSO) in the year 1995, incited by the flocking habits expressed by birds and fish [
69]. The particles, which are considered individuals in the population, act as search agents. Their goal is to provide satisfactory solutions for discrete and continuous optimization problems.
The collective experience is shared while seeking for the best solution, which consists of the individual best experience and those of neighboring solutions. After the evaluation of the gathered experiences, the next move is decided.
Initially, random velocities are given to each particle in the generated population, which are represented as initial positions. The particles move over iterations and the best position of each one is stored.
The velocity with which the particle moves is a sum of the weights for three components: the old velocity, the velocity that leads in the direction of the currently best determined individual, and the velocity towards the best individual attained by neighboring particles.
where the
shows a vector consisting of uniformly distributed random values within the limits of 0 to
, randomly produced during every round for all agents. The ⨂ represents the component-wise multiplication. Each component of
is inside the range of
.
3.2.3. Artificial Bee Colony
Karaboga devised the artificial bee colony (ABC) algorithm [
70], modeled after the food-collecting behavior shown by bees in the colony. The ABC differentiates three varieties of bees in the colony: workers, observers, and scouts, being utilized for guidance in respect to exploration and exploitation. The colony is split into two sections, the first consisting of the worker bees and the second formed by the observers. Workers are assigned to execute the exploitation procedure of nourishment sources converted to candidate solutions. Simultaneously, the observers identify the nourishment sources that are the most promising for exploitation, based on the feedback received by the workers. If the individual sticks to the food source that is not possible to enhance, that individual switches their role to scout and begins the exploration procedure. The arbitrary starting set of individuals is produced by utilizing Equation (
8):
where the
j-th component belonging to the
i-th individual is marked with
, and lower and upper limits are given by
and
for component
j.
Equation (
9) models the process where every worker bee discovers a novel nourishment source within its proximity in every round of execution.
where
gives the
j-th component of the former solution
i,
describes the
j-th component that belongs to the solution’s neighbor
k, and parameter
denotes an arbitrary number inside interval
, while the modifying rate is given as
, representing the suboptimal converging control value.
When the solution is discovered in the proximity, the fitness value is determined with respect to the former solution. If the fitness of the novel solution is more suitable, it is stored within the population.
After completing the intensification procedure, workers will give feedback to the observers about the nourishment level of quality. The observers will make a decision over a source
i with the probability correlated to the fitness, as defined by Equation (
10):
where
represents the likelihood that the nourishment source
i will be selected, when the total count of nourishment sources is given by
m, while the fitness value is denoted by
.
Equation (
10) determines the higher amount of observers that are attracted to the quality nourishment sources. After determining the latest best source of food, observers will continue to seek other fine nourishment sources in the proximity, as described by Equation (
9). If the worker abandons the source that cannot be enhanced and changes into the scout unit, that source is removed, and a novel source is produced. The control variable that determines if the source should be deserted is given as
.
3.2.4. Firefly Algorithm
Yang proposed the firefly algorithm (FA) [
71] that was inspired by the swarming properties exhibited by fireflies, which they use to communicate among themselves. The fireflies communicate by utilizing the bioluminescence treat, referring to their natural property to radiate light. The specimens communicate with respect to the light’s magnitude, as the individuals that emit less intense light tend to move in the direction of the more bright fireflies. The fireflies are observed as unisex organisms; consequently, the gender does not affect communication. The attractiveness property is used to quantify the brightness of each unit. In case several fireflies have the same attractiveness, random movement will be chosen. The objective function that is tuned influences the volume of light radiated by the single insect.
As stated above, the fitness function is modeled by manipulating the brightness and attractiveness of the fireflies. The majority of the FA variants use the brightness values provided by the fitness function. For the minimization task, the following Equation (
11) is utilized:
where
determines the attractiveness, while
denotes the value of the fitness function at position
x. To reflect the physical properties of light, where the intensity fades with the increase in distance from the source, the attractiveness also decreases at larger distances, as shown by Equation (
12).
where
gives the volume of light at range
r, assuming that the intensity at the origin is represented by
. Additionally, to reflect the effect of absorption of light by surrounding objects, parameter
is introduced to model the absorption coefficient. This property is mathematically modeled by using the Gaussian form that outlines the effect with Equation (
13).
The attractiveness
of the fireflies in the flock changes with respect to the light volume emitted by the individual, taking into account also the range between the insects, as defined by Equation (
14).
where the
value denominates the attractiveness level of the firefly at range
. It should be noted that the majority of FA applications do not use Equation (
14) directly, but recommend using Equation (
15) instead.
Equation (
16) describes the search for the
i-th random firefly that moves into the direction of
j—representing the firefly that emits more intense light, by step of
in every round. The randomization parameter is given by
, while
is an arbitrary value from the Gaussian distribution. The range among two insects
i and
j is given by
. Exhaustive simulations have shown that the FA attains the best level of performance with values of
and 1 for the
and
parameters.
Finally, Equation (
17) is used to calculate the Cartesian distance
, where the number of particular problem dimensions is represented by
D.
3.2.5. Bat Algorithm
Bats are very interesting animals in nature. This animal is the only winged mammal and it possesses an advanced echolocation ability as well. The echolocation behavior of bats should be correlated to the objective function that is required to be tuned, making it suitable for creating a metaheuristic solution called the bat algorithm (BA) [
72]. The solution is formulated by multiple factors that influence the way bats move. Firstly, every unit from the population flies at random velocity
towards the position, which is the solution
in this case. The frequency varies on wavelength and loudness. During its hunt, a bat has varying frequency, loudness, and pulse emission rate. The intensification of the search is performed by random walk. The criteria need to be set, after which it is decided that the best possible results are achieved. The tuning of the exploitation and exploration phases is controlled through the parameters of the algorithm.
To keep the algorithm simple, the following approximations or idealized guidelines were employed:
All units utilize the echolocation to feel the distance, and they can also differentiate betwixt the target prey and surrounding structures.
Even though the loudness can be varied in numerous ways, it is assumed that it differs, starting with a large (positive) to the minimal value of .
The current solution is marked by
, and the novel, adjusted location during round
t of the
i-th solution is represented by
. It is computed as in Equation (
18). The velocity is denoted by
.
The speed of the bat at iteration
t can be attained by Equation (
19).
where the most recent global best location is marked by
, while
describes the frequency level utilized by the
i-th individual.
The frequency used by the individual is uniformly taken from the specified interval bounded by the minimum and maximum frequencies, and it can be obtained in the following way:
where
and
represent the minimum and maximum frequencies, while the
value is an arbitrary number,
.
The random walk is used to modify the most recent best individual, directing the algorithm’s exploitation procedure, that can be formulated as presented in Equation (
21).
where the mean loudness value of the entire population is represented by
, while
denotes a scaling parameter produced as an arbitrary number in the range
.
When the target prey has been discovered by the bats, they will update the loudness with respect to Equation (
22).
where
denotes the loudness level of the
i-th individual, during round
t, while
r represents the pulse-emitting rate. The parameters
and
are fixed values.
3.2.6. Sine Cosine Algorithm
The sine cosine algorithm (SCA) was suggested by Mirjalili [
73]. Standardly, population-based optimization methods begin the tuning procedure with a collection of arbitrary options. This arbitrary set is assessed repetitively by the objective function enhanced by the series of regulations that represent the kernel of the tuning method.
Regardless of the differences among the formulas utilized in the stochastic population-based algorithms, the tuning task is separated into two procedures: exploration and exploitation. During exploration, the metaheuristic integrates the random services in the collection of individuals quickly by applying a high level of randomness, aiming to determine the auspicious areas of the entire search realm. During the exploitation procedure, nevertheless, there are steady modifications in the arbitrary remedies, and also arbitrary perturbations are significantly lower than compared to the expedition stage. The search is defined by Equation (
23).
where
represents the setting of the observed individual in the
i-th measurement within the
t-th model,
,
, and
are arbitrary control variables,
is placement of the location factor in
i-th dimension, and || shows the absolute value.
As the above equations reveal, there are four primary specifications in SCA:
,
,
, and
. The parameter
determines the following setting regions (or motion direction) that can be either in the area between the individual and destination or outside it. The criterion
specifies exactly how far the activity needs to be towards or outwards from the destination. The specification
offers arbitrary weights for location, aiming to emphasize (
> 1) in a stochastic fashion, or play down (
< 1) the result of desalination in specifying the distance. Ultimately, the specification
equally changes in between the two components in Equation (
23). As a result of using basic trigonometrical functions in this metaheuristic, the algorithm was named sine cosine algorithm (SCA).
The cyclic patterns of trigonometrical functions permit a service to be repositioned around an additional remedy. This can ensure the exploitation of the area located betwixt two options. For checking out the search area, the individuals need to be also capable of browsing externally in the area between their equivalent locations. This can be attained by transforming the series of the sine and cosine features.
4. Results
The observed dataset described previously was divided into training, validation, and test subsets (70%, 10%, and 20%), and multivariate time-series forecasting was performed by utilizing the tuned LSTM network. The dataset visualization is shown in
Figure 2.
All six observed metaheuristic algorithms were assigned to determine the optimal set of hyperparameters for the LSTM model inside a collection of empirically established boundaries. The LSTM models were developed in Python by utilizing the Keras and TensorFlow 2.0 libraries. The hyperparameters that were subjected to the tuning process, together with their respective search boundaries, are, namely,
count of neurons ;
learning rate ;
training epochs’ count ;
dropout .
The recurrent dropout was fixed to . Additionally, the early stopping criteria was utilized in the following way. If the results were not improved for rounds, training would halt to avoid overfitting. The metaheuristics were initialized with a starting population size of four units and the tuning process was executed in five rounds (iterations), across five independent runs. Finally, the metaheuristic-tuned LSTM was referred to by adding the LSTM suffix to help with the clarity of the presented experimental outcomes (for example, LSTM-FA denotes LSTM model tuned by FA algorithm). It is noted that a relatively low number of individuals in population, iterations, and runtime was used because experiments need computational power.
The training of an LSTM is very resource-intensive, and the graphical processing unit (GPU) that supports the CUDA technology is needed in order to finish training in a reasonable amount of time. By adding the LSTM layer, the training process requirements grow exponentially. Therefore, it should be noted that this research employs a relatively simple network structure with only one LSTM layer that can be trained relatively inexpensively, yet obtaining satisfying performance.
The results of each LSTM network are evaluated by utilizing a standard set of metrics (which are largely used in similar works, as seen in
Table 1), that includes the mean squared error (MSE)—Equation (
24), root mean squared error (RMSE)—Equation (
25), mean absolute error (MAE)—Equation (
27), and the coefficient of determination (
)—Equation (
27).
where
and
represent the vectors containing the observed set of values and the predicted ones, both having the size
N. This research utilizes MSE as the objective function with the goal to minimize it.
In all tables that contain the results, the best outcome for each category is marked in bold.
Table 2 presents the experimental outcomes in terms of objective function indicators for the best, worst, mean, and median run, as well as the standard deviation and variance throughout five independent runs. As can be observed from
Table 2, the LSTM-FA achieved the best result, while the LSTM-SCA scored the best value for the metric denoting the worst run. The best mean and median values were achieved by the LSTM-GA algorithm.
Table 3 and
Table 4 show the normalized and denormalized metrics of one-step-ahead estimations, for the best runs of all six observed models. The LSTM-FA was superior in terms of all observed metrics (MSE—the objective function,
, MAE, and RMSE). Lastly,
Table 5 shows the best hyperparameter values established by each one of the six observed metaheuristic algorithms.
The visualizations of the executed experiments are shown in
Figure 3 and
Figure 4, and outline the following for the objective function (MSE) and
indicator: the convergence graphs for the best run, box plots and violin plots for the objective function distribution over five runs, and swarm plot for population diversity in the last iteration of the best run. From the swarm plots, it is interesting to note that the ABC exhibits the highest diversity in the last round, while all solutions of the FA metaheuristic are concentrated around the best subset of the search space. This behavior is expected because the ABC uses the
parameter that ensures high population diversity on the whole, while, at the other end, the FA exhibits strong exploitation towards the current best solution.
Figure 5 shows the kernel distribution estimation (KDE) plots for both the objective function (MSE) and
R2, depicting the probability density function. These plots present the distribution of the results of the runs, and it can be noted that all run results come from the normal distribution, which proves that all metaheuristics exhibit relatively stable behavior throughout different runs.
Finally, the visualizations of the best forecasts of the results determined by the best-generated LSTM network by all six metaheuristics are shown in
Figure 6. It can be noted that the LSTM model produced by the FA algorithm achieved the best predictions of the observed power load time series.
Additionally, in order to make sure that the very-resource-intensive LSTM tuning by using metaheuristics is worthwhile, an experiment with its variable setting by employing a simple grid search was also conducted, and the same LSTM hyperparameters were used. The search space boundaries for each hyperparameter (shown at the beginning of
Section 4) are divided into 10 subsets, e.g., the number of neurons used for the grid search consists of the following values
. The best-performing LSTM structure generated by grid search obtained the MSE value of 0.002665 and an
of 0.913574, which are substantially worse than the results of any metaheuristic included in analysis that were generated in the worst run.
Best Model Interpretation Using Explainable AI
The possibility to explain the behavior exhibited by an ML model is vital to comprehend the process that is modeled and the obtained results. In order to explain the LSTM model that achieved the best level of performance on the power load dataset, the advanced explainable AI method Shapley additive explanations (SHAP) was employed. The SHAP procedure successfully evades the balancing betwixt the accuracy and the possibility to interpret the results, by providing a straightforward and relevant interpretation of the obtained LSTM predictions. SHAP relies on Shapley values, inspired by game theory, that represent the feature importance metric, providing an insight into what attributes have the largest influence on the predictions [
74].
Simply said, Shapley values denote a set of distributed payouts between players that work together (representing the features) with respect to their contributions to the joint payout (denoting the prediction). Consequently, the SHAP method allocates to each feature an additional metric—the importance, that measures the contribution of that feature to the specific prediction, by calculating the impact of the model’s prediction in comparison to the prediction if that particular feature was set to a baseline value.
The best-performing LSTM model obtained from experiments (LSTM-FA) was taken and Shapley values were calculated. To observe the influence of each predictor on the final output in the testing set, SHAP summary plots for all features with and without the target variable, which was also used as predictor, were created and are shown in
Figure 7. Summary plots in the form of bars (first row in the figure) represent the relative importance of each feature, calculated by taking the average absolute value of the SHAP values. Plots shown in the second row of the figure depict the influence that each observation has on the target variable (power-grid load).
From
Figure 7, some important conclusions can be observed. First, the feature that has the most relative importance on the target is the power-grid load in the past period, in this case in the previous h. Secondly, the second most influential predictor is the temperature and, from the summary plots per each observation, it can be clearly noted that with the increase of temperature for most observations, the power-grid load is also increasing. This implication is logical because it is generally hot in Spain and people are using air conditioning systems for cooling that use a lot of electricity. Similar conclusions can be made for humidity and wind speed predictors on the total power-grid load (energy consumption).
5. Conclusions
Some of the most recent DL architectures for energy modeling, in general, are reviewed, with an emphasis on energy load prediction. Although there exist recent shallow ML approaches for energy forecasting, the supremacy of the DL-based models is generally acknowledged. The models are assessed from several facets, and one of them refers to whether hyperparameter tuning was used or not in previous works. When not specifically mentioned whether the hyperparameters were tuned via a specific method and their values were simply provided, it is assumed that manual tuning was performed.
As many of the recent research works made in this area do not use any dedicated tool or method for parameter tuning, a straightforward experiment was assembled herein that evaluates the necessity and adequacy of employing a specific tool for this task, e.g., a metaheuristic. As a counterpart, a grid search is used for tuning the same hyperparameters. A simple LSTM architecture to allow numerous simulations in a relatively short time was used. The overall results indicate that FA led to the best results out of the entire set of used heuristics, although the differences in the outputs are very small. Nevertheless, the results obtained when using any of the six metaheuristics for parameter tuning led to results notably better than when grid search was used for the same purpose. While it is true that DL is rarely a fast procedure, it is shown that even with an economical metaheuristic (e.g., a population of only four individuals evolved over five iterations), the results are improved to a great extent and, when possible, the integration of such supplementary mechanism pays off.
The goal of the current experiment was not to propose the most appropriate model for the problem at hand, but rather to demonstrate that hyperparameter tuning via metaheuristics leads to considerably better results. More complex DL approaches, such as bidirectional LSTM (BiLSTM) or gated recurrent units (GRUs), would probably improve the results, especially if properly tuned. Still, the results that are presented in the current work can be used for comparative studies in the future, as obtained from baseline approaches. We conclude by recommending the use of metaheuristics for the hyperparameters of the DL models since these generally lead to better-tailored models and, consequently, to improved results.