An Insight of Deep Learning Based Demand Forecasting in Smart Grids

Smart grids are able to forecast customers’ consumption patterns, i.e., their energy demand, and consequently electricity can be transmitted after taking into account the expected demand. To face today’s demand forecasting challenges, where the data generated by smart grids is huge, modern data-driven techniques need to be used. In this scenario, Deep Learning models are a good alternative to learn patterns from customer data and then forecast demand for different forecasting horizons. Among the commonly used Artificial Neural Networks, Long Short-Term Memory networks—based on Recurrent Neural Networks—are playing a prominent role. This paper provides an insight into the importance of the demand forecasting issue, and other related factors, in the context of smart grids, and collects some experiences of the use of Deep Learning techniques, for demand forecasting purposes. To have an efficient power system, a balance between supply and demand is necessary. Therefore, industry stakeholders and researchers should make a special effort in load forecasting, especially in the short term, which is critical for demand response.


Introduction
Electricity cannot be easily stored for future supply, unlike other commodities such as oil. This means that electricity must be distributed to the consumers immediately after its production. The distribution of electricity to final users has been done with the help of the traditional electrical grid (see definition in Table 1) which allows the delivery of electricity from producers to consumers. To achieve that goal, it connects the electricity generating stations and the transmission lines that deliver the electricity to the final users. Traditional electrical grids vary in size. When these grids started to expand, controlling them became a very complex and difficult task. Additionally, demand forecasting (see definition in Table 1) has not traditionally been considered.
In this context, the concept of the smart grid (see definition in Table 1) arises and starts to play an important role. This concept has been exhaustively reviewed in the literature (e.g., [1][2][3][4]). Smart grids provide a two-way communication between consumers and suppliers. Smart grids add hardware and software to the traditional electrical grid to provide it with an autonomous response capacity to different events that can affect the electrical grid. The final objective is to achieve an optimal daily operational efficiency for the electrical power delivery. In [4], the authors define "smart grid" as a new form of electricity network that offers self-healing, power-flow control, energy security and energy reliability using digital technology. In [2], the authors highlight that the concept of the smart grid is transforming the traditional electrical grid by using different types of advanced technology. According to these authors, this concept integrates all the elements that are necessary to generate, distribute, and consume energy efficiently and effectively. In [5], the authors emphasize that the smart grid concept emerged to make the traditional electrical grid more The smart grid paradigm allows consumers to find out their energy usage patterns. Consequently, consumers can control their consumption and use energy more efficiently. In the implementation of the smart grid concept, demand response-for both household and industrial purposes-plays an important role. Another useful tool is load forecasting (see definition in Table 1). In [6] the authors mark the importance of this concept in the context of smart grids, as forecasting the electricity needed to meet demand allows power companies to better balance demand and supply. Power companies are especially interested in achieving accurate forecasts for the next 24 h, which is called load profile (see definition in Table 1).
In addition, in recent years, the increased demand for electricity at certain times of the day has created several problems. Load forecasting is especially important during peak hours. Demand response encourages customers to offload non-essential energy consumption during these peak hours.
To face load forecasting challenges, it is necessary to use modern data-driven techniques. Indeed, the incorporation of new technologies, such as Big Data, Machine Learning, Deep Learning, and the Internet of Things (IoT), has upgraded the smart grid concept to another level, as these technologies allow for improved demand forecasting and automated demand response. This paper provides an insight into the importance of demand forecasting and important related factors in the context of smart grids, as well as the possibility of using data-driven techniques for this purpose. More specifically, the authors focus on Deep Learning techniques, as it has emerged as a good option for the implementation of demand forecasting in the context of smart grids. The paper collects some experiences of using different Deep Learning techniques in the energy domain for forecasting purposes. An efficient power system must take demand response into account. Additionally, accurate load forecasting, especially in the short term, is essential, which is why industry stakeholders and researchers are putting special efforts into it. Table 1 defines some keywords related to the topic of this paper. The remainder of this paper is organized as follows. Section 2 presents the reasons why demand forecasting is important in the context of smart grids. Section 3 describes the most important factors in relation to demand forecasting. Section 4 presents the different possible classifications of demand forecasting techniques. Section 5 provides some fundamentals and concepts useful to understand the Deep Learning models commonly used in the energy domain. Section 6 collects different experiences of using these models in the context of smart grids for forecasting purposes. Finally, Section 7 summarizes the main conclusions of this work.

The Importance of Demand Forecasting
In [7] the authors summarize the main requirements of smart grids as follows: flexible enough to meet users' needs, able to manage uncertain events, accessible for all users, reliable enough to guarantee high-quality energy delivery to consumers, and innovative enough to manage energy efficiently.
With these requirements in mind, smart grids should aim to develop low-cost, easyto-deploy technical solutions with distributed intelligence to operate efficiently in today's increasingly complex scenarios. To upgrade a traditional electrical grid into a smart grid, intelligent and secure communication infrastructures are necessary [4].
According to the study presented in [8], forecasting can be applied in two main areas: grid control and demand response. In [9], the authors highlight that forecasting models are essential to provide optimal quality of the energy supply at the lowest cost. In addition, real-time information on users' energy consumption patterns will enable more sophisticated and efficient forecasting models to be applied. Forecasting must also consider the need to manage constantly changing information. In [10], the authors highlight that, with the smart grid, demand response programs can make the grid more cost efficient and resilient.
The authors in [11] remark that there are important challenges in demand forecasting due to the uncertainties in the generation profile of distributed and renewable energy generation resources. In fact, increasing attention is being paid to load forecasting models, especially dealing with renewable energy sources (solar radiation, wind, etc.) [9].
The distributed generation paradigm facilitates the use of renewable energy sources that can be placed near consumption points. When using this paradigm, smart grids have multiple small plants that supply energy to their surroundings. Consequently, the dependence on the distribution and transmission grid is smaller [9]. However, this paradigm makes grid control even more uncertain, especially when the distributed generation sources are renewable and consequently have a random nature. Despite this difficulty, the share in energy production of variable renewable energy sources is expected to increase in the coming years [12].
Another key element are microgrids (see definition in Table 1) [13,14]. Based on this concept, and taking into consideration the intelligence deployed in buildings, new concepts have emerged including smart homes (see definition in Table 1) and smart buildings (see definition in Table 1). Buildings today are complex combinations of structures, systems, and technology. Technology is a great ally in optimizing resources and improving safety. Advances in building technologies are combining networked sensors and data recording in innovative ways [15]. Modern facilities can adjust heating, cooling, and lighting to Sensors 2023, 23, 1467 4 of 30 maximize energy efficiency, providing also detailed reports of energy consumption. In these new smart environments (see definition in Table 1), sensors and smart devices are deployed to obtain enough information about the users' energy consumption patterns. Once again, this requires forecasting models that must be applied to the specific variables of the scenario to be controlled.
Forecasting models will allow to consider variables (climatic, social, economic, habitrelated, etc.) that can influence the accuracy of forecasts [9]. These authors remark that energy demand estimates in disaggregated scenarios, such as residential users in smart buildings, are more complex compared to energy demand estimates for an aggregated scenario, such as a country. Disaggregating the demand also facilitates the implementation of demand response, as different prices can be offered based on the criteria set by the power company.
The gradual integration of intelligence at the transmission, distribution and end-user levels of the electricity system aims to optimize energy production and distribution to adjust producers' supply to consumers' demand. Moreover, smart grids seek to improve fault detection algorithms [16]. Accurate demand forecasts are very useful for energy suppliers and other stakeholders in the energy market [17]. In fact, load forecasting has been one of the main problems faced by the power industry since the introduction of electric power [18].

Important Factors in Demand Forecasting
Electricity demand is affected by different variables or determinants. These variables include forecasting horizons, the level of load aggregation, weather conditions (humidity, temperature, wind speed, and cloudiness), socio-economic factors (industrial development, population growth, cost of electricity, etc.), customer type (residential, commercial, and industrial), and customer factors in relation to electricity consumption (characteristics of the consumer's electrical equipment) (e.g., [19][20][21][22][23]).
To fully understand demand forecasting techniques and objectives, it is necessary to examine these determinants. In this section, the authors will focus on (1) period, (2) economic issues, (3) weather conditions, and (4) customer-related factors.

Period or Forecasting Horizon
The period commonly referred as forecasting horizon is probably one of the factors that has the greatest impact.
According to different authors (e.g., [17,24]), demand forecasting can be classified into three categories with respect to the forecasting horizon: • Short-term (typically one hour to one week). • Medium-term (typically one week to one year). • Long-term (typically more than one year).
Factors affecting short-term demand forecasting usually do not last long, such as sudden changes of weather [22]. The quality of short-term demand forecasting is critical for electricity market players [20]. On the other hand, the influencing factors of medium-term demand forecasting often have a certain time duration, such as seasonal weather changes. Finally, the factors influencing long-term demand forecasting last for a long time, typically several forecast periods, e.g., changes in Gross Domestic Product (GDP) [22]. Indeed, economic factors have an important impact on long-term demand forecasting, but also on medium and short-term forecasting [25].
The authors of [26] identify the following categories in relation to the forecasting horizon: • Very short-term (typically seconds or minutes to several hours). • Short-term (typically hours to weeks). • Medium-term and long-term (typically months to years). According to these authors, very short-term demand forecasting models are generally used to control the flow. Short-term demand forecasting models are commonly used to match supply and demand. And, finally, medium-term and long-term demand forecasting models are typically used to plan asset utilities.
The authors in [27] showed that the load curve of grid stations is periodic, not only in the daily load curve, but also in the weekly, monthly, seasonal, and annual load curves. This periodicity makes it possible to forecast the load quite effectively.
Demand also reflects the daily lifestyle of the consumer [28]. Consumers' daily demand patterns are based on their daily activities, including work, leisure and sleep hours. In addition, there are other demand variations patterns over time. For example, during holidays and weekends, demand in industries and offices is significantly lower than during weekdays due to a drastic decrease in activity. Finally, power demand also varies cyclically depending on the time of the year, day of the week, and time of day [22].

Socio-Economic Factors
Socio-economic factors, including industrial development, GDP, and the cost of electricity, also significantly affect the evolution of demand. Indeed, as mentioned in the previous section, economic factors considerably affect long-term demand forecasts, and also have an important impact on medium-and short-term forecasts.
For example, industrial development will undoubtedly increase energy consumption. The same will be true for population growth. This means that there is a positive correlation between industrial development, or population growth, and energy consumption.
GDP is an indicator that captures a country's economic output. Countries with a higher GDP generate a greater quantity of goods and services and will consequently have a higher standard of living and lifestyle habits, which will stimulate energy demand.
Another economic factor to consider is cost, as it also affects demand. For example, when the price of electricity decreases, wasteful electricity consumption tends to increase [22].
The cost of electricity depends on different factors and is shaped in different ways. For example, in some countries such as Spain, there are two markets (regulated and free) for electricity. In the free market, the cost of electricity is established in the contract signed by the consumer. In contrast, in the regulated market, the price of electricity depends on supply and demand. The price is updated hourly and fluctuates. From the demand side, the more electricity is demanded, the more expensive it is. When less electricity is demanded, the cheaper it is. Normally, it is cheap to use electricity at dawn and expensive to do it when everyone else is using it (e.g., at dinner time).
But it is not only the demand that influences prices, but also the supply of energy. The reason is that variations in the price of electricity on the regulated market are caused by differences between demand and supply. Consequently, supply must consider the different ways of generating electricity, which have different costs. The cheapest is electricity generated by renewable energies such as solar, wind and hydroelectric. The price of nuclear energy is also low; however, in many countries (e.g., Spain), nuclear energy does not cover all energy needs. Thermal (coal), cogeneration, or combined cycle-whose main fuel is gas-tends to be more expensive. It is also important to remember that the main sources of renewable energy, such as hydroelectric or wind, depend on uncontrollable external factors. For example, sufficient rainfall is essential to produce hydroelectric power. However, there is no way to control the weather to make it favorable for producing electrical energy. Given the above, the price is determined by the price of a mix of different sources of power generation, from cheapest to most expensive, until the entire energy demand is met.

Weather Condition
There are different weather variables relevant for demand forecasting such as temperature, humidity, and wind speed. The influence of weather conditions on demand forecasting has attracted the interest of many researchers. As an example, the authors in [29] proposed different models to forecast next day's aggregated load using Artificial Neural Networks (ANNs), considering the most relevant weather variables-more specifically, mean temperature, relative humidity, and aggregate solar radiation-to analyze the influence of weather.
Some authors have studied the relationship between temperature and electricity consumption and claim that the correlation between temperature and the electricity load curve is positive, especially in summer (e.g., [25]).
Currently, heat waves have become more common around the world, as well as the possibility of extreme temperatures. In addition, heat waves are not only more frequent, but also more intense and longer lasting. Moreover, the nights are getting warmer, which is an added problem. The main effect of a heat wave is an increase in energy consumption as the consumer turns on the air conditioning more and for longer periods of time. Additionally, cooling systems must work harder as they must cope with higher temperatures.
During the summer, heat waves force the grid to be at maximum capacity. In fact, one of the ways in which a heat wave affects consumption is through the increased saturation of the electrical grid. While cold waves are counteracted with electricity, gas, wood, etc., heat waves can only be fought with electricity. In other words, the devices that consumers use for cooling are mainly powered by electricity. For this reason, heat waves generate more stress on power lines, as well as higher consumption.
It should be noted that, in colder countries, the increase in consumption during a heat wave is usually lower. This is because the installation of air conditioning systems is not as common as in warmer countries. However, these colder countries are facing heat waves that did not occur in previous years (before climate change) and this is causing them all type of problems, as they are less prepared. This situation is forcing these countries to make changes such as increasing the use of cooling systems.
On the other side, experience of the harshness of temperature increases with humidity, especially during the rainy season and summer. For this reason, electricity consumption increases during humid summer days. It is also important to note that in coastal areas, such as the Mediterranean area in Spain, electricity consumption tends to be higher. This is both because houses tend to have more electrical equipment than in other areas, and because of the high degree of humidity due to the proximity of the sea.
Wind speed also affects electricity consumption. When it is windy, the human body feels that the temperature is much lower and more heating is needed, which increases electricity consumption. However, it should also be noted that wind energy is one of the main renewable energies. In other words, when there is wind, electricity consumption increases, but at the same time its price decreases. This is because, as explained in the previous section, the price of the electricity is usually determined as a mix of the different energy sources, from cheapest energies (renewables, including wind, and nuclear) to the most expensive generation sources (thermal, combined cycle).
Temperature, humidity, and wind affect the use of electricity. Humidity and temperature are also the main weather variables used in electricity demand prediction systems to minimize operating costs. However, other factors, such as clouds, also play a role. For example, during the day, when clouds disrupt sunlight there is usually a drop in temperature and, consequently, higher electricity consumption.

Customer Factors
The type of customer (residential, commercial, and industrial), as well as other customer factors related to electricity consumption (characteristics of the consumer's electrical equipment) can also affect demand. This is important because most energy companies have different types of customers (residential, commercial, and industrial consumers), who have equipment that varies in type and size. These different types of customers have different load curves, although there are some similarities between commercial and industrial customers. Table 2 summarizes the main determinants affecting electricity demand described in this section.

Forecasting horizon
The time horizon for which demand forecasts are prepared.

Socio-economic factors
Industrial development, population growth, cost of electricity, and any other socio-economic factors that may influence end-users' demand.
Weather conditions Temperature, humidity, wind speed, and any other weather conditions that may influence end-users' demand.

Customer factors
Type of customer (residential, commercial, and industrial), characteristics of the consumer's equipment, and any other customer factors that may affect the end-users' demand.

Classification of Demand Forecasting Techniques
This section classifies demand forecasting models according to three different criteria: (1) period, (2) forecasting objective, and (3) type of model used.
The first classification focuses on the point of view of the period to be forecasted, i.e., the forecasting horizon. To select this criterion, the electricity demand determinants presented in the previous section have been considered. The second classification focuses on the point of view of the forecasting objective, differentiating between forecasting techniques that produce a single value and those that produce multiple values. Finally, the third classification focuses on the point of view of the model used.

Classification of Demand Forecasting Techniques according to the Forecasting Horizon
As explained in the previous section, the main forecasting horizons that can be identified are the following: • Very short-term: typically from seconds or minutes to several hours. • Short-Term: typically from hours to weeks. • Medium-Term: typically from a week to a year. • Long-Term: typically more than a year.
The main difference is the scope of the variables used in each case. Very short-term forecasting models use recent inputs (typically minutes or hours), short-term forecasting models use inputs typically in the range of days, and medium and long-term forecasting models use inputs typically in the range of weeks or even months.
Power companies are particularly interested in producing accurate forecasts for the load profile (e.g., [9,30,31]). This is because it can directly affect the optimal scheduling of power generation units. However, due to the non-linear and stochastic behavior of consumers, the load profile is complex, and although research has been done in this area, accurate forecasting models are still needed [32].

Classification of Demand Forecasting Techniques by Forecasting Objective
Forecasting models can be also classified according to the number of values to be forecasted. In this case, two main categories can be considered.
The first category refers to forecasting techniques that produce only one value (e.g., next day's total load, next day's peak load, next hour's load, etc.). Examples are found in [33,34].
The second category refers to forecasting techniques that produce multiple values, e.g., the next hours' peak load plus another parameter (e.g., the aggregate load) or the load profile. Examples are found in [35][36][37]. Generally speaking, one-value forecasts are useful for optimizing the performance of load flows. On the other hand, multiple-value forecasts are mainly used for energy generation scheduling [9].

Classification of Demand Forecasting Techniques according to the Model Used
The model to be used is usually decided by the practitioner. In terms of models, the main groups are linear and non-linear approaches.
Linear Linear techniques have progressively lost importance and interest in favor of nonlinear techniques based on ANNs. Deep Learning models use ANNs, inspired by the human nervous system. These models can learn patterns from the data generated and forecast peak demand in the context of today's complex smart scenarios, where a large amount of data is continuously generated from different sources [7]. Table 3 summarizes the criteria commonly used to classify demand forecasting models. Table 3. Main criteria commonly used to classify demand forecasting models.

Forecasting horizon
The time horizon for which electricity demand forecasts are prepared. The main forecasting horizons are: -very short-term: from seconds or minutes to several hours. -short-term: from hours to weeks. -medium-term: from a week to one year. -long-term: more than one year.

Aim of prediction
The number of values to be forecasted, mainly one value (e.g., next day's total load, next day's peak load, next hour's load, etc.) or multiple values (e.g., the load profile).

Fundamentals and Concepts of Machine Learning and Deep Learning Systems
Artificial Intelligence is a complex concept that, in a nutshell, refers to machine intelligence [38]. Unlike humans, Artificial Intelligence can identify patterns within a large amount of data using a quite limited amount of time and resources. Furthermore, the computational capacity of machines does not decrease with time and/or fatigue [39].
Artificial Intelligence systems use different type of learning methods, such as Machine Learning and Deep Learning.

Machine Learning
Machine Learning algorithms are pre-trained to produce an outcome when confronted with a never-before-seen dataset or situation [40]. However, the computer needs more examples to learn than humans do [41]. Machine Learning allows the introduction of intelligent decision-making in many areas and applications where developing algorithms would be complex and excellent results are needed [42].
There are different categories of Machine Learning algorithms including supervised, semisupervised, unsupervised, and reinforcement learning. These different categories of algorithms are briefly described below.

Supervised Learning
After being trained with a set of labelled data examples, these algorithms can predict label values when the input has unlabeled data. The problems typically associated with this type of learning are (1) regression and (2) classification [43].
In regression the algorithm focuses on understanding the relationship between dependent and independent variables. In classification, the algorithm is used to predict the class label of the data. Common classification problems include (1) binary classification, between two class labels; (2) multi-class classification, between more than two class labels; and (3) multi-label classification where one piece of data is associated with several classes or labels, as opposed to traditional classification problems with mutually exclusive class labels [44].
Some interesting practical applications are text classification, predicting the sentiment of a text (such as a Tweet or other social media), assessing the environmental sustainability of clothing products [47], characterizing, predicting, and treating mental disorders [48], and estimating peak energy demand.

Unsupervised Learning
This type of learning uses unlabeled data. In this case, the system explores the unlabeled data to find hidden structures, rather than predicting the correct output. This type of learning is not directly applicable to regression or classification problems, as the possible values of the output are unknown [49]. Instead, it is often used for (1) clustering, (2) association, and (3) dimensionality reduction [43].
Clustering allows unlabeled data to be grouped based on their similarities or differences [49,50]. Association uses different rules to identify new and relevant insights between the objects of a set. Finally, dimension reduction allows a reduction of the number of features (or dimensions) of a dataset to eliminate irrelevant or less important features and thus reduce the complexity of the model [44]. This reduction in the number of features can be done by keeping a subset of the original features (feature selection) or by creating completely new features (feature extraction).
The most popular clustering algorithm is probably K-means clustering, where the k value represents the size of the cluster [44,45,51]. Association algorithms include Apriori, Equivalence Class Transformation (ECLAT), and Frequent Pattern (F-P) Growth algorithms. Finally, dimensionality reduction typically uses the Chi-squared test, Analysis of Variance (ANOVA) test, Pearson's correlation coefficient, Recursive Feature Elimination (RFE) for feature selection, and Principal Components Analysis (PCA) for feature extraction.
According to [46], the most commonly used unsupervised learners are K-means, hierarchical clustering, and PCA.
These unsupervised learners can have many practical applications, such as facial recognition, customer classification, patient classification, detecting cyber-attacks or intrusions [52], and data analysis in the astronomical field [53].

Semisupervised Learning
Conceptually situated between supervised and unsupervised learning, this type of learning allows the taking advantage of the large unlabeled datasets that are available in some cases combined with (usually smaller) amounts of labelled data [54,55]. This opens up interesting possibilities as labelled data are often scarce, while unlabeled data are more frequent, and a semisupervised learner can obtain better predictions than those produced using only labelled data [44].
Candidate applications are those where there is only a small set of labelled examples, and many more unlabeled ones, or when the labelling effort is too high. An example is medical imaging, where a small amount of training data can provide a large improvement in accuracy [43,56]. Table 4 compares Supervised and Unsupervised learning, focusing on the type of input data used in each case (labeled versus unlabeled data), and the main tasks for which both types of learning are used (classification, regression versus clustering, association, and dimensionality reduction). This learning technique depends on the relationship between an agent performing an activity and its environment, which provides positive or negative feedback [57,58]. The agent must choose actions that maximize the reward in that environment. Popular methods include Monte Carlo, Q-learning, and Deep Q-learning [44].
Traditionally common applications include strategy games such as chess, autonomous driving, supply chain logistics and manufacturing, genetic algorithms [57], 5G mobility management [59], and personalized care delivery [60].

Deep Learning
Machine Learning can be classified into shallow and deep, considering the complexity and structure of the algorithm [41]. Deep Learning uses multiple layers of neurons composed of complex structures to model high-level data abstractions [61]. The type of output and the characteristics of the data determine the algorithm to be used for a particular use case [62].
Deep Learning uses ANNs inspired by the human nervous system [63]. This type of network typically has two layers of input and output nodes respectively, connected to each other by one or more layers of hidden nodes. Possible deep ANN architectures include Multilayer Perceptron (MLP), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN), Generative Adversarial Network (GAN), and Convolutional Neural Network (CNN or ConvNet).
According to our literature review, the most widely used models in the energy domain are Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), Deep Q-Networks (DQNs) and Conditional Restricted Boltzmann Machine (CRBM) and a variation of any of them, a combination of two or more of them, or the combination of any of them with other techniques. These models are briefly described below.

Convolutional Neural Networks
These networks are biologically inspired networks, like the ordinary neural networks. However, in this type of network the inputs are assumed to have a specific structure such as images [64]. Being one of the most widely used and effective models for Deep Learning, these networks usually include two types of layers (i.e., pooling and convolution layers). A typical CNN architecture usually consists of an input layer, a convolutional layer, a Max pooling layer, and the final fully connected layer, as shown in Figure 1 [65].
These networks are biologically inspired networks, like the ordinary neural networks. However, in this type of network the inputs are assumed to have a specific structure such as images [64]. Being one of the most widely used and effective models for Deep Learning, these networks usually include two types of layers (i.e., pooling and convolution layers). A typical CNN architecture usually consists of an input layer, a convolutional layer, a Max pooling layer, and the final fully connected layer, as shown in Figure 1 [65].
Convolutional layer output: Pooling layer output:

Recurrent Neural Networks
In this type of network, the connections between nodes form a directed or undirected graph along a time sequence. Figure 2 shows a typical RNN structure [65]. The total input to the jth feature map of layer l at position (x,y) can be expressed [66]: Convolutional layer output: Pooling layer output: where O

Recurrent Neural Networks
In this type of network, the connections between nodes form a directed or undirected graph along a time sequence. Figure 2 shows a typical RNN structure [65]. This network can use a gating mechanism called Gated Recurrent Units (GRUs) and introduced in 2014 by the authors in [67]. GRU are like LSTM networks but with a forgetting gate and fewer parameters as they lack an output gate.
Another variation of this type of network, proposed by Elman [68], is the Elman RNN This network can use a gating mechanism called Gated Recurrent Units (GRUs) and introduced in 2014 by the authors in [67]. GRU are like LSTM networks but with a forgetting gate and fewer parameters as they lack an output gate.
Another variation of this type of network, proposed by Elman [68], is the Elman RNN that includes modifiable feedforward connections and fixed recurrent connections. It uses a set of context nodes to store internal states, which gives it certain unique dynamic characteristics over static ones [69].

Long Short-Term Memory
These networks are a special kind of RNN. Unlike standard feedforward neural networks, these networks have feedback connections, and can even process entire sequences of data (such as speech or video), in addition to individual data points (such as images).
This type of RNN contains an input layer, a recurrent hidden layer, and an output layer, with a memory block structure as shown in Figure 3 [70]. This network can use a gating mechanism called Gated Recurrent Units (GRUs) and introduced in 2014 by the authors in [67]. GRU are like LSTM networks but with a forgetting gate and fewer parameters as they lack an output gate.
Another variation of this type of network, proposed by Elman [68], is the Elman RNN that includes modifiable feedforward connections and fixed recurrent connections. It uses a set of context nodes to store internal states, which gives it certain unique dynamic characteristics over static ones [69].

Long Short-Term Memory
These networks are a special kind of RNN. Unlike standard feedforward neural networks, these networks have feedback connections, and can even process entire sequences of data (such as speech or video), in addition to individual data points (such as images).
This type of RNN contains an input layer, a recurrent hidden layer, and an output layer, with a memory block structure as shown in Figure 3 [70]. The LSTM memory block can be described according to the following equations [70]: The LSTM memory block can be described according to the following equations [70]: i t , f t , 0 t are respectively the activations of the three gates at time t; c t is the state of memory cell at time t; h t is the output of the memory block at time t; represents the scalar product of two vectors; σ(x) is the gate activation function; g(x) is the cell input activation function; h(x) is the cell output activation function.
A possible extension of this model is the Bidirectional LSTM (B-LSTM). The aim of this type of LSTM network is to analyze sequences from both front-to-back and back-to-front, i.e., the sequence information flows in both directions backwards and forwards, unlike in a normal LSTM.

Deep Q Network and Dueling Deep-Q Network
Deep Q Network (DQN) and Dueling Deep-Q Network (DDQN) are a type of ANN using the Deep Q learning algorithm, which is popular in reinforcement learning. In a dueling network there are two streams to separately estimate the state-value as well as the advantages for each action. The main objective of Deep-Q Network is to choose the best action in a certain state. Considering π is the policy followed by an agent in a given environment, the function Q π can be defined as follows [71]: where s is a state; a is an action; r i is the potential reward; γ ∈ [0, 1] is a discount factor for making the immediate reward more important than the futures ones. Therefore, the objective of Q-learning is to maximize the optimized value function Q * (s,a) = max π Q π (s,a). Figure 4 shows the scheme of a typical DQN architecture [71].
A possible extension of this model is the Bidirectional LSTM (B-LSTM). The aim this type of LSTM network is to analyze sequences from both front-to-back and backfront, i.e., the sequence information flows in both directions backwards and forwards, u like in a normal LSTM.

Deep Q Network and Dueling Deep-Q Network
Deep Q Network (DQN) and Dueling Deep-Q Network (DDQN) are a type of AN using the Deep Q learning algorithm, which is popular in reinforcement learning. In dueling network there are two streams to separately estimate the state-value as well the advantages for each action. The main objective of Deep-Q Network is to choose t best action in a certain state. Considering π is the policy followed by an agent in a giv environment, the function Qπ can be defined as follows [71]: where s is a state; a is an action; ri is the potential reward; ∈ 0,1 is a discount factor f making the immediate reward more important than the futures ones. Therefore, the o jective of Q-learning is to maximize the optimized value function Q * (s,a)=max πQ π(s, Figure 4 shows the scheme of a typical DQN architecture [71].  . . , f p ), as it is shown in Figure 5 [72].
The energy function of CRBM is [73]: where m represents the number of items the user rated; H is the number of hidden layers; F is the number of conditional layers; K is the highly rating; v k i is the binary value of visible layer unit i and rating k; h j is the binary value of hidden unit j; f q is the binary value of conditional layer F; b k i is the bias of rating k with visible layer unit i; b j is the bias of feature j; W k ij is the connected weight between hidden layer H and visible layer V; D is the connected weight between hidden layer H and conditional layer F; D qj is the connected weight between hidden feature j and conditional layer unit q. and hidden) can have a symmetric connection between them, but there are no connections between nodes in the same group, allowing for more efficient training. On the other hand, unrestricted BMs can have connections between hidden units. RBM consists of m visible units V=(v1,…, vm) representing observable data and n hidden units H=(h1,…, hn) capturing dependencies between observable variables, with the conditional layer units F=(f1,…, fp), as it is shown in Figure 5 [72]. The energy function of CRBM is [73]: where m represents the number of items the user rated; H is the number of hidden layers; F is the number of conditional layers; K is the highly rating; is the binary value of visible layer unit i and rating k; hj is the binary value of hidden unit j; fq is the binary value of conditional layer F; is the bias of rating k with visible layer unit i; bj is the bias of feature j; is the connected weight between hidden layer H and visible layer V; D is the connected weight between hidden layer H and conditional layer F; Dqj is the connected weight between hidden feature j and conditional layer unit q.
In [73], the authors introduced the Factored Conditioned Restricted Boltzmann Machines (FCRBMs) by adding the concept of factored, multiplicative, and tridirectional interactions to predict multiple human movement styles.
Finally, Deep Belief Networks (DBNs) are formed by several RBMs stacked on top of the other [74].
Due to the growing demand for energy from different sectors, supply and demand must be balanced in the electrical grid. In this scenario, smart grids can play an important role by providing a bidirectional flow of energy between consumers and utilities. Unlike traditional electrical grids, smart grids have sophisticated sensing devices that generate data from which energy patterns can be derived. These patterns are extremely useful for load forecasting, peak shaving, and demand response management. In [73], the authors introduced the Factored Conditioned Restricted Boltzmann Machines (FCRBMs) by adding the concept of factored, multiplicative, and tridirectional interactions to predict multiple human movement styles.
Finally, Deep Belief Networks (DBNs) are formed by several RBMs stacked on top of the other [74].
Due to the growing demand for energy from different sectors, supply and demand must be balanced in the electrical grid. In this scenario, smart grids can play an important role by providing a bidirectional flow of energy between consumers and utilities. Unlike traditional electrical grids, smart grids have sophisticated sensing devices that generate data from which energy patterns can be derived. These patterns are extremely useful for load forecasting, peak shaving, and demand response management.
As the amount of data generated by a smart grid is huge and constantly increasing, Deep Learning based models are a good option to understand consumption patterns and make forecasts. Researchers have studied the possibilities of using Deep Learning models, with LSTM networks playing a leading role (e.g., [32,57,80]). terms related to the energy field, more specifically, "energy demand forecasting", "electricity demand forecasting", "load forecasting", "demand response", "demandside response" and variations of these expressions.  Escobar et al. [91] 2020 LSTM, CNN, GRU, and hybrid models: CNN-LSTM and CNN-GRU 3-days 4 years of hourly data from Madrid, including energy consumption, energy generation, pricing data, and meteorological information: temperature (K), humidity (water percentage in the air), wind direction in sexagesimal degrees, wind speed in miles per second (m/s), onshore wind and solar energy, and total load, these last in MW.
Comparative analysis of energy demand, and solar and onshore wind generation forecasting, for LSTM, CNN, GRU, and hybrid models merging CNN with LSTM and GRU, based on MAE, MAPE and RMSE. The combination of the best CNN and GRU models obtained better prediction results.  Qi et al. [95] 2020 CNN + LSTM 1-day Data from the integrated energy system of an industrial area in China, which is a combined electric, cooling, and heating system.
Experimental results showed that the CNN-LSTM composite forecasting model for short-term demand of individual household customers has a higher prediction accuracy than the CNN and LSTM models.

Hourly
Data from IEEE 33-node extension system (selected as a typical model of medium voltage distribution system model).
DDQN improves the noise and instability in traditional DQN, and reduces operation costs and peak load demand while regulating voltage to the safe limit.
Deep learning model to forecast and fill in missing data on residential buildings energy demand.
Wen et al. [97] 2020 Modified RNN Hourly Dataport, Pecan Street Inc. Residential buildings, Europe. Experimental results on residential buildings demand showed that peak demand can be reduced by 17%.
Yang et al. [98] 2020 Multitask Bayesian Neural Network (MT-BNN) Hourly Two public datasets on smart meters provided by the Irish Commission for Energy Regulation (CER) and the Australian Government's SGSC project, respectively. The CER dataset was collected between July 2009 and December 2010 with the participation of more than 4225 residential customers and 2210 Small and Medium Enterprises (SMEs) participating. The SGSC dataset was collected for about 10,000 customers between 2010 and 2014 in New South Wales. Electricity consumption (kWh) was recorded every half hour at each meter in both datasets.
Experimental results, based on MAE and RMSE, showed that the proposed load forecasting framework for residential demand response provided higher accuracy of individual electricity consumption than other methods such as SVR, Gradient Boosted Regression Trees (GBRT), RF, and Pooling-based Long Short-Term Memory (PLSTM).
Amin et al. [99] 2019 LSTM Several time horizons Smart meter data collected over 2 years from 114 apartments, along with weather information for the same period.
Comparison of three demand forecasting methods: a piecewise LR model, the univariate seasonal ARIMA model, and a multivariate LSTM model. The results showed that while the LR model could be used for long-term planning, the LSTM model significantly improved the accuracy of short-term (1-day) demand forecasting compared to the ARIMA and LR models.   Yang et al. [114] 2019 GRU Hourly Real-world smart meter dataset provided by the Irish CER. The study used data from 800 residents and 400 SMEs with a sampling frequency of half an hour from 1 August 2010 to 31 October 2010.
Experimental results for load forecasting, based on two typical probabilistic scoring methods (pinball loss score, and winkler score), showed better performance of the proposed model compared to other techniques such as Quantile Regression Forest (QRF), Quantile Regression Gradient Boosting (QRGB), and Quantile Long Short-Term Memory (QLSTM) Neural Network.
Zahid et al. [115] 2019 CNN Hourly ISO-NE dataset with 2018 load data. Improved classifiers were used to forecast load and electricity prices.
Ouyang et al. [116] 2019 DBN Hourly One year grid load data collected in an urbanized area in Texas, United States.   The search was limited to the last 6 years. The decision as to which articles were finally included in Table 5 was made by the authors after reviewing the search results and ensuring that the work involved the use of a Deep Learning model for demand or load forecasting purposes.

Conclusions
Increasing energy demand puts pressure on the power grid to balance supply and demand. Smart grids can play an important role. In these systems, data related to energy use are regularly collected and analyzed to obtain energy consumption. The usage patterns obtained can be useful for demand and load forecasting. This is a challenging task in the context of smart environments, which is why researchers are putting special efforts into this.
To meet today's demand forecasting challenges, where smart grids generate large amounts of data, it is necessary to use modern data-driven techniques. Deep Learning based models are a good alternative. Traditionally, research has focused on forecasting customers' energy consumption using the small historical data sets available on their behavior. However, current research applying Deep Learning methods has demonstrated better performance than conventional forecasting methods. The use of Deep Learning models involves using large amounts of data, such as those provided by the different datasets used by practitioners in the works collected in Table 5. It is a fact that smart grids generate large amounts of data, so Big Data is also a key technology to overcome the challenges of renewable energy integration, load fluctuation and sustainable development. With the introduction of renewables into the smart grid, an increasing number of variables are brought into the system and more data need to be processed. This situation is also aggravated with the gradual introduction of electric vehicles, so these Big Data technologies are also becoming increasingly necessary [124].
The study conducted has revealed that the most widely used Deep Learning models in the energy domain for demand forecasting purposes are CNNs, RNNs, LSTM, DQNs, and CRBM and a variation of any of them, a combination of two or more of them, or the combination of any of them with other techniques. Notable are CNN and its variations such as Pyramid-CNN [82,85,88,90,91,94,95,101,106,107,109,115,118,119,123], LSTM and its variations such as B-LSTM [80,82,[86][87][88]91,[93][94][95]99,100,103,104,106,107,[109][110][111][112][113]118,119,122], and a combination of both [82,88,91,94,95,106,107,109,118,119]. Real testbeds with highquality data are not common, but are necessary to determine the performance of Deep Leaning models. It is important to continue testing future Deep Learning models, including potential variations and/or combinations of two or more models, for forecasting purposes in the context of smart grids. It is also important that these tests are carried out for different scenarios. Deep Learning models capable of automatically forecasting load for different types of customers, premises/buildings, and different weather conditions are still needed. It is important to test the performance of Deep Learning, but also to determine which model is best for each scenario.
In terms of datasets, practitioners used different options, highlighting PJM electricity market [32,92,102,108,112], SGSC [85,90,98], CER [98,114,120], ISO-NE [105,109,115], Pecan Street Inc. [80,97], UCI [106,107], UKDALE [113] and REDD [21]. Many reviews on demand/load forecasting in the context of smart grids focus on the Deep Learning models used but forget about the data. However, for a Deep Learning implementation to be successful, the algorithms are as valuable as the data. In fact, it would be desirable for researchers to incorporate more information about the data used in their works, addressing for example the training/validation/testing data split, the sampling interval of the data, the method for data cleaning, etc. One of the limitations of using Deep Learning models is the lack of high-quality real-world datasets. A future trend would probably be to shift the emphasis from the model to the data. Furthermore, the authors foresee an integration of IoT into Deep Learning models used for demand/load forecasting. IoT is enabling the democratization of sensing. This opens exciting opportunities in terms of high-quality data collection, which is critical in the context of demand/load forecasting. Related to this, another future trend would be the development of integrated systems that include the necessary data acquisition and pre-processing.
Finally, it is also remarkable that in most cases researchers focused on short-term forecasting.
Load forecasting is a challenging task in the context of smart environments. Consequently, researchers are putting special efforts into it. Real testbeds with high-quality data are not common but necessary to determine the performance of the Deep Leaning models. Deep Learning models capable of automatically forecasting load for different types of customers, premises/buildings, and different weather conditions are still needed.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: