Reinforcement Learning: Theory and Applications in HEMS

: The steep rise in reinforcement learning (RL) in various applications in energy as well as the penetration of home automation in recent years are the motivation for this article. It surveys the use of RL in various home energy management system (HEMS) applications. There is a focus on deep neural network (DNN) models in RL. The article provides an overview of reinforcement learning. This is followed with discussions on state-of-the-art methods for value, policy, and actor– critic methods in deep reinforcement learning (DRL). In order to make the published literature in reinforcement learning more accessible to the HEMS community, verbal descriptions are accompanied with explanatory ﬁgures as well as mathematical expressions using standard machine learning terminology. Next, a detailed survey of how reinforcement learning is used in different HEMS domains is described. The survey also considers what kind of reinforcement learning algorithms are used in each HEMS application. It suggests that research in this direction is still in its infancy. Lastly, the article proposes four performance metrics to evaluate RL methods.


Introduction
The largest group of consumers of electricity in the US are residential units.In the year 2020, this sector alone accounted for approximately 40% of all electricity usage [1].The average daily residential consumption of electricity is 12 kWh per person [2].Therefore, effectively managing the usage of electricity in homes, while maintaining acceptable comfort levels, is vital to address the global challenges of dwindling natural resources and climate change.Rapid technological advances have now made home energy management systems (HEMS) an attainable goal that is worth pursuing.HEMS consist of automation technologies that can respond to a continuously or periodically changing home environmental as well as relevant external conditions, without human intervention [3,4].In this review, the term 'home' is taken in a broad context to also include all residential units, classrooms, apartments, offices complexes, and other buildings in the smart grid [5][6][7][8].
Artificial Intelligence (AI), more specifically machine learning, is one of the key contributing factors that have helped realize HEMS today [9][10][11].Reinforcement learning (RL) is a class of machine learning algorithms that is making deep inroads in various applications in HEMS.This learning paradigm incorporates the twin capabilities of learning from experience and learning at higher levels of abstraction.It allows algorithmic agents to replace human beings in the real world, including in homes and buildings, in applications that had hitherto been considered to be beyond today's capabilities.
RL allows an algorithmic entity to make sequences of decisions and implement actions from experience in the same manner as a human being [12][13][14][15][16][17].DNN has proven to be a powerful tool in RL, for it endows the RL agent with the capability to adapt to a wide variety of complex real-world applications [18,19].Moreover, it has been proposed in [20] that RL can attain the ultimate goal of artificial general intelligence [21].
Consequently, RL is making deep inroads into many application domains today.It has been applied extensively to robotics [22].Specific applications in this area include robotic manipulation with many degrees of freedom [23,24] and the navigation and path planning of mobile robots and UAVs [25][26][27].RL finds widespread applications in communications and networking [28][29][30].It has been used in 5G-enabled UAVs with cognitive capabilities [31], cybersecurity [32][33][34], and edge computing [35].In intelligent transportation systems, RL is used in a range of applications such as vehicle dispatching in online ride-hailing platforms [36].
Other domains where RL has been used include hospital decision making [37], precision agriculture [38], and fluid mechanics [39].The financial industry is another important sector where RL has been adopted for several scenarios [40][41][42].It is of little surprise that RL has been extensively used to solve various problems in energy systems [43][44][45][46][47]. Another review article on the use of RL [47] considers three application areas in frequency and voltage control as well as in energy management.
RL is increasingly being used in HEMS applications and several review papers have already been published.The review article in [48] focuses on RL for HVAC and water heaters.The paper in [49] is based on research published between 1997 and 2019.The survey observes that only 11% of published research reports the deployment of RL in actual HEMS.The article in [50] specifically focuses on occupant comfort in residences and offices.A more recent review on building energy management [51] focuses on deep neural network-based RL.A recent article [52] considers RL along with model predictive control in smart building applications.The article in [53] is a survey of RL in demand response.
In contrast to the previous reviews, the scope of our review is broad enough to cover all areas of HEMS, including HEMS interfacing with the energy grid.More importantly, it provides a comprehensive overview of all major RL methods, providing a sufficient level of explanation for readers' understanding.Therefore, this article would be of benefit for researchers and practitioners in other areas of the energy systems, and beyond, to acquire a theoretical level understanding of basic RL techniques.
The rest of this article is organized in the following manner.Section 2 addresses the various elements of HEMS in greater detail.Section 3 introduces basic ideas on reinforcement learning.Further details of value-based RL and associated deep architectures are discussed in Section 4, while policy-based and actor-critic architectures-the other class of RL algorithms-are described in Section 5.Sections 6 and 7 discuss the results of the research survey: while Section 6 focuses on the application of RL, Section 7 is a study on the classes of algorithms that were used.The article concludes in Section 8, where the authors propose four metrics to evaluate the performances of RL algorithms in HEMS.

Home Energy Management Systems
HEMS refers to a slew of automation techniques that can respond to continuously or periodically changing the home/building's internal as well as relevant external conditions, and without the need for human intervention.This section addresses the enabling technologies that make this an attainable goal.

Networking and Communication
All HEMS devices must have the ability to send/receive data with each other using the same communication protocol.HEMS provides the occupants with the tools that allow them to monitor, manage, and control all the activities within the system.The advancements in technologies and more specifically in IoT-enabled devices and wireless communications protocols such as ZigBee, Wi-Fi, and Z-Wave made HEMS feasible [54,55].These smart devices are connected through a home area network (HAN) and/or to the internet, i.e., a wide area network (WAN).
The choice of communication protocol for home automation is an open question.To a large extent, it depends on the user's personal requirements.If it is desired to automate a smaller set of home appliances with ease of installation, and operability in a plug-and-play manner, Wi-Fi is the appropriate one to use.However, with more extensive automation requirements, involving tens through to hundreds of smart devices, Wi-Fi is no longer the optimal choice.There are issues relating to scalability and signal interference in Wi-Fi.More importantly, due to its relatively high energy consumption, Wi-Fi is not appropriate for battery-powered devices.
Under these circumstances, ZigBee and Z-Wave are more appropriate [56].These communication protocols dominate today's home automation market.There are many common features shared between the two protocols.Both protocols use RF communication mode and offer two-way communication.Both ZigBee and Z-Wave enjoy well established commercial relationships with various companies, with tens of hundreds of smart devices using one of these protocols.
Z-Wave is superior to than ZigBee in terms of the range of transmission (120 m with three devices as repeaters vs. 60 m with two devices working as repeaters).In terms of interbrand operability, Z-Wave again holds the advantage.However, ZigBee is more competitive in terms of data rate of transmission as well as in the number of connected devices.Z-Wave was specially created for home automation applications, while ZigBee is used in a wider range of places such as industry, research, health care, and home automation [57].A study conducted by [58] foresees that ZigBee is most likely to be the standard communication protocol for HEMS.However, due to the presence of numerous factors, it is still difficult to tell with high certainty if this forecast would take place in future.It is also possible that an alternative communication protocol will emerge in future.
HEMS requires this level of connectivity to be able to access electricity price from the smart grid through the smart meter and control all the system's elements accordingly (e.g., turn on/off the TV, control the thermostat settings, determine the charge/discharge battery timings, etc.).In some scenarios, HEMS uses the forecasted electricity prices to schedule shiftable loads (e.g., washing machine, dryer, electric vehicle charging) [54].

Sensors and Controller Platforms
HEMS consists of smart appliances with sensors, these IoT-enabled devices communicate with the controller by sending and receiving data.They collect information from the environment and/or about their electricity usage using built-in sensors.The smart meter gathers information regarding the total consumers' consumption from the appliances, the peak load period, and electricity price from the smart grid.
The controller can be in the form of a physical computer located within the premises, that is equipped with the ability to run complex algorithms.An alternate approach is to leverage any of the cloud services that are available to the consumers through cloud computing firms.
The controller gathers information from the following sources: (i) the energy grid through the smart meter, which includes the power supply status and electricity price, (ii) the status of renewable energy and the energy storage systems, (iii) the electricity usage of each smart device at home, and (iv) the outside environment.Then it processes all the data through a computational algorithm to take specific action for each device in the whole system separately [5].

Control Algorithms
AI and machine learning methods are making deep inroads into HEMS [10,59].HEMS algorithms incorporated into the controller might be in the form of simple knowledge-based systems.These approaches embody a set of if-then-else rules, which may be crisp or fuzzy.However, due to their reliance on a fixed set of rules, such methods may not be of much practical use with real-time controllers.Moreover, they cannot effectively leverage the large amount of data available today [5].Although it is possible to impart a certain degree of trainability to fuzzy systems, the structural bottleneck of consolidating all inputs using only conjunctions (and) and disjunctions (or) still persists.
Numerical optimization comprises of another class of computational methods for the smart home controller.These methods entail an objective function that is to be either minimized (e.g., cost) or maximized (e.g., occupant comfort), as well as a set of constraints imposed by the underlying physical HEMS appliances and limitations.Due to its simplicity, linear programing is a popular choice for this class of algorithms.More recently, game theoretic approaches have emerged as an alternative approach for various HEMS optimization problems [5].
In recent years, artificial intelligence and machine learning, more specifically deep learning techniques, have become popular for HEMS applications.Deep learning takes advantage of all the available data for training the neural network to predict the output and control the connected devices.It is very helpful to forecast the weather, load, and electricity price.Furthermore, it handles non-linearities without resorting to explicit mathematical models.Since 2013, there have been significant efforts directed at using deep neural networks within an RL framework [60,61], that have met with much success.

Deep Neural Networks
A deep neural network (DNN) is a trainable highly nonlinear function approximator of the form y(•) : R M → R N where M and N are the dimensionalities of the input and output spaces.Structurally, the DNN consists of an input layer and output layer, and at least one hidden layer.The input layer receives the DNN input vector x.The neurons in any other layer receive, as their inputs, the weighted outputs of neurons in the preceding layer.The weights of the DNN make up its weight parameter, denoted θ.For simplicity, we consider DNNs with scalar outputs so that y(•) : R M → R .The actual output of the DNN is represented as y(x|θ), which is that of the sole neuron in the output layer.
In a typical regression application, the DNN's training set S consists of pairs (x(n), t(n)) ∈ S where n = 1, . . ., |S| is the sample index (for the sake of conciseness, this relationship is often denoted as n ∈ S in this article).The quantity t(n) is the target, or desired output.During training, θ is updated in steps so that for each input x(n), the DNN's output y(n) is as close as possible to t(n).Supervised learning algorithms aim to minimize the DNN's loss function J S (θ).The subscript S indicates that the loss is an empirically estimate over the sample in S. A popular choice of the latter is the averaged squared L 2 norms of the difference between the target and output for all samples in S, Training the DNN comprises of multiple passes called epochs, with each epoch comprising of one pass through all samples in S. In stochastic gradient descent (SGD), with η 1 being the learning rate, the parameter θ is incremented once for every sample n ∈ S as, This increment is equivalent to a single gradient step with While SGD is useful in many online applications, minibatch gradient descent is the most common training method.In each epoch, S are divided into non-overlapping minibatches B k ⊂ S (i.e., ∪ k B k = S, and k = l ⇒ B k ∩ B l = φ ).The parameter θ is updated once for every minibatch B as, One of the advantages of training in minibatches is that the trajectory taken by the training algorithm is straightened out, thereby speeding up convergence.It can be seen that the loss function is J B (θ), which is identical to that in Equation (1), with sample set B.
Typically, the loss function includes an additional regularization term designed to keep the weights in θ low in order to prevent overfitting; overfitting results in poorer performance after the DNN is deployed into the real world.Nowadays, faster training is accomplished by using extensions of gradient descent such as ADAM.These as well as many other important aspects of DNNs and training algorithms have not been addressed here; the above discussion minimally suffices to understand how DNNs are used in reinforcement learning (RL).A brief exposition to DNNs is available at [62].For a rigorous treatment of DNNs, the interested reader is referred to [63].

Reinforcement Learning
An agent in RL is a learning entity, such as a deep neural network (DNN) that exerts control over a stochastic, external environment by means of a sequence of actions over time.The agent learns to improve the performance of its environment using reward signals that it receives from the environment.
Rewards are quantitative metrics that indicate the immediate performance of the environment (e.g., average instantaneous user comfort).The sets S and A are the state and action spaces and can be discrete or continuous.Everywhere in this article it is assumed that all temporal signals are sampled at discrete, regularly spaced intervals [62].At each discrete time instance t, the current state s t ∈ S of the environment is known to the agent, which then implements an action a t ∈ A. The environment transitions to the next state s t+1 with a probability p(s t+1 |s t , a t ) while returning an immediate reward signal r t ≡ r(s t , a t , s t+1 ); where r(•) : S × A × S → R denotes the environment's reward function that it unknown to the agent.The transition can be denoted concisely as s t a t ,r t → s t+1 .The overall schematic is shown in Figure 1.→ s t+1 .Although the agent is depicted as a neural network (cf.[62]), it may be in the form of a tabular structure.
Instead of greedily aiming to improve the immediate reward r t at every time instance t, the agent may be iteratively trained to maximize the sum of the immediate and the weighted future rewards, which is called the return, The quantity γ ∈ [0, 1] is called the discount factor.This lookahead feature prevents the agent to learn greedy actions that fetch large instantaneous rewards, r t at each instant t, while adversely affect the environment later on.The process begins at time t = 0 and terminates at time t = T, the time horizon.The environment's initial state at t = 0 is denoted as s 0 ∈ S. The initial state may be probabilistic, following a distribution p 0 .It should be noted that if T = ∞, then the discount must be less than unity (γ < 1) so that the return R t stays finitely bounded at all times t.
The 5-tuple (S, A, p, r, γ) defines a Markov decision process (MDP).The initial state distribution p 0 is assumed to be subsumed by the transition probabilities p.The MDP can be viewed as an extension of a discrete Markov model.
The entire sequence of states, actions, and rewards is an episode, denoted E , so that, The policy can be deterministic or stochastic.A deterministic policy can be treated as a function π : S → A (see Figure 1) so that a t = π(s t ), whereas a stochastic policy π represents a probability distribution over A such that a t ∼ π(s t ).In several domains, the probability distribution is determined from the nature of the application itself.
During an episode, the action taken by the agent is in accordance with a policy π ∈ Π, where Π is the policy space.From the Markovian (memoryless) property of the MDP, it follows that the optimal action of an agent at each state in terms of its stated goal of maximizing the total return R t , is independent of all previous states of the environment.Therefore, the action a t taken by the agent at time t under policy π is based solely on the state s t , and the prior history of states and actions need not be taken into account.
The overall aim of reinforcement learning is usually to maximize an objective function J(•).Let J(E ) denote the total return R 0 of a given episode E .If the MDP is initialized to any state s ∈ S at t = 0 (such that s 0 = s), the expected value of this return which is dependent on policy π may be expressed as, The operator E π [∆] is the expectation when all episodes are generated by the MDP under policy π.When it follows the MDP's initial state distribution, i.e., s 0 ∼ p, the expectation may be denoted simply as J π without any argument.This informal function overloaded convention is adopted throughout this manuscript as there are other ways to define the objective function.The policy that at each state s implements the action that maximize J π (s) is referred to as the optimal policy and represented as π * .

Taxonomy of Algorithms
RL methods can be classified in several ways.In model-free training, RL takes place with the agent connected to the real-world environment, whereas in model-based RL, the agent is trained using a simulation platform to represent the environment.
In model-based RL, as the transition probabilities and the reward function are available through the environmental model platform, the algorithm must be implemented in an offline manner.Of more practical interest is online RL where the agent can be trained in real time by interacting with the real physical environment as shown in Figure 1.Although online RL is considered to be model-free, practically all research papers report the use of HEMS models for training [49].
In on-policy RL, a referential policy π * , such as that of a human, is considered to be the optimal policy and is known a priori.The goal of RL is to learn a policy π ∼ = π * .In off-policy approaches, the goal is to obtain the optimal policy π * , which maximizes J π (s) in Equation (6).
Another fundamental trichotomy of the plethora of RL approaches used today includes value-based RL, policy-based RL, and actor-critic RL, the latter having emerged more recently.Actor-critic methods are hybrid approaches that borrow features from value-based as well as policy-based RL [64].The classification of various approaches used in HEMS applications is shown in Figure 2.These are also described at great length in this article, which may be used as a tutorial style exposition to RL for the interested reader.

Value-Based Reinforcement Learning
Historically value-based RL, first proposed in [65], heralds the advent of the broad area of reinforcement learning as a distinct branch of AI.These approaches are based on dynamic programming.The formal definition of an MDP was introduced shortly thereafter [66,67].As noted earlier, an MDP is memoryless.An implication of this feature is that when the environment is in any given state s t = s at any instant t, the prior history s 0 a 0 ,r 0 → is not of any consequence in deciding the future course of actions [19].Accordingly, one can define the state-action value, or Q-value of the state s ∈ S and for each action a ∈ A, as the expected return when taking a from s (cf.[19]), Referring to a specific policy π may be achieved by using a superscript in the above equation, so that the left-hand side of Equation ( 7) is written as q π (s, a).
It must be noted that even under a deterministic policy, q π (s, a) can still be defined for any action a = π(s) merely by treating a as an evaluative action and following the policy at all future times.Whence (cf.[19]), The Q-value function q π : S × A → R can be defined using ( 6) irrespective of whether the policy is stochastic or deterministic.In case of a deterministic policy, Equation ( 8) can be applied by letting π(a |s ) = 1 when a = π(s ), and π(a |s ) = 0 otherwise.
A stochastic policy is intrinsic to many real-world applications.For instance, in order to decrease the ambient temperature by manually lowering the thermostatic setting, the final setting involves a degree of randomness arising from human imprecision.In multiagent environments, the best course may often be to adopt a stochastic policy.As an example, in a repeated game of rock-paper-scissors, randomly selecting each action ('rock', 'paper', or 'scissors') with equal probabilities of 1 / 3 is the only policy that would ensure that the probability of losing a round of the game does not exceed that of winning.
From a machine learning standpoint, stochastic policies help explore and assess the effects of the entire repertoire of actions available in A. Such exploration is critical during the initial stages of the learning algorithm.The two most commonly used stochastic policies are the -greedy and the softmax policies.Under an -greedy policy π, the probability of picking an action a when the environmental state is s is given by, It is always a good idea to lower the parameter steadily so that as learning progresses, the agent is greedier-being likelier to select actions with the highest Q-values, argmax a q(s, a ).The softmax policy is the other popular method to incorporate exploration into a policy.The probability of applying action a under such a policy π is, Initialized to a high value, the Gibbs-Boltzmann parameter τ may be steadily lowered as the learning algorithm progresses, so that the policy becomes increasingly exploitative, that is, taking the action with the highest Q-value more often.Unless specified otherwise, it shall be assumed hereafter that the policy space Π is stochastic so that actions follow probability distribution ( a ∼ π(s)).
Exploration is applied to stochastically search and evaluate the available repertoire of actions at each state, before converging towards the optimal one.It is an essential component of value-based RL.Since exploitation is the strategy of picking the best actions in Π, it should not be applied until the algorithm has all actions in a sufficient manner.However, endowing the learning algorithm with too much exploration slows down the learning.Identifying the right tradeoff between exploration and exploitation is a widely studied problem in machine learning [68].It is for this reason that the parameters in Equation ( 9), and τ in Equation ( 10) are steadily lowered as learning progresses.
Instead of an evaluative action a, suppose the policy π is applied from state s (so that either a = π(s) or a ∼ π(s)), then the expected return is called the state's value, As with the Q-value function, the policy π becomes explicit if the value of s is written as v π (s).The value of s can be expressed in terms of Q-values as, The difference between value of any state s and the Q-value of implementing an action a from s under policy is the advantage function, so that, Although the preferred notation in this manuscript is to use lowercase letters to denote variables, the advantage function is represented using uppercase as the lowercase a is reserved to denote an action.
The value function v : S → R allows the optimal policy π * to be defined in a formal manner.If the objective function with the MDP initialized to some s 0 ∈ S is defined as in Equation ( 6), then it is evident from Equation (10) that J π (s 0 ) = v(s 0 ).Furthermore, if the MDP visits state s t = s at the instant t, then from Equation ( 4), At this stage, we invoke the memoryless property of the underlying MDP.At instant t the partial sum of the terms r 0 + γr 1 + . . .+ γ t−1 r t−1 in the right-hand side are part of the episode's history, while γ t v π (s) is the expected future return.The optimal policy at every such state s is to implement the action that maximizes v π (s), so that, When the policy π * is deterministic, it can also be inferred that the optimal action from state s is to select the action with the highest Q-value.From Equation ( 15) it follows that, The Q-value q * (s, a) is equal to q π * (s, a).It can be mathematically established that the Q-values corresponding to the optimal policy are higher than those associated with other policies, i.e., q * (s, a) ≥ q π (s, a) [69].The Bellman's equation for optimality follows from the above consideration, The difference between Equation (8) and Equation ( 17) is in the second term in each summand.The policy-based Q-value in Equation ( 8) is replaced with the maximum Q-value in Equation (17).A mathematically rigorous coverage of various RL methods can be found in the seminal book [70] that is available online.

Tabular Q-Learning
The simplest possible implementation of the Q-learning algorithm is tabular Q-learning where an |S| × |A| sized array is maintained to store q(s, a) for every state-action pair [71].Initialized to either zeros or small random values, the tabular entries are periodically updated.As it is an online approach, Q-learning cannot use transition probabilities p(s |s, a).For each transition s a,r → s , the tabular entry for q(s, a) is incremented as, The quantity η is the learning rate; usually η 1.The quantity t is the target, In order to impart an exploratory component to Q-learning, the action a must be selected probabilistically as in Equation (9) or Equation (10).In many cases, increments are applied in real time at the end of each time instance.It can be shown mathematically that the tabular entries converge towards the maximum values, q * (s, a) [69], implying that Q-learning is an off-policy approach.The fully trained agent can select actions as per Equation ( 16) during actual use.
SARSA (State-Action-Reward-State-Action) [70,72] is the on-policy RL algorithm that can be implemented in a tabular manner.The update rule for SARSA is identical to the earlier expression in Equation (19).However, since SARSA is an on-policy algorithm, the target is specific to the policy π and is given by, Both Q-learning and SARSA use the tabular entries q(s , a ) of the environment's new state s following the transition s a,r → s .The difference is in how the entries are used.Whereas Q-learning uses the tabular entry corresponding to the action a with the highest q(s , a ), SARSA applies the specified policy π, using q(s , a ), the Q-value of the action a = π(s ).This difference is analogous to that between Equation (17) and Equation (8).
Tabular Q-learning and SARSA can handle continuous state as well as action spaces by discretizing them into a finite and tractable number of subdivisions.Unfortunately, such tabular learning methods cannot be applied in many large-scale domains.This is because too many discrete levels would make the algorithm computationally too intensive if not outright intractable.
When the state space S is too large (e.g., |S| ≈ 7.73 × 10 45 in chess), tabular learning becomes prohibitively expensive not only in terms of storage requirements but also in terms of computational time needed by the RL algorithm.DQNs are well equipped to handle such large discrete as well as continuous state spaces [73].However, |A|, the cardinality of the action space, must still be tractably small.In a DQN the mapping from every state action pair (s, a) to its Q-value is carried out by means of a DNN.
In reality, the DNN input is some feature vector ϕ(s) of the state s, where ϕ : S → R D and D is the dimensionality of the feature space.In the same manner, the action a can be represented in terms of its feature vectors.However, as |A| is small, it is assumed that the action a itself is the other input.In practice, a unary encoding scheme may be used to represent actions.For instance, if |A| = 4, the four discrete actions may be encoded as 0001, 0010, 0100, and 1000.Under these circumstances, the actual input to the DNN is (ϕ(s), a), and its actual output is y( ϕ(s), a|θ).For simplicity we will treat the DNN input as (s, a) and the output as q( s, a|θ), that is, (s, a) ≡ (ϕ(s), a), and q( s, a|θ) ≡ y( ϕ(s), a|θ).The Q-value q( s, a|θ) is conditioned in terms of the weight parameter θ in this manner so as to explicitly reflect its dependence on the latter.

Deep Q-Networks
There are two possible ways in which the mapping of a state-action pair (s, a) to its Q-value q( s, a|θ) can be accomplished, which are as follows.
(i) A different DNN for each action is maintained, so that the total of DNNs in this arrangement is |A|.The state s (encoded appropriately using the state's features), serves as the common input to all the DNNs.(ii) A single DNN with separate inputs for state s and action a is maintained and its output is q( s, a|θ).While this manner of storing Q-values requires the use of only a single DNN, in order to obtain max q( s, a|θ), the actions must be applied sequentially to it.
The two schemes are depicted in Figure 3. Stochastic gradient descent can be applied in a straightforward fashion to train the weight parameter θ as in Equation ( 2) for the squared error loss 1  2 (t − q( s, a|θ)) 2 , and with the DNN's output y now being q( s, a|θ), This simple approach is the neural-fitted Q-iteration (NFQI) that was proposed in [74].The target t(n) is determined in accordance with Equation ( 19) with q( s , a |θ) used to obtain the target t so that t = r + γ max a ∈A q( s , a |θ).When using tabular entries in place of θ, it becomes the fitted Q-iteration (FQI) When the learning agent interacts with the environment, the actions are generally selected using the -greedy method shown in Equation (9).
Temporal correlation in real-time training samples is an unfortunate drawback when directly implementing stochastic gradient descent.Unlike in tabular learning, in DQN updating θ changes not only the output q( s, a|θ) for the relevant state-action pair (s, a) but the Q-values q( s , a |θ) of every other pair (s , a ) = (s, a) as well.The change may be barely noticeable when s is at a large distance from s within the feature space ϕ(S); unfortunately, this is not usually the case in most real-world domains.
Consider two successive transitions s t a t ,r t → s t+1 a t+1 ,r t+1 → .Due to the property of temporal correlation between successive states, it is highly reasonable to expect that the distance ϕ(s t ) − ϕ(s t+1 ) is very small.Therefore, applying Equation ( 21) to update q( s t , a t |θ) will have an undesirable yet pronounced effect on q( s t+1 , a t+1 |θ).A similar argument holds for time sequences of actions as well.
To address the ill effects of temporal correlatedness, DNN training is carried out only after the completion of an episode or multiple episodes, during which time the DQN agent is allowed to exert control over the environment, while θ remains unchanged.All training samples are stored in an experience replay buffer B [75], which plays the role of a mini-batch in DNN training.After enough training samples have been accumulated in B, it is shuffled randomly before incrementing θ.The increment may be implemented either as in Equation ( 21), or through minibatch gradient descent as indicated earlier in Equation ( 3) with q( s, a|θ) replacing y(x|θ) (see Figure 4).For convenience, the update is shown below, The buffer B is flushed before the next cycle begins with the updated parameter θ.An improvement over this scheme is prioritized replay [76], where the probability of a getting selected chosen for a training step is proportional to (t − q( s, a|θ)) 2 + .The small constant > 0 is added to the squared loss term to ensure that all samples have non-zero probabilities.
Target non-stationarity is another closely related problem that arises in DQNs, one that is not seen in tabular Q-learning.For any given sample transition s(n) → s (n) as the DNN weight parameter θ is incremented in accordance with Equation ( 21), an undesirable effect is that the target t(n) also changes.This is because the target is determined as t(n) = r + γ max a ∈A q( s , a |θ) and this DNN is used to obtain q( s (n), a |θ).Target non- stationarity is handled by storing an older copy θ target of the primary DNN θ in memory and using this stored copy to compute the target t(n).Effectively, the RL algorithm maintains a separate target DNN parametrized by θ target .Thus, the target is, The target DNN's weight parameter is updated infrequently, and only after θ undergoes a significant amount of training.In this manner, the targets remain stationary when training the primary DNN's parameter θ so that gradient descent steps can be implemented in a straightforward manner using terms 1  2 q( s(n), a(n)|θ) − t n θ target 2 in the loss function J(θ).This scheme is shown in Figure 5. Overestimation bias [77,78] is another problem frequently encountered in stochastic environments.This is an outcome of maximization.As an example, consider an MDP with S = {A, B, C} where C is the terminal state.This is shown in Figure 6.The action space A = {a k |k = 1, 2, . . ., N} where N is relatively large, is available to the agent.From state A, only action a N leads to B whereas the remaining ones, a 1 through a N−1 , lead to C. The reward received from state A is always zero, (i.e., r(A, a k ) = 0. From state B all actions lead to C, with the reward being either −3 or +1 and with equal probabilities of N −1 .In other words, the possible transitions are A a N ,0 Since the rewards of −3 and 1 have the same probability when the environment transitions from B to C, the expected reward from B to C is −1 i.e., E[r(B, a k , C)] = −1.For simplicity, let us assume that η = 1.The Q-values for some actions would be updated to −3, whereas those of others, to +1.Since N is large enough, it is very likely that at least one of them, say a has the higher of the two.Consider the Q-values of actions from state A. It is clear that for k = 1 through N − 1, q(A, a k ) = 0.However, when the agent selects action a N from A, thereby reaching B, the operation max a∈A q(B, a) is likely to return +1 so that q(A, a N ) would be updated to γ max a∈A q(B, a) = γ.This makes a N appear to be the optimal action from state A, when in fact it is the worst choice in A. Double Q-learning [79] is a popular approach to circumvent overestimation bias in off-policy RL (see Figure 7).Although first proposed in a tabular setting [77], more recent research implements double Q-learning in conjunction with DNNs, which is called the double deep Q-network (DDQN).It incorporates two DNNs with the parameters θ 1 and θ 2 .Samples are collected by implementing actions using their mean Q-values, 1 2 q s, a θ 1 + q s, a θ 2 .For each sample transition s(n) during training, one of the two DNNs, say DNN j (j ∈ {1, 2}), is picked randomly and with equal probability to compute the target, and the other DNN, j is trained with it.Whence, Each DNN has a 0.5 probability of getting trained with the transition sample.This is the manner of updating that was originally proposed in [77].
An extension of DDQN is clipped DDQN [80][81][82].Instead of selecting the target randomly, it is obtained as minimum of the Q-values, q s (n), a θ 1 and q s (n), a θ 2 , Dueling DQN architectures (Dueling-DQN) [83] use a different scheme to avoid overestimation bias (see Figure 8).It divides the state-action value q(s, a) into two parts, the state value v(s) and the state-action advantage A(s, a).As shown in Equation ( 13), q(s, a) is the difference between the two quantities.The advantage of action a in state s, A(s, a) is the expected gain in the return obtained by picking action a.The DNN layout consists of an input layer for the state s.After a few initial preprocessing layers, it splits into two separate pathways, each of which is a fully connected DNN.Letting the symbols θ V and θ A denote the weight parameters of the pathways, the scalar output of the value pathway is the state's value, v(s|θ v ) and the output of the advantage pathway is an |A| dimensional vector comprising of the advantages A s, a θ A of all available actions in A.
The Q-value of the state-action pair (s, a) can be obtained in a straightforward manner as provided in the following equation, The quantity θ denotes the set of all weight parameters of the dueling-DQN, including θ V and θ A as well as those present in the earlier preprocessing layers.

Policy-Based and Actor-Critic Reinforcement Learning
Like tabular Q-learning, tabular policy-based RL uses an array of Q-values.Initialized with an arbitrary policy π, the tabular policy RL algorithm is an iterative process comprising of two steps [70,84].Policy evaluation is carried out in the first stage, where Q-values q π (s, a) are learned as shown in Equation ( 18) and Equation (20).In the second step, the policy is refined by defining the action for each state as shown in Equation ( 16).The two-step process is repeated until the policy can be refined no further.
Gradient descent policy learning methods do not directly draw upon tabular policy learning in the same way that value-based learning does.These methods are realized through DNNs as the agents.An attractive feature of deep policy RL is its intrinsic ability to handle continuous states as well as continuous actions.

Deep Policy Networks
Policy gradient uses an experience replay buffer B is the same manner as a DQN.The buffer stores full episodes of sequences.Instead of using Equation ( 6), it is convenient to directly express the loss function in terms of episodes E and the DNN's weight parameter θ, in the following manner, Policy gradient methods try to maximize this loss.The operator E E ∼π θ [∆] is the expected with the DNN agent operating under the probabilistic policy π θ .The initial state s 0 in the above expression is implicitly defined in E .Moreover, the distribution of s 0 within S is in accordance with the underlying MDP.The quantity R(E ) is the total return R 0 of episode E starting from t = 0.
Note that for a transition s a → s , the reward r(s, a, s ) is a feedback signal that is determined by the environment (such as a home or residential complex) which is external to the agent.So is the discounted, aggregate return R(E ), which is also equal to that in Equation ( 4).No function R : S T × A T → R that maps a sequence of states and action of time horizon T to a return is available to the agent.Consequently, a straightforward gradient descent step in the direction of ∇ θ R(E ) cannot be applied.In an apparent paradox, it turns out that its expected value E E ∼π θ [R(E )], can be differentiated by the agent, which is also the rationale behind expressing the loss as in Equation ( 27).This is due to a mathematical result known as the policy gradient theorem [14, 85,86].The policy gradient theorem establishes the theoretical foundation for the majority of deep policy gradient methods.It can be stated mathematically as below, The significance of the theorem is that the gradient of the expected return, E E ∼π θ [R(E )] does not require the gradient of the return R(E ).Only the log probability of the episode E must be differentiated.Fortunately, this gradient can readily be computed by the DNN agent.The probability of a transition s t a t ,r t → s t+1 in E (see Equation ( 5)) is the product π θ (a t |s t )p(s t+1 , r t |s t , a t ); its logarithm is log π θ (a t |s t ) + log p(s t+1 , r t |s t , a t ).The second term is intrinsic to the environment, and independent of the DNN so that differentiating it with respect to θ is zero.Since log p(E |θ) is the product p(s 0 ) ∏ t π θ (a t |s t )p(s t+1 , r t |s t , a t ), we arrive at the following interesting result, The left-hand side of Equation ( 28) to be estimated rather easily using the expression in Equation (29).This is because the policy π θ is, in fact, based on the DPN output.Whereas [85] uses softmax policies as in Equation (10), it is quite usual in later research to adopt Gaussian policies (cf.[14]).Since π θ is the same policy that is used to obtain transition samples, Equation ( 28) pertains to on-policy learning.
The expected gradient E E ∼π θ [R(E )∇ θ log p(E |θ)] can be estimated as the average of several Monte Carlo samples of episodes (also called rollouts) E (n), n = 1, . . ., N that are stored in B. This provides an estimate of the gradient of the loss function defined in Equation ( 27).An early policy gradient method, REINFORCE [73] uses Equation ( 29) to increment θ.The REINFORCE on-policy update rule is expressed as, In the above expression, it is assumed for simplicity that the time horizon is fixed across all N samples.The quantity b is called the baseline [87].It can be set to zero in the basic implementation of policy learning.Figure 9 shows a schematic of this approach.
Unfortunately, when the bias b = 0, the variance in the set of samples of the form, R(E (n)) ∑ T−1 t=0 ∇ θ log π θ (a t (n)|s t (n)) becomes too large.This in turn requires a very large number of Monte Carlo episode samples to be collected.Including the baseline in Equation ( 30) that is close to E E ∼π θ [R(E )] helps reduce the variance to tractable limits.The theoretical optimal baseline estimate is given by, There are ways to obtain reasonable baseline estimates in practice that reduce the variance without affecting the bias [87,88].The purpose of actor-critic architectures, which will be described subsequently, are also designed to obtain reliable bias estimates.Before proceeding further, we will make improvements to Equation ( 30) on the basis of the following two observations.
The first observation is that in Equation ( 30) the gradient ∑ ∇ θ log π θ (a t (n)|s t (n)) linked with E (n) is weighted by R(E (n)) − b in the outer summation.In this manner, the episode E (n) would receive a higher weight if it fetched a higher return.However, the weighting scheme is rather arbitrary.For instance, with b = 0, if all returns were nonnegative, then all gradients would receive positive weights.On the other hand, suppose the bias b were to be replaced with the expected return, then the gradients of the episodes with lower-than-expected returns would receive negative weights, whereas those with better-than-expected returns would be assigned positive weights.Using Equation (11), it is observed that the bias b is also the value of the starting state v(s 0 ).Our first improvement would be to replace the bias with a value function.
The second observation is subtler, requiring the scrutiny of the weighting scheme at each time instant t.To simplify the discussion, it will be assumed that the discount γ = 1.
Consider the episode E (n) consisting of transitions of the form s t (n) → s t+1 (n).Ignoring b, the corresponding term in the inner summation, which is ∇ θ log π θ (a t (n)|s t (n)), is weighted by the return R(E (n)) = r 0 (n) + . . .+ r t−1 (n) + r t (n) + . . .+ r T (n).At time instant t, the prior rewards r 0 (n) until r t−1 (n) represent the past history of the episode E (n); it has no role in how good the action a t (n) was from state s t (n).Removing these terms, the weight becomes 30) with q( s t (n), a t (n)|θ) is the other improvement.Once the prior history of the episode is removed, the bias must be set to v( s t (n)|θ) instead of v(s 0 ).
From the above discussion, it is seen that each factor R(E (n)) − b in Equation ( 30) should be replaced with q( s t (n), a t (n)|θ) − v( s t (n)|θ).From Equation (13), this is the advantage function A( s t (n), a t (n)|θ), so that we replace the step original increment rule with the following update rule,

Natural Gradient Methods
One of the problems associated with policy gradient learning approaches as in Equation (30) or Equation (32) is choosing an adequate learning rate η.Too small a value of η would necessitate a large training period, whereas a value that is too large would produce a 'jump' in θ large enough to yield a new policy that is too different from the previous one.Although there are several reliable methods to address this effect in gradient descent for supervised learning as in Equation ( 3), it is too pronounced in RL diminishing the efficacies of such methods.In extreme cases, a seemingly small increment along the direction of the gradient may lead to irretrievable distortion of the policy itself.The underlying reason behind this limitation is that unlike in Equation ( 1), the loss function in Equation ( 25) incorporates a probability distribution.
The change in any policy whenever a perturbation is applied to the parameter should not be quantified in terms of the norm θ − θ old , but using the Kullback-Leibler divergence between to the distributions π θ and π θ old [89].The K-L divergence is denoted as D KL π θ π θ old where the new and previous values of the parameter are θ and θ old .Figure 10 illustrates the relevance of the K-L divergence.The Hessian (2 nd derivative) of D KL π θ π θ old is known as the Fisher information matrix F θ .The increment ∆θ applied to θ should be in proportion to which is referred to as the natural gra- dient [90] of the expectation E E ∼π θ [R(E )].The Fisher information matrix can be estimated as in [14], Recent policy gradient algorithms use concepts derived from natural gradients [14] to rectify the downside of 'vanilla' gradient descent to eliminate the use of an effective learning rate η.The use of the natural gradient greatly reduces the natural gradient algorithm's dependence on how the policy is parametrized.Unfortunately, the gains of using the natural gradient come at the cost of increased computational overheads associated with matrix inversion.The overheads may outweigh the gains when the Fisher matrix F θ is very large.When the policy is represented effectively through the parameter θ, natural gradient training may not provide enough speed-up over vanilla gradient descent.
Trust region policy optimization (TRPO) is a class of training algorithms that directly uses the Kullback-Leibler divergence [91].In TRPO, a hard upper bound is imposed on the divergence produced due to the increment ∆θ is applied to the DNN weights θ.Denoting this bound as ε, Under these circumstances it can be shown that the increment in TRPO is, The expectation ∇ θ E E ∼π θ [R(E )] can be estimated in the same manner as in Equation (30) or Equation (32).
Proximal policy optimization (PPO) [92] is another RL method that uses natural gradients.PPO replaces the bound in TRPO with a penalty term.An expression for the PPO's objective function for a single episode is as shown below, π θ old (a t |s t ) A( s t , a t |θ) − λD KL π θ π θ old (36)

Off-Policy Methods
Only on-policy algorithms have been discussed so far in this section, including TRPO and PPO.Nevertheless, policy gradient can also be applied for off-policy learning.Since such an algorithm would be trained for the optimal policy, the samples in the replay buffer (that are collected using earlier policies) can be recycled multiple times.This feature is a significant advantage of off-policy learning.
Policy gradient methods for off-policy learning can be implemented using importance sampling.Let f (x) be any function of the random variable x, which follows some distribution q(x).Importance sampling can be used to estimate the expectation E x∼q [ f (x)] as follows.Samples x are drawn from a more tractable distribution p(x).Using q(x) −1 p(x) as the weight for each sampled value of x, the weighted expectation E x∼p p(x) −1 q(x) f (x) is computed from several such samples.This serves as the estimated value i.e., E x∼q(x) This approach is adopted in off-policy RL.Suppose samples in the replay buffer are based on the policy π θ .The gradient can be empirically estimated using some action distribution a ∼ p(a) as,

Actor-Critic Networks
Actor-critic methods combine policy gradient and value-based RL methods [93].The actor-critic architecture consists of two learning agents, the actor and the critic (see Figure 11).From any environmental state, the actor is trained using policy gradient to respond with an action.The critic is trained with a value-based RL method to evaluate the effectiveness of the actor's output, which is then used to train the latter.Let us denote the actor's and the critic's parameters with the symbols θ a and θ c .The critic network can be modeled as a DQN, although it is trained with the advantage function as defined in Equation (13).When the environmental state is s t (n), for every action a t (n) the critic network provides as its output, the value of the state-action pair q(s t (n), a t (n)|θ c ).The actor is incremented using gradient ascent, Equation (38) shown above closely resembles the policy gradient increment in Equation ( 30) (with b = 0).The only difference is that the critic is used in order to compute the gradient's weight q(s t (n), a t (n)|θ c ).
The update rule for the critic network, which is similar to Equation ( 22), is shown below, The advantage actor-critic (A2C) algorithm [94] is very effective in reducing the variance in the policy gradient algorithm of the actor.The A2C architecture entails a two-fold improvement over the 'vanilla' actor-critic method, which are outlined below.
(i) The actor network uses an advantage function A(s t , a t ), which is the difference be- tween a return value R and the value of state v(s t |θ c ). Accordingly, the critic is trained to approximate the value function.(ii) The reward R is computed using a τ-step lookahead feature, where the log-gradient is weighted using the sum of the next τ rewards.
To better understand how the τ-step lookahead works, let us turn our attention to Equation (30).In this expression the gradient ∇ θ log π θ (a t |s t ) at time instant t is weighted by the factor (R(E ) − b) where R(E ) is the return of an entire episode from t = 0 until T − 1, so that R(E ) = r 0 + γr 1 + . . .γ t r t + . . .γ t+τ r t+τ + . . .γ T−1 r T−1 .The baseline is b is the value of v s t θ C .It is reasoned that the sum of the past rewards r 0 + γr 1 + . . .γ t−1 r t−1 does not have any bearing on the quality of the action a t taken at the instant t.Hence all past rewards are dropped from R. Furthermore, rewards received in the distant future, i.e., after τ instants are also dropped.In other words, R consists of the sum of the discounted rewards between the instant t and the instant t + τ.Whereupon, the actor's update rule is expressed as, The critic is updated using the same return R, in accordance with the expression shown below, The asynchronous advantage actor-critic (A3C) method [94] is an extension of A2C that can be applied in parallel processing environments.A global network and a set of 'workers' are maintained in A3C.Each worker receives the actor and critic parameters that it implements on its own independent environment and collecting reward signals.The rewards are then used to determine increments ∆θ A and ∆θ C , which are then used to asynchronously update the parameters in the global network.An advantage of A3C is that due to the parallel action of multiple workers, an experience replay buffer does not have to be incorporated.
The deterministic policy gradient (DPG) algorithm was described in [85], and more recently in SLH+14].It was later extended to a deep framework in [95], known as the deep deterministic policy gradient (DDPG).DDPG is an off-policy actor-critic method that concurrently learns the optimal Q-function q * , as well as the optimal policy π * .
In any off-policy actor-critic model, the critic must be trained to output the optimal policy π * .Hence, the term q(s t+1 , a t+1 ) in Equation ( 39) should be replaced with the maximum over all actions q(s t+1 , a * ), where the optimal action a * = argmax a∈A q(s t+1 , a) as in Equation (19).Unfortunately, when the action space A is continuous (|A| = ∞) an exhaustive search to find a * is impossible.Moreover, in a majority of applications, using numerical optimization to obtain a * is computationally too intensive to be used within the training algorithm.
In order to circumvent these difficulties in identifying the optimal action that maximizes the Q-value, there are three options available for use.These are outlined below.
(i) q(s t+1 , a) can be sampled for several different actions and a * be assigned the action corresponding to the sample maximum [96].(ii) A convex approximation of q(s, a) around s t+1 can be devised and a * obtained over the approximate function [97].(iii) A separate off-policy policy network can be used to learn the optimal policy π * [98].Out of the above three available options discussed above, the third and last has been adopted in DDPG.The critic parameter θ c is updated in accordance with the expression shown below, In Equation ( 42), π(s t (n)|θ a ) is the output of DDPG's actor.DDPG uses a replay buffer B that includes samples from older policies.The actor's parameter θ a is trained using any off-policy policy gradient as in Equation (37).
One of the drawbacks of DDPG is the problem of overestimation [99].Suppose during the course of training, the function q(s, a|θ c ) acquires a sharp local peak.Under these circumstances, further training would converge towards this local optimum, leading to undesirable results.This issue has been tackled by twin delayed deterministic policy gradient (TD3) in [80].TD3 maintains a pair of critics whose parameters we shall denote as θ c 1 and θ c 2 , or more concisely as θ c i , where i ∈ {1, 2}.In Equation (42), it can be seen that DDPG has a target r t + q(s t+1 (n), a t+1 (n)|θ c ) where q(s t+1 (n), a t+1 (n)|θ c ) is obtained from the critic.TD3 has two targets, q(s t+1 (n), a t+1 (n)|θ c i ), i ∈ {1, 2}.The actions a t+1 (n) in TD3 are clipped to lie within the interval [a min , a max ].In order to increase exploration, Gaussian noise is added to this action.Finally, the target is obtained as r t + min i∈{1,2}.q(s t+1 (n), a t+1 (n)|θ c i ), which is used for training.
The soft actor-critic (SAC) RL proposed recently in [81,100] is an off-policy RL approach.The striking feature of SAC is the presence of an entropy term in the objective function, Incorporating the entropy H(π θ |s t ) in Equation ( 43) increases the degree of random- ness in the policy which helps in exploration.As with TD3, SAC uses two critic networks.

Use of Reinforcement Learning in Home Energy Management Systems
This section addresses aspects of the survey on the use of RL approaches for various HEMS applications.All articles in this survey have been published in established technical journals that were published or made available online within the past five years.

Application Classes
In this study, all applications were divided into five classes as in Figure 12 below.[105,106].Wherever applicable, EV and ES must be charged in coordination with renewable generation (RG) such as solar panels and wind turbines.The aim is to make decisions in order to save energy costs, while addressing comfort and other consumer requirements.Thus, EV, ES, and RG have been placed under a single class for the purpose of this survey.(iii) Other Loads: Suitable scheduling of several home appliances such as dishwasher, washing machine, etc., can be achieved through HEMS to save energy usage or cost.
Lighting schedules are important in buildings with large occupancy.These loads have been lumped into a single class.(iv) Demand Response: With the rapid proliferation of green energies into homes and buildings, and these sources merged into the grid, demand response (DR) has acquired much research significance in HEMS.DR programs help in load balancing, by scheduling and/or controlling shiftable loads and in incentivizing participants [107,108] to do so through HEMS.RL for DR is one of the classes in this survey.(v) Peer-to-Peer Trading: Home energy management has been used to maximize the profit for the prosumers by trading the electricity with each other directly in peer-to-peer (P2P) trading or indirectly through a third party as in [109].Currently, theoretical research on automated trading is receiving significant attention.P2P trading is the fifth and final application category to have been considered in this survey.
Each application class is associated with an objective function and a building type that are discussed in subsequent paragraphs.The schematic in Figure 13 shows all links that have been covered by the articles in this survey.
Figure 14 shows the number of research articles that applied RL to each class.Note that a significant proportion of these papers addressed more than one class.More than third of the papers we reviewed focused only on HVAC, fans and water heaters.Just above 10% of the papers studied RL control for the energy storage (ES) systems.Only 7% of the papers focused on the energy trading.However, most of the papers (46%) are targeting more than one object.These results are shown in Figure 14.

Objectives and Building Types
Within these HEMS applications, RL has been applied in several ways.It has been used to reduce energy consumption within residential units and buildings [110].It has also been used to achieve a higher comfort level for the occupants [111].In operations at the interface between the residential units and the energy grid, RL has been applied to maximize prosumers profit in energy trading as well as for load balancing.
For this purpose, we break down the objectives into three different types as listed below.
(i) Energy Cost: The cost of using any electrical device by the consumer and in most of the cases it is proportionally related to its energy consumption.In this paper we use the terms 'cost' and 'consumption' interchangeably.(ii) Occupant Comfort: the main factor that can affect the occupant's comfort is the thermal comfort, which depends mainly on the room temperature and humidity.(iii) Load Balance: Power supply companies try to achieve load balance by reducing the power consumption of consumers at peak periods to match the station power supply.The consumers are motivated to participate in such programs by price incentives.
Figure 13 illustrates the RL objectives that were used in each application class.Next, all buildings and complexes were categorized into the following three types.
(i) Residential: for the purpose of this survey, individual homes, residential communities, as well as apartment complexes fall under this type of building.
(ii) Commercial: these buildings include offices, office complexes, shops, malls, hotels, as well as industrial buildings.(iii) Academic: academic buildings range from schools, university classrooms, buildings, research laboratories, up to entire campuses.
The research literature in this survey revealed that for residential buildings, RL was applied in all five application classes.However, in case of commercial and academic buildings, RL was typically applied to the first three categories, i.e., to HVAC, fans and WH, to EVs, ESs and RGs, as well as to other loads.This is shown in Figure 13.
Figure 15 illustrates the outcome of this survey.It may be noted that in the largest proportion of articles (42%) the RL algorithm took into account both cost and comfort.About 27% of all articles addressed cost as the only objective, thereby defining the second largest proportion.

Deployment, Multi-Agents, and Discretization
The proportion of research articles where RL was actually deployed in the real world was studied.It was found that only 12% of research articles report results where RL was used with real HEMS.The results are consistent with an earlier survey [49] where this proportion was 11%.The results are shown in Figure 16.

Reinforcement Learning Algorithms in Home Energy Management Systems
This section focuses on how the RL and DRL algorithms described in earlier sections were used in HEMS applications.The references have been categorized in terms of the application class, objective function, and building type, that were described in the immediately preceding section.Table 1 provides a list of references that used tabular RL methods.About 28% of articles used tabular methods.EV, ES, and RG [122,123] Mixed/NA [124] Other Residential [125,126] Other/Mixed

Cost and Comfort
Commercial [127] Academic [107,[128][129][130][131][132] Residential [133] Other [134,135] Cost [136] Mixed/NA [137] Cost and Comfort [138,139] Cost and Load Balance [140] Other In a similar manner, Table 2 considers references that used DQN.Most algorithms in the survey used DQN.However, DDQN was also popular in the HEMS research community.The survey found that dueling-DQN was applied in only one article.Table 3 categorizes references in the survey that used deep policy learning.PPO and TRPO are the only approaches that have been used so far in HEMS.The survey also indicates that actor-critic was the preferred approach in comparison with deep policy learning.Table 4 provides a list of references that applied actor-critic learning, which constituted 53% of all deep learning methods.It shows that PPO is more popular than TRPO.We believe that this observation is due to the closer recency of the latter algorithm.References that used either a combination of two or more approaches, or any other approach not commonly used in RL literature, are shown in Table 5. Other/Mixed [182,183] Cost and Load Balance [184] Cost [185] EV, ES, and RG [186] Other/Mixed Cost and Comfort Academic [187] Other [188,189] EV, ES, and RG Commercial [190][191][192] HVAC, Fans, WH Cost and Comfort Mixed/NA [193][194][195] EV, ES, and RG Other [196,197] Other/Mixed Cost and Load Balance Residential SAC [198,199] HVAC, Fans, WH Cost Commercial [103,[200][201][202] Cost and Comfort [203] Other/Mixed [204] Academic [205][206][207] HVAC, Fans, WH

Conclusions
This article surveys how effectively RL has been leveraged for various HEMS applications.The survey reveals the following: (i) Although 66% of all articles used deep RL, many articles used tabular learning.This may indicate that only simplified application were considered.(ii) Around 53% of all articles used discrete states and actions.This is another indication that the HEMS scenarios may have been simplified.(iii) Around 12% of all approaches covered in this survey were deployed in the real world, their use being limited to simulation platforms only.
These observations strongly suggest that the use of RL in HEMS application is at a research stage and is yet to gain maturity.More in-depth investigation is necessary, particularly on RL algorithms that use DNN agents.Nonetheless, it was seen that 36% of all articles made use of multiagent schemes, which is an encouraging sign.
The only truly viable alternative is to use nonlinear control, more specifically model predictive control (MPC) [224].MPC is widely used in various engineering applications (cf.[225]).The benefit of MPC is in the explicit manner by which it handles physical constraints.At each iteration, MPC considers a receding time horizon into the future, and applies a constrained optimization algorithm to determine the best control actions.However, in most cases, MPC uses linear or quadratic objective functions.This is a basic limitation that must be taken into account before applying MPC to large-scale problems and is in sharp contrast to RL that does not place any restriction on the reward signal.Moreover, MPC is a model-based approach, whereas an overwhelming majority of references in this survey used model-free RL methods ([149] being the sole exception).
There is a diverse array of algorithms available in the RL literature.Since tabular methods require discrete states and actions, and furthermore, that these spaces have low cardinalities, they may not be much use for most HEMS applications.Not surprisingly, this survey shows that tabular methods have been used less frequently than DNN methods.In future, as the HEMS community investigates increasingly complex HEMS domains, tabular methods would become even less likely to be used.Consequently, the choice of algorithm would usually be confined to DNN methods.
Out of the DNN methods, it must be noted that DQN and its derivatives can only be used in applications only when the action space is finite and small, such as in controlling OFF-ON switches.The survey reveals that actor-critic methods, which include Q-learning and policy learning, are the most popular in HEMS applications.Another deciding factor is whether to use policy-free or policy-based RL.On-policy learning may be used is applications where abandoning the policy in the initial stages may occasionally very negatively impact the environment.Thus, they may be used if the environment does not require too much exploration.On the other hand, off-policy RL can discover more novel policies.
Unlike in the unsupervised and supervised learning where simple performance metrics are readily available, performance evaluation in RL is an open problem [226].The steadily increasing reward with iteration is the best means for any real application.The authors suggest that the following four criteria should be considered.
(i) Saturation reward (R ∞ ): the expected reward must be relatively high at saturation.(ii) Variance at saturation (σ ∞ ): the reward must not have excessive variance at saturation.(iii) Exploitation risk (R min ): The minimum possible reward must not be so low that the environment is adversely affected.This is the risk associated with exploration and tends to occur during the initial exploratory stages of the RL training.(iv) Convergence rate (C): the number of iterations before the reward starts to saturate should not be large.Since the articles in this survey have always used some HEMS simulation platform, it is assumed that the RL algorithm can be run at least a few times.The above four performance metrics (R ∞ , σ ∞ , R min , C) proposed by the authors can be empirically estimated using Monte Carlo samples of such runs.Suppose the sequence of rewards obtained from the i th run is R i 0 , R i 1 , . . ., R i k , . . ., R i K max i .Each R i k is some reward and k represents an iteration of the RL algorithm.The precise meanings of the terms (reward and iteration) are entirely dependent on the specific HEMS application, how the reward function is implemented, whether a replay buffer is used, and the RL algorithm.
A reward R i k may be the either an aggregate return value, the instantaneous reward at the time horizon T, or the reward at last parameter update, etc.Likewise, the iteration index k may be an instantaneous time step t (t ≤ T), Alternately, k may refer to the number of times the training algorithm adjusts the model parameter θ, or flushes the replay buffer, etc.The exact meanings of the terms are left to the reader.However it must be remembered that at the beginning of each run, all relevant model parameters should be reinitialized, that at the end of each run after K max i iterations (subscripted since K max i may vary with run), the RL training algorithm converges to a different final model parameter, and that R i K max i truly reflects the quality of the model.Moreover, it must be ensured that the algorithm terminates after R i k attains saturation-i.e., there is no perceptible gain from more iterations.
If the runs are indexed i = 1, 2, . . ., |I| where I is the set of runs, the suggested performance metrics can be estimated as, In Equation (46), it is assumed that R∞ is the estimated average value of R ∞ , determined in accordance with Equation (45).In some situations, it may be computationally too expensive to obtain multiple runs.In such cases, as well as when the RL is implemented on a real HEMS environment, I may be a singleton set (|I | = 1).In this case, σ ∞ in Equation ( 46) is meaningless.An alternate metric may be used by using the last few iterations before termination.

Figure 1 .
Figure 1.The quantities shown are associated with the transition s t a t ,r t

Figure 2 .
Figure 2. Taxonomy of Deep Reinforcement Learning.Classification of all deep reinforcement learning methods that are described in this article are shown.Section 3.2 provides a description of each class.(See also [64].)

Figure 3 .
Figure 3. Deep Q-Network Layouts.One scheme uses a uses a separate DNN for each action (top).The other scheme uses only one DNN that receives actions as another input (bottom).

Figure 4 .
Figure 4. Replay Buffer.Shown are the replay buffer, environment, and agent.The pathways are involved during the agent's interaction with the environment (solid blue) and training (dashed red).

Figure 5 .
Figure 5. Use of Target Network.The scheme used to correct temporal correlatedness is shown.Pathways for control (solid red), learning (dashed green), and intermittent copying (dashed, thick blue) are shown.The replay buffer has been omitted for simplicity.

Figure 6 .
Figure 6.Overestimation Bias.This example is used to illustrate the effect of overestimation bias (see text for complete explanation).

Figure 7 .
Figure 7. Double DQN.One DQN (θ 1 or θ 2) is picked at random and its Q value (q 1 or q 2 ) is used to obtain the target (t), which is used to train the other DQN.For simplicity only the pathways involved in training are shown.The target pathways are depicted with dotted lines.

Figure 8 .
Figure 8. Dueling DQN.Shown is the dueling DQN architecture.The two outputs of the DNN are parametrized by θ v and θ A .The target pathway (dotted green) is for training.

Figure 9 .
Figure 9. Policy Gradient with Baseline.Shown is the overall scheme used in REINFORCE with the baseline.There are different ways to implement the baseline.

Figure 10 .
Figure 10.K-L Divergence.Two Gaussian distributions (solid, blue) with low variance σ (left) and high variance σ (right) are shown, and θ = [µ, σ].Incrementing µ by ∆µ (dashed green) results in equal change in the norm ∆θ 1 whereas D KL (π θ π θ+∆θ ) is higher in the distribution appearing to the left.The smaller distance in the right is due to the greater overlapping region shown on top.

Figure 11 .
Figure 11.Actor-Critic Network.Shown is the overall schematic used in actor-critic learning, comprising of an actor DNN and a critic DNN.

Figure 12 .
Figure 12.HEMS Applications.All applications of reinforcement learning in home energy management systems are classified into the five categories shown.
(i) Heating, Ventilation and Air Conditioning, Fans and Water Heaters: Heating, ventilation, and air conditioning (HVAC) systems alone are responsible for about half of the total electricity consumption [48,101-104].In this survey, HVAC, fans and water heaters (WH) have been placed under a single category.Effective control of these loads is a major research topic in HEMS.(ii) Electric Vehicles, Energy Storage, and Renewable Generation: The charging of electric vehicles (EVs) and energy storage (ES) devices, i.e., batteries are studied in the literature as in

Figure 13 .
Figure 13.Building Types and Objectives.The building type and the RL's objective of each application class.Note that the links are based on the existing literature covered in the survey.The absence of a link does not necessarily imply that the building type/objective cannot be used for the application class.

Figure 14 .
Figure 14.Application Classes.The total number of articles in each application class (left), as well as their corresponding proportions (right).

Figure 15 .
Figure 15.Objectives and Building Types.Proportions of articles in each objective (left) and building type (right).

Figure 16 .
Figure 16.Real-World, Multi-Agents, and Discretization.Proportions of articles deployed in real world HEMS (left), using multi-agents (middle), and whether the states/actions are discrete or continuous (right).

Figure 17 .
Figure 17.Proposed Performance Metrics.The four metrics across multiple runs for performance evaluation of an RL algorithm that have been suggested by the authors for HEMS and other practical applications.A typical trajectory obtained from a single run (dashed red), the average of multiple runs (solid green), and the variance (shaded light green) are shown.The quantity R min is the minimum attained from all runs.

Table 1 .
References using Tabular Reinforcement Learning.

Table 2 .
References using Deep Q Networks.

Table 3 .
References using Deep Policy Networks.

Table 5 .
References using Combination of Methods and/or Miscellaneous Methods.