Real-Time Autonomous Residential Demand Response Management Based on Twin Delayed Deep Deterministic Policy Gradient Learning

With the roll-out of smart meters and the increasing prevalence of distributed energy resources (DERs) at the residential level, end-users rely on home energy management systems (HEMSs) that can harness real-time data and employ artificial intelligence techniques to optimally manage the operation of different DERs, which are targeted toward minimizing the end-user’s energy bill. In this respect, the performance of the conventional model-based demand response (DR) management approach may deteriorate due to the inaccuracy of the employed DER operating models and the probabilistic modeling of uncertain parameters. To overcome the above drawbacks, this paper develops a novel real-time DR management strategy for a residential household based on the twin delayed deep deterministic policy gradient (TD3) learning approach. This approach is model-free, and thus does not rely on knowledge of the distribution of uncertainties or the operating models and parameters of the DERs. It also enables learning of neural-network-based and finegrained DR management policies in a multi-dimensional action space by exploiting high-dimensional sensory data that encapsulate the uncertainties associated with the renewable generation, appliances’ operating states, utility prices, and outdoor temperature. The proposed method is applied to the energy management problem for a household with a portfolio of the most prominent types of DERs. Case studies involving a real-world scenario are used to validate the superior performance of the proposed method in reducing the household’s energy costs while coping with the multi-source uncertainties through comprehensive comparisons with the state-of-the-art deep reinforcement learning (DRL) methods.


Background and Motivation
The energy sector is currently undergoing a fundamental transition, with the major agenda being building a low-carbon future. To achieve this goal, a large-scale integration of renewable energy sources (RESs) at the generation side and electrification of transport and heat technologies at the demand side have been witnessed. However, significant challenges have emerged alongside this transition for electricity systems worldwide, since RES generation is inherently characterized by its intermittency and uncontrollability, while the introduction of electric vehicles (EVs) and heating loads contributes to the greater number of variable electrical demand profiles and higher demand peaks, which are disproportionately higher than the increase in energy consumption [1]. To address these challenges, an urgent need to enhance the flexibility of electricity systems has arisen in order to achieve the required generation and demand balancing in a cost-effective manner. Table 1. Summary of the existing methods used in optimal demand response (DR) management.

Category Key Features Modeling Method Advantages or Limitations
Model-based -Relies on full knowledge of distributed energy resource's (DER's) operating model and parameters -Relies on an accurate forecast of exogenous parameters -Unable to deal with the multi-source uncertainties effectively and efficiently Deterministic -Unable to deal with uncertainties SP -Unable to accurately estimate the probability distribution of uncertain parameters -Computationally inefficient

RO -Leads to overly conservative solutions
Model-free -Requires no full system identification and no a priori knowledge of the system -Employs data-driven and machine learning approaches to learn a generalizable DR strategy -Computationally efficient at deployment RL -Unable to deal with problems with high-dimensional continuous states and/or action spaces DRL -Capable of handling high-dimensional continuous states and action spaces -Effective learning of fine-grained control policies The first category focuses on model-based DR management. In [3][4][5][6][7], a deterministic energy cost minimization problem is formulated and solved to determine the optimal dayahead schedule of the different loads of the end-users. However, model-based management approaches require full knowledge of appliances' operational models and parameters. Furthermore, such an optimization requires accurate forecasts of exogenous parameters, such as the utility price patterns and weather-related PV generation. As a result, the inevitable inaccuracy of the adopted operational model (due to the lack of expert domain knowledge) and the exogenous forecasts deteriorate the quality (i.e., cost-effectiveness) of the obtained DR management strategy.
While the above deterministic optimization approach neglects the intrinsic uncertainties in the DR management problem, the scenario-based stochastic programming (SP) and robust optimization (RO) approaches have been widely adopted to deal with these uncertainties. SP generally employs statistical distributions to represent the uncertainties, whereas RO represents them as feasible sets. In [8], Monte-Carlo simulation was employed to generate scenarios for the uncertainty associated with the utility prices, and SP was subsequently solved for optimal DR management. The authors of [9] took into account the uncertainties associated with an EV's scheduling availability and the solar photovoltaic (PV) production in an SP model to minimize the expected energy cost for the end-user. In [10], a chance-constrained optimization model was formulated to enforce the probabilistic satisfaction of the temporally coupled constraints of flexible loads. In [11], a Lyapunov optimization model was developed to minimize the energy and thermal discomfort costs of a smart home equipping a smart heating, ventilation, and air conditioning (HVAC) system. An RO approach was adopted in [12] to minimize the worst-case cost, accounting for the uncertainties associated with the end-user's electricity usage behavior. However, the computational burden of SP increases drastically with the number of employed scenarios [13]. Though scenario reduction techniques have been commonly adopted to reduce the number of scenarios, significant challenges associated with identifying suitable statistical distributions and selecting a computationally manageable number of representative scenarios remain [13]. Furthermore, the nature of RO in hedging against the worst-case realization of the uncertain parameters often causes the obtained solution to be overly conservative [14].
In contrast, the second category focuses on model-free reinforcement-learning (RL)based approaches, which have recently arisen as an attractive alternative to their modelbased counterparts. In RL, an agent is trained to construct a near-optimal policy through repeated interaction with a black-box environment, i.e., without full system identification and no a priori knowledge of the environment. Furthermore, an RL agent can harness the increasing influx of data collected from Internet of Things sensors, and thus enables successive data processing and interpretation so as to train a representation of the DR management strategies that are generalizable and cope with the environmental uncertainties. Finally, at deployment, a trained RL model is able to compute real-time DR management decisions within several milliseconds, constituting a computationally efficient tool for real-time energy management tasks.
Founded on these favorable properties, the application of different RL methods to residential DR management problems has recently been witnessed. Among them, the conventional Q-learning (QL) method constitutes the most popular approach, primarily as a result by its simplicity. QL was employed in [15][16][17] for optimal appliance scheduling and in [18][19][20][21] for the management of an integrated PV and ES system. However, as a tabular-based method, QL is susceptible to the "curse of dimensionality". Concretely, it constructs a look-up table that discretizes both the state and action domains to estimate the Q-value function for every state-action pair. As a result, the feedback signal that the agent obtains regarding the influence of its actions on the environment is often distorted and may be uninformative. Moreover, the structure of the entire feasible action space may be adversely affected, which may contribute to sub-optimal policies. Furthermore, this dimensionality challenge is aggravated in the setting of the DR management problem, as both the state of the environment (e.g., the state of charge of the ES) and the agent's actions (e.g., charging/discharging power of the ES) are continuous and multi-dimensional. In light of these limitations, the fitted Q-iteration (FQI) method was applied to schedule thermostatically controlled loads [22,23], an electric water heater [24], and EVs and ES [25]. FQI employs a regression model (based on handcrafted features) to approximate the Qvalue function. However, FQI involves training of the regression model on hundreds of iterations, and is thus inefficient for use in synergy with a complex regressor, such as a deep neural network [26].
More recently, there has been a growing interest in combining RL and deep learning. Deep RL (DRL) techniques promise effective learning of more sophisticated and fine-grained control policies than those achieved by traditional RL methods [26] founded upon look-up tables or shallow regression models. In this regard, the deep Q network (DQN) method constitutes the most popular approach. The DQN is applied to perform DR management for shiftable loads [27,28], EVs [25,29], ES [30][31][32], and HVAC systems [33]. Rather than using a look-up table, the DQN relies on a deep neural network (DNN) to approximate the Q-value function. As such, the DQN promises effective learning in multidimensional continuous state spaces. Nevertheless, it performs incompetently in problems with continuous action spaces because the DNN can only output the discrete Q-value estimates rather than continuous action itself [34]. For instance, the management actions for ES in [31] were assumed to be fully charging, fully discharging, or staying idle. This design significantly restrains the flexibility potential of ES and hampers the application of the DQN in addressing the investigated problem.
Going further, relevant research efforts have been expended in order to develop DRL methods for continuous control The deep policy gradient (DPG) method was introduced in [27,28]. The DPG employs a DNN to directly estimate the action selection probability at a given state, rather than estimating the Q-value function of taking an action at a given state. However, the actions considered in [27,28] are restricted to the on/off status of different flexible-load devices, whereas their load schedules are actually optimized through the solution of a cost minimization problem at the learned on/off status. In addition, the DPG is often criticized for its low sampling efficiency and the high variance in its gradient estimates, which lead to slow convergence [35]. To overcome this drawback, Ref. [36,37] applied the deep deterministic policy gradient (DDPG) method in order to optimize the schedules of different appliances. The DDPG is an actor-critic DRL method, which estimates both the policy as well as its associated Q-value during training. As a result, it substantially alleviates the variance in the gradient estimates and contributes to better convergence performance. However, a common limitation of the DDPG is that the learned Q-value function may overestimate the Q-value function, leading to sub-optimal policies [38].

Contributions
This paper address the bottlenecks of previously employed model-free approaches by proposing a novel real-time DR management system based on the twin delayed deep deterministic policy gradient (TD3) method, which leverages the performance of the DDPG method. To the best of the authors' knowledge, this is the first application of the TD3 in an optimal DR management problem. The value of the proposed DR management system is demonstrated through case studies using real-world system data. The novel contributions of this paper are threefold: -A Markov decision process (MDP) is constructed to formulate the optimal DR management problem for a residential household operating with multiple and diverse DERs, including PV generators, ES units, and three types of FD technologies, namely an EV with flexible charging and vehicle-to-grid (V2G)/vehicle-to-home (V2H) capabilities, wet appliances (WAs) with deferrable cycles, and HVAC with certain comfortable temperature margins. -A model-free and data-driven approach based on TD3, which does not rely on any knowledge of the DERs' operational models and parameters, is proposed to optimize the real-time DR management strategy. In contrast to previous works where the DR management problem was addressed by employing discrete control RL methods, the TD3 method allows learning of neural-network-based, fine-grained DR management policies in a multi-dimensional action space by harnessing high-dimensional sensory data that also encapsulate the system uncertainties. -Case studies on a real-world scenario substantiate the superior performance of the proposed method in being more computationally efficient, as well as in achieving a significantly lower daily energy cost than the state-of-the-art DRL methods, while coping with the uncertainties stemming from both the electricity prices and the supply and demand sides of an end-user's DERs.

Paper Organization
The rest of the paper is structured as follows. Section 2 presents the system model and problem formulation. Section 3 details the proposed TD3-based DR management algorithm, and its effectiveness is verified with simulation results in Section 4. Finally, Section 5 discusses the conclusions of this work.

System Model and Problem Formulation
The investigated residential household with an HEMS managing a portfolio of assorted DERs is shown in Figure 1. The installation of an on-site, non-dispatchable PV generator and an integrated ES unit can supply the household's electricity demand in addition to acquiring some revenue by selling surplus PV power to the grid. The appliances of a household can generally be separated into two categories: non-shiftable and shiftable. The shiftable appliances can be further sub-categorized as interruptible and non-interruptible. The power demand of the non-shiftable appliances (e.g., lighting loads) must be supplied by the HEMS without any delay when they are active. On the other hand, the HEMS can delay the consumption of the shiftable appliances. WAs (e.g., washing machine, dishwasher, tumble dryer), or deferrable appliances, constitute the most representative types of the non-interruptible appliances. Their load cycles are flexible in scheduling within a specific time window but cannot be interrupted or altered. In contrast, EVs and HVAC are characterized as interruptible appliances whose operation times and energy usage can be flexibly adjusted upon satisfying some specific operating constraints, e.g., the traveling energy requirement constraint for an EV and the comfortable indoor temperature range constraint for HVAC.
The HEMS is assumed to operate in slotted time steps, i.e., t ∈ [1, T] with a temporal resolution ∆t = 0.5 h, where T = 48 is the total number of time steps in an investigated day. At each time step, the HEMS manages the charging and/or discharging power of the EV, ES, WAs, and HAVC based on high-dimensional sensory data comprised of the non-shiftable load, PV generation, outdoor temperature, the state of charge of the ES and EV, and the utility buy/sell prices, aiming at minimizing the daily energy cost of the household while maintaining a comfortable indoor temperature range. Next, we present the operating models of the EV, ES, WAs, and HVAC, as well as the model-based daily cost minimization problem for their management.

Electric Vehicle (EV) and Energy Storage (ES)
The EV is flexible in terms of the time periods in which it can acquire the amount of energy needed for its operation, as long as this is completed within a scheduling interval allowed by its users. In addition, the EV can inject stored energy during this interval (i.e., it exhibits V2G/V2H capabilities). The charging/discharging power of the EV can be continuously regulated between 0 and a maximum level, and it needs to fulfill an energy requirement for the envisaged journeys within the scheduling interval (with grid connection). Each EV is assumed to depart from its grid connection point only once within the horizon of the coordination problem (at step t dep ) and subsequently arrive back at its grid connection point only once during the same horizon (at step t arr ). The operating model of the EV includes the following constraints: Constraint (1) corresponds to the EV battery's energy balance, taking into account the energy needed for commuting purposes as well as the losses caused by charging and discharging efficiencies.
Constraint (2) expresses the lower and upper bounds of the battery's energy content.
Constraints (3)-(4) represent the limits of the battery's charging/discharging power, which depends on its power capacity P ev,max and on if the EV is available for scheduling (A ev t = 1) or not (A ev t = 0), while the binary variable V ev t is employed to avoid simultaneous charging and discharging.
Finally, constraint (5) ensures that the EV is sufficiently charged upon departure to satisfy the commuting requirements of its users.
The operating model of the ES [39] is similar to that of the EV apart from the fact that the traveling energy requirement E tr t and the scheduling availability A ev t are irrelevant and are thus removed.

Wet Appliances (WAs)
The operation of WAs is based on the execution of user-prescribed cycles, which comprise a sequence of sub-processes occurring in a fixed sequence with a generally fixed duration and fixed power demand, which are immutable [40]. Their flexibility is measured by the ability to defer these cycles up to a maximum delay limit set by their users. Without loss of generality, each WA is assumed to be activated for one operational cycle per day by its users only once during the temporal horizon between the cycle's earliest initiation time t in and latest termination time t ter . The operating model of the WAs includes the following constraints: Constraint (6) ensures that the demand activity of the WAs can be carried out once at most during the time window determined by t in and t ter .
Constraint (7) expresses that the power demand of the WAs at each time step is dependent on the initiation time, A wa t , T dur , and P cyc τ , ∀τ ∈ [1, T dur ].
2.3. Heating, Ventilation, and Air Conditioning (HVAC) The flexibility of the examined HVAC systems lies in that an indoor temperature range can be specified by the users so that their thermal comfort is preserved. The representation of the thermal comfort is a non-trivial task, as it depends on many diverse factors (e.g., air temperature, mean radiant temperature, relative humidity, air speed, etc.). Following the practice of [41], a comfortable temperature range is employed as the representation of thermal comfort: Equation (9) represents the dynamic thermal behavior of the heated/cooled space, following the first-order model presented in [41]: where C hvac and R hvac are, respectively, the thermal capacity and resistance of the heated/cooled space. η hvac is the energy efficiency of HVAC; this value is positive for cooling and negative for heating. Equation (10) expresses the electric power limits of the HVAC system:

Daily Energy Cost Minimization
The net demand (positive)/generation (negative) l t of the household at step t can be expressed as: Finally, the daily energy cost minimization problem for the household can be formulated as: where operators [·] +/− = max / min{·, 0} indicate taking the maximum/minimum value between · and 0. The first term in (12) represents the cost of purchasing electricity from the grid, while the second term represents the revenue from selling excess PV production, ES, and EV discharge to the grid. Note that problems (12)-(14) are a mixed-integer linear program (MILP) that provides a model-based DR management strategy that aims to minimize the daily energy cost, assuming full knowledge of the operating models and parameters of all the DERs and a perfect prediction of all the uncertain parameters. As such, the optimal cost in (12) can be treated as a lower bound on the cost (since the uncertainties are completely neglected), which later provides a theoretical baseline for the model-free DRL DR management strategy. As discussed in Section 1.2, the SP approach is computationally inefficient in optimizing the DR management strategy while dealing with the multi-source system uncertainties.
To address this, we propose an alternative approach for addressing the real-time DR management problem.

DR Management as an MDP
A finite Markov decision process (MDP) with discrete time steps is applied to formulate the real-time DR management problem. The time interval between two adjacent time steps is 30 min (the proposed approach can be readily extended to employ a finer temporal resolution). The HEMS constitutes the agent, while the environment is composed of many objects outside the agent (e.g., utility company, PV generator, non-shiftable loads, ES, EV), as shown in Figure 2. In the context of RL, an agent acts in an environment by sequentially taking actions over a sequence of time steps to maximize a cumulative reward. In general, RL can be described as an MDP that includes: (1) a state space S; (2) an action space A; (3) a transition dynamics distribution with conditional transition probability p(s t+1 |s t , a t ), which models the uncertainty in the evolution of states of the environment based on the executed actions of the agent; and (4) a reward r: S × A → R. The detailed MDP formulation for the DR management problem is detailed below. (1) State: The state s t at step t received by the HEMS agent entails the influence of its action on the status of the environment. s t is identified as an 11-dimensional vector s t = [t, λ + Based on P evc t and P evd t , the energy context of the EV battery at the next time step E ev i,t+1 can be written as (17).
Quantities P esc t , P esd t , and E es t+1 can be obtained following the ES operating model (Section 2.1) and the same logic of the above derivation for (15)- (17), but they neglect the charging availability.
Based on the WA operating model presented in Section 2.2, the power demand of the WAs P wa t is managed by action a wa t , and is also affected by the WA parameters T dur , A wa t , and P cyc Finally, on the basis of the HVAC operating model in Section 2.3, the indoor temperature at the next time step H in t+1 based on P hvac t can be expressed as: (3) Reward: Since the objective of the HEMS agent is to minimize the total energy cost of the household while maintaining a comfortable indoor temperature as well as ensuring the satisfaction of all DERs' operating constraints, the reward function is designed to include the following three components: 1) r cost t , which is as the negative total energy cost of the household: 2) r com t , which serves as a penalty for indoor temperature deviation from a desirable range, with κ 1 denoting a positive weighting factor: 3) r pen t , which serves as a penalty for the constraint violations of the DERs, with κ 2 denoting a positive weighting factor: Note that in (15)-(16), the charging and discharging power of the EV only respects the minimum/maximum power and energy limits of the EV, but does not ensure that its stateof-charge level is sufficient to cover the energy requirements for traveling, i.e., constraint (5) may not be satisfied. Furthermore, constraint (6) should be satisfied at the last initiation step to ensure the daily activation frequency of the WAs. To adequately account for these inter-temporal constraints of the EV and WAs, we introduce a penalty term r pen t in the reward function.
The final reward function r t can be expressed as: (4) Performance and value functions: The agent employs a policy π to interact with the MDP and emit a trajectory of states, actions, and rewards: s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ... over S × A × R. The agents' objective is to learn a policy that maximizes the cumulative discounted reward from the start state s 1 , which is termed as the performance function J(π) = E R 1 |π = E s∼ρ π ,a ∼π [r], where ρ π denotes the discounted state distribution and R t = ∑ T l=t γ (l−t) r l is the discounted reward, where γ ∈ [0, 1] is the discount factor. Furthermore, the Q-value function Q π (s, a) = E R 1 |s 1 = s, a 1 = a; π forms an estimation of the discounted reward given an action a at state s and following the policy π from the succeeding states onwards.

Proposed TD3-Based DR Management Strategy
As discussed in Section 1.2, despite the popularity of applying QL and DQN for DR management problems in the existing literature, they both suffer, to some extent, from the curse of dimensionality driven by their need to discretize the state and/or the action spaces. Furthermore, the discretization may hinder the decision-making process of the HEMS agent, leading to sub-optimal DR management policies. The DPG method is criticized for its low sampling efficiencies and high variance in its gradient estimator. In order to address these challenges, the proposed DR management strategy is founded on the TD3 method [38], the overall workflow of which is presented in Figure 3. TD3 leverages the performance of the state-of-the-art DRL method for continuous control, i.e., DDPG.   TD3 features an actor-critic architecture, which employs (a) a parameterized critic network Q θ that inputs a state s t and action a t and outputs an estimate of the Q-value function Q θ (s t , a t ) and (b) a parameterized actor network µ φ that inputs a state s t and implements a policy improvement task that updates the policy with respect to the estimated Q-value function and outputs a continuous action µ φ (s t ). QL and DQN both feature a greedy maximization of the Q-value function concerning policy improvement, i.e., µ(s t+1 ) = arg max a t+1 Q(s t+1 , a t+1 ). However, such a greedy strategy exhibits significant intractability in the high-dimensional continuous action domain, since the Q-value function needs to be globally maximized at every step. Instead, TD3 utilizes the actor µ to produce an action µ φ (s t+1 ) for the next state. The critic is responsible for policy evaluation, or criticizing the policy obtained by the actor by generating a Q-value estimate with temporal difference (TD) learning. This is achieved by minimizing the following regression loss function: where r t + γQ θ s t+1 , µ(s t+1 ) denotes the target Q-value at step t. Instead of globally maximizing Q θ (s t , a t ), the critic evaluates the gradients ∇ a Q θ (s t , a t ), which indicate the directions for the change of action to purse higher estimated Q-values. As a result, the weights of the actor are updated in the direction of the performance gradient ∇ φ J(µ φ ), which is derived according to the deterministic policy gradient theorem [42]: Exploration vs. exploitation: Maintaining an effective trade-off between exploration and exploitation plays a vital role in effective RL learning. To aid the agent in exploring the environment thoroughly, an exploration/behavior policyμ(s t ) is constructed, which imposes a random Gaussian noise N t (0, σ 2 t I) on the actor's output µ φ (s t ): It is well recognized that RL's learning process tends to exhibit instability or even divergence when a DNN is employed as a nonlinear regressor for the Q-value function. To tackle such instability, previous works have put forward two tailored mechanisms.
Target Networks: Observe in (24) that the online network Q θ is utilized for both the current Q-value estimation Q θ (s t , a t ) and the target Q-value r t + γQ θ (s t+1 , µ φ (s t+1 )). As a consequence, the Q-value update is prone to oscillations. To remedy this instability, a target network [26] can be introduced for the actor and critic, denoted as µ φ (s t ) and Q θ (s t , a t ), respectively. They are only used for evaluating the target values. Furthermore, the weights of these target networks can be updated by having them gradually track the weights of the online networks as θ ← υθ + (1 − υ)θ with υ 1. Similarly to the idea of temporally freezing the Q-target value during training (in DQN), but modified for the actor-critic RL method, the rationale behind the soft update is to restrict the target values (for both actor and critic) to change slowly for an enhanced learning stability.
Experience Replay: Since the experiences are sequentially generated through the agent's interaction with the environment, there exists temporal correlation in these experiences, which can degrade machine learning models substantially. The employment of an experience replay buffer B [26] resolves this challenge. It is a cache that pools the past experiences and uniformly samples a minibatch for training the actor and critic. Mixing recent with previous experiences alleviates the temporal correlations of the sampled experiences. Furthermore, it enables samples to be reused, which enhances the sampling efficiency.
Despite the remarkable success that DDPG has received in various power system and smart grid applications, it is often criticized for the overestimation bias of the Q-value functions, which can result in sub-optimal policies [38]. TD3 is tailored to address this challenge by concurrently learning two Q-value functions instead of one.
Double Critic Networks: In RL methods focusing on learning the Q-value, such as QL and DQN, function approximation errors may arise, which can result in an overestimated Q-value and, consequently, sub-optimal policies [43,44]. Concretely, the target Q-values used by QL and DQN can be written as: It can be observed that QL uses the same Q-table both to select (in the arg max operator) and to evaluate (calculate the target Q-value, which is subsequently used in the Q-value update) an action. Analogously, DQN uses the same set of neural network weights θ to both select and evaluate an action. This renders it more likely to select overestimated Q-values, leading to overoptimistic value estimates. Furthermore, such an overestimation bias can be propagated in time through the Bellman equation and can be developed into a more significant bias after many updates if left unnoticed. In the case of DDPG, since the policy µ φ is optimized with respect to the critic Q θ , using the same estimate in the target update of Q θ can create a similar overestimation of Q θ . This may adversely affect the policy quality, since sub-optimal actions may be highly rated by a sub-optimal critic, reinforcing the selection of these actions in the subsequent policy updates.
To address the drawback of using a single Q-value estimator, we propose a variant of the Double QL method [43] and adopt it in the actor-critic setting in order to mitigate the risk of having an overestimated Q-value. To achieve this, we introduce two independently trained online critic networks (Q θ 1 , Q θ 2 ) and their corresponding target networks It is assumed that Q θ 1 /Q θ 2 are the potentially biased/less biased Q-value estimates, respectively. Since a Q-value estimate that suffers from overestimation bias can be used as an approximated upper bound for the true value estimate, we use Q θ 1 as the upper bound of Q θ 2 . Equivalently, this is results in taking the minimum between these two estimates to get the target Q-value: Another potential failure in DDPG is that if the Q-function approximator develops an incorrect sharp peak for some actions, the policy will quickly overfit to such narrow peaks, leading to incorrect behavior. This can be averted by smoothing out the Q-function over similar actions, which target policy smoothing is designed to do. In this effect, actions used to form the critic target are based on the target policy µ φ , but with clipped noise added on each action dimension. After adding the clipped noise, the target action is clipped again to lie in the valid action range [a Low , a High ]. The target actions can be expressed as: As discussed previously, target networks can be used to reduce the error over multiple updates, while policy updates on states corresponding to high TD error may lead to divergent learning behavior. As a result, TD3 updates the policy network at a lower frequency than the critic network in order to sufficiently minimize error before introducing a policy update. In this effect, a modification is introduced only to update the policy and target networks after Z updates to the critic.
By incorporating the target network, experience replay, and the above-mentioned modifications, the critic loss in (24) can be stated as the weighted mean-squared TD error calculated based on the training data, i.e., a minibatch of prioritized sampled K experiences.
where the TD error for each critic can be expressed as: The policy gradient for the actor update in (25) can be restated in a similar fashion: Finally, the following updates are applied to the weights of the online critic networks: and after Z learning steps, the online actor and the target networks are updated according to: where α θ and α φ are the learning rates of the gradient descent algorithm and υ is the soft update rate. Algorithm 1 details the training of the DNNs employed by TD3, and the proposed TD3-based DR management strategy is outlined in Algorithm 2. After the training phase, we firstly load the weight of the online actor network that is trained by Algorithm 1. For a specific test day, at each time step t, the agent observes the current environment state s t and determines its DR management action according to the policy learned by TD3. The requested DR actions are then mapped to the input/output of different DERs of the household (Section 3).

5:
Initialize a random Gaussian exploration noise N t . 6: for time step (i.e., 30 min) t = 1 : T do 7: The HEMS agent selects action a t using (26). 8: Execute a t in the environment, observe r t using (23), and transit to the new state s t+1

9:
Store (s t , a t , r t , s t+1 ) in the experience replay buffer. 10: Sample a minibatch K of experiences from reply buffer. 11: Compute target actions µ clip (s τ+1 ) using (30). 12: Update the online critics using (34). 13: if τ mod Z then then 14: Update the online actor using (35). 15: Update the target networks using (36). 16: end if 17: end for 18: end for Algorithm 2 TD3-based DR management strategy 1: Load the DNN parameter φ * of the online actor network µ φ * trained by Algorithm 1. 2: for test day = 1 : E test do 3: Obtain the initial state s 1 of the test day. 4: for time step = 1 : T do 5: Set the DR management action as a t = µ φ * (s t ).

6:
Execute action a t in the environment, calculate reward r t , and transit to the new state s t+1 . 7: end for 8: end for

Simulation Setup and Implementation
The proposed TD3-based DR management strategy was trained and tested on a realworld scenario using household solar PV and non-shiftable load data published by Ausgrid, Australia. The employed data were collected from 1 June 2012 to 31 May 2013 (53 weeks) with a half-hourly resolution. The data for household outdoor temperature were collected from open Australian government database [45].
The assumed operating parameters of the EV, ES, WAs, and HVAC were derived from [41,46] and are provided in Table 2. Concretely, it was assumed that an EV user makes two trips per day; each is defined by a departure time, an arrival time, and an energy requirement. The grid connection period of the EV was assumed to be between the end of the second and the start of its first trip. In order to capture the inherent uncertainty residing in the DERs' operating models, the following parameters were modeled as random variables: the EV departure and arrival times, the energy requirements, the initial energy level in the EV and ES batteries, the earliest initiation and latest termination times of the WAs, and the initial indoor temperature of the HVAC. To this end, we employed truncated normal distribution T N for parameters related to temperature and energy and discrete uniform distribution for parameters related to time, as detailed in Table 3.
In the simulations, we uniformly picked one day from each of the 53 weeks to form the test set and used the rest of days as the training set. The utility buy price data follow the time-of-use structure provided in [45], partitioned into summer and winter periods, while the utility sell price is fixed at 0.04 AUD/kWh [47] throughout the year.  The TD3 algorithm employed two DNNs (i.e., online and target) for the actor and the two critics. The Adam optimizer [48] was used for learning the neural network weights with learning rates of α φ = 10 −4 and α θ = 10 −3 for the actor and critics, respectively. A soft update rate of ν = 10 −3 was used. A discount factor of γ = 0.99 was used for the critics. As shown in Figure 3, the actor and the critics all had two hidden layers with 128 and 64 neurons, respectively. Both the actor and critic employed rectified non-linearity (ReLU) [49] for all hidden layers. The output layer of the actor was a softsign layer [50] to bound the continuous actions. The minibatch size and the replay buffer size were set as 128 and 10 5 , respectively. All investigated coordination methods were implemented in Python. The training process of the examined learning algorithms was carried out on a computer with a four-core 2.80 GHz Intel(R) Core(TM) i7-7700HQ CPU and 16 GB of RAM, and the total training time for TD3 was 949 s.

Performance Evaluation
We benchmarked the performance of TD3 with DQN and DPG (which are widely adopted in the existing literature on DR management problems, as discussed in Section 1.2) in order to validate its performance superiority. Furthermore, we solved the daily cost minimization problem (MILP) presented in Section 2.4 and calculated the average daily energy cost over the 53 test days (as depicted by the black horizontal line in Figure 4). In this case, C * = 368 cents can be regarded as the theoretical optimal solution of the investigated DR management problem. In other words, it represents a lower bound on the daily cost, indicating how far from the optimum the DRL-based DR management strategy is.
To assess the average performance as well as the variability of the examined DRLbased methods, 10 different random seeds were generated, and each DRL method was trained for 20,000 epochs for each seed. Each epoch signifies a random day selected from the training dataset consisting of 48 time steps. During training, the cost effectiveness of the learned DR strategy was evaluated on the test dataset every 200 epochs. Figure 4 depicts the average daily cost C (over the 53 test days) for the investigated DRL methods with 10 random seeds. The mean and the standard deviation of the average daily cost over the 10 seeds are displayed by solid curves and shaded areas, respectively, in Figure 4. The cumulative daily energy costs of the 53 test days under TD3 and all examined baseline methods are presented in Figure 5.  As observed in Figure 4, TD3 improves the cost effectiveness of its DR management policy gradually with a declining standard deviation. Ultimately, only TD3 converges to a near-optimal solution. TD3 exhibits superior performance with regard to the two baseline DRL methods, exhibiting the lowest average daily cost of 374 cents (only 1.85% above the theoretical optimum C * ) and achieving the lowest standard deviation of 4 cents at convergence. In relative terms, TD3 outperforms DQN and DPG with 12.45%/5.93% lower average daily cost and 29.35%/44.50% lower standard deviation, respectively. In addition, it is evident that the continuous DR management strategy (employing TD3 and DPG) is more cost effective than the discrete one (employing DQN), since the former enables the agent to discover more a fine-grained DR management strategy in a multi-dimensional continuous action space. A more comprehensive illustration of the value of the continuous DR management strategy is presented in Section 5.3. Going further, TD3 exhibits superior convergence performance with respect to DPG in terms of the obtained average daily cost and learning stability. This superior performance is attributed to TD3's higher sampling efficiency in computing the policy gradient as well as the policy evaluation enabled by the joint learning of the critic in addition to the policy. On the contrary, DPG features no policy evaluation, contributing to high variance in its policy gradient estimation. Furthermore, TD3 updates the actor and critics in an online manner (i.e., the updates are performed on every time step), whereas a trajectory of experiences must be obtained before an update of the policy network can be introduced in DPG. Finally, TD3 incorporates tailored mechanisms to mitigate the overestimation of the Q-value functions, evading erroneous convergence to sub-optimal policies and thereby improving the convergence performance. As depicted in Figure 5, the cumulative costs obtained by the two benchmark approaches, DQN (green solid curve) and DPG (blue solid curve), are 14.22% and 6.30% higher than the theoretical optimum, respectively. In comparison, the cumulative cost under TD3 (red solid curve) is only 1.88% higher than the theoretical optimum (black dashed curve).
To further elaborate on the generalization capability of the learned DR management policies of TD3 with respect to the system uncertainties, we investigated the obtained DR schedules of the household for a representative summer and winter day selected from the 53 test days (reflecting the seasonable variations in the utility price, PV generation, and outdoor temperature), as displayed in Figures 6 and 7, respectively.
The summer day ( Figure 6) features ample PV generation and high outdoor temperature. At the beginning of the day, the HEMS learns not to operate the HVAC system to conserve energy and reduce cost, since the outdoor temperature is relatively low but is well above the minimum comfortable temperature (19 • C). A surge in the outdoor temperature is observed at around 5:00. As a result, the indoor temperature also increases (with a time lag) and the HVAC system is only scheduled to be on when the indoor temperature reaches the maximum comfortable temperature (24 • C) at around 8:00. During the mid-day periods (9:00-16:00), the operation of the HVAC system is optimized such that it can absorb a significant portion of the plentiful PV generation during these periods while maintaining the indoor temperature marginally below 24 • C in order to minimize cost. Furthermore, the HEMS also learns to absorb the PV generation by charging the ES instead of selling it to the grid because the utility buy price during the mid-day periods is still higher than the unfavorable sell price. During peak periods (17:00-22:00) where the PV generation is absent, it is observed that the peak demand is sufficiently flattened by discharging the ES and EV (which are both scheduled to charge during the cheapest off-peak periods). As observed in Figure 6, the learned DR management policy contributes 13 h (from 9:00-22:00) with net zero cost in total by optimally scheduling the complementary DERs and harnessing their flexibility potentials.
The winter day (Figure 7) is distinguished by scarce PV generation and low outdoor temperature. At the beginning of the day, the outdoor temperature is significant lower than the minimum comfortable temperature; the HEMS adapts to this exogenous condition by turning on the HVAC system for heating and conservatively scheduling it to sustain the indoor temperature marginally above 19 • C. After 8:00, accompanied by the increase of the utility buy price (Figure 7a), the HEMS learns to turn off the HVAC system to save cost. Subsequently, it is observed that the indoor temperature varies (with a time delay) with the outdoor temperature until the end of the day without operating the HVAC system. Similarly to the trend observed in Figure 6, the HEMS learns to charge the ES and EV sufficiently during the off-peak periods and discharge them in the morning (8:00-10:00) and afternoon/evening (14:00-22:00) peak demand periods, leading to a total of 14 h with net zero cost.
It can be concluded that the learned DR management policy exhibits excellent generalization performance with respect to the seasonal and daily variations associated with the utility prices, PV generation, residential demand, and outdoor temperature. Furthermore, the obtained DR management policies enable comprehensive harnessing of the flexibility value of complementary DERs, thus promising efficient utilization of RESs and substantial cost savings for the end-user.

Benefits of Continuous DR Management
This section more deeply explores the physical significance of the continuous DR management strategy enabled by TD3 by comparing it to DQN (a commonly employed discrete DRL method in this research topic). For DQN's implementation, actions a ev t and a es t are discretized in five integer values representing charging or discharging levels of 0%, 50%, and 100% of the maximum power limits of the EV and ES. Actions a hvac t is also discretized in five integer values representing a power demand of 0%, 25%, 50%, 75%, and 100% of the maximum power input of the HVAC. Figure 8 illustrates the DR schedules of the household obtained using DQN for the same summer day as that examined in the previous section.
Driven by the employed discretization of actions, the power input and/or output of the HVAC, EV, and ES can only be adjusted in discrete blocks, as mentioned above. In the case of the HVAC system, its demand profile is characterized by lumpiness, which exhibits significant fluctuations with respect to the one observed in Figure 6b, since the HEMS can now only adjust the power input of HVAC in five discrete blocks. This, in turn, leads to the fluctuations in the indoor temperature. In the case of the EV, since its charging power can no longer be continuously regulated, the HEMS charges the EV more during the off-peak period in order to guarantee the fulfillment of its traveling energy requirement. In the case of ES, owning to the lumpiness of the HVAC demand during the mid-day periods, the HEMS charges the ES more during these periods in order to fully consume the PV generation. However, since the power output of the PV generator is not controllable, this inevitably leads to purchasing of superfluous electricity (i.e., overcharging of ES) at high utility buy prices (Figure 8a). As a consequence of the charging activities of the EV and ES, significant reverse power flow (from selling excessive EV and ES discharges to the gird) is witnessed during the peak demand periods. Overall, the EV and ES are scheduled to charge at the shoulder/peak utility buy price and are discharged at the unfavorable utility sell price (0.04 AUD/kWh), resulting in non-economical operation. Overall, the daily energy cost under DQN (465 cents) is approximately 24.33% higher than the one under TD3 (374 cents). It can therefore be concluded that discrete control DRL methods hinder the comprehensive exploitation of the flexibility potential offered by DERs as well as the coordinated scheduling of complementary DERs.

Conclusions
In this paper, we formulate a real-time demand response management problem for a residential household as a Markov decision process. In the problem formulation, the uncertainties stemming from the supply (photovoltaic generation), demand (non-shiftable load, electric vehicle, wet appliances, heating, ventilation, and air conditioning), and storage (electric storage and electric vehicle) sides of the end-users are taken into account. A model-free and data-driven deep-reinforcement-learning-based demand response management strategy whose performance does not rely on accurate mathematical modeling of the distributed energy resources' operating models or the uncertainties was developed to determine the real-time control strategies for the household. The proposed approach constitutes an extension of the state-of-the-art deep deterministic policy gradient learning algorithm by addressing its overestimation error in the Q-value function, thus avoiding sub-optimal policies and promising better convergence properties. In comparison to the commonly employed Q-learning and deep Q network discrete control reinforcement learning methods, the proposed approach enables the agent to learn more fine-grained demand response management policies from high-dimensional sensory data. Case studies employing a large-scale real-world dataset have offered numerous valuable insights around the significance of the proposed demand response management strategy. The simulation results demonstrated that the twin delayed deep deterministic policy gradient manages to converge to a near-optimal solution and reduces the energy cost by approximately 12.45% and 5.93% compared to the costs obtained by using the deep Q network and deep policy gradient, respectively. Furthermore, the proposed method enables a representation of real-time and cost-effective demand response management strategies to be constructed, and these are shown to be generalizable despite the variabilities in multiple uncertain parameters of the problem.

Conflicts of Interest:
The authors declare no conflict of interest.

Nomenclature t
Index of time steps T, ∆t Horizon and resolution of DR management problem λ + t , λ − t Utility buy and sell prices at t (AUD/kWh) P d t Power demand of non-shiftable loads at t (kW) P pv t Power generation of PV at t (kW) V ev t Binary indicator of whether EV charges (V ev t = 1) or discharges (V ev t = 0) at t P evc t , P evd t Charging and discharging power of EV at t (kW) P ev,max Maximum charging/discharging rate of EV (kW) E ev t Energy level of EV at t (kWh) E ev,max , E ev,min Maximum and minimum energy limits of EV (kWh) E tr t Energy requirement for traveling purposes of EV at t (kWh) η evc , η evd Charging and discharging efficiencies of EV t dep , t arr Departure and arrival times of EV A ev t Binary indicator on EV scheduling availability at t (set as A ev t = 1 for the EV scheduling step t ∈ [0, t dep ) ∪ (t arr , T] and A ev t = 0 otherwise) τ Index of sub-processes of the WA cycle P cyc τ Power demand at sub-process τ of the WA cycle (kW) P wa t Power demand of WAs at t (kW) T dur Duration of WA cycle t in , t ter Earliest initiation and latest termination times of WA cycle A wa t Binary indicator of WA scheduling availability at t (set as A wa t = 1 for the WA scheduling t ∈ [t in , t ter ] and A wa t = 0 otherwise)