2. System Model and Problem Formulation
The investigated residential household with an HEMS managing a portfolio of assorted DERs is shown in
Figure 1. The installation of an on-site, non-dispatchable PV generator and an integrated ES unit can supply the household’s electricity demand in addition to acquiring some revenue by selling surplus PV power to the grid. The appliances of a household can generally be separated into two categories: non-shiftable and shiftable. The shiftable appliances can be further sub-categorized as interruptible and non-interruptible. The power demand of the non-shiftable appliances (e.g., lighting loads) must be supplied by the HEMS without any delay when they are active. On the other hand, the HEMS can delay the consumption of the shiftable appliances. WAs (e.g., washing machine, dishwasher, tumble dryer), or deferrable appliances, constitute the most representative types of the non-interruptible appliances. Their load cycles are flexible in scheduling within a specific time window but cannot be interrupted or altered. In contrast, EVs and HVAC are characterized as interruptible appliances whose operation times and energy usage can be flexibly adjusted upon satisfying some specific operating constraints, e.g., the traveling energy requirement constraint for an EV and the comfortable indoor temperature range constraint for HVAC.
The HEMS is assumed to operate in slotted time steps, i.e., with a temporal resolution = 0.5 h, where T = 48 is the total number of time steps in an investigated day. At each time step, the HEMS manages the charging and/or discharging power of the EV, ES, WAs, and HAVC based on high-dimensional sensory data comprised of the non-shiftable load, PV generation, outdoor temperature, the state of charge of the ES and EV, and the utility buy/sell prices, aiming at minimizing the daily energy cost of the household while maintaining a comfortable indoor temperature range. Next, we present the operating models of the EV, ES, WAs, and HVAC, as well as the model-based daily cost minimization problem for their management.
2.1. Electric Vehicle (EV) and Energy Storage (ES)
The EV is flexible in terms of the time periods in which it can acquire the amount of energy needed for its operation, as long as this is completed within a scheduling interval allowed by its users. In addition, the EV can inject stored energy during this interval (i.e., it exhibits V2G/V2H capabilities). The charging/discharging power of the EV can be continuously regulated between 0 and a maximum level, and it needs to fulfill an energy requirement for the envisaged journeys within the scheduling interval (with grid connection). Each EV is assumed to depart from its grid connection point only once within the horizon of the coordination problem (at step ) and subsequently arrive back at its grid connection point only once during the same horizon (at step ). The operating model of the EV includes the following constraints:
Constraint (
1) corresponds to the EV battery’s energy balance, taking into account the energy needed for commuting purposes as well as the losses caused by charging and discharging efficiencies.
Constraint (
2) expresses the lower and upper bounds of the battery’s energy content.
Constraints (
3)–(
4) represent the limits of the battery’s charging/discharging power, which depends on its power capacity
and on if the EV is available for scheduling (
) or not (
), while the binary variable
is employed to avoid simultaneous charging and discharging.
Finally, constraint (
5) ensures that the EV is sufficiently charged upon departure to satisfy the commuting requirements of its users.
The operating model of the ES [
39] is similar to that of the EV apart from the fact that the traveling energy requirement
and the scheduling availability
are irrelevant and are thus removed.
2.2. Wet Appliances (WAs)
The operation of WAs is based on the execution of user-prescribed cycles, which comprise a sequence of sub-processes occurring in a fixed sequence with a generally fixed duration and fixed power demand, which are immutable [
40]. Their flexibility is measured by the ability to defer these cycles up to a maximum delay limit set by their users. Without loss of generality, each WA is assumed to be activated for one operational cycle per day by its users only once during the temporal horizon between the cycle’s earliest initiation time
and latest termination time
. The operating model of the WAs includes the following constraints:
Constraint (
6) ensures that the demand activity of the WAs can be carried out once at most during the time window determined by
and
.
Constraint (
7) expresses that the power demand of the WAs at each time step is dependent on the initiation time,
,
, and
.
2.3. Heating, Ventilation, and Air Conditioning (HVAC)
The flexibility of the examined HVAC systems lies in that an indoor temperature range can be specified by the users so that their thermal comfort is preserved. The representation of the thermal comfort is a non-trivial task, as it depends on many diverse factors (e.g., air temperature, mean radiant temperature, relative humidity, air speed, etc.). Following the practice of [
41], a comfortable temperature range is employed as the representation of thermal comfort:
Equation (
9) represents the dynamic thermal behavior of the heated/cooled space, following the first-order model presented in [
41]:
where
and
are, respectively, the thermal capacity and resistance of the heated/cooled space.
is the energy efficiency of HVAC; this value is positive for cooling and negative for heating.
Equation (
10) expresses the electric power limits of the HVAC system:
2.4. Daily Energy Cost Minimization
The net demand (positive)/generation (negative)
of the household at step
t can be expressed as:
Finally, the daily energy cost minimization problem for the household can be formulated as:
where operators
indicate taking the maximum/minimum value between · and 0. The first term in (
12) represents the cost of purchasing electricity from the grid, while the second term represents the revenue from selling excess PV production, ES, and EV discharge to the grid.
Note that problems (
12)–(
14) are a mixed-integer linear program (MILP) that provides a model-based DR management strategy that aims to minimize the daily energy cost, assuming full knowledge of the operating models and parameters of all the DERs and a perfect prediction of all the uncertain parameters. As such, the optimal cost in (
12) can be treated as a lower bound on the cost (since the uncertainties are completely neglected), which later provides a theoretical baseline for the model-free DRL DR management strategy. As discussed in
Section 1.2, the SP approach is computationally inefficient in optimizing the DR management strategy while dealing with the multi-source system uncertainties. To address this, we propose an alternative approach for addressing the real-time DR management problem.
3. DR Management as an MDP
A finite Markov decision process (MDP) with discrete time steps is applied to formulate the real-time DR management problem. The time interval between two adjacent time steps is 30 min (the proposed approach can be readily extended to employ a finer temporal resolution). The HEMS constitutes the agent, while the environment is composed of many objects outside the agent (e.g., utility company, PV generator, non-shiftable loads, ES, EV), as shown in
Figure 2. In the context of RL, an agent acts in an environment by sequentially taking actions over a sequence of time steps to maximize a cumulative reward. In general, RL can be described as an MDP that includes: (1) a state space
; (2) an action space
; (3) a transition dynamics distribution with conditional transition probability
, which models the uncertainty in the evolution of states of the environment based on the executed actions of the agent; and (4) a reward
r:
. The detailed MDP formulation for the DR management problem is detailed below.
(1) State: The state at step t received by the HEMS agent entails the influence of its action on the status of the environment. is identified as an 11-dimensional vector , which comprises the following sensory information: the time step identifier t; the utility buy price and sell price ; the outdoor and indoor temperatures; the non-shiftable demand ; the PV production ; the energy content of the EV and ES ; and the EV’s and WAs’ scheduling availability indicators.
(2) Action: The action at step t encompasses its employed management decisions for the controllable DERs (including the EV, ES, WAs, and HVAC). It is defined as , where and represent the size of the charging (positive) and discharging (negative) power of the EV and ES as a percentage of and , represents whether the cycle of the WAs is initiated () or not () at step t, and represents the magnitude of the input power of the HVAC as a ratio of .
After the execution of action
, the environment maps
to the respective power output/input of each DER and subsequently determines the next state
and reward
. Based on the EV operating model presented in
Section 2.1, mutually exclusive quantities
and
(as EVs cannot charge and discharge at the same time step) are managed by action
, and are also limited by the EV’s parameters
,
,
,
,
, and
.
Based on
and
, the energy context of the EV battery at the next time step
can be written as (
17).
Quantities
and
can be obtained following the ES operating model (
Section 2.1) and the same logic of the above derivation for (
15)–(
17), but they neglect the charging availability.
Based on the WA operating model presented in
Section 2.2, the power demand of the WAs
is managed by action
, and is also affected by the WA parameters
,
, and
.
Finally, on the basis of the HVAC operating model in
Section 2.3, the indoor temperature at the next time step
based on
can be expressed as:
(3) Reward: Since the objective of the HEMS agent is to minimize the total energy cost of the household while maintaining a comfortable indoor temperature as well as ensuring the satisfaction of all DERs’ operating constraints, the reward function is designed to include the following three components:
1)
, which is as the negative total energy cost of the household:
2)
, which serves as a penalty for indoor temperature deviation from a desirable range, with
denoting a positive weighting factor:
3)
, which serves as a penalty for the constraint violations of the DERs, with
denoting a positive weighting factor:
Note that in (
15)–(
16), the charging and discharging power of the EV only respects the minimum/maximum power and energy limits of the EV, but does not ensure that its state-of-charge level is sufficient to cover the energy requirements for traveling, i.e., constraint (
5) may not be satisfied. Furthermore, constraint (
6) should be satisfied at the last initiation step to ensure the daily activation frequency of the WAs. To adequately account for these inter-temporal constraints of the EV and WAs, we introduce a penalty term
in the reward function.
The final reward function
can be expressed as:
(4) Performance and value functions: The agent employs a policy to interact with the MDP and emit a trajectory of states, actions, and rewards: over . The agents’ objective is to learn a policy that maximizes the cumulative discounted reward from the start state , which is termed as the performance function , where denotes the discounted state distribution and is the discounted reward, where is the discount factor. Furthermore, the Q-value function forms an estimation of the discounted reward given an action a at state s and following the policy from the succeeding states onwards.
4. Proposed TD3-Based DR Management Strategy
As discussed in
Section 1.2, despite the popularity of applying QL and DQN for DR management problems in the existing literature, they both suffer, to some extent, from the curse of dimensionality driven by their need to discretize the state and/or the action spaces. Furthermore, the discretization may hinder the decision-making process of the HEMS agent, leading to sub-optimal DR management policies. The DPG method is criticized for its low sampling efficiencies and high variance in its gradient estimator. In order to address these challenges, the proposed DR management strategy is founded on the TD3 method [
38], the overall workflow of which is presented in
Figure 3. TD3 leverages the performance of the state-of-the-art DRL method for continuous control, i.e., DDPG.
TD3 features an actor–critic architecture, which employs (a) a parameterized critic network
that inputs a state
and action
and outputs an estimate of the Q-value function
and (b) a parameterized actor network
that inputs a state
and implements a policy improvement task that updates the policy with respect to the estimated Q-value function and outputs a continuous action
. QL and DQN both feature a greedy maximization of the Q-value function concerning policy improvement, i.e.,
. However, such a greedy strategy exhibits significant intractability in the high-dimensional continuous action domain, since the Q-value function needs to be globally maximized at every step. Instead, TD3 utilizes the actor
to produce an action
for the next state. The critic is responsible for policy evaluation, or criticizing the policy obtained by the actor by generating a Q-value estimate with temporal difference (TD) learning. This is achieved by minimizing the following regression loss function:
where
denotes the target Q-value at step
t. Instead of globally maximizing
, the critic evaluates the gradients
, which indicate the directions for the change of action to purse higher estimated Q-values. As a result, the weights of the actor are updated in the direction of the performance gradient
, which is derived according to the deterministic policy gradient theorem [
42]:
Exploration vs. exploitation: Maintaining an effective trade-off between exploration and exploitation plays a vital role in effective RL learning. To aid the agent in exploring the environment thoroughly, an exploration/behavior policy
is constructed, which imposes a random Gaussian noise
on the actor’s output
:
It is well recognized that RL’s learning process tends to exhibit instability or even divergence when a DNN is employed as a nonlinear regressor for the Q-value function. To tackle such instability, previous works have put forward two tailored mechanisms.
Target Networks: Observe in (
24) that the online network
is utilized for both the current Q-value estimation
and the target Q-value
. As a consequence, the Q-value update is prone to oscillations. To remedy this instability, a target network [
26] can be introduced for the actor and critic, denoted as
and
, respectively. They are only used for evaluating the target values. Furthermore, the weights of these target networks can be updated by having them gradually track the weights of the online networks as
with
. Similarly to the idea of temporally freezing the Q-target value during training (in DQN), but modified for the actor–critic RL method, the rationale behind the soft update is to restrict the target values (for both actor and critic) to change slowly for an enhanced learning stability.
Experience Replay: Since the experiences are sequentially generated through the agent’s interaction with the environment, there exists temporal correlation in these experiences, which can degrade machine learning models substantially. The employment of an experience replay buffer
[
26] resolves this challenge. It is a cache that pools the past experiences and uniformly samples a minibatch for training the actor and critic. Mixing recent with previous experiences alleviates the temporal correlations of the sampled experiences. Furthermore, it enables samples to be reused, which enhances the sampling efficiency.
Despite the remarkable success that DDPG has received in various power system and smart grid applications, it is often criticized for the overestimation bias of the Q-value functions, which can result in sub-optimal policies [
38]. TD3 is tailored to address this challenge by concurrently learning two Q-value functions instead of one.
Double Critic Networks: In RL methods focusing on learning the Q-value, such as QL and DQN, function approximation errors may arise, which can result in an overestimated Q-value and, consequently, sub-optimal policies [
43,
44]. Concretely, the target Q-values used by QL and DQN can be written as:
It can be observed that QL uses the same Q-table both to select (in the operator) and to evaluate (calculate the target Q-value, which is subsequently used in the Q-value update) an action. Analogously, DQN uses the same set of neural network weights to both select and evaluate an action. This renders it more likely to select overestimated Q-values, leading to overoptimistic value estimates. Furthermore, such an overestimation bias can be propagated in time through the Bellman equation and can be developed into a more significant bias after many updates if left unnoticed. In the case of DDPG, since the policy is optimized with respect to the critic , using the same estimate in the target update of can create a similar overestimation of . This may adversely affect the policy quality, since sub-optimal actions may be highly rated by a sub-optimal critic, reinforcing the selection of these actions in the subsequent policy updates.
To address the drawback of using a single Q-value estimator, we propose a variant of the Double
method [
43] and adopt it in the actor–critic setting in order to mitigate the risk of having an overestimated Q-value. To achieve this, we introduce two independently trained online critic networks
and their corresponding target networks
. It is assumed that
are the potentially biased/less biased Q-value estimates, respectively. Since a Q-value estimate that suffers from overestimation bias can be used as an approximated upper bound for the true value estimate, we use
as the upper bound of
. Equivalently, this is results in taking the minimum between these two estimates to get the target Q-value:
Another potential failure in DDPG is that if the Q-function approximator develops an incorrect sharp peak for some actions, the policy will quickly overfit to such narrow peaks, leading to incorrect behavior. This can be averted by smoothing out the Q-function over similar actions, which target policy smoothing is designed to do. In this effect, actions used to form the critic target are based on the target policy
but with clipped noise added on each action dimension. After adding the clipped noise, the target action is clipped again to lie in the valid action range
. The target actions can be expressed as:
As discussed previously, target networks can be used to reduce the error over multiple updates, while policy updates on states corresponding to high TD error may lead to divergent learning behavior. As a result, TD3 updates the policy network at a lower frequency than the critic network in order to sufficiently minimize error before introducing a policy update. In this effect, a modification is introduced only to update the policy and target networks after Z updates to the critic.
By incorporating the target network, experience replay, and the above-mentioned modifications, the critic loss in (
24) can be stated as the weighted mean-squared TD error calculated based on the training data, i.e., a minibatch of prioritized sampled
K experiences.
where the TD error for each critic can be expressed as:
The policy gradient for the actor update in (
25) can be restated in a similar fashion:
Finally, the following updates are applied to the weights of the online critic networks:
and after
Z learning steps, the online actor and the target networks are updated according to:
where
and
are the learning rates of the gradient descent algorithm and
is the soft update rate.
Algorithm 1 details the training of the DNNs employed by TD3, and the proposed TD3-based DR management strategy is outlined in Algorithm 2. After the training phase, we firstly load the weight of the online actor network that is trained by Algorithm 1. For a specific test day, at each time step
t, the agent observes the current environment state
and determines its DR management action according to the policy learned by TD3. The requested DR actions are then mapped to the input/output of different DERs of the household (
Section 3).
Algorithm 1 Training procedure of TD3 |
- 1:
Initialize critic networks and and actor network with random weights , , and . - 2:
Initialize target networks with weights , and . - 3:
for episode (i.e., day) do - 4:
Obtain the initial state from the training set. - 5:
Initialize a random Gaussian exploration noise . - 6:
for time step (i.e., 30 min) do - 7:
The HEMS agent selects action using ( 26). - 8:
Execute in the environment, observe using ( 23), and transit to the new state - 9:
Store in the experience replay buffer. - 10:
Sample a minibatch K of experiences from reply buffer. - 11:
Compute target actions using ( 30). - 12:
Update the online critics using (34). - 13:
if then then - 14:
Update the online actor using (35). - 15:
Update the target networks using ( 36). - 16:
end if - 17:
end for - 18:
end for
|
Algorithm 2 TD3-based DR management strategy |
- 1:
Load the DNN parameter of the online actor network trained by Algorithm 1. - 2:
for test day = do - 3:
Obtain the initial state of the test day. - 4:
for time step = do - 5:
Set the DR management action as . - 6:
Execute action in the environment, calculate reward , and transit to the new state . - 7:
end for - 8:
end for
|