3.3. RL Model Development Based on DQN
In this work, we plan to deploy the DRL model for HVAC control during the summer 2020 season, i.e., HVAC will be in cooling mode only. The objective of the RLbased HVAC control is to learn optimal cooling policy (policy ${\pi}^{*}$) for HVAC to control the variations in the indoor temperature while maintaining the homeowner’s comfort and lower operational cost. We assume that the indoor temperature (${T}_{t}^{in}$) at any time instant t is dependent on the indoor temperature (${T}_{t1}^{in}$), outdoor temperature (${T}_{t1}^{out}$), and the action taken (${a}_{t1}$) (i.e., set point in this case) at the time instant $t1$ and formulate the HVAC control problem as a Markov Decision Process (MDP).
During offline training, i.e., in phase 2 (refer to
Figure 1), RL agent interacts with a building simulation model. We described the development of the building simulation model in
Section 3.1. We define two control steps (also described in the [
18]): (1) simulation step (
$\Delta {t}_{s}$) and (2) control step (
$\Delta {t}_{c}$). The building simulation model simulates building’s thermal behavior for every minute, i.e.,
$\Delta {t}_{s}=1$ min. The RL agent interacts with the building simulation at every
$\Delta {t}_{c}$ time steps (
$\Delta {t}_{c}=k\Delta {t}_{s}$), observes the indoor environment, and takes appropriate action. The observations about the building’s indoor environment are modeled as a state (
S) in the MDP formulation. For instance, a set of observations on the building’s indoor environment at a time instant
t such as indoor temperatures and other factors such as electricity prices forms the current state of the building’s environment, i.e.,
${S}_{t}$. In our case, it includes time of the day information (
t), indoor temperature (
${T}_{t}^{in}$), outdoor temperature (
${T}_{t}^{out}$), and a lookahead of electricity pricing (
${P}_{t}=\{{p}_{t},{p}_{t+1},\dots ,{p}_{t+k},\dots \}$). Thus, the current state
${S}_{t}$ is defined as
${S}_{t}=\{t,{T}_{t}^{in},{T}_{t}^{out},{P}_{t}\}$. Similarly, the state at the previous control step is defined as
${S}_{t\Delta {t}_{c}}=\{t\Delta {t}_{c},{T}_{t\Delta {t}_{c}}^{in},{T}_{t\Delta {t}_{c}}^{out},{P}_{t\Delta {t}_{c}}\}$.
In this work, we used a deep Qlearning approach mentioned in [
18,
24] and utilized a deep Qnetwork (DQN), a neural network, as a function approximator. We built upon our previous work on DQNbased HVAC control for a singlezone house [
30] and developed a DQN architecture for a twozone house. There are two thermostats in the house, one for each zone, and capable of taking set point values through APIs. Since Qlearning works well for finite and discrete action space, we train DQN for ON = 1/OFF = 0 actions for both the zones and further translate the ON/OFF actions to the appropriate set point values using Equation (
1), where
$\Delta T$ is the small decrement factor (in case the action is ON) and increment factor (in case the action is OFF). The action = ON (‘1’) signifies the status ON for the AC, which will reduce the indoor temperature. To achieve this, we decrease the setpoint by a factor of
$\Delta T$ than the current indoor temperature. In the similar fashion, the action = OFF (‘0’) signifies the RL’s intention on increasing the indoor temperature by a small amount (i.e.,
$\Delta T$) by turning OFF the AC. As there are two zones and two possible AC states (ON/OFF), there are total four possible actions
$A=\{{a}_{0},{a}_{1},{a}_{2},{a}_{3}\}$, where
${a}_{0}$ indicates OFF status for AC in both the zones which will be translated further by setting the higher set point than the current indoor temperature. The other actions can be explained in the same way as shown in
Table 1.
As discussed in the earlier paragraph, during offline training, RL agent interacts with the building simulation at every
$\Delta {t}_{c}$ control interval, observes the indoor environment, and receives reward (
${r}_{t}$) as an incentive for taking the previous action, i.e.,
${a}_{t\Delta {t}_{c}}$. This reward is designed as a function consisting of two parts, (1) cost of performing action, i.e.,
${a}_{t\Delta {t}_{c}}$ and (2) comfort penalty. Equation (
2) shows the reward function used in this work; it is inspired from the reward function used in the work by Wei et al. [
18]. This is a negative reward function in which the reward value close to zero signifies better performance, i.e., less penalty and more negative values signifies poor action, i.e., more penalty. The first term in Equation (
2) is the cost incurred due to the action taken by RL agent at time
$t\Delta {t}_{c}$. This cost is calculated as an average over the time duration
$[t\Delta {t}_{c},t]$. The second term of the equation represents the average comfort violation over the time duration
$[t\Delta {t}_{c},t]$. Here,
$UT$ represents the cooling set point and
$LT$ is some lower temperature value such that
$[LT,UT]$ represents a flexibility range of temperature below the cooling set point. The RL agent would make use of this flexible temperature band to save on some costs by taking optimal actions such that the indoor temperature will always remain below the cooling set point, i.e., the homeowner’s comfort. We can provide similar arguments while the RL agent operates in the heating mode. In that case, the heating set point is used to define the
$[LT,UT]$ flexibility band of temperature. In this work, we used 68 °F and 74 °F as heating and cooling set points, respectively:
The goal of our RL agent is to learn a policy which will minimize the cost of operation and incur less or no comfort violation. More specifically, under MDP formulation, the RL agent should learn the policy (i.e., action) which will maximize not only the current reward, but also future rewards. This is represented as the total discounted accumulated reward in Equation (
3) [
24]. The
$\gamma $ is a discount factor whose value ranges from
$[0,1)$ that controls the weights of the future rewards, and
T represents the time step when episode ends. The
$\gamma =0$ considers only the current reward and does not take the future rewards into account. This is characterized as an RL agent being “shortsighted” by not accounting for the future benefits (rewards) and may not achieve the optimum solution. The rationale behind using future rewards is that sometimes the current low rewarded state could lead the RL agent to the high rewarded states in the future. In this work, we chose
$\gamma =0.9$, which exponentially decreases the weights of the future rewards. In other words, the current reward would still get higher weight and the exponentially discounted futures rewards are taken into account as well. The higher the
$\gamma $ value, the better the RL’s ability to account for the future rewards and obtain the optimal solution. The lower
$\gamma $ value reduces RL’s ability to look into the future rewards which may not lead to the optimal solution.
In case of Qlearning, RL tries to learn an optimal actionvalue function
${Q}^{*}({s}_{t},{a}_{t})$ which maximizes the expected future rewards (represented by Equation (
3)) for action taken
${a}_{t}$ in the state
${s}_{t}$ by following policy
$\pi $ i.e.,
${max}_{\pi}E\left[{R}_{t}\right{s}_{t},{a}_{t}]$ [
24]. This can be defined as a recursive function using Bellman’s equation as shown in Equation (4) [
18]. The Qlearning RL algorithm estimates this optimal actionvalue function via iterative updates for a large number of iterations using Equation (5), where
$\eta $ represents a learning rate with the range of
$(0,1]$. For a large number of iterations and under an MDP environment, Equation (5) converges to the optimal
${Q}^{*}$ over time [
18,
24]:
The iterative Qlearning strategy of Equation (5) uses a tabular method to store stateaction Q values. This method works well with the discrete stateaction space and quickly becomes intractable for a large stateaction space. For instance, in our case, the indoor temperature and outdoor temperature are the continuous quantities that make infinite stateaction space. This is where DQN is useful that uses a neural network as a functional approximator to approximate the stateaction value
$Q(\xb7)$. Algorithm 1 represents the deep Qlearning procedure used in this work, which is inspired from Wei et al. [
18] and Mnih et al. [
24]. This approach uses the DQN neural network structure as shown in
Figure 5. Mnih et al. [
24] implemented a Qlearning strategy using deep Qnetworks (DQN), that uses two neural networks and experience replay. The first network
Q is called the “evaluation network”, and the second network
$\widehat{Q}$ is called a “target network” (refer to
Figure 5). Equation (5) needs two
Q values to be computed. The first is
${Q}_{t}({s}_{t},{a}_{t})$: the Qvalue of the action
${a}_{t}$ in the current state
${s}_{t}$, and the second is
${Q}_{t}({s}_{t+1},{a}_{t+1})$: to calculate the maximum of the Qvalues of the next state
${s}_{t+1}$. The evaluation network (
Q) is used to evaluate
${Q}_{t}$ using one forward pass with
${s}_{t}$ as an input. The term “(
${r}_{t+1}+\gamma \underset{{a}_{t+1}}{max}{Q}_{t}({s}_{t+1},{a}_{t+1})$” represents a “target” Qvalue that RLagent should learn. If we use the same network to compute both Qvalues i.e.,
${Q}_{t}({s}_{t},{a}_{t})$ and
${Q}_{t}({s}_{t+1},{a}_{t+1})$ makes the target unstable due the frequent updates made in the evaluation network. To avoid this, Mnih et al. suggested using two separate neural networks such that the evaluation network performs frequent updates in the network and copies updated weights to the target network after certain updates. Thus, the target network updates itself slowly compared to the evaluation network, which improves the stability of the learning process.
The training starts by initializing weights of both the networks (i.e.,
Q and
$\widehat{Q}$) randomly, reserving the experience replay buffer (
$MB$) and initializing other necessary variables (lines 1–12 in Algorithm 1). We train the RL agent for multiple episodes where each episode consists of multiple training days (e.g., two months of summer). In the beginning of each episode, we reset the building environment of the simulator, get the initial observation (
${S}_{pre}$) from the building environment, and randomly choose some initial action (refer to lines 15–18 in Algorithm 1). The lines 19–38 of the algorithm represent a loop where the building simulation model simulates the thermal behavior of the building at each
${t}_{s}$ time step. If the current time step
${t}_{s}$ is a control step, i.e.,
${t}_{s}mod\Delta {t}_{c}=0$, then we fetch the current state of the building’s indoor environment (
${S}_{curr}$) and calculate the reward (
r) as a response of the action taken at the previous control step (refer to lines 21–22 in Algorithm 1). Furthermore, we store the tuple
$<{S}_{pre},a,r,{S}_{curr}>$ to the experience replay memory (
$MB$) (refer line 23). Next, a minibatch of tuples is randomly chosen from this replay memory and used for the minibatch update (described on lines 24–25 of Algorithm 1), and if it’s time to update the target network, then copy the weights of the evaluation network to the target network (refer to line 26). Lines 26–32 choose action using a decaying
$\u03f5$greedy strategy. In the beginning, the RL agent explores random actions with high probability (line 28), whereas, as training progresses, the probability of exploitation increases, and the RLagent uses the learned actions more than choosing random actions (line 30). Line 36 continues executing the earlier action if the current time step is not a control step.
Algorithm 1 Neural networkbased Qlearning. 
 1:
Env ← Setup simulated building environment  2:
MB ← Experience Replay Buffer  3:
Δt_{s} ← Simulation time step (here, 1 min)  4:
Δt_{c} ← k × Δt_{s} RL’s control step interval (here, k = 15)  5:
N_{Days} ← Number of operating days for training  6:
N_{episodes} ← Number of training episodes  7:
TS_{max} ← N_{Days} × 24 × (60/Δt_{s})  8:
nbatch ← minibatchsize  9:
ϵ ← initial exploration rate  10:
Δϵ ← exploration rate decay  11:
Q(s, a; θ)←Initialize(Q, θ)  12:
$\widehat{Q}(s,a;\widehat{\theta})$ ← Initialize($\widehat{Q},\widehat{\theta}$)  13:
 14:
forepisode = 1 to N_{episodes} do  15:
Env← ResetBuildingSimulation(Env)  16:
S_{pre} ← ObserveBuildingState(Env, 0)
 17:
a ← GetInitialAction(Env)  18:
Set points ← ConvertActionToSetpoints(S_{pre}, a)  19:
for t_{s} = 1 to TS_{max} do  20:
if t_{s} mod Δt_{c} = 0 then  21:
S_{curr} ← ObserveBuildingState(Env, t_{s})
 22:
r ← CalReward(Env, S_{pre}, a, S_{curr})
 23:
AppendTransition(MB, 〈S_{pre}, a, r, S_{curr}〉)  24:
[T] ← DrawRandomMiniBatch(MB, nbatch)  25:
TrainQNetwork(Q(·;θ),[T])  26:
if time to update target: $\widehat{Q}(\xb7;\widehat{\theta})\leftarrow Q(\xb7;\theta )$  27:
if GenerateRandomNumber([0, 1]) < ϵ then  28:
a ← ChooseRandomAction([a_{0}, a_{1}, a_{2}, a_{3}])  29:
else  30:
$a\leftarrow arg\underset{{a}^{\prime}}{max}Q({S}_{curr},{a}^{\prime};\theta )$  31:
end if  32:
$\u03f5\leftarrow \u03f5\Delta \u03f5$  33:
Setpoints ← ConvertActionToSetpoints(S_{curr}, a)  34:
S_{pre} ← S_{curr}  35:
end if  36:
SimulateBuilding(Env, t_{s}, Setpoints)  37:
end for  38:
end for
