Reinforcement Learning ‐ Based School Energy Management System

: Energy efficiency is a key to reduced carbon footprint, savings on energy bills, and sustainability for future generations. For instance, in hot climate countries such as Qatar, buildings are high energy consumers due to air conditioning that resulted from high temperatures and humidity. Optimizing the building energy management system will reduce unnecessary energy consumptions, improve indoor environmental conditions, maximize building occupant’s comfort, and limit building greenhouse gas emissions. However, lowering energy consumption cannot be done despite the occupants’ comfort. Solutions must take into account these tradeoffs. Conventional Building Energy Management methods suffer from a high dimensional and complex control environment. In recent years, the Deep Reinforcement Learning algorithm, applying neural networks for function approximation, shows promising results in handling such complex problems. In this work, a Deep Reinforcement Learning agent is proposed for controlling and optimizing a school building’s energy consumption. It is designed to search for optimal policies to minimize energy consumption, maintain thermal comfort, and reduce indoor contaminant levels in a challenging 21 ‐ zone environment. First, the agent is trained with the baseline in a supervised learning framework. After cloning the baseline strategy, the agent learns with proximal policy optimization in an actor ‐ critic framework. The performance is evaluated on a school model simulated environment considering thermal comfort, CO2 levels, and energy consumption. The proposed methodology can achieve a 21% reduction in energy consumption, a 44% better thermal comfort, and healthier CO2 concentrations over a one ‐ year simulation, with reduced training time thanks to the integration of the behavior cloning learning technique.


Introduction
Arid climate prevails in the Arabian Gulf region characterized by mild, pleasant winters; hot, humid summers; and sparse rainfalls. In Qatar, summer temperatures exceed 45 °C and, on average, high temperature exceeds 27 °C in the rest of the seasons. Therefore, air conditioning (AC) in Qatar is more of a necessity than a luxury and accounts for about 80% (highest in the world) of buildings energy consumption. The AC systems are running nonstop throughout the year to maintain thermal comfort. To achieve sustainable development and a greener environment, energy efficiency plays a crucial role in which heating, ventilation, and air conditioning (HVAC) control paves the way forward [1]. However, drastic energy consumption reductions deteriorate the indoor comfort quality, which posits a comfort versus consumption dichotomy. The main goal and challenge of any energy management system (EMS) is to achieve the right balance between occupant comfort and building energy requirements.
Though the occupants' comfort depends on various factors, it is commonly reduced to thermal comfort, for which HVAC controllers are usually optimized, but air quality is seldom considered [2]. In practice, the carbon dioxide (CO2) levels are commonly used as internal air quality (IAQ) indicators. CO2 levels relate to human health and productivity. A school study in [3] concluded that signs of headache and dizziness were prevalent in classrooms with high CO2 levels. Additionally, student performances were better in lower CO2 level environments.
To address these problems, building control algorithms were the subject of extensive research. The methods include classical control, predictive control, and intelligent control [4]. Despite the recent advances in intelligent and predictive controls, the classical on/off and PID control methods [5,6] are the most implemented in the field, due to their simplistic nature. However, these procedures cannot account for the system complexity, stochastic nature, and nonlinear dynamics. Predictive methods try to solve these inherent issues but require complex building modeling and rely on experts' knowledge [7,8]. Therefore, they are hard to generalize over various building environments. Intelligent control methods, instead, are learning-based and model-free, and hence they do not assume complex models underlying the building systems. In these control methods, optimal policies are derived based on collected data, thus alleviating the daunting process of designing a complex mathematically accurate model, which makes them less affected by modeling inaccuracies.
One of such learning-based methodologies is the reinforcement learning (RL), which is a modelfree framework for solving optimal control problems stated as Markov Decision Processes (MDPs) [9]. RL is considered the most suitable machine learning paradigm for this task. Building control matches with RL, since there are an environment to control, hidden dynamics to learn, and serial decisions to determine. RL has gained a lot of attention in the past few years due to successes in playing Atari games and then beating the world Go champion. The combination of neural networks as function approximators and RL paradigm was the key. Since then, RL was considered as a viable solution for diverse control problems, in particular, building energy management systems. Previous attempts were limited to tabular RL and RL algorithms using linear function approximators. RL discipline is not new and its applications in building control are not either. Previously, RL algorithms were constrained to computationally cheap algorithms such as tabular Q-learning and linear function approximators and were forced to consider small state/action spaces. This paper aims to deliver an optimum solution to a multi-objective and multizone building energy management (BEM) problem that provides a comfortable indoor environment in a school building while reducing its energy consumption and, hence, lowering its operational costs. This study leverages a deep reinforcement learning (DRL) framework to develop an artificially intelligent agent capable of handling the tradeoffs between building indoor comfort and energy consumption. To the best of the authors' knowledge, this study is the first to apply a DRL-based, behavioralcloning-enhanced technique to resolve the interactions between thermal comfort, energy consumption, and indoor air quality in a multi-zone complex environment (21 zones). The experiments were conducted in a school environment in Qatar. The proposed solution handles the tradeoff between energy consumption and occupants' comfort well. The proposed intelligent control can generalize over different weather conditions throughout the year while maintaining good thermal comfort, excellent indoor air quality, and saving more than 20% of the school's energy consumption, compared to a rule-based baseline control strategy.
The main contributions of this work are summarized as follows:  Propose a proximal policy optimization (PPO) algorithm for energy optimization and occupants' comfort control for maintaining occupant's comfort while reducing energy consumption.  Use behavioral cloning to incentivize the basic baseline behavior so that the proposed algorithm converges faster than trying very random decisions.  Develop a complex 21-zone school simulation system with EnergyPlus and thoroughly investigate the performances through meticulously designed experiments.
This paper is organized as follows. Apart from the introduction (Section 1), a literature review is presented in Section 2. Section 3 describes the RL approach and its application to the school case study environment. Section 4 presents and discusses the obtained results. The concluding remarks are given in Section 5.

Literature Review
The particularity of the Arabian Gulf region climate led to several studies on improving energy efficiency in a desert climate. Buildings account for the majority of energy consumption due mostly to the cooling needs. Building retrofitting was proposed to reduce old building energy demand [10], optimal control of AC was investigated [11], and a multi-objective genetic algorithm was investigated in a Qatari house setup [12]. Until recently, energy efficiency was not considered in the region. With the fall of oil prices, the local governments started raising awareness and designing efficiency programs. The arid environment is indeed challenging, but the highly subsidized electricity tariffs and the limited financial incentives hinder the efforts. Authors in [13] analyzed electricity load profiles in Qatar. They found that approximately 50% of energy demand is attributed to cooling in summer. In [14], the impact of retrofitting and behavioral changes on energy consumption in Abu Dhabi was discussed. It is crucial to raise awareness among the community, since most present buildings do not conform with the efficiency guidelines, and citizens use cooling 24/7 even when the building is empty. Thus, there is a crucial need for strategies to decrease buildings' electricity consumption while maintaining their residents' comfort.
The recent breakthroughs of RL [15][16][17], due to the powerful combination of deep learning and RL algorithms in game playing, got the attention of and spurred multiple research interests [18,19]. Before, tabular Q-learning and variants were widely applied as RL-based controllers for energy optimization [20][21][22][23][24][25][26][27]. For instance, [20] Q-learning was employed to lower energy consumption by 10% compared to programmable control. Authors in [25] coupled an autoencoder with Q-learning to reach less consumption by 4-9% in winter and 9-11% in summer in contrast to constant set-point policy. Authors in [26] applied State-Action-Reward-State-Action (SARSA) to control the environment based on fixed setpoints and reduce energy consumption, while [27] relied on linear approximation for state--action value. These methods are incapable of ingesting large state/action space. The DRL union handles the dimensionality curse better, replacing tabular search and simple function approximators with neural networks. DRL process high dimensional raw data without the need for preprocessing and feature engineering based on raw data, and hence can accomplish endto-end control [28][29][30][31][32][33][34].
In [28], the authors applied both tabular and batch Q-learning with a neural network to realize a 10% lower energy consumption compared to rule-based control. In [31], a mixture of Long Short Term Memory (LSTM) neural network and actor-critic architecture achieved around 15% thermal comfort improvement and a 2.5% energy efficiency improvement when compared to fixed strategies. Predicted mean vote (PMV) was used as the thermal comfort indicator, and the testbed was one zone office space with two days of simulation for training and five for validation. Authors in [30] compared DQN, regular Q-learning, and rule-based on/off control in reducing the HVAC consumption and maintaining a prespecified comfortable temperature (24 °C). The algorithm was evaluated on three simulated buildings with EnergyPlus [35] (one-zone, four-zone, and five-zone models). The DQN bested the other methods with over 20% energy reduction. Authors in [36] resorted to the Asynchronous Advantage Actor-Critic (A3C) algorithm, which was trained to reduce energy and ensure good thermal comfort, measured by the predicted percentage of dissatisfied (PPD) index, in a simulation of a workplace building in Pittsburgh. Fifteen percentage of energy consumption was reduced compared to their base case. Similar to our work in [33], the authors used double Q-learning to optimize for IAQ and thermal comfort while reducing energy consumption by 4-5% in a laboratory and classroom simulation setup. To the best of the authors' knowledge, and based on the literature review, zones considered in the reviewed papers and studies do not exceed five zones [30]. These papers also focused on thermal comfort and usually defined it as fixed temperature preferences.
In this work, the building's energy consumption, thermal comfort, and indoor air quality in a more complex environment comprising of 21 zones have been optimized, where the agent selects the optimal decisions from 72 possible action combinations at each time step.
Previous attempts applied DQN [15] variants and its continuous extension deep deterministic policy gradient (DDPG) [37]. However, PPO algorithms [38], the leading policy search algorithms, have not been studied in the context of energy efficiency. PPO has shown promising results in physical control problems providing more stable learning and simpler hyperparameter tuning than previous policy gradient algorithms. PPO achieved close or above state of the art performance on a wide range of tasks, becoming the default RL algorithm at openAI. In this paper, we apply PPO to control a school building simulation and achieve excellent indoor comfort and significantly reduce energy consumption.

Proposed RL Methodology
In contrast to the known machine learning paradigms, RL deals with sequential decision making under uncertainty. In supervised learning, the data is labeled, and thus the right decisions are previously known, whereas in RL setup, the artificial agent learns from experience. Based on scalar feedback, it updates its behavior through trial and error. Different from unsupervised learning, the agent has the reward feedback. Furthermore, the RL agent generates data and experience while understanding the environment: in this work, the simulated school via EnergyPlus. The goal in RL is to maximize future returns. The agent searches for the optimal sequence of decisions. When judging a situation, the agent takes into account the possible future effects of the current decision.
To develop the right strategy, the agent explores the environment depending on these essential components:  States describing agent and/or environment position.  Actions affecting the environment.  Rewards as feedback from the environment on the chosen action.  MDPs are the mathematical framework for RL. An MDP is a tuple of states s, actions a, reward function r, transitioning probabilities p, and discount factor γ: <s, a, p, r, γ>. Usually, due to the lack of knowledge of environment dynamics (p), model-free methods to estimate the value and policy functions are used. Value functions assess how good the current state (V(s)) or state/action couple (Q(s,a)) is. The policy can be derived by selecting the actions that maximize the Q value (act greedily). Alternatively, via policy gradient algorithms, the policy can be optimized directly. Value-based methods learn the optimal policy by deriving the value function like in Q-learning. In contrast, policybased methods estimate the optimal strategy directly, like in REINFORCE [39]. The policy parameters are optimized. Actor-critic is a combination of both methods.
As illustrated by Figure 2, in actor-critic methods, both the policy and value functions are estimated in order to learn a good policy. The actor represents the policy, while the critic represents the value function. The critic estimates guide the learning with the temporal difference (TD) error ( ). The general update rule is given by Equation (1): where V is the value function, st is the state at time t, ɑ is the stepsize, γ is the discount factor, and rt is the reward at time t. If the error is positive, the current behavior is encouraged, and the probability of selecting the recent action increases by means of the policy gradient theorem. DDPG achieved good results in continuous control tasks. However, selecting the right hyperparameters is tricky. This is common in policy gradient methods. Trust region policy optimization (TRPO) algorithms iteratively optimize policies while guaranteeing improvement over the old policy [40]. TRPO algorithms are on-policy algorithms, where the agent's behavior is updated according to its current behavior. They are more stable than DDPG, and they relax the difficulty of choosing a precise step size with fewer hyperparameters tuning. Constrained to a certain degree of improvement from the old policy to the new one, the policy is updated modestly with small changes at a time via maximizing a surrogate objective, as shown in Equation (2): where is the policy (actor) function, which is the probability of selecting at given st, A is the advantage, it helps reduce variance , , , and KL is Kullback-Leibler divergence. Policy changes are constrained by , and the difference between old and new policies is measured in terms of Kullback-Leibler divergence.
TRPO has its disadvantages too. The monotonic improvement costs heavy computations to calculate the Fisher Matrix and conjugate gradient from KL divergence. In the same year, TRPO's leading author also proposed proximal policy optimization algorithms to alleviate computation and conserve TRPO's stability. Since PPO has become open AI's default algorithm. Despite its simplicity, it achieves performance comparable, and sometimes even better than state-of-the-art approaches. The most interesting feature of PPO is the ease of tuning, a characteristic rarely seen in RL research. In PPO, the surrogate objective is clipped. The policies ratio rt is constrained to the range of [1 -ε; 1 + ε] to limit fluctuations between old and new strategies (ε is a hyperparameter). The objective function selects then the lower bound or the pessimistic estimate, as shown in Equation (3) and Equation (4).
| | (4) Table 1 shows the algorithm of the PPO using actor-citric style.

Behavior Cloning
Behavior cloning (BC) is a form of imitation learning in which the agent learns a policy through supervised learning. The proposed algorithm collects an expert's knowledge or behavior, usually a combination of state-action pairs. The data is then fed to the agent to force the expert's behavior. This a supervised learning task. The agent is trained to match states with actions. In the proposed methodology, demonstrations are gathered from the simulation following the baseline strategy. Similar to RL, a decision is made, and the environment reaction is documented. The contrast here is that decisions are based on the baseline behavior, not drawn from the agent's policy. The same interaction pipeline is run, and the filed data is stored. The resulting state-action pairs are used for training the agent in order to mimic the initial cloned strategy. This is to force the agent to follow the baseline decisions and use them later on as a benchmark. Hence, when selecting a new action, it evaluates its potential compared to the baseline. We gather information from a year of simulations, then the baseline behavior is cloned before the training of the agent in the usual trial and error framework of RL. The advantage of BC is that the agent learns the desired behavior without interacting with the environment. Subsequently, the agent interacts with the environment as predefined and searches for better policies. The difference is, instead of starting with random unreliable actions in the exploration phase, the agent has the baseline behavior to build upon it as ground truth. Thus, erratic behaviors are avoided and training time is diminished.

School Testbed Control Framework
A simulated environment was developed based on a real school in Qatar, which is considered the case study testbed. The school architecture, a typical Qatari school, was organized into 21 zones, which were selected based on their common air conditioning configuration and control. The zones correspond to classrooms, offices, laboratories, and other facilities. The school layout is presented in appendix 0, specifying the zones. For instance, the air handling unit (AHU) 17 controls the gym, AHUs 8 to 15 control classrooms, and AHU 4 controls the hall. The simulation embeds the school's orientation and exposure to the sun and also the weather of the region. The RL agent is trained and evaluated using this testbed with typical weather conditions covering a whole year. The simulation sampling time is 15 minutes. EnergyPlus is used for this task.
EnergyPlus is a fully integrated building and HVAC simulation program developed by the U.S. Department of Energy. It models buildings, heating, cooling, lighting, ventilating, and other energy flows. It is used also for load calculations from energy use, modeling natural ventilation, photovoltaic systems, thermal comfort, water use, etc. Besides energy consumption, the simulation software tools can also be used to calculate the following variables:  Indoor temperatures  Needs for heating and cooling  Consumption needs of HVAC systems  Natural lighting needs of the occupants  Interior comfort of the inhabitants  Levels of ventilation As shown in Figure 3, the first step is to construct the 3D modeling of the building with the SketchUp software. Then the various zones are defined with their loads and their controls with the OpenStudio software. Finally, the model is exported to an ".idf" file, which is the file format used by the EnergyPlus software as a building model under study.
Once the modeling is finished, a Python program is developed for co-simulation. The proposed framework is developed in Python, and the communication between the EnergyPlus and the agent is provided by the PyEp library [41]. The intelligent controller is composed of two multilayer perceptron (MLP) networks; one for the actor and the other for the critic. The neural networks are developed in PyTorch. Each network is simply comprised of only two hidden layers of size 256 with ReLU activations Equation (5).
Adam optimizer [42] is applied with a learning rate of 3e-4. As shown in Figure 4, at every time step, based on the environment state, the agent estimates the state-value function (critic) on one hand. On the other hand, it decides the optimal course of action (actor). Then, it receives a feedback signal and adjusts its behavior accordingly.

Baseline
During working hours, the temperature is set to 21 °C and 28 °C when the building is unoccupied. The CO2 levels are maintained under 1000 ppm at night and under 700 ppm during the day.

States
At every timestep, the agent observes the environment to construct the state and act upon it. The state comprises the temperature, relative humidity of each zone, the outside temperature, and relative humidity, and the time step information. We opted for minimal information to ensure the ease of implementation in the real world. The state st, at time t, is then determined using (6).
All these variables are normalized to the range of [0,1].

Actions
The actor at decides for each zone the setpoints of the temperature (°C) and CO2 (ppm), as shown in (7).

Reward
The reward at any time t is a scalar value, rt designed in a way to motivate the optimal behavior. The objective is to reduce energy consumption and maintain good thermal comfort and indoor air quality. Therefore, the reward is composed of two terms: energy-related and comfort-related terms, as in (8) and (9), where  and  are both taken equal to 0.5. . .

Comfort
Comfort is divided into two categories: thermal and hygienic comforts.

Thermal Comfort
Comfort is defined here by means of the predicted mean vote (PMV). PMV is an index, developed by Fanger, that aims to predict the mean value of votes of a group of occupants on a sevenpoint thermal sensation scale, as shown in Figure 5. PMV is based on heat-balance equations and empirical studies about skin temperature to define comfort. Thermal equilibrium is obtained when an occupant's internal heat production is the same as its heat loss. PMV equal to zero is representing thermal neutrality. Fanger's equations are used to calculate the PMV of a group of subjects for a particular combination of air temperature, mean radiant temperature, relative humidity, airspeed, metabolic rate, and clothing insulation. PMV is a rigorous index for comparing the performances of different approaches. Since the PMV is a robust measure and its values are easily understandable, we chose it taking into account the model deployment and tuning later on. In a real-world implementation, it will be replaced by the occupant's feedback. The occupant will select a value from the PMV seven points. We hypothesize that the PMV reflects well enough occupant's comfort. Since the reward is a scalar feedback signal, we reduce the comfort to the average over the zones. Lower values suggest good comfort, and thus we evaluate discomfort as the absolute value of the average. The thermal comfort interval of [-0.5,0.5] is considered optimal; therefore, no penalties are incurred by the agent.
In the present study, the thermal discomfort is calculated using Equation (10): where 1 21 (11) Figure 6 illustrates the relationship between the discomfort and the PMV average value as defined by Equation (10) and Equation (11).

Hygienic Comfort
The hygienic comfort or discomfort is measured in terms of the indoor CO2 levels using Equation (12).
The optimal CO2 concentrations (good: healthy Levels) are usually within the range of [400 ppm, 600 ppm]. For this range, no discomfort is recorded. Above it, the CO2 concentrations become mediocre and even bad for health (see Table 2). For the [600 ppm, 1000 ppm], we opted for a quadratic discomfort that increases faster than a linear one to emphasize the danger of escalating levels of CO2 concentrations. When levels surpass 1000 ppm, the situation becomes dangerous for human health, and thus we raise the discomfort dramatically to restrain the agent from reaching those conditions. The hygienic discomfort versus CO2 concentration, according to Equation (12), is illustrated in Figure  7.

Results
During the simulation, each zone has its characteristics, which increases the complexity of the optimization task. As shown in Figure 8, the 21 zones differ in volume. They also differ in their exposure to direct sun radiations and in the sun-facing angle. Therefore, the optimum temperature setting changes are expected to take place in some zones more significantly than in others. The agent must navigate these variations to find optimal solutions. The dissimilarity is mostly noticed in energy consumption, because the comfort component has the same value ranges across zones. However, reaching the same comfort level for two zones requires different energy levels. Our agent is trained throughout the year, and its performance is evaluated against the baseline. We stop training when cumulative returns stabilize. An episode is a year of simulation. The agent training was done on an intel core i7-5600U cpu and each episode took around ≈ 10 minutes. Our results are very promising, taking into account the simplicity of the neural networks shallow architectures (only two hidden layers). Additionally, only temperature and humidity variables were needed as state information. At first, the baseline behavior is cloned to reduce computation time and achieve better results than training with zero knowledge, as shown in Figure 9. With BC, we start with more rewards at the beginning of learning and achieve better results in the long term. This is due to the exploration/exploitation tradeoff. The raw agent tries many variants of decisions until it reaches a good strategy. However, with BC, it learns to perform better than a good baseline from the start.
Energy consumption and thermal comfort improvements in different weather conditions are investigated. Energy consumption reduction varies from month to month, but the agent is always capable of decreasing energy consumption, as illustrated in Figure 10. For PMV comfort, in some cases, the proposed agent strategy allows less comfort compared to the baseline. This allows for having less energy consumption, while the PMV levels remain mostly inside the [−0.5,0.5] range. Overall, the proposed methodology can achieve a 21% reduction in energy consumption and 44% better thermal comfort. Figure 11 summarizes the monthly gains in comfort and energy consumption for the whole year. The optimized strategy results are compared with those of the baseline in terms of energy and comfort and report the percentage of improvement. It is clear that the amelioration follows the outdoor temperature profile well. In August, for instance, a 28% reduction in energy consumption is achieved. During the cold months, energy reduction is less significant, since the outdoor weather is pleasant. The lowest record is obtained in January, during which the energy consumption is optimized by only 6%. Also during the cold months, less improvement in thermal comfort is recorded. For example, in February, the thermal comfort is worse by 10% in terms of PMV. This might sound poor, but the PMV this month is still within the desired range of [−0.5,0.5]. The values outside the range correspond to points in the working day start or end, and they do not stretch over significant periods. Notice that the mean PMV is inside the admissible range for all the zones. Notice also that in the baseline, a cold sensation is present in some zones, even though the overall PMV values are better than the optimization. Arguably, the agent has better comfort, because the baseline reaches bad comfort values, as displayed in Figure 12. Though indoor air quality differs from season to season and from zone to zone, good CO2 levels are consistently maintained in our experiment. CO2 concentrations are always under 1000 ppm, and they are the highest in July, because the agent automatically prioritizes the energy consumption and thermal comfort. During this period, maintaining good thermal comfort with reduced energy is challenging due to the high temperatures. The plots of CO2 levels per zone for four months, a month per season, can be found in appendix 0.
In appendix 0, the PMV values are presented per zones for four months [a month per season]. The thermal comfort is maintained in the desired range. PMV varies from zone to zone due to their different characteristics. It also varies from season to season. The tendency to the warmer environment due to hot weather is noticeable. Values are within the desired [−0.5,0.5] interval overall. Boxplots depict the data quartiles, and some values outside the optimal range are present. These values are common and do not reach uncomfortable levels. Their span is brief to accommodate for the start or end of the day and optimize energy consumption. The values also follow the weather. For instance, in January, these PMV values are lower overall than in the other months, since in Qatar, the weather is hot throughout the year, and temperatures drop only during the winter. The comfort variables for the 21 zones and the four seasons (summer, autumn, winter, and spring) are summarized in Figures 13-16 (one figure for one variable: PMV, zone temperature, CO2, and relative humidity). Notice that, for all the 21 zones and all the seasons, the mean CO2 levels do not exceed 800 ppm. The maximum value is 1000 ppm reached in summer. Therefore, the contaminant concentration is always in the healthy range. The temperatures in summer are obviously lower than in the other seasons. The mean temperature changes from one season to another and does not exceed 2 °C, because the weather does not vary drastically over the year. Though the comfort levels are limited within the [−0.5,0.5] range, the mean values per zone are successfully maintained in the range of [−0. 35,0.35]. PMV values rarely exceed the desired range, and when it happens, they remain below the slightly uncomfortable thermal comfort values (+1 or −1). These values correspond to the start and end of the working day, and they do not harm the overall comfort, since the values span short periods.

Conclusions
In this work, deep reinforcement learning is applied to control a school building's indoor environmental conditions. The school building is a 21-zone environment that is modeled and simulated using EnergyPlus. The proximal policy optimization is used to train the intelligent agent. The learning process is sped up by cloning the baseline strategy at the first step before learning new policies. None of the previous studies of DRL control for building energy management applied PPO or behavioral cloning. Additionally, compared to other works, the proposed testbed is the most complex with 72 possible actions at every timestep. The agent successfully learns the optimal control decisions for different weather conditions throughout the year. The performance is then evaluated over one year of simulation, achieving a 21% reduction in energy consumption while preserving a very good indoor comfort. More interestingly, the agent achieves such results with shallow neural networks as function approximators.
In the next step, the focus will be on deploying the agent into a real school environment and investigating its performance. In addition, the behavioral cloning effect on learning should be studied in more detail and the transferability of the learned strategy to other environments should be evaluated.