Next Article in Journal
Systemic Insights for Value Creation in Solar PV Energy Markets: From Project Management to System Impacts
Next Article in Special Issue
Experimental Study on Heat Transfer Coefficients in an Office Room with a Radiant Ceiling During Low Heating Loads
Previous Article in Journal
Multi-Objective Optimization of Blockage Design Parameters Affecting the Performance of PEMFC by OEM-AHP-EWM Analysis
Previous Article in Special Issue
Advanced Fuel Based on Semi-Coke and Cedarwood: Kinetic Characteristics and Synergetic Effects
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An End-to-End Relearning Framework for Building Energy Optimization

by
Avisek Naug
1,†,
Marcos Quinones-Grueiro
2,† and
Gautam Biswas
2,*,†
1
Hewlett Packard Labs (HPE), Milpitas, CA 95035, USA
2
Institute for Software Integrated Systems, Vanderbilt University, Nashville, TN 37235, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Energies 2025, 18(6), 1408; https://doi.org/10.3390/en18061408
Submission received: 7 February 2025 / Revised: 2 March 2025 / Accepted: 7 March 2025 / Published: 12 March 2025

Abstract

:
Building HVAC systems face significant challenges in energy optimization due to changing building characteristics and the need to balance multiple efficiency objectives. Current approaches are limited: physics-based models are expensive and inflexible, while data-driven methods require extensive data collection and ongoing maintenance. This paper introduces a systematic relearning framework for HVAC supervisory control that improves adaptability while reducing operational costs. Our approach features a Reinforcement Learning controller with self-monitoring and adaptation capabilities that responds effectively to changes in building operations and environmental conditions. We simplify the complex hyperparameter optimization process through a structured decomposition method and implement a relearning strategy to handle operational changes over time. We demonstrate our framework’s effectiveness through comprehensive testing on a building testbed, comparing performance against established control methods.

1. Introduction

Autonomous control of non-stationary cyber–physical systems (CPSs) that combine sensing, computation, and actuation has been an area of active research in recent years [1,2]. It involves the development of control strategies without human-in-the-loop that can adapt to dynamic changes in plant models either due to external factors or internal model reconfiguration. For a non-stationary CPS, time-dependent characteristics may drive the system to unexpected states for the same actuation and sensing conditions [3]. Therefore, self-adaptation, i.e., making decisions depending on the context and coping with the inherent uncertainty of the real world, becomes a crucial aspect of autonomous control. In this work, we focus on buildings as a case study of autonomous control for non-stationary CPS.
Optimal control methods for dynamical systems focus on developing policies through minimizing user-defined cost functions that encapsulate specific design objectives [4]. Classical optimal control approaches generally define policies before deployment and necessitate complete knowledge of the system dynamics. Consequently, they cannot handle uncertainties and changes in the system dynamics not considered at the design stage, limiting their applicability for non-stationary CPSs. Adaptive control techniques, which can adjust their actuations in response to these changes, are required to overcome this limitation. Data-driven control techniques, which do not rely on analytical models of a system, like model-free reinforcement learning (RL) [5], have emerged as powerful tools for adaptive control. However, classic RL methods do not guarantee adequate performance for non-stationary CPSs because of the time-varying nature of the system dynamics [3].
Traditionally, non-stationary behaviors have been modeled as a hybrid system with well-defined transitions between the operating modes [6]. However, in real-world systems, where the energy behavior of the building depends on internal operational models and environmental parameters, it is difficult to establish operating modes and transitions between modes in advance. Therefore, we adopt a data-driven modeling approach that continually adjusts to the non-stationary changes in the system for accurate predictions and forward simulation to support controller relearning, i.e.,: reformulate the control strategy completely in response to changes in plant model dynamics. However, data-driven model learning may suffer from low performance due to the slow adaptation speed. Many samples are needed to relearn data-driven models, but sufficient data may not be available for timely relearning to support decision-making without significant degradation in performance. To address the above problems, we develop an autonomous relearning framework [7] augmented with a systematic hyperparameter selection approach and analyze the robustness of the proposed approach over long periods of operation.
The rest of the paper is organized as follows. Section 2 reviews the relevant literature. Section 3 outlines the problem statement and assumptions governing our relearning approach. Section 5 presents a solution to data-driven offline learning and relearning. Section 4 provides the complete framework and a detailed discussion of the individual components. We evaluate the performance of our approach on a standard ASHRAE 5 zone testbed and a real building in Section 6 and Section 7, respectively. Finally, we discuss the conclusions, limitations, and future work in Section 8.

2. Related Work

Researchers have explored traditional control methods, MPC [8,9], and deep RL [7,10,11,12] or hybrid approaches [13] for climate control in buildings and form a basis for benchmarking in our experiments. Apart from testbed experiments using these approaches, building practitioners frequently refer to the ASHRAE Guideline 36 [14,15] for implementing building control, which we also consider during our benchmarking experiments. These methods have achieved partial success, but end-to-end data-driven methodologies for non-stationary operating conditions have not been developed.
Traditional control algorithms such as PID (Proportional-Integral-Derivative) controllers are not capable of handling changes in the system dynamics [16]. Researchers have explored advanced control methods such as Model Predictive Control (MPC) with model adaptation to account for dynamic changes in building conditions and optimize the control strategy [8,9,17,18]. However, most authors acknowledged that their approaches rely on accurate models of the building’s dynamics, which can be challenging to obtain for non-stationary systems, especially for online deployment.
RL has been applied to HVAC control in buildings and tested with simulated physics-based models. However, they have had varying degrees of success. For example, [19] employed a Demand Response-aware RL controller during demand response periods, reducing power consumption by 50 % on a weekly basis compared to the default controllers. Several works have pointed out that taking actions on every instant may be detrimental and instead suggested event-driven decision-making using RL [20] showing improved performance when considering energy savings and thermal comfort violations compared to earlier works. However, these approaches always seem to be fine-tuned or ad hoc for the particular application. A universal approach that works across different weather conditions and internal building behavior seems to be missing. This is primarily due to the difficulty in modeling the non-stationary behavior in these systems. Moreover, none of these approaches are end-to-end that can be readily applied to any building HVAC and fail to discuss how to address the sim-to-real gap. Some work in the direction of a generalized end-to-end approach has been recently studied in [21]. However, they focus on the transfer learning aspect of the work without any emphasis on the widespread application under varying building conditions.
Recent work has used data-driven models to train RL agents for HVAC control. Kontes et al. [22] used Gaussian process models to simulate building energy consumption and train a control strategy using Dyna-Q, achieving comparable energy savings to an MPC-based approach. Constanzo et al. [23] applied Q-learning to a data-driven model of a building created using a combination of neural networks and an ensemble of extreme learning machines (ELMs) models, achieving results within 10 % of the true optimum. However, like physics-based models, these methods require building-specific data for training and are not end-to-end approaches that can be readily applied to any building HVAC system. Overall, while RL has shown promise for HVAC control in buildings, there is a lack of end-to-end data-driven methodologies that can be easily applied across varying operating conditions of buildings.
Creating end-to-end data-driven relearning approaches for climate control in non-stationary building environments has been challenging for several reasons. First, building HVAC systems are complex and highly nonlinear [24]. Moreover, according to [25], their behavior is influenced by several factors such as weather conditions, occupancy patterns, and equipment performance. The difficulty of developing a unified or cascade of models that capture factors and their interactions in a complete way across multiple data-driven thermal modeling for buildings has been discussed in [26]. Second, building conditions are highly variable and can change rapidly over time; therefore, any model or control strategy must adapt and be responsive to these changes, while certain building-specific transfer learning approaches may be used to address these changes [27]. Deploying this for real-time applications requires sufficient real-time data and feedback. It may be challenging to obtain and recondition the model and adapt the control during operations. Third, deploying end-to-end approaches in real systems requires hyperparameter tuning of large models, a topic rarely discussed in current literature. Intuitively, building a cascaded computational architecture for monitoring, modeling, and control makes the hyperparameter optimization problem infeasible [28,29]. Furthermore, most of these works lack a principled approach to moving towards real-world deployment since that involves an end-to-end pipeline that can handle the non-stationary exogenous variables, a highly efficient adaptive control approach, and dynamic model updates. Most importantly, these updates need to occur efficiently in real time so that the building can operate as optimally as possible while it transitions to a new operating regime.

3. Problem Formulation

This section presents a formal problem statement for the end-to-end relearning problem for energy optimization in dynamic environments. The problem is modeled as a Non-stationary Markov Decision Process, encapsulating the complexities of building energy optimization.
Definition 1 (Non-Stationary Markov Decision Process
[30] (NS-MDP)). A non-stationary Markov decision process is defined as a 5-tuple: M = { S , A , T , ( p t ) t T , ( r t ) t T } . S represents the set of possible states the environment can reach at decision epoch t. A is the action space. T = { 1 , 2 , , N } is the set of decision epochs with N + . p t ( s | s , a ) and r t ( s , a , s ) represent the transition function and the reward function at decision epoch t T , respectively.
The time-varying nature of the control problem requires the optimal solution to be a set of policies π t * whose selection depends on the transition function at each decision epoch. Our proposed framework is designed to update the control policy efficiently as the system dynamics change. This involves two steps: (1) detecting the onset of a non-stationary behavior change and (2) updating the policy. The non-stationary behavior is assumed to be driven by changes in the system parameters and/or exogenous variables. Exogenous state variables and rewards may impede the reinforcement learning (RL) controller by introducing unregulated fluctuations into the reward function [31]. The proposed updating framework, defined by transition function changes, is bounded in time, making the adaptation process feasible. This is a natural assumption for most physical systems, as abrupt changes do not occur frequently; therefore, the transitions satisfy the Lipschitz Continuity (LC) condition on the NS-MDP  [30]: i.e.,: there is an upper bound to the rate of change in the dynamics of the system.
Although the agent does not know the true NSMDP model, it can learn a quasi-optimal policy by interacting with temporal slices of the NSMDP, assuming the LC property. In other words, the agent learns a stationary MDP from the environment data at epoch t, implying the trajectory generated by an LC-NSMDP { s 0 , r 0 , , s k } is assumed to be generated by a sequence of stationary MDPs { M D P t 0 , , M D P t 0 + k 1 } , each of which can be considered to be an individual learning task [30].

4. End-to-End Relearning Framework

Using the LC-NS-MDP problem formulation in the previous section, we develop a detailed description of the proposed end-to-end relearning framework [32,33]. (The associated hyperparameter tuning methods will be described in the next section.) The first step of the relearning process is captured at a high level by a two-step process shown in Figure 1. We start with a high-level overview of the two-step process that includes self-monitoring and relearning. Then, we discuss the components of the approach in more detail. Finally, we summarize how these components work in tandem.
The proposed framework, as depicted in Figure 1, consists of two iterative processes: (i) an Outer loop, where a Performance Monitor assesses the reward per interaction during the deployment phase; and (ii) an Inner Loop that is activated when degradation of system performance attributed to non-stationary changes in system performance leads to relearning the dynamic system model for a new system configuration.

4.1. Outer Loop: Online Operation and Performance Monitoring

The purpose of the outer loop is to generate supervisory control actions given the current system state, self-monitor performance, and detect non-stationary alterations in the system and its environment. This change triggers data-driven relearning of a new system model and a corresponding control policy. The monitoring focuses on the reward signal during the interaction of the Supervisory RL Controller with the actual system, as illustrated in Figure 2. For simplicity, this reward formulation can be identical or an augmented form of the reward function used for the Lipschitz Continuous NS-MDP formulation. We assume that whenever the controller performance deteriorates, it is reflected as an overall negative trend in the reward signal. Since the reward is a simple scalar time-series signal, we estimate the fit of the negative trend using the technique described in [34]. The trend is statistically significant if the R 2 coefficient of determination is greater than 0.65 and the p-value is less than 0.05 . Although more intricate non-linear trend detection techniques are available and have been applied in recent studies ([35]), the efficacy of this module is contingent upon the window size W p m i , where i 1 is used to evaluate the reward. The window size is generally correlated with the temporal dynamics of the non-stationary changes.

4.2. Inner Loop: Update and Relearning

Our solution architecture for the inner loop includes simulating the real system using multiple data-driven models that can predict relevant state variables in the LC-NSMDP.

4.2.1. Modeling of the Dynamic Systems

Due to the complexity and associated costs of developing precise physics-based models for large systems ([36]), our approach utilizes a set of data-driven models. These models predict the next state, s ¯ t + 1 based on the current state s ¯ t , exogenous variables d ¯ t , and control inputs u ¯ t given by the actions of the RL controller a ¯ t . Figure 3 shows the framework of our data-driven strategy for constructing a state space model of the dynamic system. To simulate the state variables accurately, our architecture integrates two sequential components: (1) Long Short Term Memory (LSTM) networks to encapsulate temporal dependencies between inputs and outputs, and (2) a Fully Connected Neural Network (FCNN) to model the non-linear interactions among variables. The precise architecture of each deep learning model is derived through a hyperparameter optimization process, which we discuss in a later section.
Furthermore, given the non-stationary behaviors in the operating regimes of our system, it is necessary to retrain/relearn these models. Retraining with the limited amounts of data available after the non-stationary change may result in overfitting and, therefore, cause catastrophic forgetting. To avoid this, we adopt a regularization process termed Elastic Weight Consolidation (EWC) proposed in [37]. This process effectively retains previously learned behavior from older data while avoiding long learning times.

4.2.2. Exogenous Variable Prediction Models

In addition to the models required to simulate the system transition, we propose to develop a set of data-driven models to simulate the future behavior of exogenous variables. We name these the Exogenous Variable Predictors. These models simulate the system behavior into the future, and they help us address the problem of potentially limited data available after a non-stationary change is detected. The derived models are used to relearn the RL controller to make it optimal for the system’s future behavior.
The schematic of the generic exogenous variable predictor module is shown in Figure 4. We develop Exogenous Variable Predictors to predict the weather, thermal load, and occupancy schedules for our building’s application. Our models use the current sequence of past inputs of length K for the exogenous variables d ¯ t K : t and predict the value of the exogenous variable at the next time instant as d ^ t + 1 . For forecasting over a horizon of length N, we use the output at the t + 1 t h instant and append it to the input sequence. Both K and N are hyperparameters that are fine-tuned for the specific application. The LSTM models capture the temporal sequence relations in these variables, and FCNN captures the non-linearity of the processes. The specific network architecture is also obtained through hyperparameter search. Further, depending on the application domain, the training can be continual, as newer batches of data are acquired, or conditional, i.e., when certain system non-stationary behavior is detected.
For data-driven modeling, the training data needs to include input variables to capture accurate modeling of system behavior. This includes variables, such as a ¯ t , s ¯ t and d ¯ t , as well as the data to estimate the next state s ¯ t + 1 in the Exogenous Variable Prediction module. In our approach, we collect and use real data of system operations based on the actual pre-existing controller-system interactions. These data are collected at a pre-determined rate and stored in an Experience Buffer (Figure 3) modeled as a FIFO queue with queue length M e . The Experience Buffer helps us retain the latest data from the real system to adapt the models to the latest system behavior. Choosing an optimal value for M e is an important step in the overall approach.

4.2.3. Deep Reinforcement Learning Controller

We develop a policy gradient-based RL algorithm to design and implement the controller π t * to interact with the real system  [38]. The Policy Network, parameterized by the vector θ , takes as input the current state of the NS-MDP, s ¯ t , which comprises observations o ¯ t from the system and certain exogenous variables in d ¯ t that support the decision-making process. The network outputs an action a ¯ t in response. In other words, it collects a set of tuples of information ( s ¯ t , a ¯ t , a ¯ t + 1 , r ( s t , a t ) ), which is then used to optimize a loss function J θ . Depending on the algorithm used to update the loss function, the obtained DRL agent might be the result of a classic policy gradient network like REINFORCE, an actor-critic network like A2C, or an advanced actor-critic algorithm like Proximal Policy Optimization (PPO) [39] or Deep Deterministic Policy Gradient(DDPG) [40].
We summarize our relearning in Algorithm 1. Based on the inputs, the conditional online deployment loop with monitoring runs indefinitely. The Offline Relearn Phase is triggered conditionally.
Algorithm 1: End-to-End Relearning algorithm
Energies 18 01408 i001

5. Hyperparameter Selection

The end-to-end relearning framework discussed in the previous section has a large number of parameters across multiple systems. The joint optimization of these hyperparameters is computationally intractable. Hence, we develop a set of methods to decompose the hyperparameter space to reduce the complexity of the tuning process.

5.1. HPO: Problem Formulation

Let us assume that an ensemble A of models (not limited to machine learning models) is used to solve a problem associated with a complex system, D. The ensemble has an associated set of hyperparameters H = ( H 1 , H 2 , , H j ) . The performance of the ensemble associated with D is defined by multiple metrics, L = ( L 1 , L 2 , , L k ) . Our task is to identify the values for the set of hyperparameters, H * , that allows us to achieve the best performance given the metrics, L. This is mathematically formulated as hyperparameter optimization of the ensemble model A with respect to the metrics in L:
H * = a r g m i n h H L ( A , D ; h ) ,
where L can be a single scalar evaluated as a weighted sum of the metrics in L or a set of individual equations if the elements in L depend on different subsets of H. If the cardinality of H is large, the joint hyperparameter optimization problem becomes computationally complex and sometimes, intractable. We propose an approach for systematically decomposing the hyperparameter space into independent subsets of hyperparameters to optimize each subset individually.

5.2. Decomposition of Hyperparameter Space

We adopt a Bayes Net approach for characterizing the hyperparameter space as a graph, where the links in a graph capture the directed relations between corresponding hyperparameters and the evaluation metrics they influence. The crux of our approach then uses the concept of d-separation to determine the conditionally independent subsets of hyperparameters to be optimized independently.
Using the knowledge of the ensemble of models and the relations among performance metrics in the application domain, we derive the Bayes Net structure that captures the relations between the model hyperparameters and their corresponding evaluation metrics. Adopting the method discussed in [41], we construct the Bayes Net using a topological order (cause precedes effects) to link all H i H to the corresponding performance variables, L k L , such that the order matches the directed graph structure. This helps to establish the Markov blanket, i.e., P ( X i | X 1 , X 2 , X n ) = P ( X i | P a r e n t s ( X i ) , where X = { X 1 , X 2 , X n } = H L . Below, we describe the guiding principle to generate this graph structure.

5.2.1. Metric-Based Decomposition

We first simplify the HPO problem by classifying the hyperparameters associated with the metrics of the ensemble models into two primary groups: (1) local and (2) global.
Definition 2 
(Local and Global Metrics). For an ensemble of data-driven models A associated with a system D, the evaluation metrics in L may be decomposed into two classes. The metrics whose computation depends on an individual data-driven model are called local metrics, L l . The metrics affecting multiple models’ computations are called global metrics, L h .
Next, the set of hyperparameters is divided according to the specific metrics each one influences, as detailed below.
Definition 3 
(Decomposition of Hyperparameter space). For an ensemble of data-driven models, A associated with a system, D, the existing hyperparameter set, H can be decomposed into two main subsets H h and H l using the causal structure of a derived Bayes net [42]. Hyperparameters linked to at least one global metric, L h , are termed global and associated with the set, H h . Hyperparameters that are only linked to the local metrics L l are termed local and associated with H l .

5.2.2. Separation of Hyperparameters: Connected Components and D-Separation

Given a Bayes Net that captures the relations between hyperparameters and metrics, we apply the concepts of connected components and d-separation [43] to identify the independent hyperparameter groupings. A graph, G, is said to be connected if every vertex of the graph is reachable from every other vertex. We can use a simple Breath-First, Search approach to identify the connected components for the directed acyclic graph (DAG) representation of the Bayes Net. Once we identify the connected components, we apply d-separation by making all links undirected and then establish if a set of variables, X, in the graph is independent of a second set of variables, Y, given a third set, Z (conditional independence). There are four primary rules that can be used to test the independence relation.
  • Indirect Causal Effect X Z Y : X can influence evidence Y if Z has not been observed.
  • Indirect Evidential Effect Y Z X : Evidence X can influence Y if Z has not been observed.
  • Common Cause X Z Y : X can influence Y if Z has not been observed.
  • Common Effect X Z Y : X can influence Y only if Z has been observed. Otherwise, they are independent.
In our formulation of the ensemble models, we encounter the following cases based on the above rules.
  • The evaluation of local metrics blocks the effect of the local hyperparameters on global metrics (Rule 1). Therefore, any two models affecting the same global metric will not have their hyperparameters related via a common effect (Rule 4).
  • Assume that Z is the set of global hyperparameters, while X and Y represent local hyperparameter sets associated with individual models. If they belong to the same connected component (sub-graph), we will have to jointly optimize the hyperparameters of the individual models, thus increasing the computational complexity of the optimization process.
    Instead, we can consider a two-level optimization process: First, we chose candidate values for the global hyperparameters in Z. Given the values of the variables in Z, we can independently optimize the hyperparameters connected to components X and Y, thus decomposing the hyperparameter space further by applying Rule 3.
  • In many cases, all global hyperparameters are linked to the global metrics, and then they have to be jointly optimized following Rule 4.
Subsequently, during each trial of the hyperparameter optimization process, we consider the following steps:
  • Choose candidate values of global hyperparameters H h .
  • With the chosen values of H h , optimize the hyperparameters in H l associated with each model in the ensemble by using the corresponding model performance as the objective function.
  • Evaluate the performance of subsets in H h on L h using the set of Equation (1).

5.2.3. Bayesian Optimization for Hyperparameter Tuning

We adopt a model-based black box optimization method called Sequential Model-Based Optimization (SMBO) [44]. This method offers several advantages over gradient-based or derivative-free optimization techniques. Unlike gradient-based optimization, SMBO can handle non-convex surfaces, making it preferable when the hyperparameter space includes discrete variables. Unlike other derivative-free methods, SMBO also maintains a history of past evaluations of candidate hyperparameter settings. It uses this information to form and update a prior, thereby selecting hyperparameter values that potentially improve upon the best choices identified thus far.
The optimization process itself then includes three iterative steps that we run until a termination criterion is satisfied: (1) Using the surrogate probability model and the selection function, choose the best-performing hyperparameter; (2) Apply the hyperparameters to the fitness function L ; and (3) Update the surrogate model incorporating the new trial results. The termination criterion is based on the improvement of the metric of interest. For example, in Bayesian optimization for the HVAC problem, we look at improving detection time to non-stationarity, total relearning time, model errors, and agent reward. The details of these are provided in Section 6.4. We terminate the optimization when the expected improvement from the best hyperparameter candidate is below a threshold ϵ . We applied the Bayes rule on the surrogate probability model and then used the Tree Parzen Estimator (TPE) [44] to obtain a generative model to construct the hyperparameter distribution.

6. Experimental Settings

This section describes the settings we adopted for our relearning algorithm and the associated hyperparameter tuning approach we applied to a 5-zone building testbed.

6.1. System Description

The five-zone building testbed shown in Figure 5, developed by a Lawrence Berkeley National Labs team, is a frequently used physics-based dynamic simulation model. For our work, we exported the compiled testbed from Modelica to a Functional Mockup Unit (FMU) and used the PyFMI library to interact with the system model in Python 3.9. The FMU comprises an HVAC system, a building envelope model that includes air flow and leakage through open doors, and other building components. For a further detailed description involving the equations for modeling this building testbed, we refer the reader to [45]. Exogenous factors such as weather, occupancy, and human-induced local setpoint changes introduce non-stationary building behavior. Our approach aims to maintain operational efficiency in the face of these changes.

6.2. Reinforcement Learning Definitions

Table 1 outlines the various components of the testbed as a non-stationary MDP. Recommendations from building managers inform the selection of variables and are partially influenced by pertinent studies discussed in the literature review.
As components of the reward function, r e n e r g y optimizes energy efficiency, r c m f t penalizes zone comfort violations, and r v a v penalizes frequent changes in the controller actions T d s p , as the zone VAV dampers tend to actuate aggressively in response. Here, u b z , t and l b z , t represent the upper and lower comfort bounds for the temperature in each zone z at time t. v % , z , t denotes the VAV damper valve percentage in zone z at time t. The transition and reward functions are assumed to be locally Lipschitz continuous due to the large time constants in building thermodynamics.

6.3. Implementation of the Solution

We discuss the different components of our approach specific to our testbed. The Results section provides details regarding the architecture of the individual models, hyperparameter values, performance evaluation procedures, and the corresponding results.

6.3.1. Dynamic System Model

The transition model ( p ( s , s t , a t ) ) predicts values for the following observations ( o ¯ t ) at the next time step: Total Energy Consumption ( E t o t ), the zone temperatures ( T z , z = 1 5 ), and the VAV damper opening percentages ( v % , z , z = 1 5 ). Accordingly, we create a model M e n e r g y for predicting energy consumption, a model M T . z o n e for predicting all the zone temperatures, and 5 individual models ( M v a v % , z , z = 1 5 ) that predict the VAV damper opening percentages for the five zones. The ensemble of these models constitutes the Dynamic System Model. The discharge temperature model for predicting T d , t is modeled as
T d , t = T d s p , t 1 + N ( 0 , σ d i s c h ) .
We observed that the discharge temperature was always close to the setpoint specified by the Action: a = T d s p . σ d i s c h is chosen to be 0.1 ° F to account for non-deterministic effects.

6.3.2. Experience Buffer

The Buffer’s optimal memory size M e is determined by hyperparameter optimization.

6.3.3. Supervisory Controller

The DRL-based Supervisory Controller was implemented as a simple Multi-Actor Critic framework [46]. This formulation helped us determine whether our framework provided improvements compared to a standalone state-of-the-art online algorithm like PPO [39]. Given the training resources, we trained in n = 10 parallel environments to generate diverse samples (similar to [47]), accounting for uncertainties in measurement and resulting model inaccuracies in the Dynamic System Models. We did not specify maximum training steps, as we incrementally trained to detect performance degradation, using callbacks to stop training when the reward no longer improved, assuming convergence was achieved.

6.3.4. Exogenous Variable Predictors

For the testbed, we needed to predict ambient dry bulb temperature (oat) and relative humidity (orh). We used Long Short-Term Memory Neural Network (LSTM) models (adapted from [48]) with fine-tuning to fit our specific location’s data. The Exogenous variable predictor models were learned continuously in a batched online fashion. For zone setpoint schedules and thermal load-based exogenous variables, models used simple rules to look up current schedules or values in the system and set them for the required prediction horizon N.

6.3.5. Performance Monitor Module

For the Performance Monitor, we tracked the presence of a negative trend in the deployment phase using i = 2 windows W p m 1 and W p m 2 in parallel. This helped us track slow-moving non-stationarities due to weather using W p m 1 , and fast-moving non-stationarities due to zone temperature or load changes using W p m 2 .

6.4. Hyperparameter Selection

The framing of the hyperparameter optimization problem is presented below and the results are discussed in the next section.

6.4.1. Bayes Net

The set of hyperparameters for the ensemble models is broadly summarized in the first column of Table 2. The set of performance metrics included in the Bayes Net are: (1) time to detect non-stationarity after occurrence ( L Δ t d ), (2) total relearning time ( L Δ t r ), (3) individual data-driven models errors ( L m e ), and (4) RL agent reward ( L r w d ). The resulting Bayes Net is depicted in Figure 6. The links, representing the cause-effect relations of the ensemble structure, are derived from the local versus global labeling of the variables.
To identify the set of global and local hyperparameters, we first classified the metrics as global and local. For the ensemble approach and the five-zone testbed, the time to detect a non-stationary change after occurrence ( L Δ t d ) and total relearning time ( L Δ t r ) were labeled as global metrics, as they were influenced by multiple models and represented overall ensemble performance. The individual data-driven model errors ( L m e ) and RL agent losses ( L r w d ) were labeled as local performance metrics. Accordingly, Table 2 lists the metrics affected by the hyperparameters and shows their classification as global and local in the last column.

6.4.2. Separation of Hyperparameters

Given the Bayes Net structure in Figure 6, we identified W p m 1 and W p m 2 to be related via common effect (Rule 4). However, not being connected to other nodes, they were independent of the other variables. The nodes M e and N were connected via the common effect rule (Rule 4) through the node L Δ t r . Given the two-level hyperparameter optimization, the individual model hyperparameters in the dynamic system, exogenous variable predictors, the agent actor, and critic networks were independent of each other via the common cause (Rule 3). The hyperparameters inside individual models were dependent on each other due to the common effect rule (Rule 4). The nodes l e and γ were jointly optimized with the agent hyperparameters because of common cause (rule 4) since they affected L r w d . Finally, the individual model hyperparameters did not affect the global metrics due to the blocking effect of the local metrics L m e . Hence, the hyperparameter space could be decomposed as follows: (1) ( W p m 1 , W p m 2 ) , and (2) ( M e , N, H l ) where subsets of H l . Each could be optimized independent of the other. Here, H l comprises all the hyperparameters of individual data-driven models.

6.4.3. Two-Step Hyperparameter Optimization

Given the separation of hyperparameters, we independently optimized the values of ( W p m 1 , W p m 2 ) using the metric L Δ t d . For the other subset including ( M e , N, H l ), we performed two-level optimization, where we chose candidate values of ( M e , N) and then optimized the subsets of H l independently. The independent subsets included hyperparameters of the Exogenous Variable Predictor, Dynamic System Model(s), and RL Agent Policy Network, Value Network, l e and γ . We considered four types of building non-stationary changes: (1) Weather based: ambient temperature and ambient relative humidity, (2) Changes in zone set-points (3) Changes in thermal load due to changes in occupancy, and (4) a combination of the first 3 cases. For cases 2 and 3, weather-related non-stationary changes were also present. This two-step optimization approach was performed across all possible non-stationary changes.

6.4.4. Global and Local Hyperparameter Choices

The candidate choices for the global and local hyperparameters are described in Table 3 and Table 4, respectively. These ranges were selected based on existing building energy optimization problems in the literature, while we primarily used discrete choices for the hyperparameter values, separating the hyperparameters helped simplify the search space. For the dynamic system neural network-based models, the exogenous variable predictor models, and the actor-critic networks, we restricted the number of hidden layers to four to avoid computational intractability.
The selection and separation of the hyperparameters emphasize that we do not rely on the physics of the approach but allow the system’s information to dictate the hyperparameter values through tuning.

7. Results and Discussion

We present the results of the hyperparameter tuning followed by the relearning algorithm application on the five-zone testbed. We discuss the optimized model architectures and the evaluation metrics. The distributed two-step hyperparameter optimization utilized Ray-Tune [49], which was applied across four types of building non-stationarities. Specifically, our hyperparameters are optimized to enhance the online performance of our deep relearning RL algorithm. No additional training is conducted during online operations.

7.1. Results on Tuned Model Architecture and Model Evaluation

In this subsection, we provide the details of the tuned model architectures and the results of evaluating the models on the standard five-zone testbed.

7.1.1. Dynamic System Model

Our model evaluation used the predicted output from timestep t 1 as the input for prediction at time step t over prediction horizon N. We assessed model performance across 100 different N = 48 h intervals throughout the year. We calculated the Coefficient of Variation of the Root Mean Square Error (CVRMSE) and evaluated temperature and zone VAV predictors with the Mean Absolute Error (MAE).

7.1.2. Experience Buffer

The following values (in hours) were obtained for the size of the Experience Buffer M e : (1) Weather: 36; (2) Zone Set Point: 18; (3) Thermal Load: 18; and (4) Combined: 18.

7.1.3. Supervisory Controller

The controller is modeled as an Actor Critic framework [46]. The inputs to the actor-network are the state variables in s ¯ t described in Table 1 and the output is the mean a μ and standard deviation a σ from which an action a ¯ t is sampled. The critic network is a Q-network that takes as input the current state s ¯ t and action a ¯ t . It is then used to estimate the Advantage using Equation (2).
A ( s t , a t ) = Q ( s t , a t ) a t A π ( a t | s t ) × Q ( s t , a t )
Typically, advantages are normalized for better convergence during agent training. There is less confidence in future estimated returns for parameters like weather with slow non-stationary changes with high uncertainty. Hence, we used low values of γ for weather. In contrast, the opposite was true for non-stationarities related to zone setpoint and thermal load changes as they tended to sustain values or maintain the same schedule in the future and not change very often.

7.1.4. Exogenous Variable Predictor Models

We developed models for predicting outside air temperature and relative humidity as a sequence prediction problem. Given K past inputs, we predicted future values over a horizon N, adapting the approach of [48] to Nashville’s weather. Lacking future labeled data for sequence-to-sequence prediction, we utilized a for loop during prediction, feeding the previous output ( o ¯ t ), cell state ( c ¯ t ), and network state ( h ¯ t ) from the last LSTM into the next time-step predictions ( h ¯ t ), using recurrent neural network translation techniques [50]. The resulting encoder-decoder network was pre-trained on one year of Nashville data, predicting over a horizon of length, N using past K = 12 h inputs at half-hour intervals. The test dataset consisted of 100 samples with N = 48 -h horizons at similar intervals. The optimal K and N values for various non-stationarities are listed in Table 5.
The tuning of the output horizon length N is considered a weighted metric that includes model error, relearning time, energy consumption, actuation frequency, and comfort performance. Our approach struck a balance between producing sufficient future samples for expedited reinforcement learning agent training and convergence, and minimizing the impact of modeling errors stemming from uncertainties in future return estimates G t .

7.1.5. Performance Monitor Module

We tuned the values of the two windows for detecting negative trends over (1) small intervals due to faster relatively non-stationarities; and (2) large intervals for slow-moving non-stationarities. The tuned values for W p m 1 and W p m 2 are shown in Table 6.

7.2. Benchmark Experiments

We assessed the performance of our relearning algorithm by implementing it online in the five-zone testbed and comparing it against benchmark algorithms using established building literature metrics.
Energy Consumption  E t o t : A lower energy consumption under similar conditions indicates better energy efficiency.
Average Zone Temperature Deviation: T D = z Z m a x ( T z u b z , 0 , l b z T z ) or z Z | T z T z , s p | . Lower deviations from indicated setpoints imply better thermal comfort. Z represents the total number of zones, T z is the temperature in zone z, T z , s p is the set-point for zone z, and u b z and l b z are lower and upper comfort bounds. One of the two choices can be selected as a proxy for comfort, depending on whether the bounds are predefined or not for each zone of the building.
Actuation Rate: The actuation rate is given by A s = t = 0 T z Z | A t + 1 , z A t , z | , where A is the VAV damper opening state (percentage). Lesser actuation on aggregate implies that the controller can enforce smoother control such that the wear on the actuator is less in the long run.

7.3. Results

We conducted experiments to assess the impact of varying weather conditions, zone set-point values, and thermal loads, utilizing three months of data for pre-training to learn building dynamics and derive the initial supervisory controller. Subsequently, we tested our framework for a year with Nashville weather data. Weather-related changes include sudden increases and decreases in the outside air temperature and relative humidity, e.g., sudden short warm spells during spring mostly occurring in February and March 2021 and sudden cold spells in August and October 2021. Zone set-point adjustments altered temperature bounds by 4 ° F to 6 ° F and increased thermal loads mimicked events with high occupancy.
We detected 24 instances of weather changes, 14 set-point modifications, 11 thermal load increases, and 26 combined scenarios. Our framework’s performance was benchmarked against Guideline-36 [14] and compared with PPO [39], DDPG [40], and a data-driven MPC [51] using MPPI with periodic model updates. All algorithms shared the same reward function, state space, and control actions. The DRL approaches ran online. PPO updated policies every 12 h, DDPG sampled from its Replay Buffer weekly, and MPPI maintained consistent prediction horizons across non-stationarities, with weekly model refreshes as in [47].
To compare our approach’s performance after non-stationary changes occurred, we examined metrics over a week following detection and considered the maximum interval for building response alteration. Figure 7 shows these results as bar plots, demonstrating our method’s superior performance to rule-based control and PPO and DDPG’s online deployment. Our results are also comparable or better than MPPI. Next, we analyze individual metric performance.
Our approach yields significantly better energy savings E t o t compared to other methods during non-stationary changes. This is attributed to the accelerated adaptation process. This speed-up is due to multiple actors generating experiences and predictive models facilitating quicker future behavior simulation, that outpaces fully online methods that rely on data from the testbed in real time. The rule-based online (RBC), on policy (PPO) and off policy(DDPG) approaches perform significant amount of exploratory behavior under non-stationary system behavior, leading to suboptimal performance for a significant duration of time. This leads to faster convergence, which we discuss later in the context of adaptation time. MPPI also performs well using testbed models for offline planning and online execution. However, as we have discussed earlier, its performance suffers from less reliable state estimates as MPPI with updated models performs better than one without model updates.
Regarding comfort, measured by temperature deviation T D , our method outperformed others, with PPO and DDPG performing worse than RBC due to their exploratory behavior in non-stationary conditions. PPO’s impact was less severe due to conservative updates. Guideline 36 also underperformed, as its abrupt operational changes increased energy use and compromised temperature control. MPPI’s frequent setpoint adjustments, termed “hunting” behavior, caused discomfort in spite of making optimal choices. Our RL agent in the relearning loop used the models to generate and sample many experiences over a short time period. Sudden changes were penalized by the reward formulation, leading to superior performance compared to Guideline-36, PPO, and DDPG. The relearning agent’s performance was on par with MPPI.
The VAV Damper actuation rate of the relearning framework was also better than the other approaches due to similar reasons as above. This led us to investigate the duration of operation where each of these methods were performing suboptimally. This would provide further inside in to the differences in the behavior. Hence, we compared the time taken to relearn Δ T r e l e a r n compared to the online algorithms PPO and DDPG in Figure 8. Δ T r e l e a r n indicates the time it took from the start of the relearning trigger to the time until the controller training converged. We measured this for our approach as well as PPO and DDPG using a callback during the training process that told us whether the episode reward over the last 10 episodes was within 2 standard deviations of the average of the past 100 episode rewards.
Specifically, we observe that the dual loop controller approach takes 60% to 80% less time than the online approaches proposed in [52,53,54] using PPO and DDPG while using much more samples (10 times more samples) across parallel environments. This speed up is a major advantage to addressing the challenges of having a suboptimal policy under non-stationary behavior conditions when the system dynamics is slow.
Next, the seasonal reward analysis, shown in Figure 9, revealed optimal spring performance due to stable temperature and humidity conditions, which were close to comfortable values. Autumn and winter faced challenges due to frequent weather shifts, with autumn being particularly problematic in Nashville due to frequent changes in its weather. This required frequent sampling of the real data used as input for the simulation. Figure 10 shows the number of relearning phases in a year, highlighting the true causes of non-stationarity. Weekly activations were reasonable; they ranged from 8 to 16, and were predominantly triggered by user preference changes because weather and occupancy patterns were more predictable. The consistent average number of relearning phases suggests the framework’s limited retention capacity, pointing to future research on hybrid automata to reduce relearning frequency.

8. Conclusions and Future Work

This study introduced a relearning framework tailored for non-stationary system control, exemplified by a building energy optimization case. The framework enables end-to-end data-driven control suitable for practical applications. We also established a systematic hyperparameter optimization strategy that efficiently manages complex hyperparameter spaces to avoid computational burdens. We demonstrated the relearning framework’s performance on a five-zone building testbed and benchmarked it against Guideline-36, PPO, DDPG, and MPPI algorithms under similar operating conditions. Our relearning algorithm is stable across seasons and consistent in its relearning triggers, which underscore its robustness. Future efforts will aim to enhance model retention post-convergence to reduce relearning frequency. Another important aspect of this work is how we can extend this framework to a real building. This involves adding contingency plans to the framework that can withstand multiple robustness considerations on top of the performance challenges we addressed. It includes sensor information loss, default safety logic, constant monitoring to ensure building comfort for the occupants, and actual energy comparison under nominal conditions that have to be studied. Hence, we separately published a real world implementation of this approach in [55]. Our goal is to ensure that the framework can be deployed on any HVAC system configuration. Hence, in the future, we plan to study the generalizability of our approach across different manufacturer specifications for HVAC components as well as different weather conditions. We also plan to consider cost optimization as one of the objectives in our framework.

Author Contributions

Conceptualization, M.Q.-G. and G.B.; Methodology, A.N., M.Q.-G. and G.B.; Software, A.N.; Formal analysis, M.Q.-G. and G.B.; Investigation, A.N.; Data curation, A.N.; Writing—original draft, A.N.; Writing—review & editing, M.Q.-G. and G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in GitHub: https://github.com/AvisekNaug/ah_deployment.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HVACHeating, Ventilation, and Air Conditioning
CPSCyber Physical Systems
MPCModel Predictive Control
MLMachine Learning
RLReinforcement Learning
EWCElastic Weighted Consolidation
ASHRAEAmerican Society of Heating, Refrigerating and Air-Conditioning Engineers
MDPMarkov Decision Process
LC-NSMDPLipschitz Continuous Non Stationary Markov Decision Process
LSTMLong Short Term Memory
FCNNFully Connected Neural Network
FIFOFirst, In First, Out Queue
OODOut of Distribution Error
HPOHyperparameter Optimization
DAGDirected Acyclic Graph
SMBOSequential Model Based Optimization
FMUFunctional Mockup Unit
VAVVariable Air Volume Unit
MPPIModel Predictive Path Integral Control
PPOProximal Policy Optimization
DDPGDeep Deterministic Policy Gradient

References

  1. Cassandras, C. Chapter 3—Online Control and Optimization for Cyber-Physical Systems. In Cyber-Physical Systems; Song, H., Rawat, D.B., Jeschke, S., Brecher, C., Eds.; Intelligent Data-Centric Systems; Academic Press: Boston, MA, USA, 2017; pp. 31–54. [Google Scholar]
  2. Al-Ali, R.; Bulej, L.; Kofroň, J.; Bureš, T. A guide to design uncertainty-aware self-adaptive components in Cyber–Physical Systems. Future Gener. Comput. Syst. 2022, 128, 466–489. [Google Scholar] [CrossRef]
  3. Padakandla, S.; J, P.K.; Bhatnagar, S. Reinforcement learning algorithm for non-stationary environments. Appl. Intell. 2020, 50, 3590–3606. [Google Scholar] [CrossRef]
  4. Lewis, F.L.; Vrabie, D.L.; Syrmos, V.L. Optimal Control; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2012. [Google Scholar]
  5. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  6. Branicky, M.S. Introduction to hybrid systems. In Handbook of Networked and Embedded Control Systems; Springer: Berlin/Heidelberg, Germany, 2005; pp. 91–116. [Google Scholar]
  7. Naug, A.; Quinones-Grueiro, M.; Biswas, G. Deep reinforcement learning control for non-stationary building energy management. Energy Build. 2022, 277, 112584. [Google Scholar] [CrossRef]
  8. Kim, D.; Lee, J.; Do, S.; Mago, P.J.; Lee, K.H.; Cho, H. Energy modeling and model predictive control for HVAC in buildings: A review of current research trends. Energies 2022, 15, 7231. [Google Scholar] [CrossRef]
  9. Zhang, H.; Seal, S.; Wu, D.; Bouffard, F.; Boulet, B. Building energy management with reinforcement learning and model predictive control: A survey. IEEE Access 2022, 10, 27853–27862. [Google Scholar] [CrossRef]
  10. Fu, Q.; Han, Z.; Chen, J.; Lu, Y.; Wu, H.; Wang, Y. Applications of reinforcement learning for building energy efficiency control: A review. J. Build. Eng. 2022, 50, 104165. [Google Scholar] [CrossRef]
  11. Fang, X.; Gong, G.; Li, G.; Chun, L.; Peng, P.; Li, W.; Shi, X.; Chen, X. Deep reinforcement learning optimal control strategy for temperature setpoint real-time reset in multi-zone building HVAC system. Appl. Therm. Eng. 2022, 212, 118552. [Google Scholar] [CrossRef]
  12. Li, F.; Du, Y. Intelligent multi-zone residential HVAC control strategy based on deep reinforcement learning. In Deep Learning for Power System Applications: Case Studies Linking Artificial Intelligence and Power Systems; Springer: Berlin/Heidelberg, Germany, 2023; pp. 71–96. [Google Scholar]
  13. Arroyo, J.; Manna, C.; Spiessens, F.; Helsen, L. Reinforced model predictive control (RL-MPC) for building energy management. Appl. Energy 2022, 309, 118346. [Google Scholar] [CrossRef]
  14. Guideline 36: Best in Class HVAC Control Sequences. 2025. Available online: https://www.ashrae.org/professional-development/all-instructor-led-training/catalog-of-instructor-led-training/guideline-36-best-in-class-hvac-control-sequences (accessed on 26 February 2025).
  15. Yoon, Y.; Amasyali, K.; Li, Y.; Im, P.; Bae, Y.; Liu, Y.; Zandi, H. Energy performance evaluation of the ASHRAE Guideline 36 control and reinforcement learning–based control using field measurements. Energy Build. 2024, 325, 115005. [Google Scholar] [CrossRef]
  16. Lee, Y.M.; Horesh, R.; Liberti, L. Optimal HVAC Control as Demand Response with On-site Energy Storage and Generation System. Energy Procedia 2015, 78, 2106–2111. [Google Scholar] [CrossRef]
  17. D’Ettorre, F.; Conti, P.; Schito, E.; Testi, D. Model predictive control of a hybrid heat pump system and impact of the prediction horizon on cost-saving potential and optimal storage capacity. Appl. Therm. Eng. 2019, 148, 524–535. [Google Scholar] [CrossRef]
  18. Luzi, M.; Vaccarini, M.; Lemma, M. A tuning methodology of Model Predictive Control design for energy efficient building thermal control. J. Build. Eng. 2019, 21, 28–36. [Google Scholar] [CrossRef]
  19. Azuatalam, D.; Lee, W.L.; de Nijs, F.; Liebman, A. Reinforcement learning for whole-building HVAC control and demand response. Energy AI 2020, 2, 100020. [Google Scholar] [CrossRef]
  20. Fu, Q.; Li, Z.; Ding, Z.; Chen, J.; Luo, J.; Wang, Y.; Lu, Y. ED-DQN: An event-driven deep reinforcement learning control method for multi-zone residential buildings. Build. Environ. 2023, 242, 110546. [Google Scholar] [CrossRef]
  21. Fang, X.; Gong, G.; Li, G.; Chun, L.; Peng, P.; Li, W.; Shi, X. Cross temporal-spatial transferability investigation of deep reinforcement learning control strategy in the building HVAC system level. Energy 2023, 263, 125679. [Google Scholar] [CrossRef]
  22. Kontes, G.D.; Giannakis, G.I.; Sánchez, V.; Agustin-Camacho, D.; Romero-Amorrortu, A.; Panagiotidou, N.; Rovas, D.V.; Steiger, S.; Mutschler, C.; Gruen, G.; et al. Simulation-based evaluation and optimization of control strategies in buildings. Energies 2018, 11, 3376. [Google Scholar] [CrossRef]
  23. Costanzo, G.T.; Iacovella, S.; Ruelens, F.; Leurs, T.; Claessens, B.J. Experimental analysis of data-driven control for a building heating system. Sustain. Energy Grids Netw. 2016, 6, 81–90. [Google Scholar] [CrossRef]
  24. Di Natale, L.; Svetozarevic, B.; Heer, P.; Jones, C.N. Physically Consistent Neural Networks for building thermal modeling: Theory and analysis. Appl. Energy 2022, 325, 119806. [Google Scholar] [CrossRef]
  25. Balali, Y.; Chong, A.; Busch, A.; O’Keefe, S. Energy modelling and control of building heating and cooling systems with data-driven and hybrid models—A review. Renew. Sustain. Energy Rev. 2023, 183, 113496. [Google Scholar] [CrossRef]
  26. Shah, S.F.A.; Iqbal, M.; Aziz, Z.; Rana, T.A.; Khalid, A.; Cheah, Y.N.; Arif, M. The Role of Machine Learning and the Internet of Things in Smart Buildings for Energy Efficiency. Appl. Sci. 2022, 12, 7882. [Google Scholar] [CrossRef]
  27. Pinto, G.; Wang, Z.; Roy, A.; Hong, T.; Capozzoli, A. Transfer learning for smart buildings: A critical review of algorithms, applications, and future perspectives. Adv. Appl. Energy 2022, 5, 100084. [Google Scholar] [CrossRef]
  28. Morteza, A.; Yahyaeian, A.A.; Mirzaeibonehkhater, M.; Sadeghi, S.; Mohaimeni, A.; Taheri, S. Deep learning hyperparameter optimization: Application to electricity and heat demand prediction for buildings. Energy Build. 2023, 289, 113036. [Google Scholar] [CrossRef]
  29. Boutahri, Y.; Tilioua, A. Machine learning-based predictive model for thermal comfort and energy optimization in smart buildings. Results Eng. 2024, 22, 102148. [Google Scholar] [CrossRef]
  30. Lecarpentier, E.; Rachelson, E. Non-stationary Markov decision processes, a worst-case approach using model-based reinforcement learning. Adv. Neural Inf. Process. Syst. 2019, 32, 7214–7223. [Google Scholar]
  31. Trimponias, G.; Dietterich, T.G. Reinforcement Learning with Exogenous States and Rewards. arXiv 2023, arXiv:2303.12957. [Google Scholar]
  32. Glasmachers, T. Limits of end-to-end learning. In Proceedings of the Asian Conference on Machine Learning, PMLR, Nagoya, Japan, 15–17 November 2017; pp. 17–32. [Google Scholar]
  33. Ring, M.B. Continual Learning in Reinforcement Environments. Ph.D. Thesis, University of Texas at Austin, Austin, TX, USA, 1994. [Google Scholar]
  34. Bryhn, A.C.; Dimberg, P.H. An operational definition of a statistically meaningful trend. PLoS ONE 2011, 6, e19241. [Google Scholar] [CrossRef]
  35. Deng, X.; Zhang, Y.; Qi, H. Towards optimal HVAC control in non-stationary building environments combining active change detection and deep reinforcement learning. Build. Environ. 2022, 211, 108680. [Google Scholar] [CrossRef]
  36. Stripping off the Implementation Complexity of Physics-Based Model Predictive Control for Buildings via Deep Learning. Online. 2019. Available online: https://s3.us-east-1.amazonaws.com/climate-change-ai/papers/neurips2019/34/paper.pdf (accessed on 22 September 2021).
  37. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
  38. Sutton, R.S.; McAllester, D.A.; Singh, S.P.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 2000, 12, 1057–1063. [Google Scholar]
  39. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
  40. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2016, arXiv:1509.02971. [Google Scholar]
  41. Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 4th ed.; Pearson Education, Inc.: Hoboken, NJ, USA, 2021. [Google Scholar]
  42. Bowers, R.I.; Salmon, C. Causal Reasoning. In Encyclopedia of Evolutionary Psychological Science; Springer: Cham, Switzerland, 2017; pp. 1–17. [Google Scholar] [CrossRef]
  43. Geiger, D.; Verma, T.; Pearl, J. d-separation: From theorems to algorithms. In Machine Intelligence and Pattern Recognition; Elsevier: Amsterdam, The Netherlands, 1990; Volume 10, pp. 139–148. [Google Scholar]
  44. Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyperparameter optimization. Adv. Neural Inf. Process. Syst. 2011, 24, 2546–2554. [Google Scholar]
  45. Buildings.Examples.VAVReheat. 2021. Available online: https://simulationresearch.lbl.gov/modelica/releases/v4.0.0/help/Buildings_Examples_VAVReheat.html (accessed on 15 July 2021).
  46. Zaytar, A.; Amrani, C.E. Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks. Int. J. Comput. Appl. 2016, 143, 7–11. [Google Scholar]
  47. Ding, X.; Du, W.; Cerpa, A.E. MB2C: Model-Based Deep Reinforcement Learning for Multi-zone Building Control. In BuildSys ’20: Proceedings of the 7th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation; Association for Computing Machinery: New York, NY, USA, 2020; pp. 50–59. [Google Scholar] [CrossRef]
  48. Karevan, Z.; Suykens, J.A.K. Transductive LSTM for time-series prediction: An application to weather forecasting. Neural Netw. 2020, 125, 1–9. [Google Scholar] [CrossRef]
  49. Liaw, R.; Liang, E.; Nishihara, R.; Moritz, P.; Gonzalez, J.E.; Stoica, I. Tune: A Research Platform for Distributed Model Selection and Training. arXiv 2018, arXiv:1807.05118. [Google Scholar]
  50. A Ten-Minute Introduction to Sequence-to-Sequence Learning in Keras. 2020. Available online: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html (accessed on 27 February 2022).
  51. Nagabandi, A.; Finn, C.; Levine, S. Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL. arXiv 2018, arXiv:1812.07671. [Google Scholar]
  52. Wei, T.; Wang, Y.; Zhu, Q. Deep reinforcement learning for building HVAC control. In Proceedings of the 54th Annual Design Automation Conference 2017, Austin, TX, USA, 18–22 June 2017; pp. 1–6. [Google Scholar]
  53. Barrett, E.; Linder, S. Autonomous hvac control, a reinforcement learning approach. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2015; pp. 3–19. [Google Scholar]
  54. Zhang, Z.; Chong, A.; Pan, Y.; Zhang, C.; Lam, K.P. Whole building energy model for HVAC optimal control: A practical framework based on deep reinforcement learning. Energy Build. 2019, 199, 472–490. [Google Scholar] [CrossRef]
  55. Naug, A.; Quinones–Grueiro, M.; Biswas, G. Reinforcement learning-based HVAC supervisory control of a multi-zone building- A real case study. In Proceedings of the 2022 IEEE Conference on Control Technology and Applications (CCTA), Trieste, Italy, 23–25 August 2022; pp. 1172–1177. [Google Scholar] [CrossRef]
Figure 1. Overview of the solution using an inner and outer loop schema.
Figure 1. Overview of the solution using an inner and outer loop schema.
Energies 18 01408 g001
Figure 2. Performance monitor module.
Figure 2. Performance monitor module.
Energies 18 01408 g002
Figure 3. Dynamic system models. The dashed lines represent any interfacing operation between the component itself and external components.
Figure 3. Dynamic system models. The dashed lines represent any interfacing operation between the component itself and external components.
Energies 18 01408 g003
Figure 4. Exogenous variable prediction module. The dashed lines represent any interfacing operation between the component itself and external components. *K and *N represent ideal values for the past and future horizon respectively.
Figure 4. Exogenous variable prediction module. The dashed lines represent any interfacing operation between the component itself and external components. *K and *N represent ideal values for the past and future horizon respectively.
Energies 18 01408 g004
Figure 5. Schematic of the five zone testbed. Source: [45].
Figure 5. Schematic of the five zone testbed. Source: [45].
Energies 18 01408 g005
Figure 6. Bayes net for the relearning approach applied to the 5 zone testbed.
Figure 6. Bayes net for the relearning approach applied to the 5 zone testbed.
Energies 18 01408 g006
Figure 7. Performance of Rule-Based, PPO, DDPG, MPPI, and Relearning Approach deployed on the testbed and simulated for a period of 1 year. Benchmarking is done across all types of non-stationarity changes. Metrics are recorded for a week after detection of non-stationarity using the Performance Monitor Module. Energy performance is aggregated for a week as in figure (a). Temperature deviation in figure (b) and Actuation rates in figure (c) are aggregated on a per-hour basis.
Figure 7. Performance of Rule-Based, PPO, DDPG, MPPI, and Relearning Approach deployed on the testbed and simulated for a period of 1 year. Benchmarking is done across all types of non-stationarity changes. Metrics are recorded for a week after detection of non-stationarity using the Performance Monitor Module. Energy performance is aggregated for a week as in figure (a). Temperature deviation in figure (b) and Actuation rates in figure (c) are aggregated on a per-hour basis.
Energies 18 01408 g007
Figure 8. Time to adapt under each DRL based approach across four different types of non-stationary behavior.
Figure 8. Time to adapt under each DRL based approach across four different types of non-stationary behavior.
Energies 18 01408 g008
Figure 9. Total reward obtained for our approach depending on the week and season of the year.
Figure 9. Total reward obtained for our approach depending on the week and season of the year.
Energies 18 01408 g009
Figure 10. Total number of relearning phases triggered for our approach depending on the week of the year.
Figure 10. Total number of relearning phases triggered for our approach depending on the week of the year.
Energies 18 01408 g010
Table 1. Description of the testbed Dynamic System Model as a non-stationary MDP.
Table 1. Description of the testbed Dynamic System Model as a non-stationary MDP.
ComponentVariables
State1. Outside Air Temperature(oat)
s ¯ t 2. Outside Air Relative Humidity(orh)
3. Five Zone Temperatures( T z , z = 1 5 )
4. Total Energy Consumption( E t o t )
5. AHU Discharge Air Temperature ( T d i s c h )
Action:AHU Discharge Temperature Set-
a ¯ t point ( T d s p )
Reward r e n e r g y + r c m f t + r v a v where
r ( s t , a t ) r e n e r g y = E t o t
r c m f t = z z o n e s
    m a x ( ( T z u b z , t ) , 0 , ( l b z , t T z ) )
r v a v = z z o n e s | v % , z , t v % , z , t 1 |
Non-stationary Transition1. Total Energy Consumption Model( M e n e r g y )
Model( p t ( s , s t , a t ) )2. Zone Temperature Model( M T , z o n e )
3. VAV Damper Percentage Model( M v a v % )
Table 2. Hyperparameters of the ensemble–CPS pair, the metrics they affect and the classification of the hyperparameters as global or local.
Table 2. Hyperparameters of the ensemble–CPS pair, the metrics they affect and the classification of the hyperparameters as global or local.
Hyperparameter(s)Metric AffectedGlobal/Local
Performance
Monitor Module
Window sizes: W p m 1 , W p m 2
L Δ t d Global
Memory Buffer Size: M e L m e ,
L Δ t r
Global
Forecast Horizon Length: N L m e ,
L r w d .
L Δ t r
Global
Exogenous Variable Predictor:
Nodes, Layers, Activations
L m e Local
Dynamic System Model(s):
Nodes, Layers, Activations
L m e Local
RL Agent Policy Network:
Nodes, Layers, Activations
L r w d Local
RL Agent Value Network:
Nodes, Layers, Activations
L r w d Local
Discount Factor: γ L r w d Local
Episode Length: l e L r w d Local
Table 3. Ranges of the global hyperparameters.
Table 3. Ranges of the global hyperparameters.
HyperparameterSearch Space
(Range)
W p m 1
(Performance Monitor Module)
{ 1 , 2 , , 12 } h
W p m 2
(Performance Monitor Module)
{ 15 , 30 , , 165 , 180 } min
M e (Experience Buffer) { 24 , 48 , , 144 , 168 } h
N(Exogenous Variable Predictor
and Dynamic System Model)
{ 24 , 48 , 72 , 96 } h
Table 4. Ranges of the local hyperparameters.
Table 4. Ranges of the local hyperparameters.
HyperparameterSearch Space
(Range)
Dynamic System
Model Weights
( M e n e r g y , M T , z o n e ,
M v a v % , z )
W { 8 , 16 , 32 , , 256 }
Dynamic System
Model Activation
( M e n e r g y , M T , z o n e ,
M v a v % , z )
a { l i n e a r , r e l u , s i g m o i d , t a n h }
Exogenous Variable
Model Weights
W { 8 , 16 , 32 , , 256 }
Exogenous Variable
Model Activation
a { l i n e a r , r e l u , t a n h }
Actor( π θ ) Critic ( V ϕ )
Network weights
W { 64 , 128 , 256 , 512 }
Actor( π θ ) Critic ( V ϕ )
Network Activation
a { l i n e a r , r e l u , t a n h }
γ { 0.99 , 0.95 , , 0.80 }
l e { 1 , 2 , 3 } days
Table 5. Tuned values of the input sequence length K and the prediction horizon N under different conditions of non-stationarity.
Table 5. Tuned values of the input sequence length K and the prediction horizon N under different conditions of non-stationarity.
Non-StationarityWeatherZone
Set Point
Thermal
Load
Combined
HyperparameterKNKNKNKN
Value (h)18726726721248
Table 6. Tuned values W p m 1 and W p m 2 .
Table 6. Tuned values W p m 1 and W p m 2 .
Non-StationarityWeatherZone
Setpoint
Hyperparameter W p m 1 W p m 2 W p m 1 W p m 2
Value (h)6232
Non-stationarityThermal LoadCombined
Hyperparameter W p m 1 W p m 2 W p m 1 W p m 2
Value (h)3263
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Naug, A.; Quinones-Grueiro, M.; Biswas, G. An End-to-End Relearning Framework for Building Energy Optimization. Energies 2025, 18, 1408. https://doi.org/10.3390/en18061408

AMA Style

Naug A, Quinones-Grueiro M, Biswas G. An End-to-End Relearning Framework for Building Energy Optimization. Energies. 2025; 18(6):1408. https://doi.org/10.3390/en18061408

Chicago/Turabian Style

Naug, Avisek, Marcos Quinones-Grueiro, and Gautam Biswas. 2025. "An End-to-End Relearning Framework for Building Energy Optimization" Energies 18, no. 6: 1408. https://doi.org/10.3390/en18061408

APA Style

Naug, A., Quinones-Grueiro, M., & Biswas, G. (2025). An End-to-End Relearning Framework for Building Energy Optimization. Energies, 18(6), 1408. https://doi.org/10.3390/en18061408

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop