Accelerating Optimal Building Control Through Reinforcement Learning with Surrogate Building Models

Cespedes Cubides, Andres Sebastian; Laursen, Christian Friborg; Jradi, Muhyiddine

doi:10.3390/app16062790

Open AccessArticle

Accelerating Optimal Building Control Through Reinforcement Learning with Surrogate Building Models

by

Andres Sebastian Cespedes Cubides

^*

,

Christian Friborg Laursen

and

Muhyiddine Jradi

^*

Mærsk McKinley Møller Institute, University of Southern Denmark, 5230 Odense, Denmark

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2790; https://doi.org/10.3390/app16062790

Submission received: 6 February 2026 / Revised: 9 March 2026 / Accepted: 12 March 2026 / Published: 13 March 2026

(This article belongs to the Special Issue Managing Sustainable Energy Systems: Challenges, Models and Opportunities for the Long-Term Energy Transition)

Download

Browse Figures

Versions Notes

Abstract

Buildings account for a substantial share of global energy use, yet the adoption of advanced optimal control strategies remains limited due to high computational costs and the difficulty of safe deployment. This paper presents a fully Python-based, data-driven deep reinforcement learning (DRL) supervisory control framework that leverages gray box surrogate modeling and Imitation Learning to overcome these barriers. The novelty of this work lies in the integration of an ontology-based Twin4Build surrogate model with Imitation Learning and Deep Reinforcement Learning, enabling efficient training of building control policies in a low-cost environment before transfer to a high-fidelity BOPTEST emulator. Results demonstrate that the trade-off of using a lower-accuracy surrogate accelerates training by a factor of 11 compared to high-fidelity models. Furthermore, the RL agent successfully learned load-shifting and peak-shaving strategies, eliminating start-up power spikes and achieving energy savings of up to 28.9%. Beyond substantial energy reductions, this pipeline yields a calibrated digital twin suitable for ongoing building services like anomaly detection, presenting a scalable path for real-world smart building optimization.

Keywords:

reinforcement learning; building energy control; surrogate modeling; HVAC; Twin4Build; energy efficiency

1. Introduction

Buildings within the European Union consume nearly 40% of the total energy consumption [1]. According to the International Energy Agency (IEA), buildings and the construction sector are responsible for approximately 36% of total emissions [2], with a significant portion stemming from residential, non-residential, and construction operations [3]. Because of this, the sector holds great potential for achieving cost-effective efficiency improvements and substantial reductions in greenhouse gas emissions [4]. Traditionally, Rule-Based Control (RBC) and Proportional Integral Derivative (PID) controllers have dominated the industry [5]. However, the advent of machine learning and increased data availability has enabled new capabilities [6,7]. These data-driven methods benefit the modeling, forecasting, and optimal control of building energy systems [8]. Regardless of the promise of sizable energy demand reductions, scalable solutions remain evasive due to a lack of trust, poor interpretability, and the need for heavy, constant computation loops [9].

Within data-driven control, Reinforcement Learning (RL) has shown great promise, though extreme sample inefficiency makes learning directly on physical equipment impractical. Early model-free deep RL deployments in commercial facilities and data centers [10] and follow-up studies [11] demonstrated cooling-energy reductions, but they relied heavily on expert supervision and strictly limited decision spaces. Similarly, offline training via detailed fluid dynamics emulators [12] reached convergence but addressed only a single airflow lever. To alleviate the risks of online exploration, several groups shifted the training burden to simulators. Zhong et al. trained an RL policy using an OpenAI Gym interface wrapped around EnergyPlus [13]. However, relying strictly on EnergyPlus is computationally expensive and abstracts away fast control dynamics, resulting in policies that risk severe overfitting and exhibit poor transfer when confronted with real controller logic. Recognizing the need for physics constraints, Drgoňa et al. [14] proposed a hybrid framework, though it deliberately excluded dynamic interactions with HVAC equipment.

To mitigate extreme training times, Imitation Learning (IL) has emerged as a powerful technique to bootstrap the learning process by mimicking expert demonstrations [15]. Among IL techniques, Behavioral Cloning (BC) maps states directly to actions based on expert trajectories [16]. While this approach stabilizes initial policy optimization and provides a “warm start” for RL fine-tuning—cutting training time significantly—its widespread adoption remains low [15]. Importantly, these approaches still rely heavily on white box models based on EnergyPlus and/or Modelica, which are computationally expensive and very hard to produce. Silvestri et al. [8] successfully merges IL and RL with an important real-world validation, achieving sizable energy savings. However, despite using a model-free implementation, they still rely on a Modelica-based digital twin for validation. Furthermore, their controller has a very narrow scope—managing only one room and one main HVAC valve—thereby ignoring total building energy consumption. A critical barrier is the fragmentation of the development ecosystem, which often requires complex “glue code” to bridge simulators with machine learning libraries. Addressing the need for self-contained frameworks, Goldfeder et al. developed an all-Python simulation suite using gray box models based on energy equations [17]. However, it still lacks validation regarding the real-world effectiveness of the resulting RL policies, nor does it test their closed-loop efficacy when transferred to high-fidelity building emulators.

To the best of the authors’ knowledge, there are no publications in the literature that demonstrate a holistic, fully data-driven supervisory controller capable of coordinating envelope thermodynamics with air-side and water-side HVAC equipment using a transparent, self-contained Python framework. Existing studies typically (i) optimize a single chiller or airflow set-point [8,10,11,12], (ii) validate only in simulation without hardware-in-the-loop testing [13,17], or (iii) decouple envelope from plant optimization [14].

These factors highlight an open research gap and the clear opportunity to develop a modeling and RL control development suite entirely in Python (version 3.12). This work addresses these gaps by developing a unified pipeline using Twin4Build that produces gray box, physics-based building models calibrated directly from data. Embedded within a Gym environment, the framework utilizes Imitation Learning to efficiently produce RL control policies from surrogate models. We then analyzed the behavior of these policies when transmitted into a high-fidelity emulator—mimicking the process of transferring to a real building—and conduct further online fine-tuning to evaluate performance gains across large action spaces and total building energy consumption.

The novelty of this study lies in the development of a unified and fully Python-based framework that combines digital twin modeling with Imitation Learning and Deep Reinforcement Learning to support holistic building energy control. Unlike many previous studies that focus on optimizing a single component or building section, or rely on computationally intensive white box simulators, the proposed approach uses gray box surrogate models generated through the Twin4Build platform to efficiently train supervisory control policies that coordinate both building envelope dynamics and HVAC system operation. An additional contribution is the structured transfer of trained policies from the surrogate environment to a high-fidelity BOPTEST emulator, enabling realistic closed-loop evaluation before potential real-world deployment. By bringing these elements together in a single pipeline, the study demonstrates a practical and scalable pathway for developing data-driven control strategies capable of improving whole-building energy performance while significantly reducing training complexity.

2. Materials and Methods

The proposed methodology for developing a deep reinforcement learning (DRL)-based supervisory controller for building HVAC systems is depicted in Figure 1. These controllers act at the supervisory level, monitoring the global building state to generate near-optimal setpoints for the HVAC systems, rather than replacing low-level internal control loops like PID controllers.

To address the fragmentation and computational bottlenecks identified previously, this pipeline is unified within a Python ecosystem and divided into three primary stages: creating a Twin4Build [18] building model (Figure 2), pre-training a control policy via Imitation Learning (Figure 3), and performing RL fine-tuning (Figure 4).

In the first stage, a gray box building model is created using Twin4Build. Leveraging semantic ontologies, the framework constructs graph-based representations of building systems that seamlessly support both simulation and parameter estimation. It provides a library of modular, physics-based elements and gray box component models, allowing users to tailor model fidelity and complexity. Crucially, these surrogate models perform simulations an order of magnitude faster than conventional high-fidelity white box engines. As new component models are contributed, the framework supports varying levels of modeling accuracy. For a comprehensive overview of constructing a Twin4Build model, the reader is referred to the framework’s main repository (https://github.com/JBjoernskov/Twin4Build (accessed on 10 March 2026)) [18,19].

In the second stage (Figure 3), the objective is to establish a stable, baseline control policy capable of satisfying standard building setpoints by imitating existing controllers. Drawing on proven practices from robotics and addressing the sample inefficiency of pure RL, this study constructs a dataset of expert trajectories to pre-train the policy via supervised learning (Behavioral Cloning). This provides a robust “warm start,” significantly reducing the training time required for the subsequent RL fine-tuning phase. The following subsections describe this progression in detail.

2.1. Data Acquisition

Using the Twin4Build model described previously, we construct a dataset of expert control trajectories across various episodic scenarios. These trajectories capture the sequential interaction between the building system’s state and the baseline controller’s inputs over time. To make this data suitable for Imitation Learning, the continuous trajectories are broken down into individual transition vectors. Each transition is stored as a discrete tuple containing the current observation, the applied control action, the subsequent observation, and an episode termination flag. Aggregating these step-by-step transitions yields a comprehensive supervised dataset that demonstrates the expert’s behavior, providing the necessary examples for the neural network to learn the direct mapping from states to actions. The formal mathematical definitions of these trajectories and system dynamics are provided in Appendix A.

2.2. Policy Pre-Training and Fine-Tuning

Behavioral Cloning (BC) frames imitation learning as a supervised task, providing a computationally efficient method to initialize a control policy,

π (u ∣ o)

. Using the transition dataset acquired from the expert, the policy learns to directly map observations (

o_{t}

) to control inputs (

u_{t}

) by minimizing a loss function ℓ (e.g., mean squared error for continuous actions):

min_{π} E_{(o_{t}, u_{t}) \sim D} [ℓ (u_{t}, π (o_{t}))]

While BC yields a stable baseline that acts autonomously, further performance gains are achieved through reinforcement learning fine-tuning within the Twin4Build environment (Figure 4). To optimize the policy while preventing it from catastrophically forgetting the expert behavior during early exploration, we employ Proximal Policy Optimization (PPO) augmented with a Kullback–Leibler (KL) divergence penalty. This constraint regulates the divergence between the actively exploring policy (

π_{θ}

) and the pretrained baseline (

π_{BC}

):

L (θ) = L_{PPO} (θ) - β KL (π_{θ} ∥ π_{BC})

The weighting coefficient

β

is annealed over time, keeping the policy anchored to the expert initially, but progressively granting it the freedom to discover highly efficient control strategies.

Finally, deploying this simulation-trained policy to a real building or a high-fidelity emulator inevitably introduces a performance drop known as the sim2real gap [20]. To mitigate this discrepancy, the framework utilizes further online fine-tuning within the target environment, allowing the policy to adapt to unmodeled, real-world thermodynamic complexities.

2.3. BOPTEST Multi-Zone Office Test Case and Twin4Build Surrogate

To benchmark the proposed control strategy, this study utilizes the Building Optimization Testing Framework (BOPTEST) [21]. BOPTEST provides standardized, dynamic building models (exported as Functional Mock-up Units) that interface seamlessly with reinforcement learning workflows via the BOPTEST-Gym environment [22] and standard libraries like Stable Baselines3 [23].

We selected the multi-zone office air system emulator [24], a five-zone, single-duct variable air volume (VAV) building equipped with reheat capabilities, a central air handling unit (AHU), a variable-efficiency heat pump, and a chiller. All heating and cooling demands are satisfied through this central ventilation system.

To facilitate the rapid simulation required for RL training, we developed a gray box surrogate model of this test case using Twin4Build. The surrogate is structured into three primary subsystems: the AHU model, the building envelope (using thermal RC dynamics), and the VAV system models. The parameters for the dampers, heating coils, and RC room models were calibrated using time series data extracted from the original high-fidelity BOPTEST model. A key computational simplification in the surrogate is the treatment of the AHU fan as a passive component, which estimates power consumption based on the aggregated airflow demands inferred from individual room damper positions.

Detailed schematic diagrams of the BOPTEST building layout, the HVAC configuration, and the specific Twin4Build component topologies are provided in Appendix B.

2.4. Control Strategy

Both the BOPTEST [22] and Twin4Build models were implemented as Gym-compatible environments [25] to support reinforcement learning (RL)–based supervisory control using a Proximal Policy Optimization (PPO) agent [26]. These environments share a common observation and action space, facilitating direct policy transfer between the high-fidelity emulator and the surrogate model. Key performance indicators (KPIs), including energy consumption and setpoint deviations, are used to evaluate controller performance across both environments. The following subsections describe the KPIs, observation and action spaces for this use case scenario.

2.4.1. Key Performance Indicators (KPIs)

To evaluate the performance of the control policies, this study adopts the standardized Key Performance Indicators (KPIs) defined by the BOPTEST framework. The primary objectives are:

Total Energy Use (ener_tot): Measures the site HVAC energy consumption (in kWh/m²), accounting for all energy vectors (heating, cooling, fans, pumps) normalized by the total floor area.
Thermal Discomfort (tdis_tot): Quantifies the integral of temperature deviations outside a predefined comfort range over time, averaged across all zones (in K · h/zone).
Indoor Air Quality Violation (idis_tot): Measures the integral of CO₂ concentration deviations above a predefined safety threshold, averaged across all zones (in ppm · h/zone).

The exact mathematical formulations for these KPIs are detailed in Appendix C.

2.4.2. Observation and Action Spaces

The RL agent interacts with the Twin4Build surrogate and the BOPTEST emulator through a strictly defined interface. All inputs are min–max-scaled to the interval

[0, 1]

before being passed to the Stable Baselines3 PPO policy.

Observation Space: The state vector supplied to the agent comprises three components:

Sensor observations: Real-time physical measurements read directly from the simulator at each step (e.g., zone temperatures, damper positions).
Time embeddings: Temporal variables (hour of day, day of week, month) encoded as continuous circular features (using $sin (2 π x)$ and $cos (2 π x)$ ) to preserve periodicity boundaries.
Forecasts: A rolling horizon of the next $N_{f} = 50$ samples ( $t, t + Δ t, \dots, t + 49 Δ t$ ) for weather, occupancy, and boundary set-points, refreshed at every simulation step.

Action Space: The controller acts at a supervisory level. It outputs continuous control signals mapped to the AHU supply-air temperature set-point, as well as the local heating/cooling set-points and direct supply damper overrides for each of the five zones.

The exhaustive lookup tables defining the signal keys, ranges, and descriptions for both the observation and action spaces are provided in Appendix D.

2.4.3. Reward Function

In order to guide the control policy towards reducing both energy use and discomfort, the reward function is defined as a weighted sum of the three key performance indicators (KPIs) introduced above: total energy use (

ener_tot

), thermal discomfort (

tdis_tot

), and indoor air quality violation (

idis_tot

). Each KPI is scaled by a corresponding weighting factor

α_{e}

,

α_{t}

, and

α_{i}

to balance their relative importance. To ensure that the agent is penalized for high values of energy consumption or discomfort, the reward is taken as the negative of this weighted sum:

R = - (α_{e} ener_tot + α_{t} tdis_tot + α_{i} idis_tot) .

With this definition, the learning agent maximizes long-term reward by minimizing energy consumption and maintaining thermal comfort and indoor air quality within acceptable limits. The weighting factors can be tuned to reflect application-specific trade-offs, such as prioritizing occupant comfort over energy savings or vice versa.

3. Results

3.1. Model Validation

After an iterative modeling and parameter estimation process, the unified model comprising the AHU model, the RC envelope model, and the room VAV boxes—was tested and calibrated.

The statistical performance of the evaluated HVAC signals is summarized in Table 1. To interpret these results, it is important to consider both the magnitude and the sign of the error metrics. The CV-RMSE represents the “scatter” of the model; for example, the Core Zone Temp. (1.60%) and North Zone Temp. (2.36%) exhibit very low scatter, indicating high precision. In contrast, the Supply Fan Power (74.52%) shows significant volatility this is explained by the power peaks simulated by the BOPTEST model which are not captured by T4B, these however don’t affect much the overall power estimation. The NMBE indicates systemic bias; negative values in the CO₂ signals across the North, South, and East zones suggest a consistent under-prediction of occupancy-related emissions.

From the air-handling unit’s perspective, Figure 5 presents a comparison between the supply airflow rates under a typical heating scenario. As indicated in Table 1, the Supply Airflow signal exhibits a significant divergence from the BOPTEST reference when evaluated using the CV-RMSE metric. This discrepancy is largely an artifact of how airflow is simulated: both the BOPTEST and T4B environments represent the main fan’s activation as instantaneous, vertical step changes. Because these sharp transitions abstract away strict physical realism (such as gradual fan spin-up), even minor temporal misalignment between the two models yield disproportionately high CV-RMSE penalties. Nevertheless, this simplified representation is deemed acceptable. It provides a sufficient approximation of the AHU airflow dynamics for the RL agent to internalize, aligning with the framework’s core trade-off: accepting inherent modeling abstractions in exchange for dramatic computational acceleration.

Figure 6 presents the indoor temperature of west room during a typical heating scenario.

The visual correlation between the reference models (BOPTEST) and the surrogate predictions (T4B) is illustrated in Figure 7. The scatter plots reveal varying levels of fidelity across different signal types; temperature signals generally exhibit tighter clustering around the

45^{°}

identity line, whereas CO₂ concentrations and AHU signals like Supply Fan Power show higher dispersion, consistent with the higher CV-RMSE values previously noted. This highlights the inherent trade-off in surrogate modeling: while high-fidelity physical models provide superior accuracy by capturing complex thermodynamic interactions, they are often computationally prohibitive for real-time applications or large-scale optimizations. In contrast, the surrogate approach minimizes computational expense—enabling rapid iteration and deployment—at the cost of some predictive precision in highly dynamic or non-linear signals. Additionally, as demonstrated in the subsequent RL training results, this level of fidelity remains sufficient for the agent to internalize the underlying system dynamics. By developing an intuition for energy-saving actions within the surrogate environment, the agent establishes a robust foundation that eases the transferability of learned policies to real-world buildings or high-fidelity emulators.

3.2. Twin4Build Model: IL and RL Control Integration

Expert trajectories derived from the original BOPTEST controller were utilized to pre-train the supervisory controller, establishing a baseline policy via Behavioral Cloning (BC). This study specifically focuses on the heating scenarios defined in the Multizone office simple air case [24]. The temporal boundaries for these scenarios are detailed in Table 2, while the baseline performance benchmarks for each period are summarized in Table 3.

The control policy underwent a multi-stage training pipeline: initial pre-training on expert trajectories (BC), followed by 1 million timesteps of fine-tuning within the Twin4Build surrogate environment. A final refinement phase of 100,000 timesteps was conducted directly in BOPTEST. These phases were executed across the two heating scenarios, representing 22 days of winter operation.

To evaluate the shift from imitation to optimization, Figure 8 illustrates the room temperature and corresponding control actions for both the pre-trained BC and the RL-refined policies. A behavioral change is observed; while the BC policy remains conservative, the RL agent learns to proactively heat the zone beyond the minimum comfort threshold. This indicates that the agent has developed a thermal load-shifting strategy, leveraging the building’s thermal mass as storage to buffer against future heating requirements.

From the perspective of the Air Handling Unit (AHU), the RL supervisory controller adopts a more centralized heating strategy. As shown in Figure 9, the agent prioritizes shifting the thermal load from individual zone reheating coils to the main AHU heating coil. By maintaining a higher and more consistent supply air temperature set-point, the agent optimizes the primary heating loop’s efficiency over localized, fragmented adjustments.

3.3. Supervisory Controller in High-Fidelity BOPTEST Emulator

As previously discussed, the pre-training phase utilizes the Twin4Build (T4B) surrogate model to establish an initial control policy. This process yields a controller that aligns closely with the native control logic of the T4B model. Table 4 presents the initial performance metrics when deploying this T4B-pretrained policy directly into the higher-fidelity BOPTEST emulator without further modification.

The effectiveness of this direct deployment is best understood by analyzing the system’s power consumption profile. Figure 10 illustrates the baseline energy breakdown for a typical heating scenario. A clear pattern emerges: significant power spikes, often exceeding 70 kW to 80 kW, occur at the onset of heating operations, driven by simultaneous, intense demands from both the heat pump and the distribution pumps.

In contrast, Figure 11 demonstrates the behavior of the unmodified T4B pre-trained policy within the BOPTEST environment. The policy achieves a peak-shaving effect, reducing maximum power draws to approximately 35 kW. Notably, the heating strategy shifts; the reliance on immediate, intense heat pump utilization is curtailed. Instead, the strategy leans toward centralization, utilizing the distribution pumps more consistently over longer durations to manage thermal loads more efficiently, this follows the devised strategy during the Twin4Build fine tuning process.

To further optimize the controller, the policy underwent an additional online fine-tuning process within the BOPTEST emulator for 100,000 time steps. Table 5 details the key performance indicators following this secondary training phase.

These strategy shifts cause slight increases in thermal discomfort (‘tdis_tot’), highlighting the trade-off the RL agent makes between energy savings and strict set-point tracking. Figure 12 summarizes the baseline controller during the typical heating scenario. The baseline rigidly follows set-points: zone temperatures stay close to 20 °C during occupied hours, and CO₂ stays under 800 ppm. However, this strict control also causes HVAC power spikes, often exceeding 80 kW at start-up. While indoor air quality (IAQ) is excellent, this aggressive heating can also cause thermal discomfort (125–175 K · h per zone) due to rapid temperature changes.

Figure 13 shows the fine-tuned RL controller in the same scenario. The RL agent successfully eliminates the morning power spikes, capping peak power near 35 kW. This peak-shaving reduces total HVAC energy use from 21.46 kWh/m² to 15.25 kWh/m². To save energy, the RL policy tracks set-points more loosely. Temperatures are allowed to drift up to 25 °C, and CO₂ levels often reach 1000–1100 ppm. While CO₂ remains safely below 1200 ppm, this relaxed ventilation strategy changes the penalty profile. As seen in Figure 13, thermal discomfort is nearly gone, but strict IAQ violations increase significantly. This shows the RL agent prioritizing energy reduction by taking advantage of flexible constraints.

Zooming in into the performance of a specific room in the BOPTEST typical heating scenario, we notice that the rl supervisory control has adapted more to the specific nuances of this building emulator. Opening the damper less compared to the T4B fine tuned policy, and modulating the set-points to adapt to the accuracy of the dynamics of this system Figure 14.

3.4. Training Speed Comparison

One of the advantages of using surrogate models for RL training is the gains in training time provided by lighter simulation and less model overhead. In Figure 15, a plot of the training time efficiency of a 300,000 training steps run is presented, both running parallelized environments with 7 simulation threads. The RL training time efficiency of both Twin4Build and BOPTEST can be appreciated, both have varying training speed attributed to varying processor conditions, but on average Twin4Build allows for an increase of around an order of magnitude in training speed.

As shown in Figure 15. Two train runs of around 300,000 time steps take 12.6 h for BOPTEST while for Twin4Build required 1.13 h, 11.18 times faster. On average the twin4build RL training had a rate of 271.858 simulation steps per hour while BOPTEST 23.818 simulation steps per hour.

4. Discussion

The results of this study highlight the viability of a fully Python-based, data-driven pipeline for HVAC supervisory control. By bridging Imitation Learning (IL), Reinforcement Learning (RL), and gray-box surrogate modeling, this approach effectively navigates the bottlenecks of sample inefficiency and computational overhead typical of building control research. As demonstrated in the model validation phase, the Twin4Build surrogate model exhibits varying levels of predictive precision compared to the high-fidelity BOPTEST reference. However, this lower accuracy is a trade-off in exchange for faster simulation time and less digital overhead. The surrogate model still captures the core thermodynamic dynamics and temporal inertia of the building, providing a fidelity that is entirely sufficient for the RL agent to internalize fundamental energy-saving strategies such as thermal load-shifting and peak shaving.

The primary advantage of embracing this lower-accuracy surrogate lies in the gains in training velocity. As shown in the training speed comparison, the surrogate environment operates over an order of magnitude faster than the high-fidelity emulator—completing 300,000 steps in 1.13 h compared to 12.64 h. The gains in speed allow not only for faster deployment but also for rapid iteration and broader experimentation that would be computationally prohibitive using traditional white box simulators.

Furthermore, these computational gains facilitate an effective sim-to-real transfer strategy. Transitioning an RL policy directly from simulation to a physical building often risks erratic or unsafe behavior. However, if the pre-training is done in the surrogate environment using Behavioral Cloning while actively monitoring the physicality of the RL actions, the resulting models serve as an already more optimal “warm start” for RL policies. When deployed to real buildings—or high-fidelity emulators acting as proxies—these pre-trained policies operate safely from day one. They can then continue to adapt and improve over time through a restricted version of online RL fine-tuning, successfully navigating the sim2real gap without catastrophic exploration.

The value of this methodology extends beyond immediate operational improvements. This variant of Imitation Learning, reinforcement learning, and fine-tuning within a unified Python-based approach yields not only substantial energy savings but also a calibrated surrogate digital twin of the building. This lightweight, transparent model can run in parallel with real-time operations, providing a foundational asset that can assist in advanced building services such as anomaly detection and space management [18]. Ultimately, this dual output: a robust control policy and a highly functional digital twin—demonstrates the comprehensive utility and scalability of the proposed framework for modern smart buildings.

Several limitations must be acknowledged. First, the substantial energy reductions achieved by the RL agent rely incur in some violations of the indoor comfort set-points, which inherently trades minor increases in thermal discomfort and indoor air quality (IAQ) violations for peak-load shaving. While CO₂ levels remained safely within acceptable limits, prioritizing energy efficiency over rigid comfort adherence may require careful parameter tuning depending on specific building occupancy requirements. Second, while the gray box surrogate model successfully captures overarching thermal inertia, it struggles with highly dynamic or non-linear signals, such as instantaneous supply fan power and abrupt airflow step changes. Additionally, the current evaluation was restricted to a five-zone office emulator operating strictly under winter heating scenarios. Validating this framework’s scalability to larger, more topologically complex commercial facilities, as well as its performance during cooling-dominated summer months, remains a necessary next step. Lastly, although the methodology provides a safe “warm start”, overcoming the sim-to-real gap still necessitates a period of online fine-tuning within the target environment, which inherently demands active operational time in a real-world deployment.

5. Conclusions

This work presents a fully data-driven modeling and supervisory control framework developed entirely within a Python ecosystem. By utilizing Twin4Build to generate gray box surrogate models, the methodology bridges rapid simulation and practical reinforcement learning (RL) deployment. The approach leverages imitation learning to establish a safe baseline control behavior, followed by proximal policy optimization (PPO) fine-tuning to coordinate multi-zone HVAC operations. A primary advantage of this method is the substantial reduction in computational overhead, achieving training speeds more than an order of magnitude faster than high-fidelity simulators. Consequently, the developed controllers effectively mitigated power spikes during heating start-ups, yielding energy savings of over 24% upon initial deployment and nearly 29% following online fine-tuning. These efficiency gains require a tradeoff; the RL agent learns to prioritize energy reduction by relaxing strict setpoint tracking, exchanging minor thermal comfort and indoor air quality margins for peak-load shaving.

To bridge the gap toward broader commercial adoption, future research must address several key operational constraints. It is necessary to evaluate the statistical reliability of these controllers across diverse stochastic conditions and to implement hard safety constraints, such as fallback rule-based overrides, prior to direct deployment in occupied buildings. Furthermore, subsequent work should analyze the scalability and convergence stability of these algorithms when applied to the highly complex action spaces of larger commercial facilities. Finally, assessing the zero-shot or few-shot transferability of the learned policies to buildings with distinct geometric layouts, physical parameters, and climate zones remains critical for validating the generalizability of the framework.

Author Contributions

Conceptualization, A.S.C.C. and M.J.; methodology, A.S.C.C.; software, A.S.C.C. and C.F.L.; validation, A.S.C.C. and M.J.; formal analysis, A.S.C.C.; investigation, A.S.C.C.; resources, M.J.; data curation, A.S.C.C.; writing—original draft preparation, A.S.C.C.; writing—review and editing, M.J. and C.F.L.; visualization, A.S.C.C.; supervision, M.J.; project administration, M.J.; funding acquisition, M.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was carried out under the ‘DanRETwin: A Digital Twin solution for optimal energy retrofit decision-making and decarbonization of the Danish building stock’ project, funded by the Danish Energy Agency (Energistyrelsen) under the Energy Technology Development and Demonstration Program (EUDP), ID number: 640222-496754.

Data Availability Statement

The data and code presented in this study are available on request from the corresponding author. The Twin4Build framework used in this study is openly available at https://github.com/JBjoernskov/Twin4Build (accessed on 10 March 2026). The BOPTEST framework is openly available at https://github.com/ibpsa/project1-boptest (accessed on 10 March 2026).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Mathematical Formulation of Control Trajectories

A control trajectory

τ^{u}

is a sequence of states and control inputs in the system, defined as:

τ^{u} = (s_{0}, u_{0}, s_{1}, u_{1}, \dots) .

The initial state of the system,

s_{0}

, is randomly sampled from the initial-state distribution, denoted by

ρ_{0}

:

s_{0} \sim ρ_{0} (\cdot) .

State transitions, which describe how the system evolves from state

s_{t}

at time t to state

s_{t + 1}

at time

t + 1

, are governed by the system dynamics and depend mainly on the most recent control input

u_{t}

. These transitions can be either deterministic:

s_{t + 1} = f (s_{t}, u_{t}),

or stochastic:

s_{t + 1} \sim P (\cdot ∣ s_{t}, u_{t}) .

Control inputs come from a controller or policy that determines actions based on the current state. For practical use in imitation learning, trajectories are often reorganized into transition vectors. A transition captures a single step of interaction between the system and the controller, and is typically represented as a tuple:

(o_{t}, u_{t}, o_{t + 1}, d_{t}),

where

o_{t}

denotes the observation at time t,

u_{t}

the control input (action) applied at time t,

o_{t + 1}

the resulting next observation, and

d_{t}

is a binary done indicator that specifies whether the trajectory has terminated.

By collecting transitions from multiple trajectories, one obtains a dataset well-suited for training policies via imitation learning methods, as each transition provides a supervised example of how an expert behaves in the system.

Appendix B. BOPTEST Test Case and Twin4Build Model Details

This appendix provides the architectural and structural details for the multi-zone office test case and its corresponding Twin4Build surrogate model.

Appendix B.1. Original BOPTEST Configuration

The selected test case is adapted from the Modelica Buildings.Examples.VAVReheat. ASHRAE2006 example [24]. It consists of a five-room office building featuring a large central “core” zone surrounded by four perimeter zones. Figure A1 illustrates the building envelope layout.

Figure A1. Envelope diagram of the multi-zone office building simulation test case [24].

The HVAC configuration (Figure A2) relies on a central AHU supplied by a chiller and a heat pump. These primary plants deliver chilled and hot water to the AHU coils, as well as hot water directly to the VAV reheat coils located in each of the five zones.

Figure A2. HVAC system schematic for the BOPTEST multi-zone office test case [24].

Appendix B.2. Twin4Build Surrogate Model Topology

The Twin4Build (T4B) model approximates the original BOPTEST scenario using a combination of gray-box RC representations, equation-based components, and FMUs.

Envelope and VAV Subsystem:

Figure A3 illustrates the interconnected envelope and VAV model for the northern room. The room model processes external boundary conditions (occupancy schedules, outdoor air temperature, global solar radiation) and internal setpoints alongside the supply air and water temperatures provided by the central AHU. The VAV subsystem itself comprises a temperature controller, a supply damper, and a reheat coil. To maintain high fidelity in the local heat transfer dynamics, the reheat coil is implemented as an FMU compiled from a Modelica model containing a pump, valve, and heating coil.

Figure A3. Twin4Build model diagram for a single zone (North Room). Inputs include heating/cooling setpoints and supply air/water from the central AHU model.

Air Handling Unit (AHU) Subsystem:

The AHU surrogate model (Figure A4) uses mixing and inlet junctions to replicate the air mixing process between the outdoor and return air dampers. It receives temperature setpoints for the supply air and position control signals for the dampers. The inlet and outlet dampers share a unified control signal, whereas the mixing damper operates independently. As noted in the main text, the fan operates passively to improve computational efficiency, deriving its power estimates directly from the sum of the zone-level airflow demands.

Figure A4. Twin4Build model diagram for the Air Handling Unit (AHU), comprising air mixing dampers, a passive fan, and primary heating/cooling coils.

Appendix C. BOPTEST KPI Formulations

The mathematical formulations for the Key Performance Indicators used to evaluate the control policies are defined as follows.

Total Energy Use (ener_tot):

E (t_{0}, t_{f}) = \frac{\sum_{i \in ξ} \int_{t_{0}}^{t_{f}} P_{i} (t) d t}{A}

where

E (t_{0}, t_{f})

is the total amount of energy use from the initial time

t_{0}

up to the final time

t_{f}

;

ξ

denotes the set of equipment in the system with an associated energy use of any type;

P_{i} (t)

is the instantaneous power used by the energy vector i; and A is the total floor area of the building.

Thermal Discomfort (tdis_tot):

D (t_{0}, t_{f}) = \frac{\sum_{z = 1}^{N} \int_{t_{0}}^{t_{f}} ∥ s_{z} (t) ∥ d t}{N}

where

D (t_{0}, t_{f})

is the total discomfort between the initial time

t_{0}

and the final time

t_{f}

; z is the zone index out of N zones in the building; and

s_{z} (t)

is the deviation (slack) from the lower and upper set point temperatures established in zone z.

Indoor Air Quality Violation (idis_tot):

Φ (t_{0}, t_{f}) = \frac{\sum_{z \in Z} \int_{t_{0}}^{t_{f}} ϕ_{z} (t) d t}{N}

where

ϕ_{z} (t) = \{\begin{matrix} γ_{z} (t) - γ_{r, z} (t), & if γ_{z} (t) > γ_{r, z} (t), \\ 0, & if γ_{z} (t) \leq γ_{r, z} (t), \end{matrix}

and where

Φ (t_{0}, t_{f})

is the total violation of carbon dioxide (CO₂) concentration between the initial time

t_{0}

and the final time

t_{f}

; z is the zone index out of N zones in the building;

γ_{z} (t)

is the measured CO₂ concentration in zone z; and

γ_{r, z} (t)

is the CO₂ concentration threshold for zone z.

Appendix D. Agent Observation and Action Space Definitions

The following tables detail the signal keys, physical ranges, and descriptions for the variables exchanged between the building environment and the PPO agent.

Table A1. Observation variables supplied to the PPO agent.

Signal Key	Range	Description
`hvac_reaAhu_TSup_y`	0–40 °C	Supply-air temperature
`hvac_reaAhu_V_flow_sup_y`	0–20 L s⁻¹	Ventilation airflow rate
`hvac_reaZonCor_TZon_y`	0–40 °C	Core indoor temperature
`hvac_oveZonActCor_yDam_u`	0–1	Core supply-damper position
`hvac_oveZonActCor_yReaHea_u`	0–1	Core reheat-valve position
`hvac_reaZonNor_TZon_y`	0–40 °C	North indoor temperature
`hvac_oveZonActNor_yDam_u`	0–1	North supply-damper position
`hvac_oveZonActNor_yReaHea_u`	0–1	North reheat-valve position
`hvac_reaZonEas_TZon_y`	0–40 °C	East indoor temperature
`hvac_oveZonActEas_yDam_u`	0–1	East supply-damper position
`hvac_oveZonActEas_yReaHea_u`	0–1	East reheat-valve position
`hvac_reaZonSou_TZon_y`	0–40 °C	South indoor temperature
`hvac_oveZonActSou_yDam_u`	0–1	South supply-damper position
`hvac_oveZonActSou_yReaHea_u`	0–1	South reheat-valve position
`hvac_reaZonWes_TZon_y`	0–40 °C	West indoor temperature
`hvac_oveZonActWes_yDam_u`	0–1	West supply-damper position
`hvac_oveZonActWes_yReaHea_u`	0–1	West reheat-valve position

Table A2. Time-embedding features supplied to the PPO agent.

Signal Key	Range	Description
`timeOfDay`	0–24	Time of day (decimal hours, 0 = midnight)
`dayOfWeek`	0–7	Day of week (0 = Sunday)
`monthOfYear`	1–12	Calendar month (1 = January)

Table A3. Forecast variables provided to the PPO agent.

Signal Key	Range	Description
`TDryBul`	$- 20$ °C–50 °C	Outdoor temperature forecast
`HGloHor`	0–1000 W m⁻²	Global solar-irradiance forecast
`Occupancy[cor]`	0–400	Core occupancy schedule (forecast)
`Occupancy[nor]`	0–100	North occupancy schedule (forecast)
`Occupancy[eas]`	0–100	East occupancy schedule (forecast)
`Occupancy[sou]`	0–100	South occupancy schedule (forecast)
`Occupancy[wes]`	0–100	West occupancy schedule (forecast)
`LowerSetp[cor]`	12–20 °C	Core heating set-point schedule
`UpperSetp[cor]`	24–30 °C	Core cooling set-point schedule
`LowerSetp[nor]`	12–20 °C	North heating set-point schedule
`UpperSetp[nor]`	24–30 °C	North cooling set-point schedule
`LowerSetp[eas]`	12–20 °C	East heating set-point schedule
`UpperSetp[eas]`	24–30 °C	East cooling set-point schedule
`LowerSetp[sou]`	12–20 °C	South heating set-point schedule
`UpperSetp[sou]`	24–30 °C	South cooling set-point schedule
`LowerSetp[wes]`	12–20 °C	West heating set-point schedule
`UpperSetp[wes]`	24–30 °C	West cooling set-point schedule

Table A4. Action signals exposed to the PPO agent.

Signal Key	Range	Description
`hvac_oveAhu_TSupSet_u`	12–35 °C	AHU supply-air temperature set-point
`hvac_oveZonSupCor_TZonHeaSet_u`	12–20 °C	Core zone heating set-point
`hvac_oveZonSupCor_TZonCooSet_u`	24–30 °C	Core zone cooling set-point
`hvac_oveZonActCor_yDam_u`	0–1	Core supply-damper position
`hvac_oveZonSupNor_TZonHeaSet_u`	12–20 °C	North zone heating set-point
`hvac_oveZonSupNor_TZonCooSet_u`	24–30 °C	North zone cooling set-point
`hvac_oveZonActNor_yDam_u`	0–1	North supply-damper position
`hvac_oveZonSupEas_TZonHeaSet_u`	12–20 °C	East zone heating set-point
`hvac_oveZonSupEas_TZonCooSet_u`	24–30 °C	East zone cooling set-point
`hvac_oveZonActEas_yDam_u`	0–1	East supply-damper position
`hvac_oveZonSupSou_TZonHeaSet_u`	12–20 °C	South zone heating set-point
`hvac_oveZonSupSou_TZonCooSet_u`	24–30 °C	South zone cooling set-point
`hvac_oveZonActSou_yDam_u`	0–1	South supply-damper position
`hvac_oveZonSupWes_TZonHeaSet_u`	12–20 °C	West zone heating set-point
`hvac_oveZonSupWes_TZonCooSet_u`	24–30 °C	West zone cooling set-point
`hvac_oveZonActWes_yDam_u`	0–1	West supply-damper position

References

European Commission. Energy Performance of Buildings Directive; DIRECTIVE (EU) 2018/844; European Commission: Brussels, Belgium, 2018; Available online: https://eur-lex.europa.eu/eli/dir/2018/844/oj/eng (accessed on 10 March 2026).
European Parliament Energy Performance of Buildings: Climate Neutrality by 2050. 2023. Available online: https://www.europarl.europa.eu/news/en/press-room/20230206IPR72112/energy-performance-of-buildings-climate-neutrality-by-2050 (accessed on 10 March 2026).
IEA. Global Status Report for Buildings and Construction 2019—Analysis—IEA. 2019. Available online: https://www.iea.org/reports/global-status-report-for-buildings-and-construction-2019 (accessed on 10 March 2026).
European Commission. State of the Union: Commission Raises Climate Ambition; European Commission: Brussels, Belgium, 2020. [Google Scholar]
Dorokhova, M.; Ballif, C.; Wyrsch, N. Rule-based scheduling of air conditioning using occupancy forecasting. Energy AI 2020, 2, 100022. [Google Scholar] [CrossRef]
Capozzoli, A.; Piscitelli, M.S.; Brandi, S.; Grassi, D.; Chicco, G. Automated load pattern learning and anomaly detection for enhancing energy management in smart buildings. Energy 2018, 157, 336–352. [Google Scholar] [CrossRef]
Lissa, P.; Schukat, M.; Barrett, E. Transfer Learning Applied to Reinforcement Learning-Based HVAC Control. SN Comput. Sci. 2020, 1, 127. [Google Scholar] [CrossRef]
Silvestri, A.; Coraci, D.; Brandi, S.; Capozzoli, A.; Schlueter, A. Practical deployment of reinforcement learning for building controls using an imitation learning approach. Energy Build. 2025, 335, 115511. [Google Scholar] [CrossRef]
Goldfeder, J.; Sipple, J. Reducing Carbon Emissions at Scale: Interpretable and Efficient to Implement Reinforcement Learning via Policy Extraction. In Proceedings of the 11th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
DeepMind AI Reduces Google Data Centre Cooling Bill by 40%—Google DeepMind. Available online: https://deepmind.google/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-by-40/ (accessed on 10 March 2026).
Luo, J.; Paduraru, C.; Voicu, O.; Chervonyi, Y.; Munns, S.; Li, J.; Qian, C.; Dutta, P.; Davis, J.Q.; Wu, N.; et al. Controlling Commercial Cooling Systems Using Reinforcement Learning. arXiv 2022, arXiv:2211.07357. [Google Scholar] [CrossRef]
Zhou, C.; Platforms, M.; Gao, D.; Rivalin, L.; Grier, A.; Arteaga Ramirez, G.; Fabian, J. Simulator-Based Reinforcement Learning for Data Center Cooling Optimization. In Proceedings of the Reinforcement Learning Conference, Amherst, MA, USA, 9 August 2024. [Google Scholar]
Zhong, X.; Zhang, Z.; Zhang, R.; Zhang, C. End-to-End Deep Reinforcement Learning Control for HVAC Systems in Office Buildings. Designs 2022, 6, 52. [Google Scholar] [CrossRef]
Drgoňa, J.; Tuor, A.R.; Chandan, V.; Vrabie, D.L. Physics-constrained deep learning of multi-zone building thermal dynamics. Energy Build. 2021, 243, 110992. [Google Scholar] [CrossRef]
Pinto, G.; Wang, Z.; Roy, A.; Hong, T.; Capozzoli, A. Transfer learning for smart buildings: A critical review of algorithms, applications, and future perspectives. Adv. Appl. Energy 2022, 5, 100084. [Google Scholar] [CrossRef]
Dey, S.; Marzullo, T.; Zhang, X.; Henze, G. Reinforcement learning building control approach harnessing imitation learning. Energy AI 2023, 14, 100255. [Google Scholar] [CrossRef]
Goldfeder, J.A.; Sipple, J.A. A Lightweight Calibrated Simulation Enabling Efficient Offline Learning for Optimal Control of Real Buildings. In BuildSys ’23: Proceedings of the 10th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation; Association for Computing Machinery: New York, NY, USA, 2023; pp. 352–356. [Google Scholar] [CrossRef]
Bjørnskov, J.; Jradi, M. An ontology-based innovative energy modeling framework for scalable and adaptable building digital twins. Energy Build. 2023, 292, 113146. [Google Scholar] [CrossRef]
Bjørnskov, J.; Jradi, M.; Wetter, M. Automated model generation and parameter estimation of building energy models using an ontology-based framework. Energy Build. 2025, 329, 115228. [Google Scholar] [CrossRef]
Zhao, W.; Queralta, J.P.; Westerlund, T. Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: A Survey. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, ACT, Australia, 1–4 December 2020; pp. 737–744. [Google Scholar] [CrossRef]
Blum, D.; Arroyo, J.; Huang, S.; Drgoňa, J.; Jorissen, F.; Walnum, H.T.; Chen, Y.; Benne, K.; Vrabie, D.; Wetter, M.; et al. Building optimization testing framework (BOPTEST) for simulation-based benchmarking of control strategies in buildings. J. Build. Perform. Simul. 2021, 14, 586–610. [Google Scholar] [CrossRef]
Arroyo, J.; Manna, C.; Spiessens, F.; Helsen, L. An Open-AI gym environment for the Building Optimization Testing (BOPTEST) framework. Build. Simul. Conf. Proc. 2021, 17, 175–182. [Google Scholar] [CrossRef]
Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
Blum, D. Multizone Office Simple Air—BOPTEST. Available online: https://ibpsa.github.io/project1-boptest/docs-testcases/multizone_office_simple_air/index.html (accessed on 10 March 2026).
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Openai, W.Z. OpenAI Gym. arXiv 2016, arXiv:1606.01540. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Openai, O.K. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]

Figure 1. General workflow for generating a reinforcement learning-based HVAC supervisory controller.

Figure 2. Flow diagram for creating a Twin4Build [18] model. The simulation model is generated from a semantic, ontology-based descriptor informed by available building data (e.g., blueprints, BIM models, telemetry, and semantic descriptions).

Figure 3. Flow diagram of the pre-training process using Behavioral Cloning, which minimizes the discrepancy between the neural network’s outputs and the expert trajectories recorded from existing controllers.

Figure 4. Fine-tuning process. This stage performs a regularized RL training starting from the pretrained policy to auto-discover more efficient ways of operation. To overcome the sim2real gap, further fine-tuning in the building is needed. Future studies will utilize techniques like domain randomization to decrease the need for online fine-tuning.

Figure 5. Air handler unit (AHU) supply volumetric airflow (m³/s) comparison between T4B and BOPTEST models for a typical heating scenario.

Figure 6. Room temperature estimation comparison between BOPTEST and T4B models for a typical heating scenario.

Figure 7. Multi-panel scatter plots comparing T4B surrogate model predictions against BOPTEST reference data for zone temperatures, CO₂ concentrations, and AHU operational signals.

Figure 8. Comparative performance of the Core Zone under BC pre-training baseline versus RL fine-tuning. The RL policy demonstrates an emergent “pre-heating” behavior, utilizing the room volume as thermal storage to maintain comfort more efficiently.

Figure 9. Supervisory AHU supply temperature set-point actions. The RL controller stabilizes the set-point at a higher average value compared to the baseline, reflecting a strategic shift from local reheating to centralized AHU heating.

Figure 10. Baseline HVAC power consumption breakdown during a typical heating scenario, demonstrating large power spikes at the start of operational cycles.

Figure 11. Energy breakdown of the unmodified T4B pre-trained policy in BOPTEST. The controller successfully shaves peak loads and shifts the heating strategy towards sustained distribution rather than rapid heat generation.

Figure 12. Control performance for the baseline BOPTEST controller. Strict set-point tracking causes massive power spikes and thermal discomfort penalties.

Figure 13. Control performance for the RL-finetuned controller. Power spikes are smoothed and energy is saved by allowing looser temperature tracking and higher CO₂ levels, trading thermal discomfort for IAQ penalties.

Figure 14. Zone control signals for the core room in the BOPTEST typical heating scenario.

Figure 15. Training speed of BOPTEST and Twin4Build under a 300,000 timestep training run. BOPTEST total training time = 12.64 h; Twin4Build total training time = 1.13 h.

Table 1. Summary of CV-RMSE and NMBE Metrics for HVAC Signals.

Signal	CV-RMSE (%)	NMBE (%)
Core Zone Temp.	1.60	0.03
Core Zone CO₂	21.10	10.04
North Zone Temp.	2.36	−1.30
North Zone CO₂	29.27	−10.86
South Zone Temp.	5.95	−5.06
South Zone CO₂	27.54	−14.76
East Zone Temp.	4.41	−3.64
East Zone CO₂	28.83	−20.17
West Zone Temp.	3.86	−3.14
West Zone CO₂	29.81	−13.71
Supply Airflow	50.71	8.55
Return Air Temp.	2.73	−1.61
Return Airflow	59.04	20.74
Supply Air Temp.	9.83	−0.95
Supply Fan Power	74.52	3.91

Table 2. Summary of testing time periods for heating scenarios.

Period	Specifier	Start Time (Day)	End Time (Day)
Peak Heat Day	`peak_heat_day`	19	33
Typical Heat Day	`typical_heat_day`	11	25

Table 3. Baseline Key Performance Indicators (KPIs) for heating scenarios.

Scenario	ener_tot [kWh/m²]	tdis_tot [Kh]	idis_tot [ppmh]
Peak Heat Day	1.7903	11.873	0
Typical Heat Day	1.0734	7.7389	0

Table 4. Initial KPI results for the T4B pre-trained RL supervisory control deployed directly into BOPTEST. Energy savings are compared against the baseline, with green indicating improvement.

Scenario	ener_tot	tdis_tot	idis_tot	Energy Savings (%)
Peak Heat Day	1.353	33.47	2965.39	24.43
Typical Heat Day	0.7829	12.857	3891.23	27.06

Table 5. KPI results for the fully fine-tuned RL supervisory control after 100,000 time steps in BOPTEST. Energy savings show slight improvements over the initial pre-trained policy.

Scenario	ener_tot	tdis_tot	idis_tot	Energy Savings (%)
Peak Heat Day	1.348	40.37	3130.13	24.71
Typical Heat Day	0.763	15.80	3074.37	28.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cespedes Cubides, A.S.; Laursen, C.F.; Jradi, M. Accelerating Optimal Building Control Through Reinforcement Learning with Surrogate Building Models. Appl. Sci. 2026, 16, 2790. https://doi.org/10.3390/app16062790

AMA Style

Cespedes Cubides AS, Laursen CF, Jradi M. Accelerating Optimal Building Control Through Reinforcement Learning with Surrogate Building Models. Applied Sciences. 2026; 16(6):2790. https://doi.org/10.3390/app16062790

Chicago/Turabian Style

Cespedes Cubides, Andres Sebastian, Christian Friborg Laursen, and Muhyiddine Jradi. 2026. "Accelerating Optimal Building Control Through Reinforcement Learning with Surrogate Building Models" Applied Sciences 16, no. 6: 2790. https://doi.org/10.3390/app16062790

APA Style

Cespedes Cubides, A. S., Laursen, C. F., & Jradi, M. (2026). Accelerating Optimal Building Control Through Reinforcement Learning with Surrogate Building Models. Applied Sciences, 16(6), 2790. https://doi.org/10.3390/app16062790

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accelerating Optimal Building Control Through Reinforcement Learning with Surrogate Building Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Policy Pre-Training and Fine-Tuning

2.3. BOPTEST Multi-Zone Office Test Case and Twin4Build Surrogate

2.4. Control Strategy

2.4.1. Key Performance Indicators (KPIs)

2.4.2. Observation and Action Spaces

2.4.3. Reward Function

3. Results

3.1. Model Validation

3.2. Twin4Build Model: IL and RL Control Integration

3.3. Supervisory Controller in High-Fidelity BOPTEST Emulator

3.4. Training Speed Comparison

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Mathematical Formulation of Control Trajectories

Appendix B. BOPTEST Test Case and Twin4Build Model Details

Appendix B.1. Original BOPTEST Configuration

Appendix B.2. Twin4Build Surrogate Model Topology

Appendix C. BOPTEST KPI Formulations

Appendix D. Agent Observation and Action Space Definitions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI