1. Introduction
Buildings within the European Union consume nearly 40% of the total energy consumption [
1]. According to the International Energy Agency (IEA), buildings and the construction sector are responsible for approximately 36% of total emissions [
2], with a significant portion stemming from residential, non-residential, and construction operations [
3]. Because of this, the sector holds great potential for achieving cost-effective efficiency improvements and substantial reductions in greenhouse gas emissions [
4]. Traditionally, Rule-Based Control (RBC) and Proportional Integral Derivative (PID) controllers have dominated the industry [
5]. However, the advent of machine learning and increased data availability has enabled new capabilities [
6,
7]. These data-driven methods benefit the modeling, forecasting, and optimal control of building energy systems [
8]. Regardless of the promise of sizable energy demand reductions, scalable solutions remain evasive due to a lack of trust, poor interpretability, and the need for heavy, constant computation loops [
9].
Within data-driven control, Reinforcement Learning (RL) has shown great promise, though extreme sample inefficiency makes learning directly on physical equipment impractical. Early model-free deep RL deployments in commercial facilities and data centers [
10] and follow-up studies [
11] demonstrated cooling-energy reductions, but they relied heavily on expert supervision and strictly limited decision spaces. Similarly, offline training via detailed fluid dynamics emulators [
12] reached convergence but addressed only a single airflow lever. To alleviate the risks of online exploration, several groups shifted the training burden to simulators. Zhong et al. trained an RL policy using an OpenAI Gym interface wrapped around EnergyPlus [
13]. However, relying strictly on EnergyPlus is computationally expensive and abstracts away fast control dynamics, resulting in policies that risk severe overfitting and exhibit poor transfer when confronted with real controller logic. Recognizing the need for physics constraints, Drgoňa et al. [
14] proposed a hybrid framework, though it deliberately excluded dynamic interactions with HVAC equipment.
To mitigate extreme training times, Imitation Learning (IL) has emerged as a powerful technique to bootstrap the learning process by mimicking expert demonstrations [
15]. Among IL techniques, Behavioral Cloning (BC) maps states directly to actions based on expert trajectories [
16]. While this approach stabilizes initial policy optimization and provides a “warm start” for RL fine-tuning—cutting training time significantly—its widespread adoption remains low [
15]. Importantly, these approaches still rely heavily on white box models based on EnergyPlus and/or Modelica, which are computationally expensive and very hard to produce. Silvestri et al. [
8] successfully merges IL and RL with an important real-world validation, achieving sizable energy savings. However, despite using a model-free implementation, they still rely on a Modelica-based digital twin for validation. Furthermore, their controller has a very narrow scope—managing only one room and one main HVAC valve—thereby ignoring total building energy consumption. A critical barrier is the fragmentation of the development ecosystem, which often requires complex “glue code” to bridge simulators with machine learning libraries. Addressing the need for self-contained frameworks, Goldfeder et al. developed an all-Python simulation suite using gray box models based on energy equations [
17]. However, it still lacks validation regarding the real-world effectiveness of the resulting RL policies, nor does it test their closed-loop efficacy when transferred to high-fidelity building emulators.
To the best of the authors’ knowledge, there are no publications in the literature that demonstrate a holistic, fully data-driven supervisory controller capable of coordinating envelope thermodynamics with air-side and water-side HVAC equipment using a transparent, self-contained Python framework. Existing studies typically (i) optimize a single chiller or airflow set-point [
8,
10,
11,
12], (ii) validate only in simulation without hardware-in-the-loop testing [
13,
17], or (iii) decouple envelope from plant optimization [
14].
These factors highlight an open research gap and the clear opportunity to develop a modeling and RL control development suite entirely in Python (version 3.12). This work addresses these gaps by developing a unified pipeline using Twin4Build that produces gray box, physics-based building models calibrated directly from data. Embedded within a Gym environment, the framework utilizes Imitation Learning to efficiently produce RL control policies from surrogate models. We then analyzed the behavior of these policies when transmitted into a high-fidelity emulator—mimicking the process of transferring to a real building—and conduct further online fine-tuning to evaluate performance gains across large action spaces and total building energy consumption.
The novelty of this study lies in the development of a unified and fully Python-based framework that combines digital twin modeling with Imitation Learning and Deep Reinforcement Learning to support holistic building energy control. Unlike many previous studies that focus on optimizing a single component or building section, or rely on computationally intensive white box simulators, the proposed approach uses gray box surrogate models generated through the Twin4Build platform to efficiently train supervisory control policies that coordinate both building envelope dynamics and HVAC system operation. An additional contribution is the structured transfer of trained policies from the surrogate environment to a high-fidelity BOPTEST emulator, enabling realistic closed-loop evaluation before potential real-world deployment. By bringing these elements together in a single pipeline, the study demonstrates a practical and scalable pathway for developing data-driven control strategies capable of improving whole-building energy performance while significantly reducing training complexity.
2. Materials and Methods
The proposed methodology for developing a deep reinforcement learning (DRL)-based supervisory controller for building HVAC systems is depicted in
Figure 1. These controllers act at the supervisory level, monitoring the global building state to generate near-optimal setpoints for the HVAC systems, rather than replacing low-level internal control loops like PID controllers.
To address the fragmentation and computational bottlenecks identified previously, this pipeline is unified within a Python ecosystem and divided into three primary stages: creating a
Twin4Build [
18] building model (
Figure 2), pre-training a control policy via Imitation Learning (
Figure 3), and performing RL fine-tuning (
Figure 4).
In the first stage, a gray box building model is created using
Twin4Build. Leveraging semantic ontologies, the framework constructs graph-based representations of building systems that seamlessly support both simulation and parameter estimation. It provides a library of modular, physics-based elements and gray box component models, allowing users to tailor model fidelity and complexity. Crucially, these surrogate models perform simulations an order of magnitude faster than conventional high-fidelity white box engines. As new component models are contributed, the framework supports varying levels of modeling accuracy. For a comprehensive overview of constructing a
Twin4Build model, the reader is referred to the framework’s main repository (
https://github.com/JBjoernskov/Twin4Build (accessed on 10 March 2026)) [
18,
19].
In the second stage (
Figure 3), the objective is to establish a stable, baseline control policy capable of satisfying standard building setpoints by imitating existing controllers. Drawing on proven practices from robotics and addressing the sample inefficiency of pure RL, this study constructs a dataset of expert trajectories to pre-train the policy via supervised learning (Behavioral Cloning). This provides a robust “warm start,” significantly reducing the training time required for the subsequent RL fine-tuning phase. The following subsections describe this progression in detail.
2.1. Data Acquisition
Using the
Twin4Build model described previously, we construct a dataset of expert control trajectories across various episodic scenarios. These trajectories capture the sequential interaction between the building system’s state and the baseline controller’s inputs over time. To make this data suitable for Imitation Learning, the continuous trajectories are broken down into individual transition vectors. Each transition is stored as a discrete tuple containing the current observation, the applied control action, the subsequent observation, and an episode termination flag. Aggregating these step-by-step transitions yields a comprehensive supervised dataset that demonstrates the expert’s behavior, providing the necessary examples for the neural network to learn the direct mapping from states to actions. The formal mathematical definitions of these trajectories and system dynamics are provided in
Appendix A.
2.2. Policy Pre-Training and Fine-Tuning
Behavioral Cloning (BC) frames imitation learning as a supervised task, providing a computationally efficient method to initialize a control policy,
. Using the transition dataset acquired from the expert, the policy learns to directly map observations (
) to control inputs (
) by minimizing a loss function
ℓ (e.g., mean squared error for continuous actions):
While BC yields a stable baseline that acts autonomously, further performance gains are achieved through reinforcement learning fine-tuning within the
Twin4Build environment (
Figure 4). To optimize the policy while preventing it from catastrophically forgetting the expert behavior during early exploration, we employ Proximal Policy Optimization (PPO) augmented with a Kullback–Leibler (KL) divergence penalty. This constraint regulates the divergence between the actively exploring policy (
) and the pretrained baseline (
):
The weighting coefficient is annealed over time, keeping the policy anchored to the expert initially, but progressively granting it the freedom to discover highly efficient control strategies.
Finally, deploying this simulation-trained policy to a real building or a high-fidelity emulator inevitably introduces a performance drop known as the sim2real gap [
20]. To mitigate this discrepancy, the framework utilizes further online fine-tuning within the target environment, allowing the policy to adapt to unmodeled, real-world thermodynamic complexities.
2.3. BOPTEST Multi-Zone Office Test Case and Twin4Build Surrogate
To benchmark the proposed control strategy, this study utilizes the Building Optimization Testing Framework (BOPTEST) [
21]. BOPTEST provides standardized, dynamic building models (exported as Functional Mock-up Units) that interface seamlessly with reinforcement learning workflows via the BOPTEST-Gym environment [
22] and standard libraries like Stable Baselines3 [
23].
We selected the
multi-zone office air system emulator [
24], a five-zone, single-duct variable air volume (VAV) building equipped with reheat capabilities, a central air handling unit (AHU), a variable-efficiency heat pump, and a chiller. All heating and cooling demands are satisfied through this central ventilation system.
To facilitate the rapid simulation required for RL training, we developed a gray box surrogate model of this test case using Twin4Build. The surrogate is structured into three primary subsystems: the AHU model, the building envelope (using thermal RC dynamics), and the VAV system models. The parameters for the dampers, heating coils, and RC room models were calibrated using time series data extracted from the original high-fidelity BOPTEST model. A key computational simplification in the surrogate is the treatment of the AHU fan as a passive component, which estimates power consumption based on the aggregated airflow demands inferred from individual room damper positions.
Detailed schematic diagrams of the BOPTEST building layout, the HVAC configuration, and the specific Twin4Build component topologies are provided in
Appendix B.
2.4. Control Strategy
Both the BOPTEST [
22] and
Twin4Build models were implemented as Gym-compatible environments [
25] to support reinforcement learning (RL)–based supervisory control using a Proximal Policy Optimization (PPO) agent [
26]. These environments share a common observation and action space, facilitating direct policy transfer between the high-fidelity emulator and the surrogate model. Key performance indicators (KPIs), including energy consumption and setpoint deviations, are used to evaluate controller performance across both environments. The following subsections describe the KPIs, observation and action spaces for this use case scenario.
2.4.1. Key Performance Indicators (KPIs)
To evaluate the performance of the control policies, this study adopts the standardized Key Performance Indicators (KPIs) defined by the BOPTEST framework. The primary objectives are:
Total Energy Use (ener_tot): Measures the site HVAC energy consumption (in kWh/m2), accounting for all energy vectors (heating, cooling, fans, pumps) normalized by the total floor area.
Thermal Discomfort (tdis_tot): Quantifies the integral of temperature deviations outside a predefined comfort range over time, averaged across all zones (in K · h/zone).
Indoor Air Quality Violation (idis_tot): Measures the integral of CO2 concentration deviations above a predefined safety threshold, averaged across all zones (in ppm · h/zone).
The exact mathematical formulations for these KPIs are detailed in
Appendix C.
2.4.2. Observation and Action Spaces
The RL agent interacts with the Twin4Build surrogate and the BOPTEST emulator through a strictly defined interface. All inputs are min–max-scaled to the interval before being passed to the Stable Baselines3 PPO policy.
Observation Space: The state vector supplied to the agent comprises three components:
Sensor observations: Real-time physical measurements read directly from the simulator at each step (e.g., zone temperatures, damper positions).
Time embeddings: Temporal variables (hour of day, day of week, month) encoded as continuous circular features (using and ) to preserve periodicity boundaries.
Forecasts: A rolling horizon of the next samples () for weather, occupancy, and boundary set-points, refreshed at every simulation step.
Action Space: The controller acts at a supervisory level. It outputs continuous control signals mapped to the AHU supply-air temperature set-point, as well as the local heating/cooling set-points and direct supply damper overrides for each of the five zones.
The exhaustive lookup tables defining the signal keys, ranges, and descriptions for both the observation and action spaces are provided in
Appendix D.
2.4.3. Reward Function
In order to guide the control policy towards reducing both energy use and discomfort, the reward function is defined as a weighted sum of the three key performance indicators (KPIs) introduced above: total energy use (
), thermal discomfort (
), and indoor air quality violation (
). Each KPI is scaled by a corresponding weighting factor
,
, and
to balance their relative importance. To ensure that the agent is penalized for high values of energy consumption or discomfort, the reward is taken as the negative of this weighted sum:
With this definition, the learning agent maximizes long-term reward by minimizing energy consumption and maintaining thermal comfort and indoor air quality within acceptable limits. The weighting factors can be tuned to reflect application-specific trade-offs, such as prioritizing occupant comfort over energy savings or vice versa.
4. Discussion
The results of this study highlight the viability of a fully Python-based, data-driven pipeline for HVAC supervisory control. By bridging Imitation Learning (IL), Reinforcement Learning (RL), and gray-box surrogate modeling, this approach effectively navigates the bottlenecks of sample inefficiency and computational overhead typical of building control research. As demonstrated in the model validation phase, the Twin4Build surrogate model exhibits varying levels of predictive precision compared to the high-fidelity BOPTEST reference. However, this lower accuracy is a trade-off in exchange for faster simulation time and less digital overhead. The surrogate model still captures the core thermodynamic dynamics and temporal inertia of the building, providing a fidelity that is entirely sufficient for the RL agent to internalize fundamental energy-saving strategies such as thermal load-shifting and peak shaving.
The primary advantage of embracing this lower-accuracy surrogate lies in the gains in training velocity. As shown in the training speed comparison, the surrogate environment operates over an order of magnitude faster than the high-fidelity emulator—completing 300,000 steps in 1.13 h compared to 12.64 h. The gains in speed allow not only for faster deployment but also for rapid iteration and broader experimentation that would be computationally prohibitive using traditional white box simulators.
Furthermore, these computational gains facilitate an effective sim-to-real transfer strategy. Transitioning an RL policy directly from simulation to a physical building often risks erratic or unsafe behavior. However, if the pre-training is done in the surrogate environment using Behavioral Cloning while actively monitoring the physicality of the RL actions, the resulting models serve as an already more optimal “warm start” for RL policies. When deployed to real buildings—or high-fidelity emulators acting as proxies—these pre-trained policies operate safely from day one. They can then continue to adapt and improve over time through a restricted version of online RL fine-tuning, successfully navigating the sim2real gap without catastrophic exploration.
The value of this methodology extends beyond immediate operational improvements. This variant of Imitation Learning, reinforcement learning, and fine-tuning within a unified Python-based approach yields not only substantial energy savings but also a calibrated surrogate digital twin of the building. This lightweight, transparent model can run in parallel with real-time operations, providing a foundational asset that can assist in advanced building services such as anomaly detection and space management [
18]. Ultimately, this dual output: a robust control policy and a highly functional digital twin—demonstrates the comprehensive utility and scalability of the proposed framework for modern smart buildings.
Several limitations must be acknowledged. First, the substantial energy reductions achieved by the RL agent rely incur in some violations of the indoor comfort set-points, which inherently trades minor increases in thermal discomfort and indoor air quality (IAQ) violations for peak-load shaving. While CO2 levels remained safely within acceptable limits, prioritizing energy efficiency over rigid comfort adherence may require careful parameter tuning depending on specific building occupancy requirements. Second, while the gray box surrogate model successfully captures overarching thermal inertia, it struggles with highly dynamic or non-linear signals, such as instantaneous supply fan power and abrupt airflow step changes. Additionally, the current evaluation was restricted to a five-zone office emulator operating strictly under winter heating scenarios. Validating this framework’s scalability to larger, more topologically complex commercial facilities, as well as its performance during cooling-dominated summer months, remains a necessary next step. Lastly, although the methodology provides a safe “warm start”, overcoming the sim-to-real gap still necessitates a period of online fine-tuning within the target environment, which inherently demands active operational time in a real-world deployment.
5. Conclusions
This work presents a fully data-driven modeling and supervisory control framework developed entirely within a Python ecosystem. By utilizing Twin4Build to generate gray box surrogate models, the methodology bridges rapid simulation and practical reinforcement learning (RL) deployment. The approach leverages imitation learning to establish a safe baseline control behavior, followed by proximal policy optimization (PPO) fine-tuning to coordinate multi-zone HVAC operations. A primary advantage of this method is the substantial reduction in computational overhead, achieving training speeds more than an order of magnitude faster than high-fidelity simulators. Consequently, the developed controllers effectively mitigated power spikes during heating start-ups, yielding energy savings of over 24% upon initial deployment and nearly 29% following online fine-tuning. These efficiency gains require a tradeoff; the RL agent learns to prioritize energy reduction by relaxing strict setpoint tracking, exchanging minor thermal comfort and indoor air quality margins for peak-load shaving.
To bridge the gap toward broader commercial adoption, future research must address several key operational constraints. It is necessary to evaluate the statistical reliability of these controllers across diverse stochastic conditions and to implement hard safety constraints, such as fallback rule-based overrides, prior to direct deployment in occupied buildings. Furthermore, subsequent work should analyze the scalability and convergence stability of these algorithms when applied to the highly complex action spaces of larger commercial facilities. Finally, assessing the zero-shot or few-shot transferability of the learned policies to buildings with distinct geometric layouts, physical parameters, and climate zones remains critical for validating the generalizability of the framework.