Energy Regulation-Aware Layered Control Architecture for Building Energy Systems Using Constraint-Aware Deep Reinforcement Learning and Virtual Energy Storage Modeling

Li, Siwei; Tian, Congxiang; Abdalla, Ahmed N.

doi:10.3390/en18174698

Open AccessArticle

Energy Regulation-Aware Layered Control Architecture for Building Energy Systems Using Constraint-Aware Deep Reinforcement Learning and Virtual Energy Storage Modeling

by

Siwei Li

¹,

Congxiang Tian

¹ and

Ahmed N. Abdalla

^2,*

¹

Yangtze University College of Arts and Sciences, Jingzhou 434025, China

²

Faculty of Electronic Information Engineering, Huaiyin Institute of Technology, Huaian 223003, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(17), 4698; https://doi.org/10.3390/en18174698

Submission received: 19 July 2025 / Revised: 14 August 2025 / Accepted: 3 September 2025 / Published: 4 September 2025

(This article belongs to the Section C: Energy Economics and Policy)

Download

Browse Figures

Versions Notes

Abstract

In modern intelligent buildings, the control of Building Energy Systems (BES) faces increasing complexity in balancing energy costs, thermal comfort, and operational flexibility. Traditional centralized or flat deep reinforcement learning (DRL) methods often fail to effectively handle the multi-timescale dynamics, large state–action spaces, and strict constraint satisfaction required for real-world energy systems. To address these challenges, this paper proposes an energy policy-aware layered control architecture that combines Virtual Energy Storage System (VESS) modeling with a novel Dynamic Constraint-Aware Policy Optimization (DCPO) algorithm. The VESS is modeled based on the thermal inertia of building envelope components, quantifying flexibility in terms of virtual power, capacity, and state of charge, thus enabling BES to behave as if it had embedded, non-physical energy storage. Building on this, the BES control problem is structured using a hierarchical Markov Decision Process, in which the upper level handles strategic decisions (e.g., VESS dispatch, HVAC modes), while the lower level manages real-time control (e.g., temperature adjustments, load balancing). The proposed DCPO algorithm extends actor–critic learning by incorporating dynamic policy constraints, entropy regularization, and adaptive clipping to ensure feasible and efficient policy learning under both operational and comfort-related constraints. Simulation experiments demonstrate that the proposed approach outperforms established algorithms like Deep Q-Networks (DQN), Deep Deterministic Policy Gradient (DDPG), and Twin Delayed DDPG (TD3). Specifically, it achieves a 32.6% reduction in operational costs and over a 51% decrease in thermal comfort violations compared to DQN, while ensuring millisecond-level policy generation suitable for real-time BES deployment.

Keywords:

building energy system; reinforcement learning; Markov decision process; virtual energy storage; policy regulation; deep reinforcement learning

1. Introduction

The increasing demand for energy-efficient and resilient buildings has elevated the importance of intelligent Building Energy Systems (BES). Contemporary BES must manage energy consumption while balancing user comfort, grid interaction, and dynamic external conditions such as weather and electricity prices. However, the inherent complexity of building thermal dynamics, coupled with diverse and often conflicting operational objectives, makes traditional rule-based or static control strategies insufficient [1,2]. Reinforcement learning (RL) has emerged as a promising alternative for adaptive control in BES. By learning policies from environmental interactions, RL enables systems to autonomously optimize decisions in response to changing conditions without requiring an explicit building model [3]. In particular, deep reinforcement learning (DRL) methods have shown potential in managing HVAC systems and distributed energy resources, thanks to their scalability and function approximation capabilities [4,5].

To address the complexity of dynamic energy environments, recent research has increasingly focused on integrating hierarchical and multi-agent reinforcement learning (MARL) into energy management systems. Hierarchical frameworks have demonstrated the ability to decompose long-horizon control problems into more tractable sub-problems. For instance, Cui et al. [6] proposed a hierarchical reinforcement learning method for dispatching a hybrid compressed air and battery energy storage system, achieving flexibility across multiple operation modes. Similarly, Zhou et al. [7] employed enhanced hierarchical RL to optimize HVAC operations, showing improved energy efficiency and comfort constraint handling. In multi-agent settings, Jendoubi and Bouffard [8] developed a multi-agent hierarchical reinforcement learning architecture for distributed energy management, where decentralized agents collaborate while maintaining a layered control strategy. Yao et al. [9] introduced a hierarchical MARL framework with adjustable agent hierarchies for home energy systems, proving the potential of agent collaboration in maintaining energy balance and user comfort. Zhang et al. [10] extended these ideas to community-scale coordination, enabling multiple buildings to manage grid-aware energy loads using deep MARL. In parallel, RL-based control for building HVAC systems continues to evolve. Yang et al. [11] demonstrated that reinforcement learning with built-in comfort objectives can dynamically adapt HVAC operations in response to occupant needs and grid constraints. Other works have extended RL to broader intelligent environments—such as Azizi et al. [12], who integrated multi-agent RL for resource management in vehicular communication systems—indicating the cross-domain robustness of these methods.

Recent advances in reinforcement learning (RL) for building energy management have underscored the importance of incorporating safety, robustness, and constraint-awareness into control strategies. Traditional RL methods, while powerful, often fail to guarantee adherence to critical operational constraints such as thermal comfort, equipment limitations, and energy cost ceilings. To address these concerns, researchers have explored a range of constraint-aware enhancements. For instance, Jiang et al. [13] and Alotaibi [14] emphasized integrating real-time feedback on thermal comfort and energy pricing to drive context-aware and user-centric policies. Similarly, Kumaresan and Jeyaraj [15] proposed a reward shaping mechanism to embed demand-side management constraints directly into the learning process, improving transferability and policy stability across different building contexts. The fusion of RL with model-based approaches, such as Reinforced Model Predictive Control (RL-MPC), has also shown promise in navigating hard constraints while retaining the adaptability of learning-based control. Arroyo et al. [16] demonstrated that RL-MPC could outperform both standalone MPC and flat RL in multi-zone HVAC coordination. Robustness under environmental uncertainty has further been a key challenge, particularly in non-stationary climates or when forecasts are unreliable. Naug et al. [17] addressed this by adapting RL to non-stationary settings using online learning and model adaptation. Meanwhile, Oh [18] proposed a reinforcement-learning-based virtual energy storage operation strategy to manage wind power uncertainties, suggesting a potential synergy between storage models like VESS and RL under uncertainty. Beyond algorithmic enhancements, practical deployment remains a core focus. Silvestri et al. [19] introduced imitation learning to accelerate RL training while ensuring safety during exploration, offering a viable path for real-world implementation. Finally, Woo and Junghans [20] explored MPC-based control for thermo-active building systems (TABS), with a focus on surface condensation prevention, showing how safety constraints can be embedded into hybrid predictive frameworks. These efforts collectively highlight a trend toward hybrid and constraint-aware RL systems that can reliably operate under real-world uncertainties—an essential motivation for this work’s integration of hierarchical control to enforce safety and flexibility in building energy systems.

The concept of Virtual Energy Storage Systems (VESS) has gained considerable traction as a viable alternative or supplement to physical batteries in building energy management, especially in light of cost, lifecycle, and sustainability considerations. VESS leverages the thermal inertia of buildings—the natural ability of materials and structures to store and release heat over time—as a flexible and low-cost energy buffer. As highlighted by Fambri et al. [21], this approach provides a comparable level of flexibility to traditional electric batteries within renewable energy communities, while offering significant economic and environmental advantages. Similarly, Jin and Liu [22] demonstrated that integrating VESS into grid-connected distributed energy systems, especially through coupling with air conditioning systems, can enhance peak shaving performance and grid responsiveness. In building microgrid contexts, VESS enables the coordination of demand and supply without requiring physical storage units. Lv et al. [23] explored this by optimizing microgrid operations using thermal-based VESS models, showing improvements in operational efficiency and renewable integration. At a broader system level, Jani et al. [24] incorporated both physical and virtual storage in a multi-objective optimization framework for multi-microgrid systems, suggesting that VESS can effectively complement existing storage assets over different time scales. More recently, Mu et al. [25] proposed a data-driven rolling optimization approach to control BES equipped with VESS, emphasizing real-time adaptability to dynamic load and weather conditions. Their method significantly enhanced system stability while reducing operational costs. Moreover, Alhammad et al. [26] emphasized the role of digital technologies like Building Information Modeling (BIM) and Building Energy Modeling (BEM) in enabling accurate VESS characterization, simulation, and control. This integration is essential for scalable deployment, as accurate thermal models underpin the effectiveness of VESS-based strategies. Collectively, these studies establish VESS as a critical enabler for flexible, sustainable, and cost-effective building energy management—directly aligning with the proposed framework in this paper, which utilizes VESS to model latent thermal flexibility as a form of virtual storage.

In the domain of building and energy system optimization, the integration of policy-aware reinforcement learning has emerged as a crucial advancement, enabling intelligent control systems to not only learn effective strategies but also comply with operational, safety, and regulatory constraints. Chen et al. [27] proposed a differentiable projection method to enforce policy feasibility in energy optimization tasks, ensuring that learned policies remain within allowable action boundaries—an essential capability in constrained environments like smart buildings and energy grids. Reinforcement learning’s adaptability has also been demonstrated in broader energy systems, such as microgrids and transport, where Delavari and Naderian [28] developed a robust RL-based control framework for hybrid energy storage systems, and Jung [29] applied RL to optimize storage scheduling in urban railway systems, showing versatility in energy-aware policy learning. Several studies have focused on real-time control under uncertainty, where policy-based learning outperforms static optimization. Kolodziejczyk et al. [30] showcased the use of deep reinforcement learning to optimize real-time energy purchases in PV-storage systems, achieving enhanced economic performance through dynamic adaptation. Meanwhile, safety-critical applications have benefited from chance-constrained RL approaches, as illustrated by Mowbray et al. [31], who embedded probabilistic safety constraints into reinforcement learning to control chemical batch processes—a technique readily applicable to building systems where thermal comfort and equipment safety must be preserved. Furthermore, Zhang et al. [32] introduced a hybrid model-free DRL strategy for energy-efficient HVAC operation, balancing exploration and exploitation while satisfying comfort policies. These advancements underscore the growing importance of policy-constrained reinforcement learning in intelligent energy control applications. They affirm the relevance of embedding safe, adaptive, and constraint-aware policies within reinforcement learning frameworks, such as the one proposed in this study, where constraint-aware deep reinforcement learning is employed to manage building energy systems with virtual energy storage under real-time operational limits.

Traditional centralized or flat deep reinforcement learning (DRL) methods often fail to effectively handle the multi-timescale dynamics, large state–action spaces, and strict constraint satisfaction required for real-world energy systems (Korkas et al. [33]). These limitations are particularly evident in building energy management, where diverse objectives—such as energy cost reduction, thermal comfort, and responsiveness to external signals—must be balanced in real time. For instance, Mason and Grijalva [3] highlight that flat DRL architectures struggle with scalability and convergence in high-dimensional environments. Similarly, Naug et al. [17] demonstrate that DRL performance deteriorates in non-stationary energy environments without structured learning. Korkas et al. [34] further emphasize the need for hierarchical control frameworks to manage heterogeneous occupancy patterns and thermal zones effectively. These insights motivate the adoption of hierarchical and constraint-aware reinforcement learning techniques, such as the method proposed in this study, which leverages thermal inertia modeling and policy structure to overcome the aforementioned limitations.

Despite significant advancements in reinforcement learning and virtual energy storage modeling, several critical limitations remain unresolved in current Building Energy System (BES) control approaches. Most existing deep reinforcement learning (DRL) methods are either flat or centralized in structure, making them poorly suited for environments characterized by hierarchical decision-making, temporal abstraction, and system-wide constraints. These models often struggle to coordinate decisions across multiple timescales, fail to incorporate prior knowledge of building thermal dynamics, and suffer from instability or infeasibility under strict comfort and safety constraints. Moreover, traditional policy learning algorithms, such as Deep Q-Networks [35], and Deep Deterministic Policy Gradient [36], often overlook the potential of structured policy design and lack mechanisms to explicitly enforce constraint satisfaction during training and execution.

To address the limitations of existing building energy control approaches, this paper proposes a novel control framework that integrates constraint-aware deep reinforcement learning with virtual energy storage modeling. The framework is structured as separates long-term strategic decisions—such as VESS charge/discharge planning and HVAC mode scheduling—from short-term operational control like real-time temperature regulation and load balancing. At its core is the Dynamic Constraint-Aware Policy Optimization (DCPO) algorithm, an actor–critic method enhanced with policy constraints, entropy regularization, and adaptive clipping to ensure feasible, stable learning under energy and comfort constraints.

This work makes two key contributions. Theoretically, it advances the reinforcement learning domain by modeling thermal inertia as a Virtual Energy Storage System (VESS) and embedding it into a hierarchical control architecture that supports safe, adaptive, and interpretable decision-making. Practically, it demonstrates how the proposed DCPO-based control strategy can enable real-time, energy-efficient building operations, achieving over 30% cost savings and significant improvements in comfort compliance under realistic weather, occupancy, and pricing scenarios. The approach is designed for deployment in smart grids, net-zero buildings, and responsive energy systems, bridging the gap between academic RL research and real-world building energy management.

2. Building Energy System Modeling

Figure 1 illustrates the architecture of a Building Energy System (BES) enhanced with a Virtual Energy Storage System (VESS) and integrated Electric Vehicle (EV) subsystems [35,36,37,38]. The overall system comprises photovoltaic (PV) generation, inverter air conditioners (IACs), electric vehicles with bidirectional chargers, other electrical loads, smart meters, edge sensors, a weather data interface, and a centralized intelligent control center. These components work cohesively to ensure energy efficiency, comfort, and flexibility under varying operating conditions. In daily operation, electricity for the building is primarily supplied by PV panels and supplemented by the external distribution network (DN). The IAC regulates indoor temperature to maintain occupant comfort, while various other devices operate as passive loads. The EV system introduces a dynamic load element that also serves as a potential energy storage unit via vehicle-to-building (V2B) or vehicle-to-grid (V2G) capabilities. Smart meters retrieve real-time electricity pricing data from the DN and send it to the control center, while edge sensors continuously monitor environmental parameters and device states. Meteorological stations provide forecasts for outdoor temperature, solar radiation, and other relevant inputs. The intelligent control center forms the decision-making core of the BES. It integrates data from all subsystems and applies the proposed Dynamic Constraint-aware Policy Optimization (DCPO) algorithm to optimize control decisions. These include HVAC dispatch, PV utilization, EV charging/discharging, and power purchases from the grid. By combining the VESS model with EV behavior, the system achieves dynamic optimization across thermal, electrical, and storage domains. The VESS model captures the building envelope’s thermal inertia as a form of virtual energy storage. It includes three key parameters: virtual power, representing deviation from baseline heating needs; virtual capacity, quantifying the maximum thermal energy that can be stored or released; and virtual state of charge (SOC), indicating the current level of “stored” thermal energy. This model allows the BES to shift heating loads in time without physical batteries, effectively enhancing energy flexibility. The Electric Vehicle subsystem adds another layer of flexibility. EVs act as mobile, controllable storage units. When parked and connected to the BES, they can be charged during low electricity price periods and, if allowed by user constraints, discharge power back to the building during peak demand periods. The control center uses predicted vehicle arrival/departure times, user preferences, and battery state to determine optimal charging/discharging strategies. This is coordinated with thermal management strategies via the hierarchical control structure. By combining VESS-based thermal storage with EV electrical storage, the proposed BES architecture significantly enhances energy arbitrage opportunities, load shifting capabilities, and grid interaction strategies. The integration ensures that both comfort and economic objectives are achieved while increasing the system’s resilience and sustainability.

Inspired by the performance characteristics of physical battery systems, the VESS model introduces three core time-varying parameters. Virtual Power (

P^{V} (t)

) Reflects the power variation in the VESS before and after participating in BES optimization control. It serves as a key indicator of the VESS’s dynamic response capability to control signals. Virtual Capacity (

C^{V} (t)

) Represents the maximum energy storage potential that the VESS can offer during the optimization process. This is critical for evaluating its contribution to system flexibility. Virtual State of Charge (

S O C^{V} (t)

) Describes the real-time energy status of the VESS, analogous to the SOC of a conventional battery.

These parameters are interrelated by the following expression:

S O C^{V} (t + 1) = \frac{C^{V} (t) \cdot S O C^{V} (t) + P^{V} (t) \cdot Δ t}{C^{V} (t + 1)}

(1)

where t is the current control time step, Δt is the control time interval,

S O C^{V} (t)

is the virtual state of charge at time t,

C^{V} (t)

,

C^{V} (t + 1)

are virtual capacity at time steps t and t + 1, in kWh.

P^{V} (t)

,

P^{V} (t + 1)

are virtual power at time steps t and t + 1, in kW.

This formulation enables the BES control system to continuously assess and leverage the thermal storage flexibility of building envelopes as a virtual energy buffer, supporting dynamic optimization under fluctuating conditions.

To characterize the dynamic energy flexibility of a Virtual Energy Storage System (VESS), we first define the concept of virtual power. This is derived from the difference between actual air-conditioning power consumption and a theoretical baseline heating power required to maintain indoor comfort under standard conditions.

The baseline heating power, denoted as

Q_{base} (t)

, is calculated by considering two critical environmental factors: outdoor temperature and solar radiation. Outdoor temperature influences the heat transfer intensity between the indoor and outdoor environments, while solar radiation contributes to heat gain through the building envelope. The calculation of

Q_{base} (t)

is given by:

Q_{base} (t) = \sum_{i = 1}^{4} α_{in} \cdot O_{i} \cdot [T_{in} (t) - T_{out} (t)] - Q_{solar} (t)

(2)

where

i \in {1, 2, 3, 4}

corresponds to building envelope elements (walls, windows, roofs, and doors),

O_{i}

is the inner surface area of the i-th component (m²), and

α_{in}

is the internal surface heat transfer coefficient (W/m²·°C).

T_{in} (t)

and

T_{out} (t)

are the indoor and outdoor temperatures at time t, respectively, and

Q_{solar} (t)

is the solar heat gain at time t (kW).

Since the primary electricity consumption within the heating system is attributed to the inverter air conditioner (IAC), the auxiliary loads of the system (e.g., fans, controls) are neglected for modeling simplicity. The IAC’s electrical power consumption

P_{IAC} (t)

is assumed to be linearly related to its thermal output

Q_{IAC} (t)

, represented as:

P_{IAC} (t) = \frac{a_{2} Q_{IAC} (t) - b_{2}}{a_{1}} + b_{1}

(3)

where

a_{1}, b_{1}, a_{2}, b_{2}

are system-specific calibration constants derived from empirical data.

By substituting Equation (2) into Equation (3), it can calculate the baseline electrical power consumption

P_{base} (t)

associated with the theoretical heating load needed to maintain indoor comfort, as a function of environmental parameters and system coefficients:

P_{base} (t) = f (α_{in}, O_{i}, T_{in} (t), T_{out} (t), Q_{solar} (t))

(4)

Finally, the virtual power

P^{V} (t)

of the VESS is defined as the power deviation between the actual and baseline IAC energy consumption. This deviation reflects whether the system is effectively storing or releasing thermal energy. It is defined as:

P^{V} (t) = \{\begin{matrix} \begin{matrix} P_{IAC} (t) - P_{base} (t), & if P_{IAC} (t) > P_{base} (t) (Charging) \end{matrix} \\ \begin{matrix} P_{IAC} (t) - P_{base} (t), & if P_{IAC} (t) < P_{base} (t) (Discharging) \end{matrix} \end{matrix}

(5)

A positive virtual power indicates that the VESS is in a charging state, storing excess thermal energy by increasing the indoor temperature beyond the baseline requirement. Conversely, a negative virtual power implies a discharging process, where previously stored heat is utilized to meet the thermal demand with reduced active power input. This modeling framework enables BES controllers to evaluate and manage building thermal inertia as a virtual energy buffer, providing flexibility for load shifting and energy cost optimization.

The virtual capacity of a Virtual Energy Storage System (VESS) at time step t, denoted as

C^{V} (t)

, represents the total potential for thermal energy storage in the building envelope. This is quantified as the cumulative power that would be theoretically consumed to maintain indoor comfort if the air-conditioning system remained off, and the indoor temperature passively decreased from the upper comfort limit

T_{\max}

to the lower limit

T_{\min}

. The virtual capacity is therefore defined by the integral:

C^{V} (t) = \int_{t_{k}}^{t_{k} + Δ τ_{ca} (t)} P_{base} (t) d t

(6)

Here,

t_{k}

is the start of the control interval at time t (in hours), and

Δ τ_{ca} (t)

is the duration (in hours) required for the indoor temperature to passively decrease from

T_{\max}

to

T_{\min}

when the IAC is turned off. The integrand

P_{base} (t)

represents the baseline power required to maintain thermal equilibrium at time t. This parameter captures the maximum thermal energy that can be “stored” in the envelope through building thermal inertia during the specified temperature drift range.

To dynamically assess the real-time energy state of the VESS, we introduce the concept of virtual state of charge (SOC). This is based on the actual indoor temperature at the beginning of time interval t, denoted as

T_{in} (t)

. When the IAC is switched off, the indoor temperature begins to decline from

T_{in} (t)

toward the lower comfort threshold

T_{\min}

. The actual stored virtual energy, denoted as

E^{V} (t)

, is defined as the accumulated baseline power over this drift duration:

E^{V} (t) = \int_{t_{k}}^{t_{k} + Δ τ_{E} (t)} P_{base} (t) d t

(7)

In this case,

Δ τ_{E} (t)

is the time needed for the indoor temperature to decrease from the current temperature

T_{in} (t)

to

T_{\min}

with the IAC off.

E^{V} (t)

reflects the usable thermal energy stored at time t, based on the indoor temperature conditions and building thermal properties.

The virtual state of charge of the VESS, denoted

S O C^{V} (t)

, is then calculated as the ratio of the actual stored virtual energy

E^{V} (t)

to the maximum virtual capacity

C^{V} (t)

:

S O C^{V} (t) = \frac{E^{V} (t)}{C^{V} (t)}

(8)

This dimensionless indicator, ranging between 0 and 1, provides a real-time measure of the VESS’s thermal energy status, analogous to the state of charge in an electrochemical battery. A higher

S O C^{V} (t)

indicates that the building has more “stored heat” that can be released (discharged), while a lower value suggests a greater potential to “absorb heat” (charge) through increased IAC operation. Together, the virtual power, capacity, and state of charge form a foundational model for integrating thermal inertia into real-time building energy control.

To ensure the feasibility and stability of the building energy optimization control process, several operational constraints must be incorporated into the control model. These constraints govern the behavior of the Virtual Energy Storage System (VESS), the photovoltaic (PV) system output, and the overall power balance of the Building Energy System (BES).

The operational flexibility of the VESS is bounded by two primary constraints: the state of charge (SOC) and the power flow limits for charging and discharging. The state of charge constraint ensures that the virtual energy stored within the thermal envelope remains within physically interpretable bounds:

0 \leq S O C^{V} (t) \leq 1

(9)

This guarantees that the virtual energy level does not exceed the virtual capacity nor drop below zero, analogous to traditional battery SOC constraints. Additionally, the charging/discharging power constraint ensures that the power exchanged through the VESS at time t, denoted

P^{V} (t)

, remains within maximum allowable charging and discharging rates:

- P_{dismax} (t) \leq P^{V} (t) \leq P_{chamax} (t)

(10)

Here,

P_{dismax} (t)

and

P_{chamax} (t)

represent the maximum discharge and charge power limits, respectively, determined by the building’s thermal dynamics and the VESS control design.

The second group of constraints applies to the output of the photovoltaic (PV) system. The actual PV power delivered at time t,

P_{PV} (t)

, must lie within the range of zero and the forecasted maximum output based on irradiance predictions:

0 \leq P_{PV} (t) \leq P_{PVf} (t)

(11)

where

P_{PVf} (t)

is the maximum available PV power at time t, forecasted using meteorological and solar irradiance data. This ensures that the optimization does not overestimate the available renewable energy input.

The optimization model must satisfy the fundamental electrical power balance constraint to ensure supply-demand equilibrium at each time step. The sum of the power supplied from the grid

P_{com} (t)

and the PV generation

P_{PV} (t)

must equal the total load demand of the BES,

P^{E V} (t)

is net EV charging (positive = charging, negative = discharging) which includes the baseline power for thermal conditioning

P_{base}^{V} (t)

, the VESS virtual power exchange

P^{V} (t)

, and the power consumed by other devices

P_{other} (t)

:

P_{com} (t) + P_{PV} (t) = P_{base}^{V} (t) + P^{V} (t) + P^{E V} (t) + P_{other} (t)

(12)

This equality ensures that at every moment, the BES remains in a state of power balance, thereby maintaining stable and reliable operation. These constraints collectively enable the DCPO-based control model to operate within realistic technical boundaries while pursuing cost-efficient and comfort-aware decision-making.

In addition to VESS, this paper incorporates Electric Vehicles (EVs) as flexible energy storage components within the BES framework. EVs are capable of both charging from and discharging to the building, thereby contributing to demand flexibility and system optimization. To accurately model this interaction, the State of Charge (SOC) dynamics of EV batteries must be considered. The SOC of the EV at time t + 1 is given by:

{SOC}_{E V} (t + 1) = {SOC}_{E V} (t) + \frac{η_{E V} \cdot P_{E V} (t) \cdot Δ t}{C_{E V}}

(13)

where

η_{E V}

is the charging/discharging efficiency,

P_{E V} (t)

is the net EV power (positive for charging, negative for discharging), Δt is the control timestep, and

C_{E V}

is the EV battery capacity. This equation allows the BES to manage EVs as both dynamic loads and distributed storage elements, enhancing the system’s responsiveness to pricing and comfort constraints.

3. Proposed Method

Figure 2 illustrates the hierarchical control architecture of the proposed BES under the DCPO framework. The control strategy is organized into two main layers: the constraint policy layer and the operation layer. At the top, the constraint policy layer consists of a high-level controller that interacts with a constraint policy module to ensure that strategic decisions comply with thermal comfort and energy limitations. The high-level controller issues abstract actions, which are then interpreted by the low-level controller in the operation layer. This controller translates strategic policies into actionable device-level commands while receiving continuous observations from the system and feeding them back to the upper layers.

At the system level, the BES integrates multiple components including PV generation, EV systems, IAC units, VESS, and Other Loads, all of which interact with dynamic electricity price signals and weather data inputs. The VESS model is incorporated into the simulation environment to represent the building’s thermal flexibility. This model is used consistently across all reinforcement learning methods to ensure a fair and unified comparison framework. The DCPO agent uses these data streams to make adaptive decisions in real time, optimizing energy cost, comfort, and grid interaction. The diagram ensuring coordination between strategic planning and operational execution. Once training converges, the DCPO model is deployed within the BES. At each time step, the control strategy receives the current state vector and generates the appropriate control actions via the trained Actor network, ensuring adaptive, and energy-efficient operation of the BES while respecting VESS dynamics and user comfort constraints.

3.1. Objective Function

In the context of integrating a VESS into a BES, the primary control objective is to minimize the total electricity cost incurred during system operation. This total cost F is formulated as the sum of two components: the actual energy cost

F_{e}

, and a penalty term

F_{SOC}

that accounts for violations of indoor thermal comfort constraints based on the VESS’s state of charge. The total cost function is expressed as:

F = F_{e} + F_{SOC}

(14)

The first component,

F_{e}

, represents the accumulated electricity cost over a 24 h optimization horizon. It accounts for the base thermal demand of the building, the power consumption of other electrical devices, the net purchased power from the grid, and the onsite photovoltaic (PV) power generation. This is computed as:

F_{e} = \sum_{t = 1}^{24} [P_{V}^{base} (t) + P_{other} (t) + P_{EV} (t) + P_{com} (t) - P_{PV} (t)] \cdot c (t) \cdot Δ t

(15)

where

P_{V}^{base} (t)

is the baseline power demand related to thermal comfort (kW),

P_{other} (t)

denotes the power consumption of other electrical appliances (kW),

P_{com} (t)

is the power purchased from the grid (kW),

P_{PV} (t)

is the on-site PV power generation (kW), c(t) is the electricity price at time t (in cents/kWh), Δt is the duration of each control interval, typically set to one hour.

The second component,

F_{SOC}

, introduces a penalty based on indoor temperature violations that reflect breaches in the desired state of thermal comfort. If the actual indoor temperature

T_{in} (t)

falls outside the acceptable range defined by a minimum threshold

T_{\min}

and a maximum threshold

T_{\max}

, a penalty is incurred. The penalty function is defined as:

F_{SOC} = \sum_{t = 1}^{24} \{\begin{matrix} \begin{matrix} 0, & T_{\min} \leq T_{in} (t) \leq T_{\max} \end{matrix} \\ \begin{matrix} σ \cdot (T_{\min} - T_{in} (t)), & T_{in} (t) < T_{\min} \end{matrix} \\ \begin{matrix} σ \cdot (T_{in} (t) - T_{\max}), & T_{in} (t) > T_{\max} \end{matrix} \end{matrix}

(16)

where σ is a weighting factor (penalty coefficient) used to assign a cost to deviations outside the thermal comfort range. This formulation ensures that the control strategy not only minimizes energy costs but also maintains thermal comfort by constraining the VESS operation to realistic indoor temperature boundaries.

In addition to the objective function, the optimization model is subject to several constraints that ensure physical feasibility. These include VESS operational constraints, such as limits on virtual power, capacity, and state of charge; PV output constraints, which ensure the generation stays within realistic solar irradiance and system capacity; Power balance constraints, ensuring that at each time step, the sum of PV generation and grid purchases meets the total load demand of the BES.

3.2. Hierarchical Policy-Based Control Modeling

To apply deep reinforcement learning for the optimization of a BES enhanced with a VESS, the control problem must first be framed as a HMDP. An HMDP is defined by four key components: the state space S, the action space A, the reward function R, and the state transition probability D. In this formulation, S includes all observable information available to the agent; A defines the set of controllable variables; R provides scalar feedback indicating the quality of actions taken; and D captures the probabilistic evolution of the environment based on agent actions. For any given time step t, the variables s(t), a(t), and r(t) represent the specific realizations of the state, action, and reward, respectively.

To formulate the VESS-integrated BES optimization model as an HMDP, each of the HMDP components must be clearly defined. The state vector s(t) comprises the following elements: the VESS’s virtual state of charge

S O C^{V} (t)

, indoor temperature

T_{in} (t)

, outdoor temperature

T_{out} (t)

, solar heat gain

Q_{solar} (t)

, time information, and the real-time electricity price c(t). Together, these parameters form a rich state representation:

High-Level State (Strategic Layer):

s_{H} (t) = [S O C^{V} (t), {S O C^{E V} (t), T}_{in} (t), T_{out} (t), Q_{solar} (t), c (t)]

(17)

Low-Level State (Operational Layer):

s_{L} (t) = [T_{i n} (t), T_{t a r g e t} (t), P_{I A C} (t), P_{E V} (t), device states]

(18)

Unlike conventional DCPO applications, this state vector explicitly includes the virtual SOC, allowing the agent to learn from the building’s thermal inertia and incorporate prior knowledge of energy flexibility into the policy network.

The action space a(t) is defined by two continuous control variables: the VESS power dispatch

P^{V} (t)

and the photovoltaic power output

P_{PV} (t)

.

These form the action vector at each time step as follows:

a_{H} (t) = [P_{V}^{t a r g e t} (t), E V^{m o d e}]

(19)

a_{L} (t) = [P_{I A C} (t), P_{E V} (t)]

(20)

As DCPO is a model-free reinforcement learning algorithm, it does not require an explicit expression of the BES objective function. Instead, the optimization goal, previously defined in Equation (9), is embedded in the reward function to guide the learning process. The total reward r(t) consists of two components: an electricity cost-related reward

r_{elec} (t)

, and a penalty/reward associated with the VESS’s SOC

r_{s} (t)

. These are linearly combined using pre-defined weight coefficients

w_{1}

and

w_{2}

:

r (t) = w_{1} \cdot r_{elec} (t) + w_{2} \cdot r_{s} (t) + w_{2} \cdot r_{E V} (t)

(21)

This composite reward structure balances the dual goals of minimizing energy costs and maintaining thermal comfort, allowing the DCPO agent to explore optimal control strategies while adhering to real-world constraints.

With these definitions in place, the control problem for a BES integrated with a VESS is successfully formulated as an HMDP. This framework is well-suited for application of the DCPO algorithm, which iteratively improves the control policy by maximizing expected cumulative rewards while enforcing stability and constraint satisfaction through adaptive updates and variance reduction techniques. This formulation allows the BES to achieve real-time, robust, and cost-effective energy management under uncertain and dynamic conditions.

3.3. DCPO-Based Optimization of the HMDP Control Model

With the BES and VESS modeled as a HMDP, the next step is to apply the DCPO algorithm to derive real-time and adaptive control strategies. DCPO is a policy gradient method that seeks to maximize the expected long-term return of the policy. The expected return under policy π, denoted

ρ^{π}

, is defined as:

ρ^{π} = \sum_{s \in S} \sum_{a \in A} d^{π} (s) \cdot π (s, a) \cdot r (s, a)

(22)

Here,

π (s, a)

is the probability of selecting action A in states under policy π,

d^{π} (s)

is the stationary distribution of state’s when the HMDP stabilizes under policy π, and r(s, a) is the immediate reward obtained by taking action A in states. The goal is to find an optimal policy that selects the best action for every state to maximize

ρ^{π}

.

To do so, the policy π is parameterized by θ, such that

π = f (θ)

. The optimization process involves computing the gradient of

ρ^{π}

with respect to θ using the policy gradient theorem:

\frac{\partial ρ^{π}}{\partial θ} = \sum_{s \in S} \sum_{a \in A} d^{π} (s) \cdot \frac{\partial π (s, a)}{\partial θ} \cdot G^{π} (s, a)

(23)

where

G^{π} (s, a)

is the action-value function representing the expected cumulative reward from taking action A in states under policy π. DCPO adopts an Actor-Critic architecture, wherein the Actor is responsible for policy generation and the Critic estimates the value function. To ensure policy stability, DCPO employs a clipping mechanism that restricts large updates by bounding the policy ratio during training.

The loss function for the Actor network is expressed as:

L_{θ} = E_{t} [\min (q_{θ} (t) \cdot I (t), clip (q_{θ} (t), 1 - ε, 1 + ε) \cdot I (t)) + β \cdot H (π_{θ} (s (t)))]

(24)

where

q_{θ} (t)

is the importance sampling ratio between the new and old policies at time t, I(t) is the advantage function estimating how much better the action is compared to average, ε is the clipping threshold (a hyperparameter),

H (π_{θ} (s (t)))

is the entropy of the policy, and β is the entropy coefficient, encouraging exploration.

The Critic network minimizes the value function estimation error, with its loss function defined as:

L_{ϕ} = {(V_{ϕ} (s (t)) - B (t))}^{2}

(25)

where

V_{ϕ} (s (t))

is the value estimate of state’s(t), parameterized by ϕ, and B(t) is the actual cumulative return from time t onward.

The importance sampling ratio

q_{θ} (t)

is calculated as:

q_{θ} (t) = \frac{π_{θ} (a (t)| s (t))}{π_{θ}^{old} (a (t)| s (t))}

(26)

This measures the relative likelihood of the current policy versus the previous policy in choosing action a(t) at state’s(t). The advantage function I(t) is defined as the difference between actual return and estimated value:

I (t) = B (t) - V_{ϕ} (s (t))

(27)

To encourage exploration and avoid premature convergence, DCPO incorporates the entropy of the policy distribution:

H (π_{θ} (s (t))) = - \sum_{a \in A} π_{θ} (a| s (t)) \cdot \log π_{θ} (a| s (t))

(28)

In the reinforcement learning-based control framework, the agent is trained to maximize a cumulative reward function that reflects the control objectives of the Building Energy System (BES). The immediate reward at time step t is defined as:

R_{t} = - λ_{1} \cdot C_{t} - λ_{2} \cdot {|T_{t}^{in} - T_{t}^{ref}|}^{β} - λ_{3} \cdot E_{t}

(29)

where

C_{t}

is operational energy cost at time t, β is comfort penalty exponent (β > 1),

E_{t}

is energy switching or transition cost (e.g., equipment wear),

λ_{1}, λ_{2}, λ_{3}

are positive weighting factors balancing economic, comfort, and operational efficiency objectives.

4. Result and Discussion

4.1. Simulation Environment and Scenario Configuration

To validate the performance of the proposed DCPO strategy, simulation experiments are conducted using a representative winter day scenario. The simulated building is defined with dimensions of 10 m × 10 m × 3 m, and features a window-to-wall ratio of 0.2, indicating that each external wall contains a 6 m² window. The external walls and roof are modeled with identical thermal properties: thermal conductivity (k) of 0.8 W/m²·K, specific heat capacity (c) of 840 J/kg·K, density of 1800 kg/m³, and thickness of 0.2 m. These values reflect common medium-mass construction materials used in residential buildings. The Intelligent Air Conditioning (IAC) system has a rated cooling capacity of 3.5 kW and a COP of 3.2. Equipment configurations for the Intelligent Air Conditioning (IAC) system, PV generation, and auxiliary electrical devices follow the specifications provided in [39]. A 1 h control interval is used for dispatch and scheduling decisions. The PV system consists of 12 panels, each rated at 300 W with an efficiency of 18%, for a total installed capacity of 3.6 kW [40]. Auxiliary loads include lighting and appliances with a combined constant power draw of 1.2 kW during occupied hours. The roof structure is modeled with thermal characteristics equivalent to those of the external walls.

Figure 3, depicts the hourly electricity pricing variation across a full day [41]. Prices, expressed in cents per kilowatt-hour (¢/kWh), reflect temporal fluctuations influenced by grid load dynamics and supply-demand interactions. The lowest price is observed at hour 24 at 2.29 ¢/kWh, while the peak price occurs at hour 10, reaching 7.78 ¢/kWh. During off-peak hours (1–6 a.m.), prices remain relatively low, averaging 3.7–4.5 ¢/kWh, while a noticeable surge starts after hour 7. The price stabilizes at approximately 5–6 ¢/kWh between hours 8 and 13, before tapering off through the evening and night. This dynamic pricing pattern is vital for BES control strategies, especially those that incorporate Virtual Energy Storage Systems (VESS) and adaptive algorithms like DCPO.

Figure 4 capturing hourly changes in external environmental conditions over the same 24 h period [42]. The outdoor temperature begins at −1.75 °C at hour 1, decreases to a daily minimum of −5.86 °C at hour 8, and rises to a maximum of 2.18 °C at hour 14 before declining again in the late hours. Simultaneously, solar heat gain is zero during nighttime (hours 1–7 and 19–24). It starts to increase at hour 8, peaking at 676.26 W/m² at hour 13, then gradually declines to zero again by hour 19. These profiles play a critical role in defining the building’s thermal response and flexibility. Accurately modeling these inputs enhances the control capability of DCPO, especially by leveraging the thermal inertia of the building envelope for optimal energy and comfort management.

4.2. System Control Performance Evaluation

This section evaluates the dynamic performance of the proposed DCPO control strategy within the context of an intelligent BES. The evaluation focuses on operational behavior over a 24 h simulation period, under realistic winter-day conditions, dynamic electricity pricing, and fluctuating solar and thermal profiles. Key metrics include HVAC indoor temperature regulation, EV charging behavior, VESS utilization, and real-time grid power exchange. The results are benchmarked against baseline deep reinforcement learning approaches such as Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG), enabling a comprehensive comparison of control quality, energy cost efficiency, and comfort constraint compliance. Each figure in this section illustrates the temporal coordination of device actions and system responses under the influence of DCPO’s hierarchical and constraint-aware decision-making policy.

Figure 5 illustrates the indoor temperature trajectory over a typical day under the proposed DCPO-based control strategy. The plot includes the upper and lower thermal comfort bounds, set, respectively, at 26 °C and 18 °C, in accordance with standard occupant comfort guidelines. As observed, the indoor temperature remains predominantly within the defined comfort band, indicating the system’s ability to adapt to both internal gains and external weather conditions while ensuring occupant thermal satisfaction. Notably, during early morning hours (1:00–6:00), the temperature slightly approaches the lower bound due to reduced solar gains and colder ambient conditions, yet remains within acceptable comfort levels. Around mid-day (11:00–15:00), when solar radiation is at its peak and outdoor temperatures rise, the system efficiently manages cooling output to prevent overheating. This controlled thermal response demonstrates the effectiveness of the DCPO controller in leveraging VESS characteristics, balancing user comfort with energy efficiency.

Figure 6 presents the coordination profile between EV charging and VESS behavior under the proposed DCPO control strategy. EV charging power peaks twice during the day: first between hours 3 to 5, reaching a maximum of 4.2 kW at hour 4, and again in the evening between hours 16 to 18, peaking at 3.5 kW at hour 18. These periods align with off-peak electricity pricing and reflect demand-side flexibility strategies enabled by the controller. The VESS state of charge increases from 0% at hour 1 to a peak of 5 units (normalized scale) by hour 7, during which electricity is relatively inexpensive. The system then strategically discharges from hour 9 to 14, reaching a minimum SoC of 1.8 units at hour 14—coinciding with the highest electricity price periods. This discharge supports internal heating demand and reduces real-time power purchases. From hour 15 onward, the VESS begins to recharge as grid pressure and tariffs decline, reaching a final SoC of 9 units by hour 24. This coordinated behavior reflects how DCPO optimally allocates energy between physical EV demand and thermal storage flexibility, maximizing economic efficiency and comfort compliance within real-time constraints.

Figure 7 illustrates the dynamic interaction between key components of the BES including EV charging, IAC, PV generation, and the VESS and their cumulative effect on grid power demand. The graph presents a detailed power flow analysis where each component’s power profile is visualized alongside the total power drawn from the grid. From the figure, it is evident that PV generation reaches its peak output of approximately 6.5 kW at hour 12, significantly offsetting the demand on the grid during daylight hours. EV charging primarily occurs during off-peak periods (hours 1–5 and 20–24), with a maximum charging rate around 4 kW, aligning with low electricity prices. The IAC load, driven by indoor thermal requirements and outdoor temperature, shows higher consumption between hours 10–16, reaching a peak of 7 kW, reflecting occupant comfort control during mid-day. Crucially, the VESS operates flexibly—charging when excess PV power or low-cost electricity is available and discharging during peak demand. This bidirectional behavior smooths the total power draw from the grid. The grid demand curve, as a result, appears flatter and more consistent throughout the day, avoiding sharp peaks and minimizing cost-intensive consumption periods.

Figure 8 illustrates collectively reveal the power behavior, operation scheduling, and control signal intensity of these devices across a typical 24 h cycle. Figure 8a, the stacked area chart visualizes the hourly power consumption profile. The HVAC system exhibits intermittent power draw, reaching a peak of 2.6 kW, primarily during early morning and late afternoon, aligning with periods of higher thermal load. The EV shows concentrated charging activity between hours 1 to 6 and again from hours 22 to 24, with a peak of 4.2 kW, indicating alignment with off-peak electricity pricing. VESS, designed to simulate thermal inertia, demonstrates moderate but strategically timed usage, with a maximum power of 1.4 kW, mainly supporting demand smoothing around mid-morning and early evening. Figure 8b presents the binary ON/OFF operational status of each device using a heatmap. The HVAC system is active for 15 h, adapting to indoor thermal requirements. The EV operates for 9 h, scheduled during low-tariff periods to minimize cost. The VESS is ON for 13 h, underscoring its role in providing virtual flexibility and enhancing the system’s responsiveness to external conditions. Figure 8c depicts the normalized control intensity signals (ranging from 0 to 1) applied to each device. HVAC modulation shows adaptive control patterns, with intensity values reaching 0.9 at hour 18, indicating high activity due to thermal demand. EV control signals remain consistently strong during its scheduled hours (e.g., 1.0 at hour 24), ensuring effective battery charging. VESS displays variable intensity (e.g., 0.7 at hour 21), used to buffer energy use and reduce grid reliance during peak periods.

Figure 9 illustrates the proposed DCPO strategy, highlighting how control actions vary with electricity price and outdoor temperature. The results show that when electricity prices are low (around 2.5–4.0 ¢/kWh), and outdoor temperatures are moderate (−2 °C to 0 °C), the control signal is generally high, suggesting active energy use for preheating or charging the thermal storage. In contrast, at higher electricity prices (above 6.5 ¢/kWh) and extremely cold temperatures (below −4 °C), the control level drops significantly, reflecting a conservative strategy to minimize grid dependence and maintain comfort using stored thermal energy. Overall, this figure demonstrates the DCPO policy’s ability to balance cost and comfort efficiently. The average control intensity under favorable conditions is approximately 0.75, while under peak price and harsh climate, it decreases to around 0.2, showcasing dynamic adaptability to varying energy and environmental states.

4.3. Final Comparative Analysis

This section presents a comparative analysis of the proposed DCPO method against conventional reinforcement learning approaches, namely DQN and DDPG [33,34]. Four key performance indicators are used to evaluate effectiveness: operational cost, thermal comfort violation rate, policy convergence speed, and computational performance (measured by inference time). These indicators were chosen due to their central relevance in building energy management, where energy savings, occupant comfort, and real-time feasibility are critical deployment factors. Together, they reflect the economic, environmental, and technical performance of the system under realistic operating conditions. Together, they offer a comprehensive evaluation of system performance under realistic operating conditions.

Figure 10 illustrates the training convergence behavior of four reinforcement learning algorithms—DQN, DDPG, TD3, and the proposed DCPO—over 200 training episodes. The vertical axis shows the cumulative reward, while the horizontal axis denotes the training episodes. As observed, DCPO achieves the highest and most stable cumulative reward, converging around episode 90 with a final average reward of approximately 290. TD3 converges slightly later, near episode 110, with a final reward around 275. DDPG stabilizes near episode 130 at approximately 260, while DQN shows the slowest convergence and lowest reward, reaching around 245 after 150 episodes. The superior performance and lower variance in DCPO’s trajectory reflect its enhanced learning efficiency and policy stability.

Figure 11 presents a comparative analysis of the operational costs associated with three control strategies applied to BES: DQN, DDPG, and DCPO. The results are visualized using a bar chart, highlighting the significant economic advantages of DCPO. The operational cost under DQN reaches approximately 1147.14 cents, while DDPG reduces this to 875.80 cents. In contrast, the DCPO approach achieves a substantially lower cost of 773.32 cents. This corresponds to a 32.6% reduction compared to DQN and an 11.7% improvement over DDPG. These results confirm that the DCPO method is not only more energy-efficient but also economically advantageous in dynamic control environments, effectively minimizing energy expenses through better coordination of system components such as EVs, HVAC units, and VESS.

Figure 12 illustrates the thermal comfort violation rates achieved by different control algorithms—DQN, DDPG, and the proposed DCPO. The violation rate quantifies the percentage of time that indoor temperatures exceed user-defined comfort bounds over a 24 h operational period. As shown, DQN records the highest violation rate at 67.87%, reflecting its limited capacity to handle complex constraints in dynamic building environments. DDPG achieves a significant improvement with a reduced violation rate of 19.89%, owing to its use of continuous action spaces. The proposed DCPO method delivers the best performance, maintaining thermal comfort within acceptable ranges for most of the time, with a violation rate of just 15.19%.

Figure 13 presents heatmaps of control intensity for three critical building energy system components: HVAC, EVs, and VESS, under three reinforcement learning strategies—DCPO, DQN, and DDPG. Each heatmap captures the dynamic adjustment of normalized control effort (ranging from 0 to 1), providing insight into the operational behavior and responsiveness of each algorithm. For the HVAC system, the DCPO algorithm exhibits strategically elevated control levels during thermal demand peaks—reaching intensities of 0.9 at hours 8, 17, and 18, while reducing activity to zero during low-demand nighttime periods (hours 3–5 and 10–15). In comparison, DQN maintains moderate control (0.6–0.8) but lacks sharper transitions, and DDPG fluctuates more significantly, with zero intensity in the early morning (hours 6–9) and inconsistent activations afterward. For EV charging, DCPO applies maximum control intensities of 0.9 to 1.0 during early hours (1–3) and again at hour 24, enabling efficient overnight and end-of-day charging. DQN displays a similar pattern but with slightly lower activation levels, while DDPG shows sparse control, with minimal activation for most of the day and only brief increases to 0.6–0.7 around hours 19–20. In the case of the VESS, DCPO maintains a balanced charging/discharging profile, with control intensities rising to 0.7–0.8 during late afternoon (hours 19–21), leveraging thermal storage during peak demand hours. Overall, clearly demonstrate that DCPO provides more proactive, adaptive, and policy-aligned control across all devices, in contrast to the more passive or unstable patterns of DQN and DDPG.

Table 1 summarizes the performance metrics of the proposed DCPO algorithm compared with two baseline reinforcement learning methods—DQN and DDPG. The comparison focuses on four critical indicators: total operational cost (in cents), thermal comfort violations (as a percentage of time outside comfort bounds), convergence speed (number of episodes required for learning stabilization), and policy inference time (in milliseconds). The results clearly show that DCPO outperforms the baselines across all indicators, achieving the lowest operational cost of 773.32 ¢, compared to 875.80 ¢ for DDPG and 1147.14 ¢ for DQN. It also achieves the lowest comfort violation rate of 15.19%, while DDPG and DQN show 19.89% and 67.87%, respectively. In terms of convergence, DCPO stabilizes in 110 episodes, compared to 120 episodes for DDPG and 130 for DQN. Finally, DCPO demonstrates superior real-time responsiveness with the lowest inference time of 3.6 ms, compared to 8.7 ms (DDPG) and 15.2 ms (DQN).

5. Conclusions

This study proposed a DCPO-based control strategy for Building Energy Systems (BES) that integrates a Virtual Energy Storage System (VESS). By transforming the VESS-integrated BES model into a HMDP framework, the method enables real-time, adaptive optimization under complex operational conditions. It is worth noting that the proposed control framework relies on a Virtual Energy Storage System (VESS) model to represent building thermal inertia. While effective in simulation, real-world deviations in thermal behavior may affect performance. Nevertheless, the DCPO algorithm learns policies through direct interaction with the environment, enabling robustness to such variations. Future work will explore adaptive modeling and transfer learning to improve deployment reliability under uncertain conditions. The DCPO algorithm was then employed to efficiently learn optimal control policies while balancing energy costs, indoor comfort, and computational performance. This metric reflects the system’s feasibility for practical applications where rapid decision-making is critical.

The key observations of this research are as follows:

The proposed method successfully achieved millisecond-level policy generation, significantly outperforming DQN, DDPG, and TD3 in terms of both convergence speed and computational efficiency. This demonstrates its ability to manage multi-variable, nonlinear BES optimization problems with the real-time responsiveness required for practical deployment.
Under dynamically changing external conditions, such as outdoor temperature and solar radiation, the proposed strategy substantially reduced operating costs, while maintaining stable VESS state-of-charge (SOCV) and achieving a lower rate of indoor temperature violations. This reflects the method’s dual advantage in energy efficiency and occupant comfort.
A novel VESS model was developed based on the building’s thermal inertia, introducing key parameters such as virtual power, virtual capacity, and virtual state of charge. These were embedded into the HMDP to improve the interpretability and physical relevance of the control policy, marking a methodological advance in integrating virtual storage into intelligent building control systems.

While the proposed control framework demonstrates promising results, this study was conducted under typical winter weather conditions. Future research should explore the robustness and adaptability of the DCPO-based control strategy under extreme or uncertain meteorological scenarios, such as heatwaves, cold snaps, or high variability in solar radiation. Moreover, extending the model to incorporate multi-zone buildings, occupancy-driven control, and hybrid storage systems (thermal + electrical) would further enhance the method’s applicability and generalizability in smart grid and net-zero energy building contexts. Further work will also explore online model adaptation and transfer learning to improve the system’s reliability and generalization in real-world deployment.

Author Contributions

Methodology, S.L.; Software, A.N.A.; Formal analysis, C.T.; Writing—original draft, S.L.; Writing—review & editing, A.N.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported Humanities and Social Sciences Project of Hubei Provincial Education Department (No.: 23G077).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Al Sayed, K.; Boodi, A.; Sadeghian Broujeny, R.; Beddiar, K. Reinforcement learning for HVAC control in intelligent buildings: A technical and conceptual review. J. Build. Eng. 2024, 95, 110085. [Google Scholar] [CrossRef]
Korkas, C.D.; Baldi, S.; Michailidis, I.; Kosmatopoulos, E.B. Occupancy-based demand response and thermal comfort optimization in microgrids with renewable energy sources and energy storage. Appl. Energy 2016, 163, 93–104. [Google Scholar] [CrossRef]
Mason, K.; Grijalva, S. A review of reinforcement learning for autonomous building energy management. Comput. Electr. Eng. 2019, 78, 300–312. [Google Scholar] [CrossRef]
Bagle, M.E.; Goia, F. Combined reinforcement learning (RL) and model predictive control (MPC) for optimal building energy use. In Proceedings of the Building Simulation Conference Proceedings, Shanghai, China, 4–6 September 2023; Volume 18. [Google Scholar]
Park, S.; Park, S.; Choi, M.; Lee, S.; Lee, T.; Kim, S.; Cho, K.; Park, S. Reinforcement learning-based BEMS architecture for energy usage optimization. Sensors 2020, 20, 4918. [Google Scholar] [CrossRef]
Cui, F.; An, D.; Xi, H. Integrated energy hub dispatch with a multi-mode CAES–BESS hybrid system: An option-based hierarchical reinforcement learning approach. Appl. Energy 2024, 374, 123950. [Google Scholar] [CrossRef]
Zhou, X.; Li, J.; Mo, H.; Yan, J.; Liang, L.; Pan, D. Enhanced hierarchical reinforcement learning for Co-optimization of HVAC system operations. J. Build. Eng. 2025, 106, 112663. [Google Scholar] [CrossRef]
Jendoubi, I.; Bouffard, F. Multi-agent hierarchical reinforcement learning for energy management. Appl. Energy 2023, 332, 120500. [Google Scholar] [CrossRef]
Yao, L.; Liu, P.; Teo, J. Hierarchical multi-agent deep reinforcement learning with adjustable hierarchy for home energy management systems. Energy Build. 2025, 331, 115391. [Google Scholar] [CrossRef]
Zhang, B.; Hu, W.; Ghias, A.M.; Xu, X.; Chen, Z. Multi-agent deep reinforcement learning-based coordination control for grid-aware multi-buildings. Appl. Energy 2022, 328, 120215. [Google Scholar] [CrossRef]
Yang, H.; Tsai, H.; Liao, J.; Wu, Y. Optimizing demand response in Multi-HVAC systems by RL control with comfort consideration. In Proceedings of the 2024 IEEE 19th Conference on Industrial Electronics and Applications (ICIEA), Kristiansand, Norway, 5–8 August 2024; pp. 1–6. [Google Scholar] [CrossRef]
Azizi, M.; Zeinali, F.; Mili, M.R.; Shokrollahi, S. Efficient aoi-aware resource management in VLC-V2X networks via multi-agent RL mechanism. IEEE Trans. Veh. Technol. 2024, 73, 14009–14014. [Google Scholar] [CrossRef]
Jiang, Z.; Risbeck, M.J.; Ramamurti, V.; Murugesan, S.; Amores, J.; Zhang, C.; Lee, Y.M.; Drees, K.H. Building HVAC control with reinforcement learning for reduction of energy cost and demand charge. Energy Build. 2021, 239, 110833. [Google Scholar] [CrossRef]
Alotaibi, B.S. Context-aware smart energy management system: A reinforcement learning and IoT-based framework for enhancing energy efficiency and thermal comfort in sustainable buildings. Energy Build. 2025, 340, 115804. [Google Scholar] [CrossRef]
Kumaresan, S.S.; Jeyaraj, P.R. Smart building transferable energy scheduling employing reward shaping deep reinforcement learning with demand side energy management. J. Build. Eng. 2025, 104, 112316. [Google Scholar] [CrossRef]
Arroyo, J.; Manna, C.; Spiessens, F.; Helsen, L. Reinforced model predictive control (RL-MPC) for building energy management. Appl. Energy 2022, 309, 118346. [Google Scholar] [CrossRef]
Naug, A.; Quinones-Grueiro, M.; Biswas, G. Deep reinforcement learning control for non-stationary building energy management. Energy Build. 2022, 277, 112584. [Google Scholar] [CrossRef]
Oh, E. Reinforcement-learning-Based virtual energy storage system operation strategy for wind power forecast uncertainty management. Appl. Sci. 2020, 10, 6420. [Google Scholar] [CrossRef]
Silvestri, A.; Coraci, D.; Brandi, S.; Capozzoli, A.; Schlueter, A. Practical deployment of reinforcement learning for building controls using an imitation learning approach. Energy Build. 2025, 335, 115511. [Google Scholar] [CrossRef]
Woo, D.; Junghans, L. Framework for model predictive control (mpc)-based surface condensation prevention for thermo-active building systems (TABS). Energy Build. 2020, 215, 109898. [Google Scholar] [CrossRef]
Fambri, G.; Marocco, P.; Badami, M.; Tsagkrasoulis, D. The flexibility of virtual energy storage based on the thermal inertia of buildings in renewable energy communities: A techno-economic analysis and comparison with the electric battery solution. J. Energy Storage 2023, 73, 109083. [Google Scholar] [CrossRef]
Jin, B.; Liu, Z. Evaluating the impact of virtual energy storage under air conditioning and building coupling on the performance of a grid-connected distributed energy system. J. Build. Eng. 2024, 89, 109309. [Google Scholar] [CrossRef]
Lv, G.; Ji, Y.; Zhang, Y.; Wang, W.; Zhang, J.; Chen, J.; Nie, Y. Optimization of building microgrid energy system based on virtual energy storage. Front. Energy Res. 2023, 10, 1053498. [Google Scholar] [CrossRef]
Jani, A.; Karimi, H.; Jadid, S. Multi-time scale energy management of multi-microgrid systems considering energy storage systems: A multi-objective two-stage optimization framework. J. Energy Storage 2022, 51, 104554. [Google Scholar] [CrossRef]
Mu, Y.; Xu, Y.; Zhang, J.; Wu, Z.; Jia, H.; Jin, X.; Qi, Y. A data-driven rolling optimization control approach for building energy systems that integrate virtual energy storage systems. Appl. Energy 2023, 346, 121362. [Google Scholar] [CrossRef]
Alhammad, M.; Eames, M.; Vinai, R. Enhancing building energy efficiency through building information modeling (BIM) and building energy modeling (BEM) integration: A systematic review. Buildings 2024, 14, 581. [Google Scholar] [CrossRef]
Chen, B.; Donti, P.L.; Baker, K.; Kolter, J.Z.; Bergés, M. Enforcing policy feasibility constraints through differentiable projection for energy optimization. In Proceedings of the Twelfth ACM International Conference on Future Energy Systems, Torino, Italy, 28 June–2 July 2021; pp. 199–210. [Google Scholar] [CrossRef]
Delavari, H.; Naderian, S. Reinforcement learning robust nonlinear control of a microgrid with hybrid energy storage systems. J. Energy Storage 2024, 81, 110407. [Google Scholar] [CrossRef]
Jung, H. An optimal charging and discharging scheduling algorithm of energy storage system to save electricity pricing using reinforcement learning in urban railway system. J. Electr. Eng. Technol. 2021, 17, 727–735. [Google Scholar] [CrossRef]
Kolodziejczyk, W.; Zoltowska, I.; Cichosz, P. Real-time energy purchase optimization for a storage-integrated photovoltaic system by deep reinforcement learning. Control. Eng. Pract. 2021, 106, 104598. [Google Scholar] [CrossRef]
Mowbray, M.; Petsagkourakis, P.; Del Rio-Chanona, E.; Zhang, D. Safe chance constrained reinforcement learning for batch process control. Comput. Chem. Eng. 2022, 157, 107630. [Google Scholar] [CrossRef]
Zhang, X.; Wang, X.; Zhang, H.; Ma, Y.; Chen, S.; Wang, C.; Chen, Q.; Xiao, X. Hybrid model-free control based on deep reinforcement learning: An energy-efficient operation strategy for HVAC systems. J. Build. Eng. 2024, 96, 110410. [Google Scholar] [CrossRef]
Korkas, C.D.; Baldi, S.; Michailidis, I.; Kosmatopoulos, E.B. Intelligent energy and thermal comfort management in grid-connected microgrids with hetero-geneous occupancy schedule. Appl. Energy 2015, 149, 194–203. [Google Scholar] [CrossRef]
Baldi, S.; Korkas, C.D.; Lv, M.; Kosmatopoulos, E.B. Automating occu-pant-building interaction via smart zoning of thermostatic loads: A switched self-tuning approach. Appl. Energy 2018, 231, 1246–1258. [Google Scholar] [CrossRef]
Yang, T.; Zhao, L.; Li, W.; Wu, J.; Zomaya, A.Y. Towards healthy and cost-effective indoor environment management in smart homes: A deep reinforcement learning approach. Appl. Energy 2021, 300, 117335. [Google Scholar] [CrossRef]
Yu, L.; Sun, Y.; Xu, Z.; Shen, C.; Yue, D.; Jiang, T.; Guan, X. Multi-agent deep reinforcement learning for HVAC control in commercial buildings. IEEE Trans. Smart Grid 2021, 12, 407–419. [Google Scholar] [CrossRef]
Idrissi, R.N.; Ouassaid, M.; Maaroufi, M. A constrained programming-based algorithm for optimal scheduling of aggregated EVs power demand in smart buildings. IEEE Access 2022, 10, 28434–28444. [Google Scholar] [CrossRef]
Operation for building design PC Thomas, A.; Rao, G.; Wong, J. Load calculation and energy simulation: The link between design and operation for building design. In Proceedings of the Building Simulation Conference Proceedings, San Francisco, CA, USA, 7–9 August 2017. [Google Scholar] [CrossRef]
CEIC. United States Electricity Prices [Data Set]; CEIC Data: Singapore, 2021; Available online: https://www.ceicdata.com/en/united-states/electricity-price (accessed on 18 July 2025).
U.S. Department of Energy. EnergyPlus Weather Data [Data Set]. EnergyPlus. 2009. Available online: https://energyplus.net/ (accessed on 18 July 2025).
Qian, T.; Shao, C.; Wang, X.; Zhou, Q.; Shahidehpour, M. Shadow-price DRL: A framework for online scheduling of shared autonomous EVs fleets. IEEE Trans. Smart Grid 2022, 13, 3106–3117. [Google Scholar] [CrossRef]
Stopps, H.; Huchuk, B.; Touchie, M.F.; O’Brien, W. Is anyone home? A critical review of occupant-centric smart HVAC controls implementations in residential buildings. Build. Environ. 2021, 187, 107369. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of BES with VESS.

Figure 2. Multi-Layer Control Framework for BES with Dynamic Constraint-Aware Policy Optimization.

Figure 3. Dynamic Electricity Price Curve for a Typical Day.

Figure 4. Outdoor Temperature and Solar Heat Gain Profile.

Figure 5. Hourly Indoor Temperature Profile with Thermal Comfort Range Overlay.

Figure 6. DCPO-Based Coordination of Electric Vehicle Charging and Thermal Storage.

Figure 7. Linking Grid Power Demand to System Equipment and Control.

Figure 8. Integrated Control Visualization of HVAC, EV, and VESS Devices under DCPO Strategy—(a) Hourly Energy Consumption, (b) Binary ON/OFF Control Schedule, and (c) Control Intensity.

Figure 9. Control Decisions Based on Energy Cost and Climate Input.

Figure 10. Comparative Convergence of DQN, DDPG, and DCPO Algorithms.

Figure 11. Comparative Operational Costs of Control Strategies.

Figure 12. Thermal Comfort Compliance Performance of DQN, DDPG, and DCPO.

Figure 13. Comparative Device Operation Schedules under DCPO, DQN, and DDPG Control Strategies.

Table 1. Performance Comparison of DCPO, DDPG, and DQN Algorithms for Building Energy System Control.

Algorithm	Cost (¢)	Comfort Violations (%)	Convergence Episodes	Inference Time (ms)
DQN	1147.14	67.87	130	15.2
DDPG	875.80	19.89	120	8.7
DCPO	773.32	15.19	110	3.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Tian, C.; Abdalla, A.N. Energy Regulation-Aware Layered Control Architecture for Building Energy Systems Using Constraint-Aware Deep Reinforcement Learning and Virtual Energy Storage Modeling. Energies 2025, 18, 4698. https://doi.org/10.3390/en18174698

AMA Style

Li S, Tian C, Abdalla AN. Energy Regulation-Aware Layered Control Architecture for Building Energy Systems Using Constraint-Aware Deep Reinforcement Learning and Virtual Energy Storage Modeling. Energies. 2025; 18(17):4698. https://doi.org/10.3390/en18174698

Chicago/Turabian Style

Li, Siwei, Congxiang Tian, and Ahmed N. Abdalla. 2025. "Energy Regulation-Aware Layered Control Architecture for Building Energy Systems Using Constraint-Aware Deep Reinforcement Learning and Virtual Energy Storage Modeling" Energies 18, no. 17: 4698. https://doi.org/10.3390/en18174698

APA Style

Li, S., Tian, C., & Abdalla, A. N. (2025). Energy Regulation-Aware Layered Control Architecture for Building Energy Systems Using Constraint-Aware Deep Reinforcement Learning and Virtual Energy Storage Modeling. Energies, 18(17), 4698. https://doi.org/10.3390/en18174698

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Energy Regulation-Aware Layered Control Architecture for Building Energy Systems Using Constraint-Aware Deep Reinforcement Learning and Virtual Energy Storage Modeling

Abstract

1. Introduction

2. Building Energy System Modeling

3. Proposed Method

3.1. Objective Function

3.2. Hierarchical Policy-Based Control Modeling

3.3. DCPO-Based Optimization of the HMDP Control Model

4. Result and Discussion

4.1. Simulation Environment and Scenario Configuration

4.2. System Control Performance Evaluation

4.3. Final Comparative Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI