1. Introduction
The increasing demand for energy-efficient and resilient buildings has elevated the importance of intelligent Building Energy Systems (BES). Contemporary BES must manage energy consumption while balancing user comfort, grid interaction, and dynamic external conditions such as weather and electricity prices. However, the inherent complexity of building thermal dynamics, coupled with diverse and often conflicting operational objectives, makes traditional rule-based or static control strategies insufficient [
1,
2]. Reinforcement learning (RL) has emerged as a promising alternative for adaptive control in BES. By learning policies from environmental interactions, RL enables systems to autonomously optimize decisions in response to changing conditions without requiring an explicit building model [
3]. In particular, deep reinforcement learning (DRL) methods have shown potential in managing HVAC systems and distributed energy resources, thanks to their scalability and function approximation capabilities [
4,
5].
To address the complexity of dynamic energy environments, recent research has increasingly focused on integrating hierarchical and multi-agent reinforcement learning (MARL) into energy management systems. Hierarchical frameworks have demonstrated the ability to decompose long-horizon control problems into more tractable sub-problems. For instance, Cui et al. [
6] proposed a hierarchical reinforcement learning method for dispatching a hybrid compressed air and battery energy storage system, achieving flexibility across multiple operation modes. Similarly, Zhou et al. [
7] employed enhanced hierarchical RL to optimize HVAC operations, showing improved energy efficiency and comfort constraint handling. In multi-agent settings, Jendoubi and Bouffard [
8] developed a multi-agent hierarchical reinforcement learning architecture for distributed energy management, where decentralized agents collaborate while maintaining a layered control strategy. Yao et al. [
9] introduced a hierarchical MARL framework with adjustable agent hierarchies for home energy systems, proving the potential of agent collaboration in maintaining energy balance and user comfort. Zhang et al. [
10] extended these ideas to community-scale coordination, enabling multiple buildings to manage grid-aware energy loads using deep MARL. In parallel, RL-based control for building HVAC systems continues to evolve. Yang et al. [
11] demonstrated that reinforcement learning with built-in comfort objectives can dynamically adapt HVAC operations in response to occupant needs and grid constraints. Other works have extended RL to broader intelligent environments—such as Azizi et al. [
12], who integrated multi-agent RL for resource management in vehicular communication systems—indicating the cross-domain robustness of these methods.
Recent advances in reinforcement learning (RL) for building energy management have underscored the importance of incorporating safety, robustness, and constraint-awareness into control strategies. Traditional RL methods, while powerful, often fail to guarantee adherence to critical operational constraints such as thermal comfort, equipment limitations, and energy cost ceilings. To address these concerns, researchers have explored a range of constraint-aware enhancements. For instance, Jiang et al. [
13] and Alotaibi [
14] emphasized integrating real-time feedback on thermal comfort and energy pricing to drive context-aware and user-centric policies. Similarly, Kumaresan and Jeyaraj [
15] proposed a reward shaping mechanism to embed demand-side management constraints directly into the learning process, improving transferability and policy stability across different building contexts. The fusion of RL with model-based approaches, such as Reinforced Model Predictive Control (RL-MPC), has also shown promise in navigating hard constraints while retaining the adaptability of learning-based control. Arroyo et al. [
16] demonstrated that RL-MPC could outperform both standalone MPC and flat RL in multi-zone HVAC coordination. Robustness under environmental uncertainty has further been a key challenge, particularly in non-stationary climates or when forecasts are unreliable. Naug et al. [
17] addressed this by adapting RL to non-stationary settings using online learning and model adaptation. Meanwhile, Oh [
18] proposed a reinforcement-learning-based virtual energy storage operation strategy to manage wind power uncertainties, suggesting a potential synergy between storage models like VESS and RL under uncertainty. Beyond algorithmic enhancements, practical deployment remains a core focus. Silvestri et al. [
19] introduced imitation learning to accelerate RL training while ensuring safety during exploration, offering a viable path for real-world implementation. Finally, Woo and Junghans [
20] explored MPC-based control for thermo-active building systems (TABS), with a focus on surface condensation prevention, showing how safety constraints can be embedded into hybrid predictive frameworks. These efforts collectively highlight a trend toward hybrid and constraint-aware RL systems that can reliably operate under real-world uncertainties—an essential motivation for this work’s integration of hierarchical control to enforce safety and flexibility in building energy systems.
The concept of Virtual Energy Storage Systems (VESS) has gained considerable traction as a viable alternative or supplement to physical batteries in building energy management, especially in light of cost, lifecycle, and sustainability considerations. VESS leverages the thermal inertia of buildings—the natural ability of materials and structures to store and release heat over time—as a flexible and low-cost energy buffer. As highlighted by Fambri et al. [
21], this approach provides a comparable level of flexibility to traditional electric batteries within renewable energy communities, while offering significant economic and environmental advantages. Similarly, Jin and Liu [
22] demonstrated that integrating VESS into grid-connected distributed energy systems, especially through coupling with air conditioning systems, can enhance peak shaving performance and grid responsiveness. In building microgrid contexts, VESS enables the coordination of demand and supply without requiring physical storage units. Lv et al. [
23] explored this by optimizing microgrid operations using thermal-based VESS models, showing improvements in operational efficiency and renewable integration. At a broader system level, Jani et al. [
24] incorporated both physical and virtual storage in a multi-objective optimization framework for multi-microgrid systems, suggesting that VESS can effectively complement existing storage assets over different time scales. More recently, Mu et al. [
25] proposed a data-driven rolling optimization approach to control BES equipped with VESS, emphasizing real-time adaptability to dynamic load and weather conditions. Their method significantly enhanced system stability while reducing operational costs. Moreover, Alhammad et al. [
26] emphasized the role of digital technologies like Building Information Modeling (BIM) and Building Energy Modeling (BEM) in enabling accurate VESS characterization, simulation, and control. This integration is essential for scalable deployment, as accurate thermal models underpin the effectiveness of VESS-based strategies. Collectively, these studies establish VESS as a critical enabler for flexible, sustainable, and cost-effective building energy management—directly aligning with the proposed framework in this paper, which utilizes VESS to model latent thermal flexibility as a form of virtual storage.
In the domain of building and energy system optimization, the integration of policy-aware reinforcement learning has emerged as a crucial advancement, enabling intelligent control systems to not only learn effective strategies but also comply with operational, safety, and regulatory constraints. Chen et al. [
27] proposed a differentiable projection method to enforce policy feasibility in energy optimization tasks, ensuring that learned policies remain within allowable action boundaries—an essential capability in constrained environments like smart buildings and energy grids. Reinforcement learning’s adaptability has also been demonstrated in broader energy systems, such as microgrids and transport, where Delavari and Naderian [
28] developed a robust RL-based control framework for hybrid energy storage systems, and Jung [
29] applied RL to optimize storage scheduling in urban railway systems, showing versatility in energy-aware policy learning. Several studies have focused on real-time control under uncertainty, where policy-based learning outperforms static optimization. Kolodziejczyk et al. [
30] showcased the use of deep reinforcement learning to optimize real-time energy purchases in PV-storage systems, achieving enhanced economic performance through dynamic adaptation. Meanwhile, safety-critical applications have benefited from chance-constrained RL approaches, as illustrated by Mowbray et al. [
31], who embedded probabilistic safety constraints into reinforcement learning to control chemical batch processes—a technique readily applicable to building systems where thermal comfort and equipment safety must be preserved. Furthermore, Zhang et al. [
32] introduced a hybrid model-free DRL strategy for energy-efficient HVAC operation, balancing exploration and exploitation while satisfying comfort policies. These advancements underscore the growing importance of policy-constrained reinforcement learning in intelligent energy control applications. They affirm the relevance of embedding safe, adaptive, and constraint-aware policies within reinforcement learning frameworks, such as the one proposed in this study, where constraint-aware deep reinforcement learning is employed to manage building energy systems with virtual energy storage under real-time operational limits.
Traditional centralized or flat deep reinforcement learning (DRL) methods often fail to effectively handle the multi-timescale dynamics, large state–action spaces, and strict constraint satisfaction required for real-world energy systems (Korkas et al. [
33]). These limitations are particularly evident in building energy management, where diverse objectives—such as energy cost reduction, thermal comfort, and responsiveness to external signals—must be balanced in real time. For instance, Mason and Grijalva [
3] highlight that flat DRL architectures struggle with scalability and convergence in high-dimensional environments. Similarly, Naug et al. [
17] demonstrate that DRL performance deteriorates in non-stationary energy environments without structured learning. Korkas et al. [
34] further emphasize the need for hierarchical control frameworks to manage heterogeneous occupancy patterns and thermal zones effectively. These insights motivate the adoption of hierarchical and constraint-aware reinforcement learning techniques, such as the method proposed in this study, which leverages thermal inertia modeling and policy structure to overcome the aforementioned limitations.
Despite significant advancements in reinforcement learning and virtual energy storage modeling, several critical limitations remain unresolved in current Building Energy System (BES) control approaches. Most existing deep reinforcement learning (DRL) methods are either flat or centralized in structure, making them poorly suited for environments characterized by hierarchical decision-making, temporal abstraction, and system-wide constraints. These models often struggle to coordinate decisions across multiple timescales, fail to incorporate prior knowledge of building thermal dynamics, and suffer from instability or infeasibility under strict comfort and safety constraints. Moreover, traditional policy learning algorithms, such as Deep Q-Networks [
35], and Deep Deterministic Policy Gradient [
36], often overlook the potential of structured policy design and lack mechanisms to explicitly enforce constraint satisfaction during training and execution.
To address the limitations of existing building energy control approaches, this paper proposes a novel control framework that integrates constraint-aware deep reinforcement learning with virtual energy storage modeling. The framework is structured as separates long-term strategic decisions—such as VESS charge/discharge planning and HVAC mode scheduling—from short-term operational control like real-time temperature regulation and load balancing. At its core is the Dynamic Constraint-Aware Policy Optimization (DCPO) algorithm, an actor–critic method enhanced with policy constraints, entropy regularization, and adaptive clipping to ensure feasible, stable learning under energy and comfort constraints.
This work makes two key contributions. Theoretically, it advances the reinforcement learning domain by modeling thermal inertia as a Virtual Energy Storage System (VESS) and embedding it into a hierarchical control architecture that supports safe, adaptive, and interpretable decision-making. Practically, it demonstrates how the proposed DCPO-based control strategy can enable real-time, energy-efficient building operations, achieving over 30% cost savings and significant improvements in comfort compliance under realistic weather, occupancy, and pricing scenarios. The approach is designed for deployment in smart grids, net-zero buildings, and responsive energy systems, bridging the gap between academic RL research and real-world building energy management.
2. Building Energy System Modeling
Figure 1 illustrates the architecture of a Building Energy System (BES) enhanced with a Virtual Energy Storage System (VESS) and integrated Electric Vehicle (EV) subsystems [
35,
36,
37,
38]. The overall system comprises photovoltaic (PV) generation, inverter air conditioners (IACs), electric vehicles with bidirectional chargers, other electrical loads, smart meters, edge sensors, a weather data interface, and a centralized intelligent control center. These components work cohesively to ensure energy efficiency, comfort, and flexibility under varying operating conditions. In daily operation, electricity for the building is primarily supplied by PV panels and supplemented by the external distribution network (DN). The IAC regulates indoor temperature to maintain occupant comfort, while various other devices operate as passive loads. The EV system introduces a dynamic load element that also serves as a potential energy storage unit via vehicle-to-building (V2B) or vehicle-to-grid (V2G) capabilities. Smart meters retrieve real-time electricity pricing data from the DN and send it to the control center, while edge sensors continuously monitor environmental parameters and device states. Meteorological stations provide forecasts for outdoor temperature, solar radiation, and other relevant inputs. The intelligent control center forms the decision-making core of the BES. It integrates data from all subsystems and applies the proposed Dynamic Constraint-aware Policy Optimization (DCPO) algorithm to optimize control decisions. These include HVAC dispatch, PV utilization, EV charging/discharging, and power purchases from the grid. By combining the VESS model with EV behavior, the system achieves dynamic optimization across thermal, electrical, and storage domains. The VESS model captures the building envelope’s thermal inertia as a form of virtual energy storage. It includes three key parameters: virtual power, representing deviation from baseline heating needs; virtual capacity, quantifying the maximum thermal energy that can be stored or released; and virtual state of charge (SOC), indicating the current level of “stored” thermal energy. This model allows the BES to shift heating loads in time without physical batteries, effectively enhancing energy flexibility. The Electric Vehicle subsystem adds another layer of flexibility. EVs act as mobile, controllable storage units. When parked and connected to the BES, they can be charged during low electricity price periods and, if allowed by user constraints, discharge power back to the building during peak demand periods. The control center uses predicted vehicle arrival/departure times, user preferences, and battery state to determine optimal charging/discharging strategies. This is coordinated with thermal management strategies via the hierarchical control structure. By combining VESS-based thermal storage with EV electrical storage, the proposed BES architecture significantly enhances energy arbitrage opportunities, load shifting capabilities, and grid interaction strategies. The integration ensures that both comfort and economic objectives are achieved while increasing the system’s resilience and sustainability.
Inspired by the performance characteristics of physical battery systems, the VESS model introduces three core time-varying parameters. Virtual Power () Reflects the power variation in the VESS before and after participating in BES optimization control. It serves as a key indicator of the VESS’s dynamic response capability to control signals. Virtual Capacity () Represents the maximum energy storage potential that the VESS can offer during the optimization process. This is critical for evaluating its contribution to system flexibility. Virtual State of Charge () Describes the real-time energy status of the VESS, analogous to the SOC of a conventional battery.
These parameters are interrelated by the following expression:
where
t is the current control time step, Δ
t is the control time interval,
is the virtual state of charge at time
t,
,
are virtual capacity at time steps
t and
t + 1, in kWh.
,
are virtual power at time steps
t and
t + 1, in kW.
This formulation enables the BES control system to continuously assess and leverage the thermal storage flexibility of building envelopes as a virtual energy buffer, supporting dynamic optimization under fluctuating conditions.
To characterize the dynamic energy flexibility of a Virtual Energy Storage System (VESS), we first define the concept of virtual power. This is derived from the difference between actual air-conditioning power consumption and a theoretical baseline heating power required to maintain indoor comfort under standard conditions.
The baseline heating power, denoted as
, is calculated by considering two critical environmental factors: outdoor temperature and solar radiation. Outdoor temperature influences the heat transfer intensity between the indoor and outdoor environments, while solar radiation contributes to heat gain through the building envelope. The calculation of
is given by:
where
corresponds to building envelope elements (walls, windows, roofs, and doors),
is the inner surface area of the
i-th component (m
2), and
is the internal surface heat transfer coefficient (W/m
2·°C).
and
are the indoor and outdoor temperatures at time
t, respectively, and
is the solar heat gain at time
t (kW).
Since the primary electricity consumption within the heating system is attributed to the inverter air conditioner (IAC), the auxiliary loads of the system (e.g., fans, controls) are neglected for modeling simplicity. The IAC’s electrical power consumption
is assumed to be linearly related to its thermal output
, represented as:
where
are system-specific calibration constants derived from empirical data.
By substituting Equation (2) into Equation (3), it can calculate the baseline electrical power consumption
associated with the theoretical heating load needed to maintain indoor comfort, as a function of environmental parameters and system coefficients:
Finally, the virtual power
of the VESS is defined as the power deviation between the actual and baseline IAC energy consumption. This deviation reflects whether the system is effectively storing or releasing thermal energy. It is defined as:
A positive virtual power indicates that the VESS is in a charging state, storing excess thermal energy by increasing the indoor temperature beyond the baseline requirement. Conversely, a negative virtual power implies a discharging process, where previously stored heat is utilized to meet the thermal demand with reduced active power input. This modeling framework enables BES controllers to evaluate and manage building thermal inertia as a virtual energy buffer, providing flexibility for load shifting and energy cost optimization.
The virtual capacity of a Virtual Energy Storage System (VESS) at time step
t, denoted as
, represents the total potential for thermal energy storage in the building envelope. This is quantified as the cumulative power that would be theoretically consumed to maintain indoor comfort if the air-conditioning system remained off, and the indoor temperature passively decreased from the upper comfort limit
to the lower limit
. The virtual capacity is therefore defined by the integral:
Here, is the start of the control interval at time t (in hours), and is the duration (in hours) required for the indoor temperature to passively decrease from to when the IAC is turned off. The integrand represents the baseline power required to maintain thermal equilibrium at time t. This parameter captures the maximum thermal energy that can be “stored” in the envelope through building thermal inertia during the specified temperature drift range.
To dynamically assess the real-time energy state of the VESS, we introduce the concept of virtual state of charge (SOC). This is based on the actual indoor temperature at the beginning of time interval
t, denoted as
. When the IAC is switched off, the indoor temperature begins to decline from
toward the lower comfort threshold
. The actual stored virtual energy, denoted as
, is defined as the accumulated baseline power over this drift duration:
In this case, is the time needed for the indoor temperature to decrease from the current temperature to with the IAC off. reflects the usable thermal energy stored at time t, based on the indoor temperature conditions and building thermal properties.
The virtual state of charge of the VESS, denoted
, is then calculated as the ratio of the actual stored virtual energy
to the maximum virtual capacity
:
This dimensionless indicator, ranging between 0 and 1, provides a real-time measure of the VESS’s thermal energy status, analogous to the state of charge in an electrochemical battery. A higher indicates that the building has more “stored heat” that can be released (discharged), while a lower value suggests a greater potential to “absorb heat” (charge) through increased IAC operation. Together, the virtual power, capacity, and state of charge form a foundational model for integrating thermal inertia into real-time building energy control.
To ensure the feasibility and stability of the building energy optimization control process, several operational constraints must be incorporated into the control model. These constraints govern the behavior of the Virtual Energy Storage System (VESS), the photovoltaic (PV) system output, and the overall power balance of the Building Energy System (BES).
The operational flexibility of the VESS is bounded by two primary constraints: the state of charge (SOC) and the power flow limits for charging and discharging. The state of charge constraint ensures that the virtual energy stored within the thermal envelope remains within physically interpretable bounds:
This guarantees that the virtual energy level does not exceed the virtual capacity nor drop below zero, analogous to traditional battery SOC constraints. Additionally, the charging/discharging power constraint ensures that the power exchanged through the VESS at time
t, denoted
, remains within maximum allowable charging and discharging rates:
Here, and represent the maximum discharge and charge power limits, respectively, determined by the building’s thermal dynamics and the VESS control design.
The second group of constraints applies to the output of the photovoltaic (PV) system. The actual PV power delivered at time
t,
, must lie within the range of zero and the forecasted maximum output based on irradiance predictions:
where
is the maximum available PV power at time
t, forecasted using meteorological and solar irradiance data. This ensures that the optimization does not overestimate the available renewable energy input.
The optimization model must satisfy the fundamental electrical power balance constraint to ensure supply-demand equilibrium at each time step. The sum of the power supplied from the grid
and the PV generation
must equal the total load demand of the BES,
is net EV charging (positive = charging, negative = discharging) which includes the baseline power for thermal conditioning
, the VESS virtual power exchange
, and the power consumed by other devices
:
This equality ensures that at every moment, the BES remains in a state of power balance, thereby maintaining stable and reliable operation. These constraints collectively enable the DCPO-based control model to operate within realistic technical boundaries while pursuing cost-efficient and comfort-aware decision-making.
In addition to VESS, this paper incorporates Electric Vehicles (EVs) as flexible energy storage components within the BES framework. EVs are capable of both charging from and discharging to the building, thereby contributing to demand flexibility and system optimization. To accurately model this interaction, the State of Charge (SOC) dynamics of EV batteries must be considered. The SOC of the EV at time
t + 1 is given by:
where
is the charging/discharging efficiency,
is the net EV power (positive for charging, negative for discharging), Δ
t is the control timestep, and
is the EV battery capacity. This equation allows the BES to manage EVs as both dynamic loads and distributed storage elements, enhancing the system’s responsiveness to pricing and comfort constraints.
3. Proposed Method
Figure 2 illustrates the hierarchical control architecture of the proposed BES under the DCPO framework. The control strategy is organized into two main layers: the constraint policy layer and the operation layer. At the top, the constraint policy layer consists of a high-level controller that interacts with a constraint policy module to ensure that strategic decisions comply with thermal comfort and energy limitations. The high-level controller issues abstract actions, which are then interpreted by the low-level controller in the operation layer. This controller translates strategic policies into actionable device-level commands while receiving continuous observations from the system and feeding them back to the upper layers.
At the system level, the BES integrates multiple components including PV generation, EV systems, IAC units, VESS, and Other Loads, all of which interact with dynamic electricity price signals and weather data inputs. The VESS model is incorporated into the simulation environment to represent the building’s thermal flexibility. This model is used consistently across all reinforcement learning methods to ensure a fair and unified comparison framework. The DCPO agent uses these data streams to make adaptive decisions in real time, optimizing energy cost, comfort, and grid interaction. The diagram ensuring coordination between strategic planning and operational execution. Once training converges, the DCPO model is deployed within the BES. At each time step, the control strategy receives the current state vector and generates the appropriate control actions via the trained Actor network, ensuring adaptive, and energy-efficient operation of the BES while respecting VESS dynamics and user comfort constraints.
3.1. Objective Function
In the context of integrating a VESS into a BES, the primary control objective is to minimize the total electricity cost incurred during system operation. This total cost
F is formulated as the sum of two components: the actual energy cost
, and a penalty term
that accounts for violations of indoor thermal comfort constraints based on the VESS’s state of charge. The total cost function is expressed as:
The first component,
, represents the accumulated electricity cost over a 24 h optimization horizon. It accounts for the base thermal demand of the building, the power consumption of other electrical devices, the net purchased power from the grid, and the onsite photovoltaic (PV) power generation. This is computed as:
where
is the baseline power demand related to thermal comfort (kW),
denotes the power consumption of other electrical appliances (kW),
is the power purchased from the grid (kW),
is the on-site PV power generation (kW),
c(
t) is the electricity price at time
t (in cents/kWh), Δ
t is the duration of each control interval, typically set to one hour.
The second component,
, introduces a penalty based on indoor temperature violations that reflect breaches in the desired state of thermal comfort. If the actual indoor temperature
falls outside the acceptable range defined by a minimum threshold
and a maximum threshold
, a penalty is incurred. The penalty function is defined as:
where σ is a weighting factor (penalty coefficient) used to assign a cost to deviations outside the thermal comfort range. This formulation ensures that the control strategy not only minimizes energy costs but also maintains thermal comfort by constraining the VESS operation to realistic indoor temperature boundaries.
In addition to the objective function, the optimization model is subject to several constraints that ensure physical feasibility. These include VESS operational constraints, such as limits on virtual power, capacity, and state of charge; PV output constraints, which ensure the generation stays within realistic solar irradiance and system capacity; Power balance constraints, ensuring that at each time step, the sum of PV generation and grid purchases meets the total load demand of the BES.
3.2. Hierarchical Policy-Based Control Modeling
To apply deep reinforcement learning for the optimization of a BES enhanced with a VESS, the control problem must first be framed as a HMDP. An HMDP is defined by four key components: the state space S, the action space A, the reward function R, and the state transition probability D. In this formulation, S includes all observable information available to the agent; A defines the set of controllable variables; R provides scalar feedback indicating the quality of actions taken; and D captures the probabilistic evolution of the environment based on agent actions. For any given time step t, the variables s(t), a(t), and r(t) represent the specific realizations of the state, action, and reward, respectively.
To formulate the VESS-integrated BES optimization model as an HMDP, each of the HMDP components must be clearly defined. The state vector s(t) comprises the following elements: the VESS’s virtual state of charge , indoor temperature , outdoor temperature , solar heat gain , time information, and the real-time electricity price c(t). Together, these parameters form a rich state representation:
High-Level State (Strategic Layer):
Low-Level State (Operational Layer):
Unlike conventional DCPO applications, this state vector explicitly includes the virtual SOC, allowing the agent to learn from the building’s thermal inertia and incorporate prior knowledge of energy flexibility into the policy network.
The action space a(t) is defined by two continuous control variables: the VESS power dispatch and the photovoltaic power output .
These form the action vector at each time step as follows:
As DCPO is a model-free reinforcement learning algorithm, it does not require an explicit expression of the BES objective function. Instead, the optimization goal, previously defined in Equation (9), is embedded in the reward function to guide the learning process. The total reward
r(
t) consists of two components: an electricity cost-related reward
, and a penalty/reward associated with the VESS’s SOC
. These are linearly combined using pre-defined weight coefficients
and
:
This composite reward structure balances the dual goals of minimizing energy costs and maintaining thermal comfort, allowing the DCPO agent to explore optimal control strategies while adhering to real-world constraints.
With these definitions in place, the control problem for a BES integrated with a VESS is successfully formulated as an HMDP. This framework is well-suited for application of the DCPO algorithm, which iteratively improves the control policy by maximizing expected cumulative rewards while enforcing stability and constraint satisfaction through adaptive updates and variance reduction techniques. This formulation allows the BES to achieve real-time, robust, and cost-effective energy management under uncertain and dynamic conditions.
3.3. DCPO-Based Optimization of the HMDP Control Model
With the BES and VESS modeled as a HMDP, the next step is to apply the DCPO algorithm to derive real-time and adaptive control strategies. DCPO is a policy gradient method that seeks to maximize the expected long-term return of the policy. The expected return under policy π, denoted
, is defined as:
Here, is the probability of selecting action A in states under policy π, is the stationary distribution of state’s when the HMDP stabilizes under policy π, and r(s, a) is the immediate reward obtained by taking action A in states. The goal is to find an optimal policy that selects the best action for every state to maximize .
To do so, the policy π is parameterized by θ, such that
. The optimization process involves computing the gradient of
with respect to θ using the policy gradient theorem:
where
is the action-value function representing the expected cumulative reward from taking action A in states under policy π. DCPO adopts an Actor-Critic architecture, wherein the Actor is responsible for policy generation and the Critic estimates the value function. To ensure policy stability, DCPO employs a clipping mechanism that restricts large updates by bounding the policy ratio during training.
The loss function for the Actor network is expressed as:
where
is the importance sampling ratio between the new and old policies at time
t,
I(
t) is the advantage function estimating how much better the action is compared to average, ε is the clipping threshold (a hyperparameter),
is the entropy of the policy, and β is the entropy coefficient, encouraging exploration.
The Critic network minimizes the value function estimation error, with its loss function defined as:
where
is the value estimate of state’
s(
t), parameterized by ϕ, and
B(
t) is the actual cumulative return from time
t onward.
The importance sampling ratio
is calculated as:
This measures the relative likelihood of the current policy versus the previous policy in choosing action
a(
t) at state’
s(
t). The advantage function
I(
t) is defined as the difference between actual return and estimated value:
To encourage exploration and avoid premature convergence, DCPO incorporates the entropy of the policy distribution:
In the reinforcement learning-based control framework, the agent is trained to maximize a cumulative reward function that reflects the control objectives of the Building Energy System (BES). The immediate reward at time step
t is defined as:
where
is operational energy cost at time
t, β is comfort penalty exponent (β > 1),
is energy switching or transition cost (e.g., equipment wear),
are positive weighting factors balancing economic, comfort, and operational efficiency objectives.