Next Article in Journal
Methodological Openness and Social Sustainability in Secondary Education: Sociometric and Perceptual Evidence from Project-Based Learning
Previous Article in Journal
Research on the Coupling Coordination and Influencing Factors Between Digital Economy and High-Quality Cultural Tourism Development in Shanxi Province Under the Background of Sustainable Development
Previous Article in Special Issue
Multi-Objective Optimal Scheduling of Integrated Energy Systems Considering Tiered Carbon Trading and Load-Side Demand Response
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Heterogeneous Exploration and Double-Critic Transfer Reinforcement Learning for Sustainable Cross-Domain Energy Management in Smart Buildings

1
School of Information Science and Engineering, Shenyang University of Technology, Shenyang 110870, China
2
School of Artificial Intelligence, Shenyang University of Technology, Shenyang 110870, China
*
Author to whom correspondence should be addressed.
Sustainability 2026, 18(11), 5685; https://doi.org/10.3390/su18115685
Submission received: 30 April 2026 / Revised: 30 May 2026 / Accepted: 1 June 2026 / Published: 3 June 2026

Abstract

The integration of distributed energy resources (DERs) has enhanced the operational flexibility and complexity of smart building energy management, which is crucial to urban sustainable development. However, the limitations of strategy applicability across different environments and lengthy development cycles pose significant challenges for energy management. To address this, this paper proposes a transferred multi-thread deep reinforcement learning (TMDRL) framework for the cross-domain energy management of smart buildings. Firstly, a source-domain heterogeneous exploration architecture based on multi-thread deep reinforcement learning (DRL) is proposed. A transferable source-domain knowledge base is constructed to enhance the generalization ability of pre-trained strategies. Secondly, a decoupled double-critic optimization mechanism is designed to mitigate policy evaluation bias during cross-domain transfer. Finally, simulations using real-world datasets from different times and areas are conducted. The results show that compared to A3C, DDPG, and SAC, the proposed TMDRL framework reduces total costs by 32.77%, 18.14%, and 37.24%, while improving convergence efficiency by 29.55%, 22.89%, and 32.84%, respectively. The reduction in total cost and improvement in convergence efficiency demonstrate that the proposed TMDRL framework effectively saves energy and enhances the utilization of renewable energy, proving the sustainable benefits of smart building energy management across domains.

1. Introduction

In recent decades, the rapid development of DERs in the distributed networks, such as rooftop solar photovoltaic (PV), controllable loads, electric vehicles (EVs) and energy storage (ES) systems, has significantly transformed the landscape of traditional energy systems and promoted global sustainability [1]. Against this background, the energy management of smart buildings has attracted great attention. Smart buildings utilize DERs and smart technologies to provide load flexibility while co-optimizing energy cost and efficiency [2]. However, the uncertain power behavior of DERs across different times and areas increases the difficulty of energy management [3]. Moreover, changes in environmental factors such as weather patterns and electricity prices pose serious challenges, and energy management strategies developed for one environment often cannot be extended to similar environments. Therefore, it is imperative to develop efficient methods that can quickly adapt to new environments without requiring extensive retraining from scratch.
The energy management methods of smart buildings are usually classified into two types, i.e., model-based methods and learning-based methods. Specifically, model-based methods include rule-based control [4], distributed coordinated control [5], model predictive control [6]. However, the strategic performance of the above methods is limited in dynamic environments. In addition, the computational complexity of optimization methods grows with the scale of the system. To overcome these limitations, DRL has emerged as a promising learning-based alternative [7]. By directly interacting with the dynamic environment, DRL agents learn optimal policies without requiring explicit system models [8,9]. Ref. [10] models the multi-stage uncertainty of the buildings as the interaction process of the multi-agent, and the global optimal local policy is generated by the iterative algorithm under uncertainty. Ref. [11] proposes an attention-based fault diagnosis model combined with multi-parameter joint optimization to detect HVAC anomalies and reduce unnecessary energy consumption. Ref. [12] designs a two-tier scheduling framework based on DRL to optimize DERs based on demand response, and improved operational efficiency through the hierarchical interaction of two DRL models.
Although the above DRL methods reduce the computational burden compared to traditional model-based methods, they exhibit fundamental limitations in practical smart building energy management. Specifically, these methods require extensive interaction with the environment to converge on an optimal strategy, leading to prohibitively long training times. Furthermore, a strategy trained in a source domain often performs poorly when deployed to a target domain, even if the two domains share similarities. Consequently, retraining DRL agents from scratch for each new environment becomes necessary, which is both time-consuming and data-intensive, thereby limiting the actual deployment of DRL-based energy management systems. To address this, Ref. [13] proposes a distributed framework to accelerate data collection and training. To improve the performance of multi-agent training, Ref. [14] proposes an imitation mechanism. The effectiveness of the strategy is improved by clone pretraining and expert knowledge replay. The above methods improve exploration efficiency but do not solve the cross-domain adaptation problem.
Transfer learning (TL) utilizes the neural network parameters of a source domain as the initialization point for a target domain, enabling non-zero exploration in a target domain [15]. The model quickly adapts to the differences in the target domain through progressive fine-tuning, thereby reducing training costs [16]. Recently, many studies have explored the application of TL in energy management. Ref. [17] designs a deep transfer RL method for the energy management of multiple residences, which effectively reduces both energy consumption and user discomfort. Ref. [18] investigates the transferability of DRL-based control strategies and designs an integrated framework of DRL and TL to optimize building heating, ventilation and air conditioning (HVAC) systems. Based on DRL and TL, Ref. [19] transfers the energy management strategy based on an enhanced SAC algorithm to different types of EVs, which improves the efficiency of model training.
To address the cross-domain adaptation challenge, TL methods usually transfer the entire neural network containing actor and critic from the source-domain to the target domain. However, this transfer leads to bias in the value estimation in the target domain due to the source-domain assumptions. Because the pre-trained critic network is optimized entirely on source-domain transition dynamics and reward structures, it embeds source domain assumptions into its value estimations. Consequently, when deployed in target domains with differing characteristics, this biased critic consistently produces inaccurate value estimates, which inevitably misguides the actor network during policy updates and leads to suboptimal convergence. The bias is particularly problematic in building energy management, where user behavior, weather conditions, and electricity price may vary across areas and times. Furthermore, existing methods insufficiently exploit the synergy between parallel exploration and knowledge transfer. Although multi-threaded DRL accelerates training in a single domain and TL promotes cross-domain strategy transfer, their combination has not been fully explored, especially in the mechanism of reducing the bias of critics during the transfer process.
In summary, while DRL algorithms have shown promise in building energy management, they struggle with rapid adaptation to environmental changes. Although TL leverages pre-trained networks to accelerate cross-domain adaptation, directly transferring the critic network embeds source-domain assumptions into the target domain, introducing severe policy evaluation bias. Therefore, a novel TMDRL framework is proposed to overcome the limitations of strategy transfer. Unlike existing approaches that simply combine these techniques, the proposed framework integrates a multi-thread DRL architecture for accelerated exploration, a TL mechanism for cross-domain knowledge reuse, and a decoupled double-critic optimization mechanism to mitigate policy evaluation bias during transfer. The results show that the well-trained source-domain management strategy can effectively adapt to the target domain, improve the utilization of renewable energy and the flexibility of the system, and prove the sustainable benefits of cross-domain smart building energy management. The main contributions of this paper are as follows:
(1)
A heterogeneous exploration architecture for the source-domain is introduced, utilizing multi-thread DRL to construct a transferable knowledge base. Unlike conventional distributed methods and standard prioritized experience replay that focus on training speed, the heterogeneous exploration architecture employs a quantitative novelty-based filtering mechanism to selectively store transitions. This maximizes the coverage of different source domain environments and enhances the generality and transferability of the learned knowledge.
(2)
A decoupled double-critic optimization mechanism is designed to mitigate policy evaluation bias during cross-domain transfer. Different from the symmetric initialized double-critic method, the proposed mechanism introduces a randomly initialized local-critic network to learn the target domain dynamics independently, while the transfer-critic network provides shared source-domain prior knowledge. Based on the double-critic networks, the policy update is jointly guided to enhance the robustness and adaptability of the management strategy in new environments.
(3)
A TMDRL framework combining TL and multi-threaded DRL is proposed for cross-domain smart building energy management. By integrating the source-domain heterogeneous exploration architecture and the decoupled double-critic optimization mechanism, the framework achieves robust knowledge construction and strategy transfer. The case study verifies that the proposed framework significantly shortens the development time, reduces the total cost and improves the sustainability of the smart building energy management.
The rest of this paper is organized as follows. Section 2 introduces the mathematical model and problem formulation of smart building energy management. Section 3 develops the TMDRL-based energy management framework. Section 4 presents the simulation setup and results evaluation. Section 5 draws the conclusions.

2. System Models and Problem Formulation

This paper considers the energy management of a smart building with DERs based on knowledge transfer, as shown in Figure 1. The smart building trades energy with the external grid at market electricity price. The smart building model is introduced in this section. Then, the MDP formulations for smart building energy management are developed.

2.1. Smart Building Model

Each smart building is equipped with rooftop solar systems, smart electrical appliances and smart meters to participate in energy management. The smart electrical appliances include HVAC systems, EVs, ES, deferrable loads and non-shiftable loads. The detailed models are as follows.

2.1.1. HVAC Systems

The HVAC system influences the power of the building to participate in energy optimization by maintaining thermal comfort inside the building [20]. The temperature range for thermal comfort is given by:
T min T t BI T max
The maintenance of indoor temperature directly determines the power consumption of the HVAC system. For the convenience of calculation, the smart building is modeled as an isothermal air conditioning area [21]. The temperature dynamics of the air conditioning area are modeled by the R-C network, as follow [22]:
C t c d T t BI / d t = P t HVAC ( T t BI T t BO ) / R t r
In addition, the power of HVAC system is constrained as:
0 P t HVAC P max HVAC

2.1.2. Electric Vehicles

EVs are temporary ES systems with flexible charging and discharging capabilities. They charge and discharge at dedicated charging piles in the building to satisfy travel demands [23]. The charging and discharging process of the EV battery from arrival time ta to departure time td is formulated as:
E t + 1 EV = η c EV P c , t EV Δ t P d , t EV Δ t / η d EV + E t EV
where η c EV , η d EV 0 , 1 .
Furthermore, EVs cannot be charged and discharged simultaneously. The operational constraints are as follows:
P c , t EV × P d , t EV = 0
0 P c , t EV P c , max EV
0 P d , t EV P d , max EV
0 E t EV E max EV
Moreover, the EV battery should be sufficiently charged before departure to support the next travel plan. The required stored energy at departure is set as:
E d EV E e EV

2.1.3. ES System

The dynamic model similar to EVs is employed to describe the output characteristics of ES as follows:
E t + 1 ES = η c ES P c , t ES Δ t P d , t ES Δ t / η d ES + E t ES
where η c ES , η d ES 0 , 1 .
Similarly, the ES cannot be charged and discharged at the same time. The constraints are as follows:
P c , t ES × P d , t ES = 0
0 P c , t ES P c , max ES
0 P d , t ES P d , max ES
Unlike EVs, the ES operates continuously to participate in energy management, and has no expected stored energy target at fixed times. The stored energy is bounded as:
0 E t ES E max ES

2.1.4. Deferrable Appliances

Deferrable appliances require continuous operation over multiple time slots. Usually, once the appliances are started, their operation cannot be interrupted. Based on typical user behavior in smart buildings, we assume that each deferrable appliance operates once per day [24]. The constraints are as follows:
t = t s DL t e DL t d u r DL B t DL = 1 , t d u r DL t e DL t s DL
0 P t DL Δ t B t DL P max DL

2.1.5. Power Balance Constraint

To ensure the safe and sustainable operation of smart buildings, it is necessary to meet the operating power balance of the system. The power balance constraints are as follows:
P t EG + P t PV + P d , t EV + P d , t ES = P t HVAC + P t BL + P t DL + P c , t EV + P c , t ES

2.2. MDP Formulation for Energy Management

2.2.1. Environment State Set

The agent observes the environment state at time t, selects an action according to the policy π , and applies action to the environment. The environment transitions to the next time slot state and provides a reward to the agent at time t + 1. The environment state set is defined as:
S t B = P t PV , T t BO , λ t EP , λ t ES i n i t i a l   s t a t e s , B t DL i n d i c a t i n g   s t a t e , T t BI , P t HVAC , P t EV , E t EV , P t ES , E t ES , P t EG , P t DL c o n t r o l l a b l e   s t a t e s
The state set contains initial states, indicating state and controllable states. The initial states represent that PV power generation, outdoor temperature and electricity price are uncontrollable variables. The indicating state as a binary state indicates whether the deferred load is activated. When B t D L = 1, a signal is sent to the agent, indicating that the deferrable load is activated and is currently in an uninterruptible operation cycle, otherwise it is not activated. This affects the decision-making process by performing continuous operations in subsequent time steps until the task duration is completed, thereby ensuring the operational integrity of the device. The controllable states are the decision variable of agent optimization, which is a feedback signal to guide the agent to balance energy consumption between economy, thermal comfort and range anxiety. Due to the dependence on exogenous time-series characteristics, the environment is partially observable, and relying on current observations is not sufficient to capture potential system dynamics. Considering that the system operation time resolution in this paper is 15 min, we set the history window length to four time steps. The window size is chosen to sufficiently capture the short-term trending information of highly dynamic exogenous variations without expanding the input dimensionality. The historical values of the exogenous variables from the past four time steps are flattened in chronological order and concatenated with the features of the current time step along the feature dimension, resulting in a unified one-dimensional augmented state vector of length 25. All continuous state features are normalized to the [0, 1] range using Min–Max scaling before being fed into the neural network.

2.2.2. Control Action Set

Based on the state set of the environment, all appliances collectively determine optimal actions. The action set consists of four tuples, as follows:
A t B = ( A t HVAC , A t EV , A t ES , A t DL )
where A t HVAC 0 , 1 , A t EV 1 , 1 and A t ES 1 , 1 are continuous variables. A t DL 0 , 1 is a binary variable.

2.2.3. State Transition Probability

The information of control action set is received by the environment, and the environment state set transitions probabilistically from S t B to S t + 1 B . The state transition probability is expressed as:
S t + 1 B = P ( S t B , A t B , φ t )
The state transition probability is stochastic due to environmental factors, making an accurate analytical formula difficult to obtain. Consequently, a model-free DRL-based algorithm is adopted to generate the optimal policy within this MDP model.

2.2.4. The Reward of MDP

The reward incorporates three components: (a) daily operating cost, (b) thermal comfort, and (c) EV range anxiety.
The daily operating cost component is designed as:
R t OC = λ t EP P t PE λ t ES P t SE + σ t DL P t DL + σ t ES P c , t ES σ t ES P d , t ES
To ensure thermal comfort and consider the computational complexity of the model, this paper takes indoor temperature as the main index of thermal comfort. The deviation of indoor temperature from the expected temperature range is punished, as shown below:
R t TC = λ TC ( [ T min T t BI ] + + [ T t BI T max ] + )
Range anxiety is the concern that the EV battery energy is not enough to support driving to the destination. Therefore, a high battery energy is maintained to release range anxiety before departure. The reward function of range anxiety is penalized by the difference between the battery energy at departure and the expected energy, as follows:
R t RA = λ EV [ E e EV E t EV ] +
Considering these three components, the overall reward function at time t is expressed as:
R t B = α 1 R t OC + α 2 R t TC + α 3 R t RA
where α 1 = 0.6, α 2 = 0.3, and α 3 = 0.1. The selection of these weights reflects the relative importance of each optimization goal. Specifically, minimizing the operating cost is the primary objective due to fluctuating electricity prices; thus, it is assigned the maximum weight of 0.6. Setting the thermal comfort weight to 0.3 provides a sufficient penalty to maintain the indoor temperature without rendering the policy overly conservative. Finally, range anxiety is assigned a weight of 0.1. Since charging is a time-constrained task, a small weight is sufficient to ensure the target SOC at departure. In addition, this setting can prevent extreme operating behavior. If the operating cost weight is too high, the agent would aggressively minimize expenditure at the cost of temperature violations. If the thermal comfort weight is too high, the agent would converge to a fixed temperature, which will affect the economic operation. If the range anxiety weight is too high, the agent would charge at maximum power regardless of peak price periods. Therefore, the weights are set to 0.6, 0.3, and 0.1, respectively. To handle operational constraints, we employ a hybrid constraints approach. For the limitation of power, hard constraints are achieved by clipping after the output of the actor network. Soft constraints are included in the reward function as a penalty term for violations. This ensures that the strategy is physically constrained while learning to minimize constraint violations.

3. The TMDRL-Based Energy Management Framework

The variability in scheduling environments, such as outdoor temperature, PV generation, and electricity consumption behaviors, brings challenges to smart building energy management. These environmental changes render previously trained optimal strategies inapplicable. To address this, this section proposes a TMDRL framework combining TL and multi-thread DRL to develop optimal energy management strategies for new environments.

3.1. Source-Domain Heterogeneous Exploration

The conventional distributed DRL architectures are designed to accelerate the convergence of a single task within a fixed environment [13]. To address the cross-domain adaptation challenge in smart building energy management, a transferable knowledge base is constructed in the source domain as a prior knowledge for subsequent transfer. The proposed multi-thread DRL architecture is shown in Figure 2.
To construct a representative transferable knowledge base in multi-threaded DRL, a novelty-based experience replay buffer integrating novelty filtering mechanism is introduced. Traditional first-in-first-out buffers may suffer from redundancy dominated by frequent steady states; the novelty-based experience replay buffer preferentially filters and stores samples that differ from existing knowledge. The state transition tuple generated by each thread interacting with the environment needs to be evaluated before being stored. The eight parallel threads independently interact with the environment in a asynchronous manner to maximize exploration diversity. The novelty score is calculated as the minimum Euclidean distance between the current state vector and the existing samples in the buffer, as follows:
N ( s i , t ) = min s j D r b s i , t s j 2
To ensure the rationality of utilizing the Euclidean distance within a high-dimensional and heterogeneous state space, all state variables are normalized to a uniform boundary prior to any distance computation. This normalization eliminates the scale discrepancies among diverse physical quantities, preventing variables with larger numerical magnitudes from dominating the distance metric. The sample is filtered by a dynamic threshold and stored only when its novelty score exceeds that threshold. To mitigate the computational cost associated with this continuous novelty-based experience replay buffer mechanism, the distance calculations are executed by performing vectorization matrix operations locally on each asynchronous thread. This decentralized process introduces negligible computational overhead and avoids bottlenecking the global network updates, which are executed by asynchronously aggregating local gradients from the eight threads exactly at every 100 interaction steps. In practice, the novelty threshold is dynamically adjusted to prevent buffer starvation or redundancy. Specifically, novelty threshold is set to the median of the minimum Euclidean distances computed for each incoming batch of transitions, ensuring that only samples with above-median novelty are stored. This dynamic threshold update strategy continuously adapts to the newly collected data, effectively tracking the progressively expanding exploration boundary. During the initial warm-up phase, the threshold is relaxed to zero to accelerate knowledge accumulation until the buffer reaches capacity. Thereafter, a replacement policy is enforced whereby a new sample displaces the oldest buffered sample solely if its novelty score exceeds that of the evicted transition. This adaptive scheme maintains a fixed buffer size while maximizing coverage of the source-domain state distribution. Unlike techniques such as prioritized experience replay [25] that retroactively reweight stored transitions based on temporal difference errors, the proposed heterogeneous exploration architecture employs a proactive selection mechanism. By evaluating the Euclidean distance between incoming states and existing buffer representations prior to storage, this filtering process discards redundant transitions and forces the retention of diverse and underrepresented operational regimes. This fundamental difference ensures that the constructed knowledge base avoids the severe sample redundancy typical of periodic building energy systems. The global layer updates the neural network parameters by sampling experience from the novelty-based experience replay buffer, rather than directly using the original data. Based on novelty-based experience replay buffer mechanism, the buffer maximizes the coverage of the source-domain state-action space, and urges the policy and value functions to learn universal features from representative scenarios, thereby avoiding the overfitting of specific scenarios. The policy parameters are updated by the gradient ascent method, as follows:
d θ d θ + θ log π ( a t s t ; θ ) ( R V ( s t ; θ v ) )
R V ( s t ; θ v ) = i = 0 k 1 γ i r t + i + γ k V ( s t + k ; θ v ) V ( s t ; θ v )
The parameters of the critic network are updated by the gradient descent method, as follows:
d θ v d θ v + ( R V ( s t ; θ v ) ) 2 / θ v
Based on the distributed structure, multiple threads perform heterogeneous exploration concurrently, generating a diverse and high-quality set of samples that populate the buffer. This process accelerates the model’s learning of source domain variability. Furthermore, by constructing the knowledge base from fast convergence to transfer, the pre-trained strategy improves robustness and transferability.

3.2. Cross-Domain Knowledge Transfer Protocol

Even for a new yet similar task in a different environment, DRL needs to train the model from scratch. The proposed method combines TL with DRL, uses neural networks to extract features from sufficient source-domain data, and applies them to similar target domains through multi-layer linear transformation and nonlinear activations. The goal of TL is to utilize the common knowledge learned in the source domain to accelerate policy learning in the target domain, thereby reducing the need for extensive retraining.
In this paper, to formally characterize the cross-domain energy management problem, we define the environment as a MDP tuple M = ( S , A , P , R ) [11]. The source domain and target domain are denoted as M S = ( S S , A S , P S , R S ) and M T = ( S T , A T , P T , R T ) , respectively. A domain shift occurs when a pre-trained strategy in M S is deployed to M T where the state space, which utilizes a unified representation of historical observations and controllable device states, and the action space remain structurally identical, i.e., S S = S T and A S = A T , but the underlying environmental dynamics and objectives diverge. Specifically, the domain shift is formally characterized by two principal discrepancies: (1) Transition Probability Shift P S P T : This shift arises from the differing stochasticity in user behaviors, such as variations in EV arrival and departure time distributions across different residential areas, as well as the distinct temporal evolution patterns of exogenous variables like seasonal changes in outdoor temperature dynamics. (2) Reward Function Shift R S R T : This shift is caused by distinct spatiotemporal exogenous variables that directly enter the reward calculation, including different TOU electricity pricing schemes, region-specific PV generation profiles, and seasonal outdoor temperature fluctuations that influence thermal comfort penalties. The sketch map of network-based TL is shown in Figure 3.
The back propagation method is used to optimize weights and biases during training. The actor and critic networks are expressed as:
A c t o r : X 1 = R e L u ( ω a 1 S + b a 1 ) X 2 = R e L u ( ω a 2 X 1 + b a 2 ) X n = R e L u ( ω a n X n 1 + b a n ) A c t i o n = Tanh ( ω a n + 1 X n + b a n + 1 )
C r i t i c : X 1 = R e L u ( ω c 1 s S + ω c 1 a A + b c 1 ) X 2 = R e L u ( ω c 2 X 1 + b c 2 ) X n = R e L u ( ω c n X n 1 + b c n ) R v a l u e = ( ω c n + 1 X n + b c n + 1 )

3.3. Target-Domain Adaptation

Following the cross-domain knowledge transfer, the framework executes the target-domain adaptation. The proposed TMDRL framework is a hybrid actor–critic algorithm, which combines asynchronous parallel training structures to accelerate data collection in different environments, and applies a clipping probability ratio mechanism to maintain training stability. The proposed TMDRL pseudocode is shown in Appendix A. The framework of TMDRL is shown in Figure 4. Firstly, the neural network is trained in the source-domain until convergence, and the structure and parameters of the neural network are obtained. Conventional transfer RL methods typically initialize the target-domain networks using parameters directly copied from the source domain. However, this direct transfer approach introduces policy evaluation bias. Because the pre-trained critic network inherently evaluates actions based on source-domain dynamics and reward structures, this biased critic misguides the actor network policy updates during target-domain adaptation. To resolve this limitation of transfer RL, we design a decoupled double-critic optimization mechanism. A local-critic network with randomly initialized parameters is added, operating alongside the transferred critic network. In this decoupled design, the local-critic network learns exclusively from target-domain interactions, evaluating the policy free from source-domain bias, while the transfer-critic network provides shared knowledge. Their outputs are combined to guide the actor network update, which mitigates the policy evaluation bias. Traditional Twin Delayed DDPG employs two symmetrically initialized critics trained within the same target domain to mitigate the overestimation bias in single-domain reinforcement learning. In contrast, the proposed decoupled mechanism uses two asymmetric critic networks to mitigate the policy evaluation bias between the prior knowledge of the source domain and the dynamics of the actual target domain. One critic network is pre-trained in the source domain and transferred to the target domain, providing prior knowledge. The second critic network is randomly initialized and trained locally in the target domain, learning new dynamics. The store memory transfers the next state data to the two critic networks, which learn corresponding next V-values. The state, action and reward data are combined into the stack state and transferred to the two critic networks to learn R-value, as shown below:
R c r i = i = 0 k 1 γ i r t + i + γ k V c r i ( s t + k ; θ v , c r i )
R L c r i = i = 0 k 1 γ i r t + i + γ k V L c r i ( s t + k ; θ v , L c r i )
In the proposed decoupled double-critic optimization mechanism, the transfer-critic network and the local-critic network are updated independently. Each critic network maintains its own parameters and is trained to minimize its own time difference error. The loss functions of the two critic networks are defined as:
L c r i = E [ ( R c r i V c r i ( s t ; θ v , c r i ) ) 2
L L c r i = E [ ( R L c r i V L c r i ( s t ; θ v , L c r i ) ) 2
In the decoupled double-critic optimization mechanism, the two critic networks are updated independently with distinct learning dynamics. The transfer-critic network, while initialized with source-domain weights, is not frozen at any point during the target-domain adaptation phase. Instead, gradients flow continuously through this network during backpropagation. However, its parameter updates are constrained by a reduced learning rate to preserve the foundational value assessment from the source domain and avoid catastrophic forgetting. In contrast, the local-critic network is trained from random initialization with a standard learning rate, enabling it to rapidly capture the residual dynamics unique to the target domain. The actor network is updated by the average advantage of the two critic networks to mitigate policy evaluation bias. This design is based on local-critic network independently learning the specific dynamics of the target domain, while transfer-critic network provides shared knowledge from the source domain, which enhances robustness and accelerates adaptation in the new environment. The updated network of actors and critics is expressed as:
d θ d θ + θ log π ( a t s t ; θ ) [ ( ( R c r i + R L c r i ) / 2 ) ( V c r i ( s t ; θ v , c r i ) + V L c r i ( s t ; θ v , L c r i ) ) / 2 ) ]
d θ v d θ v + ( ( ( R c r i + R L c r i ) / 2 ) ( ( V c r i ( s t ; θ v , c r i ) + V L c r i ( s t ; θ v , L c r i ) ) / 2 ) ) 2 / θ v

3.4. Metrics of TL

To evaluate the transfer performance of proposed TMDRL framework, the evaluation metrics are defined for the energy management of smart buildings, as shown in Figure 5. There are four metrics to evaluate the benefits of transfer strategy in [26]: total reward, jumpstart, convergence efficiency, and number of outliers. The performance of the TMDRL framework with the baseline is compared to verify whether TL is beneficial for the energy management of smart buildings in the target domain. For energy management based on TL, economy and robustness need to be further considered. Therefore, the evaluation metrics are redefined as follows.

3.4.1. Total Reward

The total operating cost, thermal comfort, and range anxiety over the energy management period.

3.4.2. Jumpstart

The initial performance of the agent before learning, that is, the total reward of the agent before target-domain training.

3.4.3. Convergence Efficiency

Convergence efficiency is defined as the number of training episodes required for total reward convergence. In this paper, the convergence is judged by the distance between two adjacent iteration points, as follows:
R ¯ i + 1 R ¯ i χ
The threshold χ is set to 1.0. Convergence is acknowledged only when the absolute difference between the average total rewards of two consecutive episodes satisfies this condition for 50 consecutive episodes.

3.4.4. Number of Outliers

The robustness of a transfer strategy is quantified by the number of outliers. Specifically, an outlier is defined as an evaluation episode where the cumulative reward deviates from the mean by more than two standard deviations.

4. Case Study

4.1. Simulation Setup

The real datasets are selected to construct the energy management environment, including electricity prices, temperature, PV generation and energy demand. Electricity prices including time-of-use (TOU) prices and feed-in tariffs (FiT) are obtained from [27,28], with data available at the Energy Australia website. PV generation and smart building energy demand are sourced from the Ausgrid solar home electricity data [29]. Outdoor temperature data is obtained from the Australian government’s Bureau of Meteorology dataset [30]. To rigorously ensure data integrity and algorithmic stability, missing values caused by sensor anomalies are linearly interpolated, and all continuous environmental state variables are subsequently normalized using Min–Max scaling. The energy demand, PV generation and outdoor temperature last for 1 year, and the data of the past day is utilized as the state information of the interaction between the DRL agents and the environment. For the source domain, data from 1 January 2020 to 29 February 2020 is used. For the transfer to different time scenarios, the target domain uses data from 1 March 2020 to 31 May 2020. For the transfer to different area scenarios, the data period remains as 1 January 2020 to 31 May 2020, but EV arrival and departure times are sampled from different distributions to simulate different areas. In the scenario of transfer to different areas, the departure time of EVs is sampled from N(16,12,15,17), and the arrival time is sampled from N(7,12,5,9). For the time-series data splitting, we use chronological order rather than random splitting. For each month, the first 70% of the data in chronological order is used as the training set, and the remaining 30% is used as the testing set. The source domain and the target domain are isolated in the time domain, preventing any data leakage. In the scenario evaluating the transfer to different times, the KL divergence is 0.38, and the average correlation shift is 0.15. In the scenario evaluating the transfer to different areas, the KL divergence and the average correlation shift increase to 0.76 and 0.42. The initial observations of smart electrical appliances are randomly sampled 4000 times and 1000 times to form a training set and a test set, respectively. The operational parameters of electrical appliances are shown in Table 1. The parameters of HVAC, ES systems and deferrable appliances are referred to [31,32]. EV driving data is selected from the 2017 National Household Survey [33], which includes travel areas, arrival and departure time. The arrival time, departure time and state of charge (SOC) of EVs are sampled from the normal distribution.
The gradients of neural network in the TMDRL are computed by back propagation, and the network weights are updated by Adam optimizer. The actor and critic networks comprise three fully connected layers with 128 neural units per layer. The activation function is Relu. The actor network uses a learning rate of 0.00001. For the decoupled critic networks in the target domain, the transfer-critic network is fine-tuned with a reduced learning rate of 0.00001, while the local-critic network is trained with a learning rate of 0.0001. This 1 to 10 learning rate ratio is selected based on empirical validation to resolve the trade-off between catastrophic forgetting and slow adaptation during the transfer process. The reduced learning rate of the transfer-critic network prevents the catastrophic forgetting of the generalized source-domain features, whereas the standard learning rate of the local-critic ensures rapid adaptation to the unique temporal or spatial dynamics of the target domain. A lower learning rate ratio restricts the exploration capability of the local-critic, leading to slow adaptation, while a higher ratio causes excessive gradient updates to the transfer critic, destroying the pre-trained prior knowledge. The update factor and discount factor are 0.001 and 0.95. The size of the mini-batch and experience replay buffer are 256 and 10,000. The simulation environment is built by Python 3.9 with TensorFlow 2.10. The clip parameter is 0.2. In this paper, eight parallel threads are used for environment interaction. The selection of eight threads is based on the simulation setup of the experimental platform. This configuration allows full utilization of the CPU resources while minimizing the overhead associated with inter-thread communication and context switching, thereby achieving an optimal balance between sampling efficiency and training stability. Each thread collects trajectories independently and computes gradients locally. Every 100 steps, gradients from all threads are aggregated asynchronously to update the global actor and critic networks. This asynchronous update mechanism reduces idle time and enhances sampling efficiency, while periodic synchronization ensures policy consistency across threads. The computer is configured with Intel Core i5-3470 3.20 GHz CPU and 16 GB RAM. In addition, three benchmark methods are compared against TMDRL in the target domain.
  • Asynchronous Advantage Actor–Critic (A3C)
The A3C algorithm divides the training process into multiple parallel threads. The Actor–Critic method and the asynchronous update mechanism are combined to train the agent [34].
  • Deep Deterministic Policy Gradient (DDPG)
Deep Q-learning and deterministic policy methods are combined to deal with continuous action space control tasks [35].
  • Soft Actor–Critic (SAC)
The SAC algorithm is an off-policy algorithm based on the maximum entropy RL method, enhancing stability in high-dimensional tasks [36].
  • Distributed Proximal Policy Optimization (DPPO)
The DPPO algorithm extends PPO with a distributed architecture, leveraging multiple parallel explorations to collect trajectories and update the policy [37].
To ensure a rigorous and fair comparative evaluation, the baseline settings are explicitly defined. All benchmark methods are implemented with internal network scales identical to the proposed framework. Specifically, the actor and critic networks across all evaluated methods uniformly comprise three fully connected hidden layers with 128 neurons per layer. However, their network configurations adhere to their respective standard algorithmic designs. Explicitly, A3C and DDPG are implemented utilizing standard single-critic architectures, whereas SAC employs a symmetric twin-critic architecture to mitigate overestimation bias. These baseline structures differ from our proposed framework, which employs an asymmetric decoupled double-critic mechanism to mitigate policy evaluation bias during cross-domain transfer. To ensure the statistical rigor of the comparative evaluation, all benchmark methods are independently executed across 10 different random seeds.

4.2. Heterogeneous Exploration Architecture Training Performance

To validate the transferability of the proposed method, we set up a knowledge transfer experiment for the energy management strategy, including different times and areas. The management strategy is built on the multi-layer neural network, and different network layers learn different features. In the multi-layer neural network, the general features in the source domain and the target domain are captured by shallow layers, and the specific features in the target domain are captured by deep layers. Therefore, the amount of captured features in different layers is significant for the transferred management strategy. In the experiments, four transfer layer tests are evaluated, including transferring the first layer, transferring the first two layers, transferring the first three layers, and transferring all layers. Subsequently, the performance of the transferred management strategy is compared with the baseline method. Moreover, the four defined evaluation metrics are used for further comparison in the simulations.
In the training state, the performance of the proposed heterogeneous exploration architecture and single-threaded DRL is compared, as shown in Figure 6. The management strategy is generated by sufficient data training in the source domain, and the end of training process depends on the convergence of episode reward. Compared with single-thread DRL, multiple threads explore different regions of the state-action space simultaneously, which increases the diversity of collected experiences and reduces the variance of gradient estimates during policy updates. In addition, the asynchronous update mechanism allows the global network to be updated without waiting for all threads to complete their local trajectories, thereby accelerating the policy iteration process. The aggregated experiences from different exploration trajectories help the agent escape local optima more efficiently compared to a single exploration trajectory. Compared with single-threaded DRL, the convergence efficiency of multi-thread DRL is improved by 34.89%.

4.3. Decoupled Double-Critic Mechanism Performance

To verify the effectiveness of the proposed decoupled double-critic mechanism in reducing the policy evaluation bias, the loss curves of the double-critic network and the single-critic network in the target-domain training are compared, as shown in Figure 7. The fast learning ability of the local-critic network for the unique features of the target domain complements the general knowledge provided by the transfer-critic network for the source domain. Compared with a single-critic network, the convergence speed of a double-critic network is improved by 45.11%. The decoupled double-critic mechanism mitigates the instability caused by the mismatch between the assumed source domain environment and the actual dynamic target domain by dispersing the risk of evaluation bias. The loss curve of the double-critic network is less than that of the single-critic network throughout the entire training cycle. The final loss value of the double-critic network remains stable at around 0.10, which is about 33% lower than the 0.15 of the single-critic network. This indicates that it can more accurately evaluate the value of the policy and the long-term value of the policy in the target domain, thereby guiding the policy to update in a better direction. In addition, in the early stage of training, the loss of double-critic network decreases faster than that of single-critic network. This is because the local-critic network learn directly from the target-domain interaction data, avoiding the initial bias that may exist in the transfer-critic network, thereby accelerating the early exploration and feature extraction process.

4.4. Novelty-Based Filtering Mechanism Performance

To verify the effectiveness of the proposed novelty-based filtering mechanism, ablation experiments are performed and the results are shown in Table 2. Compared with the novelty-based filtering mechanism, the jumpstart of the TMDRL framework without the novelty-based filtering mechanism is reduced by 13.64%, which indicates that the pre-training knowledge base constructed by novelty filtering provides more effective initialization for the target domain. Compared to the TMDRL framework with a mean threshold, the proposed dynamic median threshold improves the jumpstart to −225.14 and reduces the number of outliers from 2.35% to 1.84%. This demonstrates that the median threshold is more robust against outlier states in the source domain, preventing them from skewing the filtering criteria. Without the filtering mechanism, the number of outliers and convergence efficiency increased by 1.97% and 34.88%, respectively. These results show that the proposed novelty-based filtering mechanism maximizes the coverage of the state space and enhances the transferability of the learned knowledge by selectively storing transitions with high information content, thereby improving the overall performance of the TMDRL framework.
To verify the effectiveness of the proposed decoupled double-critic mechanism, ablation experiments are performed and the results are shown in Table 3. The results show that using the minimum selection strategy in a cross-domain transfer causes the randomly initialized local-critic network to drag down the robust value estimates of the pre-trained transfer-critic network. The deviation is underestimated, the jump starting point drops to −278.55, the total reward of convergence drops to −21.31, and the convergence efficiency slows down to 845 episodes. The maximum selection strategy leads to rapid overestimation and policy instability, which increases the number of outliers to a peak of 3.94%. Compared with the two critic combination strategies, the proposed method has the highest total reward of −18.47 and the lowest number of outliers of 1.84%. This shows that the proposed strategy achieves the optimal balance between the established source-domain prior knowledge and the dynamic target-domain exploration, and effectively overcomes the limitations of the critic combination methods in the transfer process.

4.5. Transfer to Different Times

To evaluate the impact of layer-wise transfer depth and identify the optimal transfer architecture, ablation experiments are performed on different transfer scenarios. Figure 8 shows the comparison of the total rewards of transferring layers and the baseline in the target domain. The baseline uses the same actor–critic network architecture as the proposed TMDRL, which consists of three fully connected layers with 128 neurons per layer and ReLU activation, but with a single-critic network. This ensures that the comparison isolates the effect of transfer learning itself. The baseline is trained under the identical environmental conditions as the proposed TMDRL, including the same dataset splits and the same number of training episodes. Its hyperparameters are tuned using the same grid search process, and the search range of the learning rate, batch size and discount factor is the same as that of TMDRL. The energy management baseline of transferring to different times is trained under the target dataset. Compared with the baseline, the proposed method utilizes the pre-trained energy management knowledge in the source domain to accelerate the management strategy learning process in the target domain. Although reducing environmental exploration, the proposed method uses fine-tuning techniques to maintain the same level of total rewards as the baseline.
The performance evaluation of transferring to different times is shown in Table 4. The transfer layer methods use pre-training parameters as the starting point, without random initialization, and the jumpstart is improved 32.70%, 33.24%, 49.92%, and 41.63%. The baseline converges from the 2671st episode, and maintains stability thereafter. Compared with the baseline, the transfer layer methods learn the common feature knowledge in the source domain and the target domain, and the convergence efficiency is improved by 53.91%, 60.99%, 78.21% and 74.43%. Lower layers in DRL usually learns cross-domain general features. Higher layers tend to encode domain-specific strategies and complex logical abstractions. These strategies are highly sensitive under specific constraints, resulting in higher layer transfers to target domains with different distributions that may introduce negative transfers. The method of transferring the first three layers encapsulates sufficient shared features between the source domain and the target domain, provides effective initialization capabilities, and improves initial performance and convergence efficiency. Compared with other transferring layer methods, the convergence efficiency is improved by 52.72%, 44.15% and 14.79%.
To further analyze the impact of seasonal changes on the performance of the proposed method, we set up a scenario that transferring from winter to summer, and the performance evaluation results are shown in Table 5. As the strategy transfers to different seasons, the performance of the method based on the transferring layer is better than the baseline. As the number of transferring layers increases, the performance gradually improves. The results show that all the transfer layer methods are superior to the baseline in most metrics. Among the four transfer layer methods, the method of transferring the first three layers achieves the best overall performance. Compared to the baseline, the method of transferring the first three layers improves jumpstart by 49.9%, reduces the number of outliers by 28.7%, improves convergence efficiency by 78.2% and improves total reward by 30.9%. However, when all layers are transferred, the performance experiences a decline. This indicates the occurrence of negative transfer, as the deep neural network layers become overly fitted to the specific seasonal operation patterns and hinder exploration in the changing seasonal environment. These results demonstrate that the proposed TMDRL framework possesses robustness and adaptability even under seasonal changes.

4.6. Transfer to Different Areas

To further explore the transferability of the common knowledge, energy management is studied in different areas. The variables of the source domain and the target domain are more different than the tasks in Section 4.5. The simulation results in Figure 9 show that the transferring layer method is a feasible solution for the transfer of energy management strategy between different areas. Compared with the baseline, the neural network through knowledge transfer has higher jumpstart, and the convergence efficiency is improved by 50.70%, 51.83%, 60.88%, and 53.64%.
The simulation results show that the proposed method transfers the neural network trained based on the source-domain dataset to the target domain, and has higher model performance in the target domain. Table 6 further shows that the neural network in the method of transferring the first three layers retains enough flexibility to learn the features of the target domain, while avoiding the lack of prior knowledge caused by transferring lower layers and the overfitting of source-domain features caused by transferring all layers. It means that by transferring the trained knowledge of energy management strategy, the problem of repetitive model training caused by differences in electricity consumption habits across different areas can be effectively solved.

4.7. Energy Management Results

In this case, the energy management performance of the smart building based on the TMDRL method is tested. As shown in Figure 10a, the charging and discharging of the EV is guided by electricity price. The EV is charged when the electricity price is low from 0:00 to 06:00, and releases energy to other appliances when the electricity price is high from 18:00 to 21:00. The SOC histogram of EV in Figure 10d shows that EV has a high SOC level before departure to meet travel demand. It further shows that the driver’s mileage anxiety is alleviated. The power and SOC of ES are shown in Figure 10b,e. The ES is charged at 10:00–16:00 when the PV generation is high, and discharged at 24:00–06:00 with almost no PV generation. The ES is affected by PV generation to store and release energy to provide to other appliances. The EVs and ES management results show that the proposed method stores energy when the electricity price is low and the PV generation is high, and releases energy when the electricity price is high and the PV is low, which reduces the operating cost. Figure 10c is the comparison curve of indoor temperature and outdoor temperature. The HVAC system is optimized to increase the indoor temperature when the outdoor temperature is below the minimum comfort temperature to maintain thermal comfort. The results of smart building energy trading are shown in Figure 10f. The smart building buys energy at night and sells energy during the day to maximize operational economy.

4.8. Performance Comparison with Benchmarks

In this case, the performance of the proposed method and the benchmarks are compared. To ensure absolute experimental fairness during the comparison process, all benchmarks are optimized using a grid search on a validation set comprising 10% of the training data. The search range includes learning rate from 0.00001 to 0.001, batch size from 64 to 256 and discount factor from 0.90 to 0.99. The final hyperparameters are selected based on the highest total reward achieved on the validation set. The total rewards of the five algorithms are shown in Figure 11. In the initial stage of training, the proposed method uses the strategy trained in the source domain to accelerate the learning process. Compared with the benchmarks, the jumpstart of the proposed method is higher. The SAC and the A3C are based on the random strategy exploration mechanism, and their jumpstarts are lower than the DDPG based on the deterministic strategy. Although DPPO utilizes distributed parallel exploration to enhance training stability, its jumpstart remains lower than the proposed method due to the lack of transferred prior knowledge. The proposed method has the highest total reward and the fastest convergence speed among the benchmarks, which indicates that the TL-based method utilizes common knowledge to optimize the energy of the smart building.
To further compare the performance of the proposed method and the benchmarks, a test set containing 100 days is performed. For each method, 10 independent runs are performed using different random seeds to ensure the reliability of the statistics. A Wilcoxon signed-rank test is employed to compute the statistical significance between the proposed TMDRL framework and the benchmarks. To control the family-wise error rate due to multiple comparisons, we applied the Bonferroni correction with a significance threshold of 0.0125. The 95% confidence intervals (CIs) have been calculated as shown in Table 7. Among the five methods, the TMDRL method has the highest total reward and lowest daily total cost, which indicates that it has a better optimization strategy. The decoupled double-critic optimization mechanism requires an additional local-critic network. The number of parameters of the proposed TMDRL method is 144 K, which is 42.17% lower than that of SAC. Based on the hybrid constraint approach, the constraint violation rate of TMDRL is 1.13%, enhancing the model’s ability to comply with constraints. In terms of computational efficiency, the convergence time of TMDRL is 41.2 min, significantly better than other methods. Compared with A3C, DDPG, SAC and DPPO, the convergence efficiency of TMDRL is improved by 29.55%, 22.89%, 32.84% and 21.16%, respectively, verifying the effectiveness and applicability of the proposed method in smart building energy management.

5. Conclusions

This paper addresses the sustainable energy management problem for smart buildings across different environments, aiming to overcome the challenges of spatiotemporal applicability limitations and prolonged strategy development cycles. To this end, a TMDRL framework integrating heterogeneous exploration architecture and a decoupled double-critic optimization mechanism is proposed. Through a novelty-based filtering mechanism, the heterogeneous exploration architecture constructs a novelty-based experience replay buffer that suppresses overfitting to specific operational patterns in the source-domain, strengthening the generalization robustness of the learned representations. The decoupled double-critic optimization mechanism mitigates the policy evaluation bias in the cross-domain transfer by introducing a local-critic network that independently learns the target domain and a transfer-critic network that provides generalized source-domain prior knowledge. The results show that the proposed framework effectively learns the sustainable energy management strategy through knowledge transfer. Compared with the baseline, the proposed method shortens the development cycle of the management strategy in different times, seasons and areas, and the convergence efficiency is improved by 78.21%, 77.15% and 60.88%, respectively. By enabling rapid adaptation to new environments without extensive retraining, the proposed method accelerates the deployment of sustainable energy management solutions and improves the utilization of renewable energy and the flexibility of the system.
In the future, we will study the knowledge transfer between different tasks under privacy protection to enhance the security of energy management schemes. In addition, while this study relies on real-world datasets to validate the proposed framework, physical deployment introduces unmodeled dynamics such as hardware execution latency, sensor noise, and unpredictable human interactions. To solve this limitation and track the real behavior of the intelligent system, our future plan is to build a real intelligent building energy management experimental platform. The test bench will feature an integrated rooftop photovoltaic, a battery energy storage system, and an operating electric vehicle charging station. By evaluating the real-time system dynamics of the proposed framework under actual physical constraints such as HVAC thermal response delay, its robustness to physical sensors is analyzed to verify whether the decoupled double-critic optimization mechanism maintains its performance advantages under real-world noise.

Author Contributions

Conceptualization, J.F. and J.H.; methodology, J.F., J.H. and Q.S.; software, J.F. and J.H.; validation, J.F., Q.S. and J.H.; formal analysis, J.F., Q.S. and J.H.; investigation, J.H. and Q.S.; resources, J.H. and Q.S.; data curation, J.F. and J.H.; writing—original draft preparation, J.F. and J.H.; writing—review and editing, J.H. and Q.S.; visualization, J.F., J.H. and Q.S.; supervision, Q.S.; project administration, J.F., J.H. and Q.S.; funding acquisition, J.F., J.H. and Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China under Grant 62503340, 62433013 and Grant U25B20190, the Liaoning Young Elite Scientists Sponsorship Program, the Liaoning Provincial Department of Education Basic Research Project under Grant LJ212510142036, the Liaoning Province Science and Technology Plan Joint Program Project under Grant 2025-BSLH-279 and Liaoning Provincial Department of Science and Technology Basic Research Project under Grant 2024-MSLH-356.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to express their sincere gratitude to the editor and the anonymous reviewers for their invaluable time, insightful comments, and constructive suggestions. Their efforts have significantly improved the quality of this paper. We also extend our heartfelt thanks to all the co-authors for their collaborative efforts, intellectual contributions, and dedicated work throughout the development of this research.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Nomenclature

The following abbreviations are used in this manuscript:
TMDRLTransferred multi-thread deep reinforcement learning
DRLDeep reinforcement learning
TLTransfer learning
DERsDistributed energy resources
HVACHeating, ventilation, and air conditioning
EVElectric vehicle
ESEnergy storage
PVPhotovoltaic
MDPMarkov decision process
SOCState of charge
TOUTime-of-use
A3CAsynchronous advantage actor–critic
DDPGDeep deterministic policy gradient
SACSoft actor–critic
t , Δ t Time slot index and duration
T BI , T BO Indoor and outdoor temperature
T min , T max Minimum and maximum comfort temperature
R t r , C t c Equivalent thermal resistance and capacitance
P PV Power generation of PV
φ t Inherent randomness of the environment
P HVAC Power consumption of HVAC system
P max HVAC Rated power of HVAC system
E EV , E ES Energy stored in EV and ES
P c EV , P d EV Charging and discharging power of EV
P c ES , P d ES Charging and discharging power of ES
η c EV , η d EV Charging and discharging efficiency of EV
η c ES , η d ES Charging and discharging efficiency of ES
P c , max EV , P d , max EV Maximum charging and discharging power of EV
P c , max ES , P d , max ES Maximum charging and discharging power of ES
E max EV , E max ES Maximum battery capacity of EV and ES
E d EV , E e EV Required and expected energy of EV at departure
P DL , B DL Power consumption and binary activation status of deferrable load
t s DL , t e DL , t d u r DL Start, end, and duration time of deferrable load
λ EP , λ ES Electricity purchasing price and selling price
σ DL , σ ES Cost coefficients for deferrable loads and ES
P EG Traded electrical power with the grid
P PE , P SE Power purchased from and sold to the grid
S B , A B State set and Action set
A HVAC Power regulation action of HVAC system
A EV , A ES Actions of EV and ES
A DL Activation status action of deferrable load
R B Total reward function
R OC Operating cost reward function
R TC Thermal comfort reward function
R RA Range anxiety reward function
λ TC , λ EV Penalty coefficients for thermal comfort and range anxiety
α 1 , α 2 , α 3 Weight coefficients
π , V Policy function and Value function
D r b Novelty-based experience replay buffer
N ( ) Novelty score function of the state
θ , θ v Parameters of the policy and critic networks
γ , k Discount factor and look-ahead steps
R ¯ Mean total reward
[ ] + The   projection   operator   max ( , 0 )

Appendix A

Table A1. Pseudocode of the TMDRL framework.
Table A1. Pseudocode of the TMDRL framework.
Source-Domain Pre-Training
1: Initialize global Actor π θ , critic V ϕ
2: Operate 8 parallel threads asynchronously:
3:    For each step do:
4:        Interact with source-domain environment, sample action, and observe r t , s t + 1
5:        Calculate the novelty score according to Equation (25).
6:        If N ( s t ) > τ
7:            Store ( s t , a t , r t , s t + 1 ) in novelty-based experience replay buffer
8:        End If
9:        If the current step is a multiple of 100 then:
10:              Sample mini-batch from experience replay buffer, update global θ and ϕ
11:            Synchronize local threads with global networks
12:        End If
13:    End For
14: Save pre-trained actor and critic network parameters
Target-DomainStrategy Transfer
15: Initialize the actor and transfer critic network with pre-trained parameters
16: Initialize the local critic network with random parameters
17: Initialize novelty-based experience replay buffer
18: Operate 8 parallel threads asynchronously:
19:    For each step do:
20:        Interact with target domain environment, sample action, and observe r t , s t + 1
21:        Calculate the novelty score according to Equation (25).
22:       If N ( s t ) > τ
23:            Store ( s t , a t , r t , s t + 1 ) in D r b
24:        End If
25:        If the current step is a multiple of 100 then:
26:           Sample mini-batch from experience replay buffer
27:           Calculate target R-values according to Equations (31) and (32)
28:           Calculate independent critic losses according to Equations (33) and (34)
29:           Update critics independently
30:           Update actor network using clipping probability ratio mechanism
31:           Synchronize local threads with global networks
32:       End If
33:    End For
34: Return optimized target strategy

References

  1. Nambiar, J.; Yu, S.; Lilley, I.; Makam, J. Coordinating vehicle-to-grid and distributed energy resources in multi-dwelling developments: A real-time gateway control framework. Sustainability 2026, 18, 3861. [Google Scholar] [CrossRef]
  2. Chen, J.; Lu, L. Renewable energy integration and application in buildings for carbon neutrality. Sustainability 2026, 18, 4310. [Google Scholar] [CrossRef]
  3. Chakraborty, S.; Modi, G.; Singh, B. A cost optimized-reliable-resilient-realtime-rule-based energy management scheme for a SPV-BES-based microgrid for smart building applications. IEEE Trans. Smart Grid 2023, 14, 2572–2581. [Google Scholar] [CrossRef]
  4. Sun, Y.; Luo, Z.; Li, Y.; Zhao, T. Grey-box model-based demand side management for rooftop PV and air conditioning systems in public buildings using PSO algorithm. Energy 2024, 296, 131052. [Google Scholar] [CrossRef]
  5. Zheng, Z.; Tang, R.; Luo, X.; Li, H.; Wang, S. A distributed coordination strategy for heterogeneous building flexible thermal loads in responding to smart grids. IEEE Trans. Smart Grid 2024, 15, 1620–1633. [Google Scholar] [CrossRef]
  6. Wang, C.; Wang, B.; You, F. Demand response for residential buildings using hierarchical nonlinear model predictive control for plug-and-play. Appl. Energy 2024, 369, 123581. [Google Scholar] [CrossRef]
  7. Pei, Y.; Yao, Y.; Zhao, J.; Hao, J.; Ding, F.; Wang, J. Multi-agent hierarchical deep reinforcement learning for HVAC control with flexible DERs. IEEE Trans. Smart Grid 2025, 16, 5589–5601. [Google Scholar] [CrossRef]
  8. Yin, Z.; Wang, S.; Zhao, Q. A flexibility scheduling method for distribution network based on robust graph DRL against state adversarial attacks. J. Mod. Power Syst. Clean Energy 2025, 13, 514–526. [Google Scholar] [CrossRef]
  9. Mocanu, E.; Mocanu, D.C.; Nguyen, P.H.; Liotta, A.; Webber, M.E.; Gibescu, M.; Slootweg, J.G. On-line building energy optimization using deep reinforcement learning. IEEE Trans. Smart Grid 2019, 10, 3698–3708. [Google Scholar] [CrossRef]
  10. Tsaousoglou, G.; Efthymiopoulos, N.; Makris, P.; Varvarigos, E. Multistage energy management of coordinated smart buildings: A multiagent markov decision process approach. IEEE Trans. Smart Grid 2022, 13, 2788–2797. [Google Scholar] [CrossRef]
  11. Guo, Y.; Du, C.; Liu, X.; Zhang, X.; Jin, Z. Research on attention-based fault diagnosis and multi-parameter joint optimization of CO2 heat pump system. Appl. Therm. Eng. 2026, 289, 129942. [Google Scholar] [CrossRef]
  12. Zhang, W.; Li, Y. Aggregator-grid interactive building dual-layer price-responsive demand response scheduling based on federated deep reinforcement learning. IEEE Trans. Smart Grid 2025, 16, 1142–1154. [Google Scholar] [CrossRef]
  13. Liu, H.; You, C.; Han, L.; Yang, N.; Liu, B. Off-road hybrid electric vehicle energy management strategy using multi-agent soft actor-critic with collaborative-independent algorithm. Energy 2025, 328, 136463. [Google Scholar] [CrossRef]
  14. Liu, J.; Ma, Y.; Chen, Y.; Zhao, C.; Meng, X.; Wu, J. Multi-agent deep reinforcement learning-based cooperative energy management for regional integrated energy system incorporating active demand-side management. Energy 2025, 319, 135056. [Google Scholar] [CrossRef]
  15. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
  16. Incecco, M.D.; Squartini, S.; Zhong, M. Transfer learning for non-intrusive load monitoring. IEEE Trans. Smart Grid 2020, 11, 1419–1429. [Google Scholar] [CrossRef]
  17. Khan, M.; Silva, B.N.; Khattab, O.; Alothman, B.; Joumaa, C. A transfer reinforcement learning framework for smart home energy management systems. IEEE Sens. J. 2023, 23, 4060–4068. [Google Scholar] [CrossRef]
  18. Fang, X.; Gong, G.; Li, G.; Chun, L.; Peng, P.; Li, W.; Shi, X. Cross temporal-spatial transferability investigation of deep reinforcement learning control strategy in the building HVAC system level. Energy 2023, 263, 125679. [Google Scholar] [CrossRef]
  19. Li, H.; Ma, Z.; Weng, Y. A transfer learning framework for power system event identification. IEEE Trans. Power Syst. 2022, 37, 4424–4435. [Google Scholar] [CrossRef]
  20. Ariwoola, R.; Kamalasadan, S. An integrated hybrid thermal dynamics model and energy aware optimization framework for grid-interactive residential building management. IEEE Trans. Ind. Appl. 2023, 59, 2519–2531. [Google Scholar] [CrossRef]
  21. Wang, W.; Tian, G.; Sun, Q.Z.; Liu, H. A control framework to enable a commercial building HVAC system for energy and regulation market signal tracking. IEEE Trans. Power Syst. 2023, 38, 290–301. [Google Scholar] [CrossRef]
  22. Kim, Y.-J. A supervised-learning-based strategy for optimal demand response of an HVAC system in a multi-zone office building. IEEE Trans. Smart Grid 2020, 11, 4212–4226. [Google Scholar] [CrossRef]
  23. Chen, Y.; Lu, J.; Liu, Z.; Peng, P.; Yang, X.; Wu, M. A real-time energy management strategy for sustainable operation of electrified railway grid-source-storage-vehicle system integrating rule and optimization. Sustainability 2026, 18, 3914. [Google Scholar] [CrossRef]
  24. Liu, X.; Tang, D.; Dai, Z. A Bayesian game approach for demand response management considering incomplete information. J. Mod. Power Syst. Clean Energy 2025, 10, 492–501. [Google Scholar] [CrossRef]
  25. Lu, Y.; Zuo, Y. Multiobjective optimization of path planning and communication capacity based on DQN with weighted prioritized experience replay. IEEE Internet Things J. 2025, 12, 53262–53273. [Google Scholar] [CrossRef]
  26. Dridi, J.; Amayri, M.; Bouguila, N. Transfer learning for estimating occupancy and recognizing activities in smart buildings. Build. Environ. 2022, 217, 109057. [Google Scholar] [CrossRef]
  27. Yan, L.; Chen, X.; Chen, Y.; Wen, J. A hierarchical deep reinforcement learning-based community energy trading scheme for a neighborhood of smart households. IEEE Trans. Smart Grid 2022, 13, 4747–4758. [Google Scholar] [CrossRef]
  28. Energy Australia. Solar Rebates and Feed-in Tariffs. Available online: https://www.energyaustralia.com.au/home/electricity-and-gas/solar-power/feed-in-tariffs (accessed on 10 August 2024).
  29. Ratnam, E.L.; Weller, S.R.; Kellett, C.M.; Murray, A.T. Residential load and rooftop PV generation: An Australian distribution network dataset. Int. J. Sustain. Energy 2017, 36, 787–806. [Google Scholar] [CrossRef]
  30. Australian Government. Rainfall, Temperature and Wind Forecast and Observations. 2024. Available online: https://data.gov.au/data/dataset/rainfall-and-temperature-forecast-and-observations-verification-2017-05-to-2018-04 (accessed on 15 August 2024).
  31. Li, H.; Wan, Z.; He, H. Real-time residential demand response. IEEE Trans. Smart Grid 2020, 11, 4144–4154. [Google Scholar] [CrossRef]
  32. Huang, Y.; Sun, Q.; Zhang, N.; Wang, R. A multi-slack bus model for bi-directional energy flow analysis of integrated power-gas systems. CSEE J. Power Energy Syst. 2024, 10, 2186–2196. [Google Scholar] [CrossRef]
  33. U.S. Department of Transportation. National Household Travel Survey. 2024. Available online: https://nhts.ornl.gov/ (accessed on 20 March 2024).
  34. Yang, F.; Meng, J.; Ci, M.; Lin, N.; Gao, F. An efficient reconfigurable battery network based on the asynchronous advantage actor–critic paradigm. IEEE Trans. Transp. Electrif. 2025, 11, 1479–1487. [Google Scholar] [CrossRef]
  35. Liang, Y.; Guo, C.; Ding, Z.; Hua, H. Agent-based modeling in electricity market using deep deterministic policy gradient algorithm. IEEE Trans. Power Syst. 2020, 35, 4180–4192. [Google Scholar] [CrossRef]
  36. Hu, Z.; Zheng, P.; Chan, K.W.; Bu, S.; Zhu, Z.; Wei, X.; Nakanishi, Y. A hybrid data-driven approach integrating temporal fusion transformer and soft actor-critic algorithm for optimal scheduling of building integrated energy systems. J. Mod. Power Syst. Clean Energy 2025, 13, 878–891. [Google Scholar] [CrossRef]
  37. Lu, P.; Wu, Y.; Li, J.; Zhang, N.; Li, K.; Shahidehpour, M. Distributed proximal policy optimization with embedded dual rules for power systems considering wind and photovoltaic forecasting. IEEE Trans. Sustain. Energy 2026, 17, 421–434. [Google Scholar] [CrossRef]
Figure 1. The system architecture.
Figure 1. The system architecture.
Sustainability 18 05685 g001
Figure 2. The proposed multi-thread DRL architecture.
Figure 2. The proposed multi-thread DRL architecture.
Sustainability 18 05685 g002
Figure 3. The sketch map of network-based TL.
Figure 3. The sketch map of network-based TL.
Sustainability 18 05685 g003
Figure 4. The framework of the proposed TMDRL.
Figure 4. The framework of the proposed TMDRL.
Sustainability 18 05685 g004
Figure 5. The evaluation metrics for transfer performance.
Figure 5. The evaluation metrics for transfer performance.
Sustainability 18 05685 g005
Figure 6. Training performance. (a) Single-threaded DRL; (b) multi-thread DRL.
Figure 6. Training performance. (a) Single-threaded DRL; (b) multi-thread DRL.
Sustainability 18 05685 g006
Figure 7. The loss curves of different critic networks. (a) Single-critic network; (b) decoupled double-critic network.
Figure 7. The loss curves of different critic networks. (a) Single-critic network; (b) decoupled double-critic network.
Sustainability 18 05685 g007
Figure 8. The total rewards of transferring to different times. (a) Transfer the first layer; (b) transfer the first two layers; (c) transfer the first three layers; (d) transfer all layers.
Figure 8. The total rewards of transferring to different times. (a) Transfer the first layer; (b) transfer the first two layers; (c) transfer the first three layers; (d) transfer all layers.
Sustainability 18 05685 g008
Figure 9. The total rewards of transferring to different areas. (a) Transfer the first layer; (b) transfer the first two layers; (c) transfer the first three layers; (d) transfer all layers.
Figure 9. The total rewards of transferring to different areas. (a) Transfer the first layer; (b) transfer the first two layers; (c) transfer the first three layers; (d) transfer all layers.
Sustainability 18 05685 g009
Figure 10. The results of energy management in the smart building. (a) Power of EV; (b) power of ES; (c) indoor and outdoor temperature; (d) SOC of EV; (e) SOC of ES; (f) energy trading and deferrable load power.
Figure 10. The results of energy management in the smart building. (a) Power of EV; (b) power of ES; (c) indoor and outdoor temperature; (d) SOC of EV; (e) SOC of ES; (f) energy trading and deferrable load power.
Sustainability 18 05685 g010
Figure 11. The results of energy management in the smart building.
Figure 11. The results of energy management in the smart building.
Sustainability 18 05685 g011
Table 1. The parameters of electrical appliances.
Table 1. The parameters of electrical appliances.
ApplianceParametersValue
HVAC systemMinimum comfort temperature19 °C
Maximum comfort temperature24 °C
Penalty coefficients of thermal comfort10
Rated power of HVAC system2.5 kW
Thermal resistance7.5 °C/kW
Thermal capacitance0.594 kWh/°C
EVsMaximum battery capacity of EV55 kWh
Maximum charging and discharging power of EV10 kW
Charging and discharging efficiency of EV0.98
Arrival timeN(17,12,15,9) 1
Departure timeN(8,12,6,10) 1
Penalty coefficients of range anxiety10
ESMaximum battery capacity of ES20 kWh
Maximum charging and discharging power of ES3 kWh
Charging and discharging efficiency of ES0.98
Deferrable LoadsPower consumption of deferrable load1.65 kW
Duration time of deferrable load2 h
Start time of deferrable loadN(24,12,22,26) 1
End time of deferrable loadN(4,12,2,6) 1
1 N(μ,σ,a,b) denotes a truncated normal distribution with mean μ, standard deviation σ, lower bound a, and upper bound b.
Table 2. Performance evaluation of novelty-based filtering mechanism.
Table 2. Performance evaluation of novelty-based filtering mechanism.
ParametersTMDRL (Proposed)TMDRL (with Mean Threshold)TMDRL (Without Novelty-Based
Filtering Mechanism)
Jumpstart−225.14 ± 9.55−248.32 ± 10.44−260.72 ± 12.36
Number of outliers (%)1.84 ± 0.09%2.35 ± 0.12%3.81% ± 0.34%
Convergence efficiency (episodes)582 ± 17654 ± 22894 ± 41
Total reward−18.47 ± 0.85−19.85 ± 0.91−20.23 ± 0.98
Table 3. Performance evaluation of critic combination strategy.
Table 3. Performance evaluation of critic combination strategy.
ParametersProposedMin SelectionMax Selection
Jumpstart−225.14 ± 9.55−278.55 ± 12.60−245.62 ± 11.15
Number of outliers (%)1.84 ± 0.09%3.15 ± 0.14%3.94 ± 0.19%
Convergence efficiency (episodes)582 ± 17845 ± 32760 ± 29
Total reward−18.47 ± 0.85−21.31 ± 1.02−19.98 ± 0.94
Table 4. Performance evaluation of transferring to different times.
Table 4. Performance evaluation of transferring to different times.
ParametersBaselineThe First LayerThe First Two LayersThe First Three LayersAll Layers
Jumpstart−449.59 ± 18.42−302.56 ± 12.15−300.15 ± 11.83−225.14 ± 9.55−262.43 ± 10.28
Number of outliers (%)2.58 ± 0.12%2.61 ± 0.15%2.73 ± 0.14%1.84 ± 0.09%1.63 ± 0.08%
Convergence efficiency (episodes)2671 ± 451231 ± 281042 ± 25582 ± 17683 ± 19
Total reward−26.73 ± 1.25−24.86 ± 1.10−24.93 ± 1.12−18.47 ± 0.85−20.32 ± 0.92
Table 5. Performance evaluation of transferring to different seasons.
Table 5. Performance evaluation of transferring to different seasons.
ParametersBaselineThe First LayerThe First Two
Layers
The First Three
Layers
All Layers
Jumpstart−480.25 ± 20.14−350.12 ± 15.35−345.61 ± 14.82−240.35 ± 11.21−290.47 ± 13.56
Number of outliers (%)2.85 ± 0.16%2.78 ± 0.14%2.65 ± 0.12%1.95 ± 0.10%2.13 ± 0.11%
Convergence efficiency (episodes)2853 ± 531451 ± 321214 ± 29652 ± 18783 ± 21
Total reward−29.53 ± 1.35−26.15 ± 1.20−25.81 ± 1.15−19.65 ± 0.88−22.37 ± 0.95
Table 6. Performance evaluation of transferring to different areas.
Table 6. Performance evaluation of transferring to different areas.
ParametersBaselineThe First LayerThe First Two LayersThe First Three LayersAll Layers
Jumpstart−379.68 ± 15.85−327.61 ± 14.22−329.11 ± 14.52−256.43 ± 12.18−288.42 ± 13.05
Number of outliers (%)2.21 ± 0.11%2.17 ± 0.12%2.09 ± 0.09%1.89 ± 0.08%1.97 ± 0.08%
Convergence efficiency (episodes)2487 ± 411226 ± 261198 ± 25973 ± 201153 ± 23
Total reward−25.65 ± 1.15−15.74 ± 0.75−14.63 ± 0.68−11.85 ± 0.55−14.22 ± 0.65
Table 7. Performance of various methods.
Table 7. Performance of various methods.
MethodsTotal Reward (Mean ± Std
[95% CI])
Total Cost ($)Parameters (K)Constraints Violation Rate (%)Convergence Time (min)Convergence Efficiency (Episodes)p-Value
A3C−20.17 ± 1.43 [−21.19, −19.15]18.52 ± 0.751103.82 ± 0.2558.5 ± 2.11100 ± 33<0.01
DDPG−15.36 ± 0.78 [−15.92, −14.80]15.21 ± 0.551262.15 ± 0.1872.1 ± 2.51005 ± 280.012
SAC−23.23 ± 1.82 [−24.53, −21.93]19.84 ± 0.882494.42 ± 0.3285.4 ± 3.21154 ± 36<0.01
DPPO−21.17 ± 1.74 [−22.41, −19.93]18.93 ± 0.811312.91 ± 0.2265.7 ± 2.2983 ± 31<0.01
TMDRL−12.43 ± 0.56 [−12.83, −12.03]12.45 ± 0.421441.13 ± 0.0941.2 ± 1.5775 ± 21-
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Feng, J.; Hu, J.; Sun, Q. Heterogeneous Exploration and Double-Critic Transfer Reinforcement Learning for Sustainable Cross-Domain Energy Management in Smart Buildings. Sustainability 2026, 18, 5685. https://doi.org/10.3390/su18115685

AMA Style

Feng J, Hu J, Sun Q. Heterogeneous Exploration and Double-Critic Transfer Reinforcement Learning for Sustainable Cross-Domain Energy Management in Smart Buildings. Sustainability. 2026; 18(11):5685. https://doi.org/10.3390/su18115685

Chicago/Turabian Style

Feng, Jiawei, Jie Hu, and Qiuye Sun. 2026. "Heterogeneous Exploration and Double-Critic Transfer Reinforcement Learning for Sustainable Cross-Domain Energy Management in Smart Buildings" Sustainability 18, no. 11: 5685. https://doi.org/10.3390/su18115685

APA Style

Feng, J., Hu, J., & Sun, Q. (2026). Heterogeneous Exploration and Double-Critic Transfer Reinforcement Learning for Sustainable Cross-Domain Energy Management in Smart Buildings. Sustainability, 18(11), 5685. https://doi.org/10.3390/su18115685

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop