Heterogeneous Exploration and Double-Critic Transfer Reinforcement Learning for Sustainable Cross-Domain Energy Management in Smart Buildings
Abstract
1. Introduction
- (1)
- A heterogeneous exploration architecture for the source-domain is introduced, utilizing multi-thread DRL to construct a transferable knowledge base. Unlike conventional distributed methods and standard prioritized experience replay that focus on training speed, the heterogeneous exploration architecture employs a quantitative novelty-based filtering mechanism to selectively store transitions. This maximizes the coverage of different source domain environments and enhances the generality and transferability of the learned knowledge.
- (2)
- A decoupled double-critic optimization mechanism is designed to mitigate policy evaluation bias during cross-domain transfer. Different from the symmetric initialized double-critic method, the proposed mechanism introduces a randomly initialized local-critic network to learn the target domain dynamics independently, while the transfer-critic network provides shared source-domain prior knowledge. Based on the double-critic networks, the policy update is jointly guided to enhance the robustness and adaptability of the management strategy in new environments.
- (3)
- A TMDRL framework combining TL and multi-threaded DRL is proposed for cross-domain smart building energy management. By integrating the source-domain heterogeneous exploration architecture and the decoupled double-critic optimization mechanism, the framework achieves robust knowledge construction and strategy transfer. The case study verifies that the proposed framework significantly shortens the development time, reduces the total cost and improves the sustainability of the smart building energy management.
2. System Models and Problem Formulation
2.1. Smart Building Model
2.1.1. HVAC Systems
2.1.2. Electric Vehicles
2.1.3. ES System
2.1.4. Deferrable Appliances
2.1.5. Power Balance Constraint
2.2. MDP Formulation for Energy Management
2.2.1. Environment State Set
2.2.2. Control Action Set
2.2.3. State Transition Probability
2.2.4. The Reward of MDP
3. The TMDRL-Based Energy Management Framework
3.1. Source-Domain Heterogeneous Exploration
3.2. Cross-Domain Knowledge Transfer Protocol
3.3. Target-Domain Adaptation
3.4. Metrics of TL
3.4.1. Total Reward
3.4.2. Jumpstart
3.4.3. Convergence Efficiency
3.4.4. Number of Outliers
4. Case Study
4.1. Simulation Setup
- Asynchronous Advantage Actor–Critic (A3C)
- Deep Deterministic Policy Gradient (DDPG)
- Soft Actor–Critic (SAC)
- Distributed Proximal Policy Optimization (DPPO)
4.2. Heterogeneous Exploration Architecture Training Performance
4.3. Decoupled Double-Critic Mechanism Performance
4.4. Novelty-Based Filtering Mechanism Performance
4.5. Transfer to Different Times
4.6. Transfer to Different Areas
4.7. Energy Management Results
4.8. Performance Comparison with Benchmarks
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Nomenclature
| TMDRL | Transferred multi-thread deep reinforcement learning |
| DRL | Deep reinforcement learning |
| TL | Transfer learning |
| DERs | Distributed energy resources |
| HVAC | Heating, ventilation, and air conditioning |
| EV | Electric vehicle |
| ES | Energy storage |
| PV | Photovoltaic |
| MDP | Markov decision process |
| SOC | State of charge |
| TOU | Time-of-use |
| A3C | Asynchronous advantage actor–critic |
| DDPG | Deep deterministic policy gradient |
| SAC | Soft actor–critic |
| Time slot index and duration | |
| Indoor and outdoor temperature | |
| Minimum and maximum comfort temperature | |
| Equivalent thermal resistance and capacitance | |
| Power generation of PV | |
| Inherent randomness of the environment | |
| Power consumption of HVAC system | |
| Rated power of HVAC system | |
| Energy stored in EV and ES | |
| Charging and discharging power of EV | |
| Charging and discharging power of ES | |
| Charging and discharging efficiency of EV | |
| Charging and discharging efficiency of ES | |
| Maximum charging and discharging power of EV | |
| Maximum charging and discharging power of ES | |
| Maximum battery capacity of EV and ES | |
| Required and expected energy of EV at departure | |
| Power consumption and binary activation status of deferrable load | |
| Start, end, and duration time of deferrable load | |
| Electricity purchasing price and selling price | |
| Cost coefficients for deferrable loads and ES | |
| Traded electrical power with the grid | |
| Power purchased from and sold to the grid | |
| State set and Action set | |
| Power regulation action of HVAC system | |
| Actions of EV and ES | |
| Activation status action of deferrable load | |
| Total reward function | |
| Operating cost reward function | |
| Thermal comfort reward function | |
| Range anxiety reward function | |
| Penalty coefficients for thermal comfort and range anxiety | |
| Weight coefficients | |
| Policy function and Value function | |
| Novelty-based experience replay buffer | |
| Novelty score function of the state | |
| Parameters of the policy and critic networks | |
| Discount factor and look-ahead steps | |
| Mean total reward | |
Appendix A
| Source-Domain Pre-Training |
| 1: Initialize global Actor , critic |
| 2: Operate 8 parallel threads asynchronously: |
| 3: For each step do: |
| 4: Interact with source-domain environment, sample action, and observe |
| 5: Calculate the novelty score according to Equation (25). |
| 6: If |
| 7: Store in novelty-based experience replay buffer |
| 8: End If |
| 9: If the current step is a multiple of 100 then: |
| 10: Sample mini-batch from experience replay buffer, update global and |
| 11: Synchronize local threads with global networks |
| 12: End If |
| 13: End For |
| 14: Save pre-trained actor and critic network parameters |
| Target-DomainStrategy Transfer |
| 15: Initialize the actor and transfer critic network with pre-trained parameters |
| 16: Initialize the local critic network with random parameters |
| 17: Initialize novelty-based experience replay buffer |
| 18: Operate 8 parallel threads asynchronously: |
| 19: For each step do: |
| 20: Interact with target domain environment, sample action, and observe |
| 21: Calculate the novelty score according to Equation (25). |
| 22: If |
| 23: Store in |
| 24: End If |
| 25: If the current step is a multiple of 100 then: |
| 26: Sample mini-batch from experience replay buffer |
| 27: Calculate target R-values according to Equations (31) and (32) |
| 28: Calculate independent critic losses according to Equations (33) and (34) |
| 29: Update critics independently |
| 30: Update actor network using clipping probability ratio mechanism |
| 31: Synchronize local threads with global networks |
| 32: End If |
| 33: End For |
| 34: Return optimized target strategy |
References
- Nambiar, J.; Yu, S.; Lilley, I.; Makam, J. Coordinating vehicle-to-grid and distributed energy resources in multi-dwelling developments: A real-time gateway control framework. Sustainability 2026, 18, 3861. [Google Scholar] [CrossRef]
- Chen, J.; Lu, L. Renewable energy integration and application in buildings for carbon neutrality. Sustainability 2026, 18, 4310. [Google Scholar] [CrossRef]
- Chakraborty, S.; Modi, G.; Singh, B. A cost optimized-reliable-resilient-realtime-rule-based energy management scheme for a SPV-BES-based microgrid for smart building applications. IEEE Trans. Smart Grid 2023, 14, 2572–2581. [Google Scholar] [CrossRef]
- Sun, Y.; Luo, Z.; Li, Y.; Zhao, T. Grey-box model-based demand side management for rooftop PV and air conditioning systems in public buildings using PSO algorithm. Energy 2024, 296, 131052. [Google Scholar] [CrossRef]
- Zheng, Z.; Tang, R.; Luo, X.; Li, H.; Wang, S. A distributed coordination strategy for heterogeneous building flexible thermal loads in responding to smart grids. IEEE Trans. Smart Grid 2024, 15, 1620–1633. [Google Scholar] [CrossRef]
- Wang, C.; Wang, B.; You, F. Demand response for residential buildings using hierarchical nonlinear model predictive control for plug-and-play. Appl. Energy 2024, 369, 123581. [Google Scholar] [CrossRef]
- Pei, Y.; Yao, Y.; Zhao, J.; Hao, J.; Ding, F.; Wang, J. Multi-agent hierarchical deep reinforcement learning for HVAC control with flexible DERs. IEEE Trans. Smart Grid 2025, 16, 5589–5601. [Google Scholar] [CrossRef]
- Yin, Z.; Wang, S.; Zhao, Q. A flexibility scheduling method for distribution network based on robust graph DRL against state adversarial attacks. J. Mod. Power Syst. Clean Energy 2025, 13, 514–526. [Google Scholar] [CrossRef]
- Mocanu, E.; Mocanu, D.C.; Nguyen, P.H.; Liotta, A.; Webber, M.E.; Gibescu, M.; Slootweg, J.G. On-line building energy optimization using deep reinforcement learning. IEEE Trans. Smart Grid 2019, 10, 3698–3708. [Google Scholar] [CrossRef]
- Tsaousoglou, G.; Efthymiopoulos, N.; Makris, P.; Varvarigos, E. Multistage energy management of coordinated smart buildings: A multiagent markov decision process approach. IEEE Trans. Smart Grid 2022, 13, 2788–2797. [Google Scholar] [CrossRef]
- Guo, Y.; Du, C.; Liu, X.; Zhang, X.; Jin, Z. Research on attention-based fault diagnosis and multi-parameter joint optimization of CO2 heat pump system. Appl. Therm. Eng. 2026, 289, 129942. [Google Scholar] [CrossRef]
- Zhang, W.; Li, Y. Aggregator-grid interactive building dual-layer price-responsive demand response scheduling based on federated deep reinforcement learning. IEEE Trans. Smart Grid 2025, 16, 1142–1154. [Google Scholar] [CrossRef]
- Liu, H.; You, C.; Han, L.; Yang, N.; Liu, B. Off-road hybrid electric vehicle energy management strategy using multi-agent soft actor-critic with collaborative-independent algorithm. Energy 2025, 328, 136463. [Google Scholar] [CrossRef]
- Liu, J.; Ma, Y.; Chen, Y.; Zhao, C.; Meng, X.; Wu, J. Multi-agent deep reinforcement learning-based cooperative energy management for regional integrated energy system incorporating active demand-side management. Energy 2025, 319, 135056. [Google Scholar] [CrossRef]
- Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
- Incecco, M.D.; Squartini, S.; Zhong, M. Transfer learning for non-intrusive load monitoring. IEEE Trans. Smart Grid 2020, 11, 1419–1429. [Google Scholar] [CrossRef]
- Khan, M.; Silva, B.N.; Khattab, O.; Alothman, B.; Joumaa, C. A transfer reinforcement learning framework for smart home energy management systems. IEEE Sens. J. 2023, 23, 4060–4068. [Google Scholar] [CrossRef]
- Fang, X.; Gong, G.; Li, G.; Chun, L.; Peng, P.; Li, W.; Shi, X. Cross temporal-spatial transferability investigation of deep reinforcement learning control strategy in the building HVAC system level. Energy 2023, 263, 125679. [Google Scholar] [CrossRef]
- Li, H.; Ma, Z.; Weng, Y. A transfer learning framework for power system event identification. IEEE Trans. Power Syst. 2022, 37, 4424–4435. [Google Scholar] [CrossRef]
- Ariwoola, R.; Kamalasadan, S. An integrated hybrid thermal dynamics model and energy aware optimization framework for grid-interactive residential building management. IEEE Trans. Ind. Appl. 2023, 59, 2519–2531. [Google Scholar] [CrossRef]
- Wang, W.; Tian, G.; Sun, Q.Z.; Liu, H. A control framework to enable a commercial building HVAC system for energy and regulation market signal tracking. IEEE Trans. Power Syst. 2023, 38, 290–301. [Google Scholar] [CrossRef]
- Kim, Y.-J. A supervised-learning-based strategy for optimal demand response of an HVAC system in a multi-zone office building. IEEE Trans. Smart Grid 2020, 11, 4212–4226. [Google Scholar] [CrossRef]
- Chen, Y.; Lu, J.; Liu, Z.; Peng, P.; Yang, X.; Wu, M. A real-time energy management strategy for sustainable operation of electrified railway grid-source-storage-vehicle system integrating rule and optimization. Sustainability 2026, 18, 3914. [Google Scholar] [CrossRef]
- Liu, X.; Tang, D.; Dai, Z. A Bayesian game approach for demand response management considering incomplete information. J. Mod. Power Syst. Clean Energy 2025, 10, 492–501. [Google Scholar] [CrossRef]
- Lu, Y.; Zuo, Y. Multiobjective optimization of path planning and communication capacity based on DQN with weighted prioritized experience replay. IEEE Internet Things J. 2025, 12, 53262–53273. [Google Scholar] [CrossRef]
- Dridi, J.; Amayri, M.; Bouguila, N. Transfer learning for estimating occupancy and recognizing activities in smart buildings. Build. Environ. 2022, 217, 109057. [Google Scholar] [CrossRef]
- Yan, L.; Chen, X.; Chen, Y.; Wen, J. A hierarchical deep reinforcement learning-based community energy trading scheme for a neighborhood of smart households. IEEE Trans. Smart Grid 2022, 13, 4747–4758. [Google Scholar] [CrossRef]
- Energy Australia. Solar Rebates and Feed-in Tariffs. Available online: https://www.energyaustralia.com.au/home/electricity-and-gas/solar-power/feed-in-tariffs (accessed on 10 August 2024).
- Ratnam, E.L.; Weller, S.R.; Kellett, C.M.; Murray, A.T. Residential load and rooftop PV generation: An Australian distribution network dataset. Int. J. Sustain. Energy 2017, 36, 787–806. [Google Scholar] [CrossRef]
- Australian Government. Rainfall, Temperature and Wind Forecast and Observations. 2024. Available online: https://data.gov.au/data/dataset/rainfall-and-temperature-forecast-and-observations-verification-2017-05-to-2018-04 (accessed on 15 August 2024).
- Li, H.; Wan, Z.; He, H. Real-time residential demand response. IEEE Trans. Smart Grid 2020, 11, 4144–4154. [Google Scholar] [CrossRef]
- Huang, Y.; Sun, Q.; Zhang, N.; Wang, R. A multi-slack bus model for bi-directional energy flow analysis of integrated power-gas systems. CSEE J. Power Energy Syst. 2024, 10, 2186–2196. [Google Scholar] [CrossRef]
- U.S. Department of Transportation. National Household Travel Survey. 2024. Available online: https://nhts.ornl.gov/ (accessed on 20 March 2024).
- Yang, F.; Meng, J.; Ci, M.; Lin, N.; Gao, F. An efficient reconfigurable battery network based on the asynchronous advantage actor–critic paradigm. IEEE Trans. Transp. Electrif. 2025, 11, 1479–1487. [Google Scholar] [CrossRef]
- Liang, Y.; Guo, C.; Ding, Z.; Hua, H. Agent-based modeling in electricity market using deep deterministic policy gradient algorithm. IEEE Trans. Power Syst. 2020, 35, 4180–4192. [Google Scholar] [CrossRef]
- Hu, Z.; Zheng, P.; Chan, K.W.; Bu, S.; Zhu, Z.; Wei, X.; Nakanishi, Y. A hybrid data-driven approach integrating temporal fusion transformer and soft actor-critic algorithm for optimal scheduling of building integrated energy systems. J. Mod. Power Syst. Clean Energy 2025, 13, 878–891. [Google Scholar] [CrossRef]
- Lu, P.; Wu, Y.; Li, J.; Zhang, N.; Li, K.; Shahidehpour, M. Distributed proximal policy optimization with embedded dual rules for power systems considering wind and photovoltaic forecasting. IEEE Trans. Sustain. Energy 2026, 17, 421–434. [Google Scholar] [CrossRef]











| Appliance | Parameters | Value |
|---|---|---|
| HVAC system | Minimum comfort temperature | 19 °C |
| Maximum comfort temperature | 24 °C | |
| Penalty coefficients of thermal comfort | 10 | |
| Rated power of HVAC system | 2.5 kW | |
| Thermal resistance | 7.5 °C/kW | |
| Thermal capacitance | 0.594 kWh/°C | |
| EVs | Maximum battery capacity of EV | 55 kWh |
| Maximum charging and discharging power of EV | 10 kW | |
| Charging and discharging efficiency of EV | 0.98 | |
| Arrival time | N(17,12,15,9) 1 | |
| Departure time | N(8,12,6,10) 1 | |
| Penalty coefficients of range anxiety | 10 | |
| ES | Maximum battery capacity of ES | 20 kWh |
| Maximum charging and discharging power of ES | 3 kWh | |
| Charging and discharging efficiency of ES | 0.98 | |
| Deferrable Loads | Power consumption of deferrable load | 1.65 kW |
| Duration time of deferrable load | 2 h | |
| Start time of deferrable load | N(24,12,22,26) 1 | |
| End time of deferrable load | N(4,12,2,6) 1 |
| Parameters | TMDRL (Proposed) | TMDRL (with Mean Threshold) | TMDRL (Without Novelty-Based Filtering Mechanism) |
|---|---|---|---|
| Jumpstart | −225.14 ± 9.55 | −248.32 ± 10.44 | −260.72 ± 12.36 |
| Number of outliers (%) | 1.84 ± 0.09% | 2.35 ± 0.12% | 3.81% ± 0.34% |
| Convergence efficiency (episodes) | 582 ± 17 | 654 ± 22 | 894 ± 41 |
| Total reward | −18.47 ± 0.85 | −19.85 ± 0.91 | −20.23 ± 0.98 |
| Parameters | Proposed | Min Selection | Max Selection |
|---|---|---|---|
| Jumpstart | −225.14 ± 9.55 | −278.55 ± 12.60 | −245.62 ± 11.15 |
| Number of outliers (%) | 1.84 ± 0.09% | 3.15 ± 0.14% | 3.94 ± 0.19% |
| Convergence efficiency (episodes) | 582 ± 17 | 845 ± 32 | 760 ± 29 |
| Total reward | −18.47 ± 0.85 | −21.31 ± 1.02 | −19.98 ± 0.94 |
| Parameters | Baseline | The First Layer | The First Two Layers | The First Three Layers | All Layers |
|---|---|---|---|---|---|
| Jumpstart | −449.59 ± 18.42 | −302.56 ± 12.15 | −300.15 ± 11.83 | −225.14 ± 9.55 | −262.43 ± 10.28 |
| Number of outliers (%) | 2.58 ± 0.12% | 2.61 ± 0.15% | 2.73 ± 0.14% | 1.84 ± 0.09% | 1.63 ± 0.08% |
| Convergence efficiency (episodes) | 2671 ± 45 | 1231 ± 28 | 1042 ± 25 | 582 ± 17 | 683 ± 19 |
| Total reward | −26.73 ± 1.25 | −24.86 ± 1.10 | −24.93 ± 1.12 | −18.47 ± 0.85 | −20.32 ± 0.92 |
| Parameters | Baseline | The First Layer | The First Two Layers | The First Three Layers | All Layers |
|---|---|---|---|---|---|
| Jumpstart | −480.25 ± 20.14 | −350.12 ± 15.35 | −345.61 ± 14.82 | −240.35 ± 11.21 | −290.47 ± 13.56 |
| Number of outliers (%) | 2.85 ± 0.16% | 2.78 ± 0.14% | 2.65 ± 0.12% | 1.95 ± 0.10% | 2.13 ± 0.11% |
| Convergence efficiency (episodes) | 2853 ± 53 | 1451 ± 32 | 1214 ± 29 | 652 ± 18 | 783 ± 21 |
| Total reward | −29.53 ± 1.35 | −26.15 ± 1.20 | −25.81 ± 1.15 | −19.65 ± 0.88 | −22.37 ± 0.95 |
| Parameters | Baseline | The First Layer | The First Two Layers | The First Three Layers | All Layers |
|---|---|---|---|---|---|
| Jumpstart | −379.68 ± 15.85 | −327.61 ± 14.22 | −329.11 ± 14.52 | −256.43 ± 12.18 | −288.42 ± 13.05 |
| Number of outliers (%) | 2.21 ± 0.11% | 2.17 ± 0.12% | 2.09 ± 0.09% | 1.89 ± 0.08% | 1.97 ± 0.08% |
| Convergence efficiency (episodes) | 2487 ± 41 | 1226 ± 26 | 1198 ± 25 | 973 ± 20 | 1153 ± 23 |
| Total reward | −25.65 ± 1.15 | −15.74 ± 0.75 | −14.63 ± 0.68 | −11.85 ± 0.55 | −14.22 ± 0.65 |
| Methods | Total Reward (Mean ± Std [95% CI]) | Total Cost ($) | Parameters (K) | Constraints Violation Rate (%) | Convergence Time (min) | Convergence Efficiency (Episodes) | p-Value |
|---|---|---|---|---|---|---|---|
| A3C | −20.17 ± 1.43 [−21.19, −19.15] | 18.52 ± 0.75 | 110 | 3.82 ± 0.25 | 58.5 ± 2.1 | 1100 ± 33 | <0.01 |
| DDPG | −15.36 ± 0.78 [−15.92, −14.80] | 15.21 ± 0.55 | 126 | 2.15 ± 0.18 | 72.1 ± 2.5 | 1005 ± 28 | 0.012 |
| SAC | −23.23 ± 1.82 [−24.53, −21.93] | 19.84 ± 0.88 | 249 | 4.42 ± 0.32 | 85.4 ± 3.2 | 1154 ± 36 | <0.01 |
| DPPO | −21.17 ± 1.74 [−22.41, −19.93] | 18.93 ± 0.81 | 131 | 2.91 ± 0.22 | 65.7 ± 2.2 | 983 ± 31 | <0.01 |
| TMDRL | −12.43 ± 0.56 [−12.83, −12.03] | 12.45 ± 0.42 | 144 | 1.13 ± 0.09 | 41.2 ± 1.5 | 775 ± 21 | - |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Feng, J.; Hu, J.; Sun, Q. Heterogeneous Exploration and Double-Critic Transfer Reinforcement Learning for Sustainable Cross-Domain Energy Management in Smart Buildings. Sustainability 2026, 18, 5685. https://doi.org/10.3390/su18115685
Feng J, Hu J, Sun Q. Heterogeneous Exploration and Double-Critic Transfer Reinforcement Learning for Sustainable Cross-Domain Energy Management in Smart Buildings. Sustainability. 2026; 18(11):5685. https://doi.org/10.3390/su18115685
Chicago/Turabian StyleFeng, Jiawei, Jie Hu, and Qiuye Sun. 2026. "Heterogeneous Exploration and Double-Critic Transfer Reinforcement Learning for Sustainable Cross-Domain Energy Management in Smart Buildings" Sustainability 18, no. 11: 5685. https://doi.org/10.3390/su18115685
APA StyleFeng, J., Hu, J., & Sun, Q. (2026). Heterogeneous Exploration and Double-Critic Transfer Reinforcement Learning for Sustainable Cross-Domain Energy Management in Smart Buildings. Sustainability, 18(11), 5685. https://doi.org/10.3390/su18115685
