A Physics-Informed Reinforcement Learning Framework for HVAC Optimization: Thermodynamically-Constrained Deep Deterministic Policy Gradients with Simulation-Based Validation
Abstract
1. Introduction
Contributions (Scope, Novelty, and Validation Context)
- Physics-informed continuous control: A TC-DDPG architecture that operates directly on continuous HVAC actions, avoiding discretization artifacts inherent to DQN-style methods.
- Thermodynamic constraint layer: A differentiable projection that enforces feasibility by design within the simulator, subject to model fidelity (energy balance, psychrometric bounds, capacity/rate limits), coupled with a physics-regularized objective.
- Simulation-Based Performance Validation: In a multi-zone RC simulator, the method yields 34.7% annual energy reduction vs. rule-based control and improves comfort (occupied-hour PMV ∈ [−0.5, 0.5]). Results are reported with 95% CIs over 50 seeds and significance testing.
- Reproducibility: A commitment to the public release of code, simulator configuration, training/evaluation scripts, and hyperparameters following stabilization to enable replication and extension.
- Transparent limitations and roadmap: Clear simulation-based scope, with discussion of sensor/actuator realities, operational overrides, and a staged path toward hardware-in-the-loop and pilot deployments.
2. Related Work
- Recent progress in RL for building/HVAC control (2020–2025)
- Benchmarks and application-level evidence
- Safety and explicit handling of constraints
- Physics-informed learning for building control
2.1. Traditional HVAC Control
2.2. Machine Learning Approaches
2.3. Physics-Informed Machine Learning
2.4. Research Gap
- Gap 1—Continuous action feasibility:
- Gap 2—Safety and physics violations during exploration:
- Gap 3—Limited reporting of constraint violation metrics:
- Gap 4—Weak reproducibility and benchmarking standards:
- Gap 5—Lack of a roadmap for real-world deployment:
3. Mathematical Framework
3.1. System Dynamics
- temperature of zone ;
- thermal capacitance of zone ;
- heat transfer coefficient between zones and ;
- surface area between zones ;
- HVAC heat transfer rate ;
- internal heat gains ;
- solar heat gains ;
- ;
- .
3.2. Psychrometric Constraints
- specific enthalpy
- humidity ratio ;
- relative humidity
- saturation pressure at temperature T
- barometric pressure .
3.3. Energy Conservation
- mass flow rate of air in zone
- fan power consumption
- pump power consumption .

3.4. Model Calibration and Validation
4. Physics-Informed Reinforcement Learning
4.1. State Space Definition
4.2. Action Space
4.3. Reward Function
- is instantaneous energy use normalized by a fixed reference
- [2].
- is the 15 min rolling average electrical demand; Pref is a site-level reference for normalization.
- below a threshold (or penalizes exceedance), normalized to .
4.4. Thermodynamically-Constrained Deep Deterministic Policy Gradient (TC-DDPG)
4.4.1. Differentiable Projection
4.4.2. Actor–Critic Networks
- Actor MLP with tan h output; actions are scaled and then projected by
- Critic MLP that estimates on projected actions.
- Targets: with soft updates
4.4.3. Objectives and Physics Regularization
- : normalized residual between required and modeled HVAC power (sensible + latent + auxiliaries).
- : deviation from consistent (T, ω, ϕ) via saturation-based relations.
- : soft corridor penalties (∣PMV∣ ≤ 0.5).
4.4.4. Training Procedure
| Algorithm 1. TC-DDPG training procedure incorporating the differentiable thermodynamic projection and physics-regularized actor update |
| # s: state, a: action, rb: replay buffer for episode in range(E): s = env.reset() for t in range(T): a_raw = actor(s) # continuous in [−1, 1] via tanh a_raw = a_raw + ou_noise.sample() a = Pi_phys.enforce(s, a_raw) # differentiable projection s2, r, done, info = env.step(a) rb.add(s, a_raw, r, s2, done) # store raw; projection is deterministic s = s2 if len(rb) > batch: S, Araw, R, S2, D = rb.sample(batch) # Critic update with torch.no_grad(): A2 = Pi_phys.enforce(S2, actor_t(S2)) y = R + gamma*(1-D)*critic_t(S2, A2) A = Pi_phys.enforce(S, Araw) Lc = mse(critic(S, A), y) optC.zero_grad(); Lc.backward(); optC.step() # Actor update Ahat = Pi_phys.enforce(S, actor(S)) Lphys = L_energy(S, Ahat) + 0.1*L_psychro(S, Ahat) + 0.05*L_comfort(S, Ahat) La = -critic(S, Ahat).mean() + lambda_phys*Lphys optA.zero_grad(); La.backward(); optA.step() # Soft update targets soft_update(actor_t, actor, tau); soft_update(critic_t, critic, tau) if done: break |
4.4.5. Design Notes and Caveats
- Constraint handling is architectural (projection + loss) rather than guaranteed optimal control constraints; results are in simulation and depend on model fidelity.
- Using projected actions in both the target and the actor paths is critical for training stability.
4.5. Hyperparameters and Implementation Details
- Algorithm: DDPG with target networks and OU exploration.
- Actor/Critic LRs: ; Batch: 64; Buffer: .
- Discount/Soft-update: γ = 0.99, τ = 0.005.
- Exploration: OU noise σ = 0.1 (decayed).
- Runs: 50 independent seeds.
- Normalization: all inputs/outputs use fixed scalers saved with the model; reward terms are normalized to stable magnitudes.
- Early-stopping/evaluation: validation rollouts every episodes; model selection by average return and constraint violation rate.
4.6. Statistical Evaluation and Uncertainty Quantification
- Network initialization;
- Replay buffer order;
- Weather-day sampling sequence.
- p < 0.05 → significant;
- p < 0.01 → highly significant.
5. Experimental Setup
- Building Archetype and Zone Layout
5.1. Building Simulation Environment
- Thermal capacitance : heat storage of zone air and interior surfaces.
- Inter-zone conduction : walls/floors/ceilings between adjacent zones.
- Envelope exchange : external walls, roof, glazing; convective exchange with outdoor air.
- Solar gains : computed from window orientation, glazing properties, and synthetic irradiance profiles (direct + diffuse).
- Internal gains : occupants/lighting/equipment based on office-style schedules.
- HVAC sensible/latent terms: via supply mass flow , supply temperature , and humidity ratio (Section 3).
- State integration: forward Euler, (internal step).
- Control interval: 5 min (the agent acts every 5 min).
- Episode length: 288 steps (24 h per episode) unless otherwise noted.
- Warm-start: initial zone temperatures sampled uniformly from a comfort band (e.g., )
- Rate limits: per control step, where ≈ ( Δt)/Ci.
- Psychometric: ϕ ∈ [0, 1] and consistent ω.
- Equipment: fan/pump/chiller capacity and ramp constraints.
Thermal Simulator Validation
5.2. Baseline Controllers and Fairness Protocol
5.2.1. Rule-Based Controller (RBC)
- Occupied band: {21, 25} °C (08:00–18:00); Unoccupied: {18, 30} °C [45].
- Heating/Cooling enable: Two-position with deadband 1.0 °C; anti-short-cycle timer = 10 min.
- Morning warmup: If at 07:00, preheat with SAT ramp to 32 °C capped by rate-limit.
- Outdoor air (OA) damper: min 10%, economizer up to 100% if and humidity < 65%.
- VAV flows: zone PI loops (Kp = 0.7, Ki = 0.03) to hold the active setpoint.
5.2.2. Model Predictive Control (MPC)
- MPC-PF (Perfect Forecasts): uses simulator-truth weather/internal-gains (favorable to MPC).
- MPC-EF (Error Forecasts): uses biased/noisy forecasts (hour-ahead MAE: 1.0 °C for , 15% for solar; gains ±10%).
5.2.3. RL Baselines
- Standard DDPG: identical network sizes, replay ratio, target-update, and exploration schedule as TC-DDPG; no feasibility/projection layer, no physics regularization.
- TC-DDPG (ours): adds the differentiable thermodynamic feasibility projection and physics-regularized loss.
- Both RL agents:
- Train on the same scenario set and seeds (50 seeds).
- Observe the same state vector (temperatures, OA, schedules, etc.).
- Are subject to the same actuator bounds/rate-limits and first-order actuator lag in the plant.
- Use the same episode length, learning steps per episode, and wall-clock training budget (early-stopping by validation reward).
5.2.4. Fairness Charter
- Common plant and constraints: identical physics, bounds, rate-limits, latencies, and disturbances.
- Matched knowledge: RBC/MPC/RL all receive the same state; MPC-PF is reported separately from MPC-EF.
- Budget parity: equal hyperparameter-tuning passes (grid for MPC weights; sweep for RL rewards) and identical seed counts.
- Identical metrics: energy (kWh), comfort drift (°C·h outside band), violation counts, and 95% CIs with the same bootstrap.
- Transparent reporting: we present both MPC-PF and MPC-EF to avoid over-crediting perfect foresight.
| Channel | Symbol | Range | Max Step Δ per 5 min | Units |
|---|---|---|---|---|
| Zone setpoint | 20–24 | 0.5 | °C | |
| Supply temp | 12–20 | 1.0 | °C | |
| Supply flow | 0–10 | 1.0 | kg·s−1 | |
| OA damper | 0–1 | 0.2 | — | |
| Chiller load | 0–1 | 0.2 | — |
| Aspect | Shared by All | RBC | MPC-PF/MPC-EF | Standard DDPG | TC-DDPG (Ours) |
|---|---|---|---|---|---|
| Plant, bounds, rate-limits | Yes | — | — | — | — |
| Comfort bands and schedules | Yes | — | — | — | — |
| Forecast type | — | n/a | PF: perfect; EF: biased/noisy | n/a | n/a |
| Actuator lag in plant | Yes | Honored | Honored | Honored | Honored |
| Hyperparam tuning budget | Equal | Deadband sweep | (\alpha,\beta,\gamma) grid | Reward sweep | Reward + λ_phys sweep |
| Observations | Same | Setpoint/zone | Forecasts/zone | Zone state | Zone + projection residuals |
| Optimization/Training budget | Equal | — | Same horizon/solver | Same steps/seeds | Same steps/seeds |
| Method | Energy (kWh) | Comfort Drift (°C·h) | Violations (Count/Day) |
|---|---|---|---|
| RBC | 100.9 ± 2.8 | 7.6 ± 1.1 | 1.8 ± 0.5 |
| MPC-PF | 93.4 ± 2.1 | 3.9 ± 0.7 | 0.0 ± 0.0 |
| MPC-EF | 96.8 ± 2.5 | 5.2 ± 0.8 | 0.0 ± 0.0 |
| Standard DDPG | 92.7 ± 2.3 | 4.8 ± 0.9 | 2.6 ± 0.7 |
| TC-DDPG (ours) | 88.9 ± 2.0 | 3.1 ± 0.6 | 0.4 ± 0.2 |
5.3. Training Configuration
- Normalization: all state features and reward components are normalized using fixed scalers saved with the model.
- Evaluation: periodic validation rollouts without exploration noise; model selection by average return and constraint violation rate.
- Early stopping: if validation plateaus for K evaluations (reported in code).
- Software: Python 3.10+, PyTorch ≥ 2.0.
- Hardware (reference): a single consumer GPU (e.g., RTX-class) is sufficient; CPU-only is feasible with longer training time
Baseline Configuration
- Rule-based. Occupied deadband; night setback ; minimum airflow of max; OA damper occupied/ unoccupied; simple demand limit above 95th percentile of historical power.
- MPC. Linearized RC predictor; horizon H = 24 steps (2 h), move-blocking 2 steps; quadratic cost on energy, setpoint tracking, and demand; hard bounds and rate limits as in Table 6; solver: OSQP via CVXPY; forecasts: perfect (simulator truth) for , occupancy, solar (favorable to MPC).
- Standard DDPG. Same state/action spaces and network sizes as TC-DDPG; no projection layer and no physics regularizers; OU noise for exploration; identical training schedule.
6. Theoretical Framework Validation and Simulated Performance
6.1. Physics-Based Validation Methodology
- Unit-tested implementation of all thermal/psychrometric relations (Section 3).
- Energy balance residuals checked per step with relative tolerance ≤ 1 × 10−4.
- Constraint set-membership tests across 10k+ randomized states/actions.
- Psychrometric feasibility: ϕ ∈ [0, 1], ω ≥ 0, saturation relations consistent.
- Parameters sampled within standard literature ranges (capacitances, conductances, gains); not calibrated to a specific building.
- Weather and occupancy generated synthetically with diurnal/seasonal trends and stochastic variability (scripts provided).
- Equipment limits and rate bounds consistent with typical VAV-style systems.
- Rule-based baseline reproduces expected on/off and deadband behavior across seasons.
- MPC baseline (internal RC model, CVXPY) respects constraints and responds predictably to forecast shifts.
- Standard DDPG baseline (no physics) matches published qualitative trends (faster but less safe exploration)
- Monte Carlo: parameter draws; report dispersion of metrics.
- Sensitivity: ±30% sweeps over key parameters (e.g., , gains).
- Stress tests: heat waves, cold snaps, humidity extremes; optional sensor noise and actuator lag.
- Fault injections (optional): stuck damper, biased sensor; report constraint handling.
6.2. Synthetic Data Generation
| Algorithm 2. Configuration dictionary defining the synthetic simulation environment, including parameter distributions, weather models, and equipment constraints |
| sim_cfg = { “zones”: 5, “dt_ode_sec”: 60, # internal integrator step “dt_ctrl_sec”: 300, # 5-min control interval “horizon_steps”: 288, # 24 h per episode “params”: { “C_i_J_per_K”: “Uniform [0.8e6, 1.4e6] per zone”, “UijAij_W_per_K”: “Sparse, Uniform [40, 140] off-diagonals”, “UoutAout_W_per_K”: “Uniform [120, 350] per zone”, “Qint_W”: “Piecewise schedule + noise”, “Qsol_W”: “Aspect/orientation + diurnal profiles” }, “weather”: { “Tout_C”: “Seasonal sinusoid + daily oscillation + noise”, “RH_out”: “Seasonal baseline + daily oscillation”, “solar”: “Clear/partly-cloudy patterns” }, “occupancy”: “Weekday office schedule (08–18) + stochastic arrivals”, “equipment_limits”: { “Tset_C”: [20, 24], “Tsup_C”: [12, 20], “m_dot”: [0.0, 10.0], # kg/s per zone “damper”: [0.0, 1.0], “chiller”: [0.0, 1.0] } } |
6.3. Simulated Energy Performance from Framework Validation

| Method | Energy Use (kWh/m2·yr) | Savings vs. Baseline | Peak Power (kW) | COP (–) |
|---|---|---|---|---|
| Rule-Based (Baseline) | 187.3 ± 4.2 | — | 498.6 ± 12.3 | 2.87 ± 0.08 |
| MPC | 156.8 ± 3.8 | 16.3% | 456.2 ± 11.7 | 3.42 ± 0.09 |
| Standard DDPG | 152.4 ± 4.1 | 18.6% | 441.8 ± 10.9 | 3.68 ± 0.11 |
| TC-DDPG (Ours) | 122.4 ± 3.6 (95% CI: 119.0–125.7) | 34.7% | 320.1 ± 9.2 | 4.12 ± 0.10 |

6.4. Summary of Validation Results
- Energy savings: 34.7% vs. rule-based; 16.1 percentage points better than standard DDPG.
- Comfort: Occupied-hour PMV within [−0.5, 0.5] for 98.3% of hours (mean); lower setpoint deviation than baselines.
- Physics consistency: Constraint violations reduced by ~2 orders of magnitude relative to standard DDPG.
- Convergence: Faster learning (Section 6.9) with reduced exploration of infeasible regions.
6.5. Comfort Metrics
6.6. Physics Constraint Satisfaction
6.7. Robustness to Sensing/Actuation Imperfections and Faults
- Sensor noise: with
- Sensor bias: with
- Telemetry latency: controller observes steps (5–25 min at our 5 min control period)
- Actuator lag (first-order): min, applied to supply-air temperature (SAT), flow, damper position, and chiller load.
- Saturation & rate limits: and
- Stuck outdoor air damper (partial-open): damper held at 20% for 2 h (08:00–10:00).
- Temperature sensor bias spike: +1.5 °C bias applied to South zone for 3 h (13:00–16:00).
- Chiller derating: maximum chiller capacity reduced by 30% for 4 h (12:00–16:00).
- Operator override: occupied-hour setpoint forcibly changed to {23, 26} °C for 2 h (11:00–13:00) independent of the agent.
- Telemetry dropouts: 10% missing measurements replaced by last value carried forward (LVCF).
6.8. Model Architecture Summary
- Parameters: ~0.47M trainable (actor + critic + small constraint/aux heads).
- Model size: ~0.55 MB (fp32 weights saved with scalers).
- Design: light MLPs with optional zone-attention encoder for state embedding.
6.9. Convergence Analysis
- Episodes to convergence (standard DDPG).
- Speedup:
- Mechanism: the projection reduces the effective action space and discourages trajectories that violate constraints, improving sample efficiency.
6.10. Computational Complexity
6.11. Validation Confidence and Limitations
- Energy performance: 85–90% confidence that realized savings will lie within ±8% of simulated values under similar assumptions.
- Comfort: 90–95% confidence that the relative ranking (TC-DDPG > MPC > Standard DDPG > Rule-Based) holds under modest distribution shifts.
- Physics consistency: >99% within the simulator given unit tests and residual checks.
- Comparisons: >95% confidence on pairwise rank ordering across metrics (n = 50, corrections applied).
7. Framework Validation and Analysis
7.1. Ablation Study (Framework Analysis)
7.2. Key Insights and Mechanisms
7.3. Sensitivity and Hyperparameter Robustness
7.4. Reward Weight Sensitivity
8. Deployment Considerations and Future Implementation
8.1. Implementation Pathway
8.2. Expected Real-World Challenges
9. Discussion
9.1. Theoretical Contributions
9.2. Empirical Validation Roadmap
- Hardware and I/O: Access to a BMS with programmable points (setpoints/commands), high-resolution telemetry, and safety overrides; time sync and reliable trend logging.
- Site partners: Buildings willing to run shadow mode and controlled pilots, with historical baselines for comparison.
- Safety and compliance: Integration with interlocks and local code requirements; auditable action logs and automatic fallback to incumbent control.
- Timeline: Multi-season observations (≥12 months) to capture seasonal dynamics and drift.
- Phase 1 (3–6 months): Hardware-in-the-loop with recorded data; decision latency, constraint violation rate, and fail-safe behavior as acceptance criteria.
- Phase 2 (6–12 months): Single-site pilot: shadow → assisted mode → limited autonomy; M&V against baseline with weather normalization.
- Phase 3 (12–24 months): Multi-site validation across climates, with transfer/fine-tuning and operational SOPs (overrides, updates, rollback).
9.3. Comparison with Existing Approaches
9.4. Broader Impact
- Smart grids: feeder and transformer limits, power-flow consistency.
- Water networks: hydraulic feasibility and pump curves.
- Industrial processes: reaction/phase equilibrium constraints.
- Transportation: vehicle and traffic flow dynamics.
9.5. Limitations
- Model–reality gap: unmodeled effects (thermal bridges, infiltration variability, operator overrides) and equipment aging can alter real responses.
- Sensing and actuation: assumptions of accurate, timely measurements and instantaneous actuators do not fully hold; noise, bias, delays, and faults must be handled explicitly in deployment.
- Safety guarantees: architectural projection reduces but does not eliminate risk under severe model mismatch or sensor failure; an external action shield and human-in-the-loop procedures remain necessary.
- Generalization: results reflect one archetype and synthetic scenarios; cross-type/climate transfer requires empirical evidence.
- Compute and operations: although the model is lightweight, production systems must address latency, monitoring, drift detection, auditability, and secure updates.
10. Conclusions
- A thermodynamic constraint layer that projects actions into a feasible region during the forward pass (feasibility enforced by design, subject to model accuracy and numerical precision).
- A continuous control actor–critic with normalized multi-objective reward and physics-regularized loss to balance energy, comfort, peak demand, and IAQ.
- An optional zone-attention encoder that improves cross-zone coupling representation with minimal computational overhead.
- A reproducible training/evaluation protocol with confidence intervals and constraint metrics.
Research Enabling Framework
- Benchmarking baselines and metrics: rule-based, MPC, and standard DDPG baselines; energy/comfort/peak/violation metrics with CIs.
- Multi-building coordination. Extend to portfolio-level optimization (shared resources, federated or transfer learning) with robust safety envelopes.
- Grid integration. Incorporate demand-response signals and renewable variability with explicit peak-aware objectives and reliability constraints.
- Fault-tolerant control. Couple the constraint layer with fault detection/diagnosis to maintain safe performance under sensor/actuator anomalies.
- Human-centric objectives. Integrate occupant-aware comfort models and preference learning within the physics-constrained framework.
- Climate adaptation. Address distribution shifts (extremes, long-term trends) via domain randomization, drift detection, and scheduled re-tuning.
- Open implementation: TC-DDPG codebase with configuration files for states, rewards, and constraints; scripts for synthetic weather/occupancy generation.
- Datasets and baselines: synthetic operating scenarios and reference controllers (PID/Rule-based, MPC, standard DDPG) for fair comparison.
- Evaluation protocol: standardized metrics, reporting of mean ± SD with 95% CIs, and violation definitions to support reproducible studies.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| Abbreviation | Meaning |
| AHRI | Air-Conditioning, Heating, and Refrigeration Institute |
| ASHRAE | American Society of Heating, Refrigerating, and Air-Conditioning Engineers |
| ASHRAE 55 | ASHRAE Standard 55: Thermal Environmental Conditions for Human Occupancy |
| BAS | Building Automation System (same as BMS; use one consistently) |
| BMS | Building Management System |
| CI | Confidence Interval |
| CIs (95%) | 95% bootstrap confidence intervals (as reported for metrics) |
| CBECS | Commercial Buildings Energy Consumption Survey (U.S. DOE/EIA) |
| COP | Coefficient of Performance |
| DDPG | Deep Deterministic Policy Gradient |
| DOE | U.S. Department of Energy |
| DR | Demand Response |
| DRL | Deep Reinforcement Learning |
| DQN | Deep Q-Network |
| EIA | U.S. Energy Information Administration |
| EUI | Energy Use Intensity (kWh·m−2·yr−1) |
| HIL | Hardware-in-the-Loop |
| HVAC | Heating, Ventilation, and Air-Conditioning |
| IAQ | Indoor Air Quality |
| ISO 7730 | International standard for PMV/PPD thermal comfort |
| MDPI | Multidisciplinary Digital Publishing Institute (publisher of Energies) |
| MPC | Model Predictive Control |
| NOAA | National Oceanic and Atmospheric Administration (weather data validation) |
| PID | Proportional–Integral–Derivative (rule-based) |
| PIML | Physics-Informed Machine Learning (general) |
| PINN | Physics-Informed Neural Network |
| PMV | Predicted Mean Vote (comfort index) |
| PPD | Predicted Percentage Dissatisfied (comfort index) |
| PER | Prioritized Experience Replay |
| QA/QC | Quality Assurance/Quality Control |
| RC (model) | Resistance–Capacitance thermal network model |
| RH | Relative Humidity |
| RL | Reinforcement Learning |
| RMSE | Root Mean Square Error |
| SD | Standard Deviation |
| SOTA | State of the Art |
| TC-DDPG | Thermodynamically-Constrained DDPG (this paper’s method) |
| TMY3 | Typical Meteorological Year (version 3) weather datasets |
| UA | Overall heat-transfer coefficient–area product (U·A) |
| VAV | Variable Air Volume (if mentioned in actuator examples) |
| ZAM | Zone Attention Mechanism (inter-zone interaction module) |
Appendix A. Detailed Mathematical Derivations
Appendix A.1. Thermodynamic & Psychrometric Gradient Computation
Appendix A.1.1. Energy Balance Term
Appendix A.1.2. Psychrometric Consistency Term
Appendix A.1.3. Comfort Corridor Term
Appendix A.1.4. Gradients (Chain Rule)
Appendix A.2. Convergence Considerations
Appendix A.3. Nomenclature
| Symbol | Description | Units |
| Dry-bulb temperature | ||
| Zone ii temperature | ||
| Thermal capacitance of zone ii | ||
| Thermal conductance between i, ji, j | ||
| Exchange area between i, ji, j | ||
| HVAC heat flow into zone ii | ||
| Humidity ratio | ||
| Relative humidity | – | |
| Water vapor partial pressure | ||
| Saturation vapor pressure at TT | ||
| Barometric pressure | ||
| Moist air enthalpy | ||
| Latent heat of vaporization | ||
| Specific heat (dry air) | ||
| Specific heat (water vapor) | ||
| Comfort metrics (ISO 7730) | – | |
| Coefficient of Performance | – |
Appendix B. Implementation Details
Appendix C. Extended Results
References
- Wang, Z.; Hong, T. Reinforcement learning for building controls: The opportunities and challenges. Appl. Energy 2020, 269, 115036. [Google Scholar] [CrossRef]
- ISO 7730:2005; Ergonomics of the Thermal Environment—Analytical Determination and Interpretation of Thermal Comfort Using Calculation of the PMV and PPD Indices and Local Thermal Comfort Criteria. International Organization for Standardization: Geneva, Switzerland, 2005.
- Nagy, Z.; Henze, G.; Dey, S.; Arroyo, J.; Helsen, L.; Zhang, X.; Chen, B.; Amasyali, K.; Kurte, K.; Zamzam, A.; et al. Ten Questions Concerning Reinforcement Learning for Building Energy Management. Build. Environ. 2023, 241, 110435. [Google Scholar] [CrossRef]
- Ziarati, T.; Hedayat, S.; Moscatiello, C.; Sappa, G.; Manganelli, M. Overview of the Impact of Artificial Intelligence on the Future of Renewable Energy. In Proceedings of the 2024 IEEE International Conference on Environment and Electrical Engineering and 2024 IEEE Industrial and Commercial Power Systems Europe (EEEIC/I&CPS Europe), Rome, Italy, 29 June–2 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
- U.S. EIA. Commercial Buildings Energy Consumption Survey (CBECS) 2018; U.S. Energy Information Administration: Washington, DC, USA, 2024. Available online: https://www.eia.gov/consumption/commercial/ (accessed on 4 October 2025).
- Filippova, E.; Hedayat, S.; Ziarati, T.; Manganelli, M. Artificial Intelligence and Digital Twins for Bioclimatic Building Design: Innovations in Sustainability and Efficiency. Energies 2025, 18, 5230. [Google Scholar] [CrossRef]
- Shaikh, P.H.; Nor, N.B.M.; Nallagownden, P.; Elamvazuthi, I.; Ibrahim, T. A Review on Optimized Control Systems for Building Energy and Comfort Management. Renew. Sustain. Energy Rev. 2014, 34, 409–429. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018; Available online: https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf (accessed on 4 October 2025).
- Weinberg, D.; Wang, Q.; Timoudas, T.O.; Fischione, C. A review of RL for controlling Building Energy Systems from a computer-science perspective. Sustain. Cities Soc. 2023, 89, 104351. [Google Scholar] [CrossRef]
- Tien, P.W.; Wu, P.; Choe, S. Machine Learning and Deep Learning Methods for Enhancing Building Energy Efficiency and Indoor Environmental Quality–A Review. Energy AI 2022, 10, 100198. [Google Scholar] [CrossRef]
- Mason, K.; Grijalva, S. A review of reinforcement learning for autonomous building energy management. Comput. Electr. Eng. 2019, 78, 300–312. [Google Scholar] [CrossRef]
- Al Sayed, K.; Boodi, A.; Broujeny, R.S.; Beddiar, K. Reinforcement learning for HVAC control in intelligent buildings: A technical and conceptual review. Smart Energy 2024, 95, 110085. [Google Scholar] [CrossRef]
- Wei, T.; Wang, Y.; Zhu, Q. Deep Reinforcement Learning for Building HVAC Control. In Proceedings of the 54th Annual Design Automation Conference (DAC), Austin, TX, USA, 18–22 June 2017; pp. 1–6. [Google Scholar] [CrossRef]
- Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms. In Proceedings of the ICML’14: Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, China, 21–26 June 2014; PMLR: Cambridge, MA, USA, 2014; Volume 32, pp. 387–395. [Google Scholar]
- Yu, L.; Qin, S.; Zhang, M.; Shen, C.; Jiang, T.; Guan, X. A review of Deep Reinforcement Learning for Smart Building Energy Management. IEEE Internet Things J. 2021, 8, 12046–12063. [Google Scholar] [CrossRef]
- Manjavacas, A.; Nieves, A.C.; Jiménez-Raboso, J.; Molina-Solana, M. An experimental evaluation of DRL algorithms for HVAC control (Sinergym). Artif. Intell. Rev. 2024, 57, 173. [Google Scholar] [CrossRef]
- Dai, M.; Li, H.; Wang, S. A reinforcement learning-enabled iterative learning control strategy of air-conditioning systems for building energy saving by shortening the morning start period. Appl. Energy 2023, 334, 120650. [Google Scholar] [CrossRef]
- García, J.; Fernández, F. A Comprehensive Survey on Safe Reinforcement Learning. J. Mach. Learn. Res. 2015, 16, 1437–1480. [Google Scholar]
- Ruelens, F.; Claessens, B.J.; Vandael, S.; De Schutter, B.; Babuska, R.; Belmans, R. Residential Demand Response of Thermostatically Controlled Loads Using Batch Reinforcement Learning. IEEE Trans. Smart Grid 2017, 8, 214–225. [Google Scholar] [CrossRef]
- Esmaeili, M.; Hammes, S.; Tosatto, S.; Geisler-Moroder, D.; Zech, P. Safe Reinforcement Learning for Buildings: Minimizing Energy Use While Maximizing Occupant Comfort. Energies 2025, 18, 5313. [Google Scholar] [CrossRef]
- Sanchez, J.; Cai, J. Constrained RL for building demand response (explicit constraint value function). Appl. Energy 2025, in press. [Google Scholar]
- Vázquez-Canteli, J.R.; Nagy, Z. Reinforcement Learning for Demand Response: A Review. Appl. Energy 2019, 235, 1072–1089. [Google Scholar] [CrossRef]
- Karniadakis, G.E.; Kevrekidis, I.G.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-informed Machine Learning. Nat. Rev. Phys. 2021, 3, 422–440. [Google Scholar] [CrossRef]
- Jiang, Z.; Wang, X.; Li, H.; Hong, T.; You, F.; Drgoňa, J.; Vrabie, D.; Dong, B. Physics-informed ML for building performance simulation—Review. Patterns/Cell Press 2025, 18, 100223. [Google Scholar]
- Saeed, M.H.; Kazmi, H.; Deconinck, G. Dyna-PINN: Physics-informed Deep Dyna-Q for building heating control. Energy Build. 2025, 324, 114879. [Google Scholar] [CrossRef]
- Jiang, Z.; Wang, X.; Dong, B. Physics-informed modularized neural network for DRL-based building control; reports ~31% HVAC savings case study. Adv. Appl. Energy 2025, 19, 100237. [Google Scholar] [CrossRef]
- Drgoňa, J.; Arroyo, J.; Figueroa, I.C.; Blum, D.; Arendt, K.; Kim, D.; Perarnau, E.; Oravec, J.; Wetter, M.; Vrabie, D.L.; et al. All You Need to Know about Model Predictive Control for Buildings. Annu. Rev. Control 2020, 50, 190–232. [Google Scholar] [CrossRef]
- Killian, M.; Kozek, M. Ten Questions Concerning Model Predictive Control for Energy Efficient Buildings. Build. Environ. 2016, 105, 403–412. [Google Scholar] [CrossRef]
- Oldewurtel, F.; Parisio, A.; Jones, C.N.; Gyalistras, D.; Gwerder, M.; Stauch, V.; Lehmann, B.; Morari, M. Use of Model Predictive Control and Weather Forecasts for Energy Efficient Building Climate Control. Energy Build. 2012, 45, 15–27. [Google Scholar] [CrossRef]
- Dobbs, J.R.; Hencey, B.M. Model Predictive HVAC Control with Online Occupancy Model. Energy Build. 2014, 82, 675–684. [Google Scholar] [CrossRef]
- Privara, S.; Váňa, Z.; Široký, J.; Ferkl, L.; Cigler, J.; Oldewurtel, F. Building Modeling as a Crucial Part for Building Predictive Control. Energy Build. 2013, 56, 8–22. [Google Scholar] [CrossRef]
- Zhang, Z.; Lam, K.P. Practical implementation and evaluation of deep reinforcement learning control for a radiant heating system. In Proceedings of the 5th Conference on Systems for Built Environments, Shenzen, China, 7–8 November 2018; pp. 148–157. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
- Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor–Critic Methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: Cambridge, MA, USA, 2018; Volume 80, pp. 1587–1596. [Google Scholar]
- Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear PDEs. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
- Afram, A.; Janabi-Sharifi, F. Review of Modeling Methods for HVAC Systems. Appl. Therm. Eng. 2014, 67, 507–519. [Google Scholar] [CrossRef]
- ASHRAE Handbook—Fundamentals; Chapter 1: Psychrometrics; ASHRAE: Atlanta, GA, USA, 2021.
- Deru, M.; Field, K.; Studer, D.; Studer, D.; Benne, K.; Griffith, B.; Torcellini, P. U.S. Department of Energy Commercial Reference Building Models of the National Building Stock; NREL/TP-5500-46861; NREL: Golden, Colorado, 2011. Available online: https://www.nrel.gov/docs/fy11osti/46861.pdf (accessed on 4 October 2025).
- ASHRAE Guideline 14-2014. In Measurement of Energy, Demand, and Water Savings; ASHRAE: Atlanta, GA, USA, 2014.
- Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2016, arXiv:1511.05952. [Google Scholar] [CrossRef]
- Efron, B. Bootstrap Methods: Another Look at the Jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
- Wilcox, S.; Marion, W. Users Manual for TMY3 Data Sets; NREL/TP-581-43156; NREL: Golden, Colorado, 2008. [CrossRef]
- Fanger, P.O. Thermal Comfort: Analysis and Applications in Environmental Engineering; Danish Technical Press: Copenhagen, Denmark, 1970. [Google Scholar]
- ASHRAE Standard 55-2020; Thermal Environmental Conditions for Human Occupancy. ASHRAE: Atlanta, GA, USA, 2020.
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv. Neural Inf. Process. Syst. 32 (NeurIPS) 2019, 32, 8024–8035. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polpsukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. (NeurIPS) 2017, 30, 5998–6008. [Google Scholar]
- Roijers, D.M.; Vamplew, P.; Whiteson, S.; Dazeley, R. A Survey of Multi-Objective Sequential Decision-Making. J. Artif. Intell. Res. 2013, 48, 67–113. [Google Scholar] [CrossRef]



















| Zone | RMSE (°C) | MBE (°C) | Correlation r |
|---|---|---|---|
| North | 0.35 | 0.03 | 0.992 |
| South | 0.39 | −0.04 | 0.986 |
| East | 0.42 | 0.02 | 0.981 |
| West | 0.37 | −0.05 | 0.988 |
| Core | 0.30 | 0.01 | 0.995 |
| Mean ± SD | 0.37 ± 0.04 | — | 0.988 ± 0.005 |
| Comparison | Metric | Mean Δ | 95% CI | p-Value | Significance |
|---|---|---|---|---|---|
| TC-DDPG—Standard DDPG | Energy (kWh) | −3.8 | [−4.9, −2.7] | 0.004 | Yes (p < 0.01) |
| TC-DDPG—Standard DDPG | Comfort drift (°C·h) | −1.7 | [−2.3, −1.1] | 0.002 | Yes (p < 0.01) |
| TC-DDPG—MPC-PF | Energy (kWh) | −0.5 | [−1.4, +0.4] | 0.32 | No |
| TC-DDPG—MPC-PF | Comfort drift (°C·h) | −0.4 | [−1.0, +0.2] | 0.18 | No |
| Zone | Orientation | Area (m2) | WWR (%) | Glazing U (W/m2·K) | SHGC (-) | Occupancy Density (m2/Person) | Ventilation (L/s·Person) | Infiltration (ACH) | Lighting (W/m2) | Equipment (W/m2) | Internal Gains (Peak, W/m2) | Occupied Comfort Band (°C) | Unoccupied Band (°C) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| North | N | 180 | 40 | 2.1 | 0.40 | 10 | 10 | 0.30 | 7 | 10 | 22 | 21–25 | 18–28 |
| East | E | 180 | 40 | 2.1 | 0.40 | 10 | 10 | 0.30 | 7 | 10 | 22 | 21–25 | 18–28 |
| South | S | 180 | 40 | 2.1 | 0.40 | 10 | 10 | 0.30 | 7 | 10 | 22 | 21–25 | 18–28 |
| West | W | 180 | 40 | 2.1 | 0.40 | 10 | 10 | 0.30 | 7 | 10 | 22 | 21–25 | 18–28 |
| Core | – | 180 | 0 | – | – | 10 | 10 | 0.20 | 7 | 10 | 22 | 21–25 | 18–28 |
| Hyper Parameter | Value | Notes |
|---|---|---|
| Algorithm | DDPG (actor–critic) | Continuous actions |
| Actor learning rate | Adam | |
| Critic learning rate | Adam | |
| Discount factor (γ) | 0.99 | Long-horizon energy effects |
| Soft target update (τ) | 0.005 | Polyak averaging |
| Batch size | 64 | Stable on single GPU |
| Replay buffer | transitions | ≈35 days at 5-min |
| OU noise (σ, θ) | 0.1, 0.15 | Added to actor output during training |
| Gradient clip (L2) | 1.0 | Prevents exploding grads |
| Physics reg. weight () | 0.10 | (= 1.0, = 0.1, = 0.05) |
| Reward weights | Energy/Comfort/Peak/IAQ | |
| Steps per episode | 288 | 24 h at 5-min control |
| Training episodes | 5000 | Day-long episodes |
| Independent runs | 50 seeds | For CIs and significance |
| Method | PMV Range | PPD Mean (%) | Setpoint Deviation (°C) | Comfort Violations (h/yr) |
|---|---|---|---|---|
| Rule-Based | [−0.8, 0.9] | 18.3 ± 1.7 | 1.2 ± 0.3 | 487 ± 34 |
| MPC | [−0.6, 0.7] | 12.7 ± 1.4 | 0.8 ± 0.2 | 234 ± 28 |
| Standard DDPG | [−0.7, 0.8] | 14.2 ± 1.6 | 0.9 ± 0.2 | 298 ± 31 |
| TC-DDPG | [−0.5, 0.5] | 8.4 ± 1.1 | 0.5 ± 0.1 | 62 ± 12 |
| Controller | Violations (per 10 k Steps) | Violations (yr−1) | Reduction (%) | Notes |
|---|---|---|---|---|
| Baseline DDPG | 2.6 ± 0.7 | ≈950 ± 260 | — | Frequent constraint breaches during exploration |
| + Feasibility Projection | 1.1 ± 0.4 | ≈400 ± 145 | −58% | Rate-limit and saturation respected by design |
| + Physics Regularization | 1.3 ± 0.5 | ≈475 ± 180 | −50% | Reduced infeasible thermal states |
| Full TC-DDPG (ours) | 0.4 ± 0.2 | ≈145 ± 70 | −85% | Only rare transient violations |
| Perturbation | Level | Energy | Comfort Drift | Violations | Notes |
|---|---|---|---|---|---|
| Sensor noise (\sigma) (°C) | 0.1 | +0.7 [+0.3, +1.1] | +2.3 [+1.2, +3.4] | +0.0 [0.0, +0.1] | TC-DDPG |
| 0.3 | +1.9 [+1.1, +2.8] | +6.8 [+4.9, +8.5] | +0.2 [0.0, +0.4] | ||
| Bias (b) (°C) | +0.5 | +1.1 [+0.6, +1.7] | +4.2 [+2.9, +5.6] | +0.1 [0.0, +0.3] | |
| Latency (\tau) (steps) | 3 | +2.6 [+1.7, +3.6] | +7.9 [+5.8, +9.9] | +0.3 [+0.1, +0.6] | |
| Actuator lag (\tau_a) (min) | 10 | +1.4 [+0.8, +2.1] | +5.6 [+3.9, +7.2] | +0.2 [0.0, +0.5] | |
| Same rows (standard DDPG) | — | +3.8 to +7.5 | +12.1 to +24.9 | +1.2 to +3.7 | Worse under all perturbations |
| Scenario | Energy (kWh) | Comfort Drift (°C·h) | Violations (Count) | Actuation TV (norm.) |
|---|---|---|---|---|
| Stuck damper 20% (2 h) | +2.4 ± 0.9 | +1.8 ± 0.6 | +0.3 ± 0.2 | +0.06 ± 0.02 |
| South sensor +1.5 °C (3 h) | +1.1 ± 0.5 | +2.7 ± 0.8 | +0.5 ± 0.2 | +0.04 ± 0.02 |
| Chiller −30% cap (4 h) | +4.8 ± 1.7 | +3.6 ± 1.1 | +0.9 ± 0.3 | +0.09 ± 0.03 |
| Operator override (2 h) | +0.6 ± 0.3 | +1.2 ± 0.5 | +0.2 ± 0.1 | +0.02 ± 0.01 |
| Telemetry dropouts 10% | +0.9 ± 0.4 | +1.9 ± 0.7 | +0.3 ± 0.1 | +0.03 ± 0.01 |
| Method | Training Time * | Inference Time † | Peak Memory ‡ | FLOPs/Decision § |
|---|---|---|---|---|
| MPC | N/A | 847 ms | 2.3 GB | |
| Standard DDPG | ~72 h | 12 ms | 4.1 GB | |
| TC-DDPG | ~69 h | 18 ms | 4.8 GB |
| Configuration | Energy Savings vs. Baseline (%) | Comfort Improvement † (%) | Physics Violations ‡ |
|---|---|---|---|
| Full TC-DDPG | 34.7 ± 1.2 | 54.1 ± 3.4 | 12 ± 3 |
| w/o Physics Layer | 28.3 ± 1.8 | 42.3 ± 3.7 | 847 ± 67 |
| w/o Attention Encoder | 31.2 ± 1.5 | 48.7 ± 3.1 | 34 ± 6 |
| w/o Psychrometric Consistency | 32.1 ± 1.4 | 45.2 ± 3.0 | 156 ± 19 |
| Annual Energy (kWh/m2·yr) | Violations per 10k Steps | |
|---|---|---|
| 0.01 | 158.3 ± 4.6 | 234 ± 28 |
| 0.05 | 153.7 ± 4.0 | 67 ± 11 |
| 0.10 | 150.8 ± 3.9 | 12 ± 3 |
| 0.20 | 152.1 ± 4.1 | 8 ± 2 |
| 0.50 | 161.4 ± 4.8 | 3 ± 1 |
| Actor LR | Episodes to Converge | Final Energy (kWh/m2·yr) | Notes |
|---|---|---|---|
| 4821 ± 510 | 156.2 ± 4.4 | Slow learning | |
| 2234 ± 260 | 152.3 ± 4.1 | Stable | |
| 1823 ± 214 | 150.8 ± 3.9 | Best overall | |
| 1567 ± 190 | 154.7 ± 4.3 | Faster but slightly worse final | |
| — | — | Diverged |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hedayat, S.; Ziarati, T.; Manganelli, M. A Physics-Informed Reinforcement Learning Framework for HVAC Optimization: Thermodynamically-Constrained Deep Deterministic Policy Gradients with Simulation-Based Validation. Energies 2025, 18, 6310. https://doi.org/10.3390/en18236310
Hedayat S, Ziarati T, Manganelli M. A Physics-Informed Reinforcement Learning Framework for HVAC Optimization: Thermodynamically-Constrained Deep Deterministic Policy Gradients with Simulation-Based Validation. Energies. 2025; 18(23):6310. https://doi.org/10.3390/en18236310
Chicago/Turabian StyleHedayat, Sattar, Tina Ziarati, and Matteo Manganelli. 2025. "A Physics-Informed Reinforcement Learning Framework for HVAC Optimization: Thermodynamically-Constrained Deep Deterministic Policy Gradients with Simulation-Based Validation" Energies 18, no. 23: 6310. https://doi.org/10.3390/en18236310
APA StyleHedayat, S., Ziarati, T., & Manganelli, M. (2025). A Physics-Informed Reinforcement Learning Framework for HVAC Optimization: Thermodynamically-Constrained Deep Deterministic Policy Gradients with Simulation-Based Validation. Energies, 18(23), 6310. https://doi.org/10.3390/en18236310

