1. Introduction
The demand for energy in buildings around the world will continue to rise because cities are growing, the conditioned floor area is getting bigger, and people expect more comfort and air quality. Space conditioning is the biggest operational load in both residential and light-commercial stock [
1]. Air handling units (AHUs) with cooling coils play a crucial role in regulating the energy demands of chillers, heat pumps, pumps, and fans by maintaining optimal air and water temperatures. Ineffective coil performance in response to changing conditions can lead to inefficiencies such as overshooting and short cycling, resulting in excessive energy consumption. To address rising energy use and emissions, industry must transition from traditional methods to “high-efficiency, low-carbon” control strategies. This includes utilizing advanced equipment directly from routine AHU data (like smart thermostat/BAS streams) [
2] and AI-driven forecasting to enhance coil performance, promote safer temperature settings, stabilize system conditions, and reduce energy waste, thereby facilitating decarbonization in homes and buildings.
Physics-based and gray box models for coils and AHUs have long been favored because they are interpretable, grounded in first principles, and straightforward to validate. Yet, in practice, they often require site-specific calibration and periodic retuning and they are vulnerable to sensor bias and drift as equipment ages, which constrain scalability beyond initial commissioning. Recent reviews echo these trade-offs for building/HVAC systems and simplified coil models. Balali et al. [
3] reviewed HVAC control and showed that both model-based and data-driven approaches can improve efficiency and comfort, but real-world impact hinges on model accuracy, data quality, and occupancy uncertainty. Li et al. [
4] illustrated the theory and applications of grey box models, while follow-on commentary points to the need for standardized structures, explicit assumption, and unified software frameworks to enable broader adoption [
5], and Zhou et al. [
6] cataloged machine-learning advances across optimization, control, and fault detection but also note persistent challenges in data curation, generalization, and deployment. Alongside these reviews, empirical studies illustrate the promise and limits of current practice. Transfer learning layered on LSTM improves building-load forecasting under changing weather regimes, indicating meaningful cross-domain gains [
7]. Model-free reinforcement learning can stabilize VAV operation using only time-series signals, reducing airflow and energy in simulation without a detailed plant model [
8]. Simplified coil surrogates based on overall UA matching and Holmes’ correlations can predict performance without geometric detail and remain useful for design and fault detection [
9], while a Danish AHU grey-box study demonstrates parameter identification and efficiency analysis from routine sensors under limited dehumidification [
10]. On an AHU case, a hybrid CNN–LSTM achieves strong supply-air temperature forecasts with associated energy improvements [
11]. However, across this literature, one finds that forecasting is often framed as static regression with short or ad hoc memory, validation sometimes departs from strict chronology (risking leakage and over-optimism), and temporal features and hyperparameters are frequently fixed rather than selected systematically for short-horizon supervisory use. As a result, there remains a gap for a deployment-ready, leakage-controlled, short-term forecaster of discharge-air temperature and chilled water leaving temperature that jointly (i) encodes and selects effective temporal memory and (ii) tunes model capacity under chronological evaluation aligned with routine AHU operation.
In routine AHU operation, two short-horizon responses govern plant effort and stability: supply-air temperature (SAT) delivered to zones and the chilled water leaving temperature (CHWLT) returning to the plant. Bias in either forecast propagates into chiller lift and pump flow through the plant’s ΔT and sustained error can precipitate the low ΔT syndrome, which is an efficiency loss comprehensively reviewed by Brink et al. [
12], who catalogued severity levels, causes, and the heightened risk in coils designed for large water temperature differences. Reliable near-term SAT/CHWLT prediction also underpins freezing protection and safe set-point moves; modern sequences tie SAT reset to plant/air-side conditions to curb simultaneous cooling and reheat, so supervisory choices hinge on trustworthy trajectories. Recent work on fast ML for building management systems shows that hardware-accelerated pipelines can deliver low-latency inference while preserving accuracy, enabling real-time forecasting and control directly on BMS signals [
13]. Consistent with this, Elehwany et al. [
14] showed that trim and respond SAT reset outperforms constant or OAT-based schemes under varying occupancy/preferences, strengthening the case for occupant-centric supervisory logic. Deployment realities further constrain the solution space. Practical forecasters must operate on standard BMS points (no extra sensors), run in real time, and remain maintainable. Field evidence from a Modelica-based MPC in a real office building demonstrates the upside of ~40% HVAC energy savings but also the integration burden of data collection and commissioning in Blum et al. [
15]. On the control side, gain-scheduling has emerged as a robust compromise for stabilizing SAT under varying loads, outperforming pure cascade in response stability and valve wear in Wang et al. [
16]. Finally, temporal resolution matters: while ultra-fine granularity adds little to load-shape extraction [
17], it materially improves forecast accuracy, with 15-min data identified as a practical optimum between performance and cost [
18]. Contemporary reviews emphasize BMS-centric analytics in forecasting, diagnostics, and control built from routine BAS points as a pragmatic path to scalable efficiency improvements across building stock [
19]. Together, these results have a clear operational need, which is a leakage-safe, BMS-native forecaster for SAT and CHWLT that is accurate at the control horizon, robust to regime changes, and straightforward to deploy at scale.
Recent evidence underscores why SAT must be a primary control target. In a multi-calorimeter study of EHP–AHU systems, Jung et al. [
20] showed that raising SAT from 12 °C to 16 °C cut outdoor-unit power by up to 32.3%, directly linking modest SAT resets to sizeable plant savings. Complementarily, DNN predictors with tuned hyperparameters have achieved high fidelity (R
2 ≈ 0.984), supporting prediction-driven optimal control for HVAC efficiency [
21,
22]. Guided by this literature and supervisory practice, the present work adopts operations-aligned accuracy targets: low scale-free error on future segments (NRMSE ~1–3%) and minimal phase lag in time overlays—thresholds repeatedly cited as sufficient to stabilize set-point moves and avoid wasteful transients in buildings [
23,
24].
Short horizon dynamics in AHU coils are driven by coupled air and water-side lags (valve motion, mixed air swings, plant ΔT). To encode this short memory without bespoke sensing, the study adopts temporally aware features, input/target lags, and short rolling statistics consistent with recent building energy syntheses and AHU control practice. Reviews emphasize that multivariate coupling and time-aware features are central to operational accuracy in Zhu et al. [
25], while modern ML for HVAC shows promise yet still wrestles with data quality, generalization, and limited industrial uptake [
6]. To ensure that the results reflect deployable performance (not optimistic bias), Bergmeir et al. [
26] illustrated that the pipeline uses leakage-safe chronology: burn-in removal and blocked cross-validation for model selection, in line with evidence that time-ordered CV yields more reliable choices than random splits for forecasting. Because default settings under-represent transients and overfit noise, Optuna provides define-by-run search with pruning to co-tune memory and model capacity efficiently, as described in Akiba et al. [
27]; complementary base learners RF, Bagging-DT, XGBoost, and ANN are chosen to balance bias/variance and diversify inductive biases seen across energy-forecasting work. These learners are fused through OOF stacking (RidgeCV), which is a principled meta-learning scheme that learns how base-model biases interact and is widely validated in ensemble studies. This design aligns with broader building-analytics literature calling for standardized, real-world pipelines across scales and metrics [
28] and with empirical results showing that rigorous HPO materially improves HVAC prediction [
29], while stacked generalization reduces error versus single models for heating/cooling forecasting [
30,
31]. Related ensemble evidence in adjacent meteorological tasks likewise reports stacked predictors outperforming strong single learners (e.g., TFDF, RNN-LSTM) on long historical series [
32]. Operationally, the resulting surrogate targets supervisory/MPC needs stabilizing set-point moves and curbing wasteful transients while keeping deployment lightweight and consistent with field experience in control layers [
33,
34].
This study presents a deployment-oriented surrogate for AHU cooling-coil behavior built on four coordinated pillars: (i) dynamic, physics-aligned features, where lag and short-window roll lengths for inputs and targets are learned rather than fixed, capturing the short-horizon air and water-side responses that dominate set-point changes and valve motion; (ii) leakage-safe learning, with strictly causal feature construction, removal of burn-in induced by dynamic windows and chronological blocked/forward cross-validation to mirror operations, and the avoidance of optimistic bias; (iii) an Optuna-tuned, OOF-trained stacked ensemble, in which RF, Bagging-DT, XGBoost, and ANN are co-optimized via define-by-run search and pruning, then fused with RidgeCV on out-of-fold predictions to reduce transient error while preserving steady-state stability; and (iv) deployment and reproducibility, requiring only routine BMS points, supporting real-time inference and accompanied by concise figures/tables and parameter artifacts to enable replication and transfer. The emphasis on supply/discharge temperature control and stability aligns with contemporary AHU studies that evaluate SAT strategies under realistic operating regimes [
35,
36].
In conclusion, this study demonstrates that forecasting supply-air temperature (SAT) and chilled water leaving temperature (CHWLT) at short horizons can materially stabilize AHU operation and reduce plant effort. A deployment-oriented pipeline was developed in which coil dynamics are captured by learned input/target lags and short rolling windows, features are constructed causally with burn-in trimming and strictly chronological evaluation, and model capacity is selected via Optuna-based hyperparameter optimization. The tuned base learners (RF, Bagging-DT, XGBoost, ANN) are fused through out-of-fold RidgeCV stacking, leveraging complementary inductive biases to suppress transient error while preserving steady-state agreement. The surrogate uses only standard BMS points, supports real-time inference, and is delivered with reproducible artifacts suitable for supervisory layers. The principal novelty lies in a leakage-safe, multi-output stacking framework that jointly optimizes dynamic memory (lags/rolls) and model hyperparameters under time-ordered validation, which is an integration tailored for AHU deployment and uncommon in prior coil studies. Acknowledged limitations include evaluation on a single laboratory AHU under controlled perturbations and one-step horizons; external, multi-site validation and closed-loop trials remain future work. Even within these bounds, results indicate that the proposed surrogate can enable smoother set-point moves, reduce overshoot and short-cycling, and mitigate low-ΔT penalties without additional sensors or site-specific physics calibration.
2. Methodology
2.1. Comprehensive Laboratory Testing and Data Collection
To develop a robust dataset suitable for predictive modeling of cooling-coil performance, a series of controlled laboratory experiments was conducted under realistic HVAC operating conditions. The experimental design aimed to capture the dynamic thermal interactions between air and chilled water streams while ensuring high-quality, high-resolution data across a diverse range of operational scenarios. Parameters such as chilled water flow rate, coil valve position, fan speed, and supply-air temperature were systematically varied to reflect the full spectrum of coil behavior encountered in building applications. All tests were performed using precisely monitored and automated systems to minimize noise and operator-induced uncertainty. The following subsections describe the laboratory setup, testing protocol, and control strategies used to ensure data consistency and experimental integrity.
2.1.1. Testing Facility
The experimental work was carried out in the Building Energy Assessments, Solutions, and Technologies (BEAST) Laboratory (
Figure 1a), located at the University of Cincinnati’s Victory Parkway campus. The facility contains multiple HVAC system configurations, including variable refrigerant flow systems, water-based air handling units paired with an electric heater and an air-cooled chiller, and direct expansion units. The laboratory also includes three thermally insulated and independently controlled test spaces designed to replicate typical building zones, each equipped with a variable air volume (VAV) terminal unit.
For this study, a water-based air handling unit (
Figure 1b) equipped with a chilled water-cooling coil served as the primary test system. All system parameters were monitored and controlled using the laboratory’s centralized Building Automation System (BAS), which minimized uncertainties associated with manual operation.
2.1.2. Laboratory Testing Procedure
Testing was conducted during the summer season to ensure that the hot water system components remained inactive. During the entire study, the boiler and all associated hot water valves were kept fully closed. Data were recorded continuously over a three-week period at one-minute intervals, resulting in more than 20,000 time-stamped observations of system operation. To evaluate coil performance across a broad range of operating conditions, the chilled water valve position, supply fan speed, and chilled water supply temperature were adjusted at two-hour intervals. The cooling coil valve position varied from 30% to 100% in increments of 10%, ensuring a minimum flow of 0.25 gallons per minute (GPM) at the lowest valve setting. The supply fan speed followed the same incremental schedule, ranging from 10% to 100%. Additionally, the chilled water supply temperature was adjusted from 45 °F to 60 °F in 5 °F increments.
System safeguards were followed to prevent coil freezing and overheating. The internal temperature of the air handling unit was maintained between 35 °F and 95 °F to protect both the coil and electrical components. All auxiliary HVAC elements, including the heating coil, VAV reheating coil, and humidifier, remained inactive to avoid influencing cooling-only operation. The three laboratory zones remained unoccupied throughout testing, and the outdoor air damper was kept fully open to ensure that condensation occurred on the cooling coil as intended. Incremental adjustments to the coil valve position resulted in chilled water flow rates ranging from approximately 0.4 GPM to 5.4 GPM. Variations in supply fan speed allowed for the assessment of airflow effects on coil behavior. At full fan speed, the system delivered approximately 1300 cubic feet per minute (CFM), representing the maximum airflow capacity of the unit. The testing protocol therefore captured both typical and extreme cooling conditions, producing a dataset with strong suitability for subsequent modeling efforts.
2.1.3. Chilled Water Loop and Airflow Control Conditions
The cooling coil was fed by a packaged chiller unit and the returning water temperature varied according to coil load. Since the outdoor air damper remained fully open, the supply-air temperature was influenced primarily by mixing conditions and cooling-coil performance rather than damper modulation. During the testing period, the mixed air temperature ranged from 50.8 °F to 88.4 °F.
The VAV dampers remained fully open, and airflow modulation was controlled exclusively through supply fan speed adjustments. This approach isolated the effects of airflow and chilled water flow on coil performance, enabling the development of a dataset representative of real operating behavior and suitable for predictive modeling.
2.2. Study Framework
This study develops a dynamic supervised-learning framework to model cooling-coil behavior under realistic AHU operation. The approach integrates engineered temporal inputs, automated hyperparameter tuning, and a leakage-safe stacking ensemble to accurately predict both thermal and moisture responses across transient and steady-state periods. Building on the prior work of Nassif et al. [
37] that systematically screened 31 off-the-shelf learners for AHU coil prediction on the same laboratory platform for identifying tree ensembles, gradient boosting, and neural networks as the strongest single models under standard error metrics, the present study advances from breadth to deployment-oriented depth. Unlike the largely static feature settings and per-model ranking, the current framework (i) learns short-memory dynamics by tuning input/target lags and short rolling windows, (ii) treats SAT and CHWLT as a multi-output forecasting problem, and (iii) fuses high-performing base learners via leakage-safe out-of-fold stacking with RidgeCV under strictly chronological validation. Hyperparameter search is elevated from ad hoc trialing to a formal Optuna procedure with pruning, and evaluation emphasizes transient behavior, latency suitable for real-time supervisory use, and reproducible artifacts. In this way, the earlier benchmark supplies the empirical rationale for the chosen base models and metrics, while the present work operationalizes those insights into a causal, multi-output, and deployable surrogate tailored to supervisory control rather than another model-ranking exercise [
38]. Two schematic views of the workflow are provided in
Figure 2 and
Figure 3 to make the pipeline transparent and reproducible.
As shown in
Figure 2, five physical inputs—airflow, chilled water flow, entering chilled water temperature, mixed-air temperature, and mixed-air humidity ratio—serve as drivers to a hybrid ensemble architecture. Two coil outputs are then predicted: supply-air temperature and chilled water leaving temperature. Rather than relying on a single learning strategy, this study combines decision-tree ensembles, gradient-boosted trees, and a fully connected ANN. This reflects growing evidence that hybrid learners outperform individual models when modeling HVAC-system nonlinearities, psychrometric coupling, and coil moisture dynamics 1 “-” 4. The final stage applies to an out-of-fold (OOF) Ridge meta-learner to blend predictions without leaking future information.
Figure 3 expands this into an operational workflow. The dataset is split temporally (70/30) and dynamic feature engineering creates lagged and rolling windows to capture short-cycle memory effects inherent to coil valve actuation and mixed-air transitions. These features are tuned using Optuna with time-series cross-validation to ensure that lag/roll depths, model hyperparameters, and learning-rate schedules are optimized without violating causality. Each model is trained in its best configuration and evaluated on the held-out horizon. Out-of-fold predictions collected during cross-validation form the inputs to the stacking regressor, which learns how to weight each base model’s strengths. This design preserves temporal structure, avoids information leakage, and enables stable accuracy during fast humidity swings and slow thermal settling periods—conditions frequently encountered in real AHU supervisory control. Overall, the methodology aims to balance fidelity and practicality: engineered temporal context instead of black-box sequence models, automated but time-aware hyperparameter search, and a stacking layer that improves robustness without sacrificing interpretability. This approach aligns with recent trends in high-resolution building-energy forecasting and control-oriented ML pipelines 1–4 and is tailored to support real-time decision-making in chilled water systems. Besides the final 70/30 lockbox test, model selection and stacking use a chronological Time Series Split with out-of-fold (OOF) predictions to prevent leakage. This study further performs a rolling-origin (walk-forward) evaluation: at each origin, all preprocessing (scaling, lag/roll construction, burn-in removal) is refit on the training segment, and one-step forecasts are generated for the subsequent window.
2.3. Dynamic Features Engineering
This study encodes short-horizon cooling-coil behavior using causal history built from time-lagged inputs and short rolling statistics. Each exogenous variable (e.g., mixed-air temperature and humidity ratio, chilled water flow, entering water temperature, supply airflow) is augmented with lags to expose recent operating context, while rolling means and standard deviations over brief windows summarize near-term load evolution and damp measurement noise. Candidate lags span 1–12 samples and rolling windows 3–12 samples, capturing the fast dynamics associated with valve movement, air mixing, and water-side transport. All features at time are paired with targets at (one-step predictive shift), and rows impacted by the longest window are removed as burn-in to guarantee strict causality.
Temporal depth (lag length) and window size are treated as hyperparameters and selected via Optuna, allowing the effective memory horizon to adapt to system inertia and control delays rather than being fixed a priori. In addition to exogenous lags, autoregressive target lags are included for supply-air temperature (SAT) and chilled water leaving temperature (CHWLT), reflecting the coil’s physical persistence as conditions approach a new steady state. This hybrid exogenous–autoregressive design yields a compact, deployment-oriented feature set that captures short-cycle transients without relying on heavy recurrent architectures, while preserving reproducibility and leakage-free learning.
For setting up the lag roll feature, let samples arrive every
and index time by
. Denote the
BMS inputs by
and the two outputs by
in Equation (1). For input
and lag
[
30],
Causal rolling statistics (inputs) in Equation (2). For window
,
For
and
, target lag outputs in Equation (3),
Causal design vector and one-step predictive shift. The feature map at time
uses only present/past information in Equation (4):
No term at
appears in
(strict causality). To define the maximum memory required by the features:
All rows with are excluded prior to chronological cross-validation (TimeSeriesSplit), ensuring every feature is well-defined and strictly past-only.
Figure 4 illustrates that raw inputs are transformed into lagged variables, rolling window statistics, and target lags to construct a leakage-free temporal feature matrix for model training. This pipeline ensures the model receives recent operational context while preserving causality by capturing coil transients, control behavior, and psychrometric response without peeking ahead in time.
2.4. Model Structures and Optuna-Guided Hyperparameter Optimization
To evaluate the value of dynamic feature engineering and ensure a fair comparison across modeling strategies, four supervised learning models were trained: Random Forest (RF), Bagging with a Decision-Tree base (Bagging-DT), XGBoost (XGB), and a fully connected Artificial Neural Network (ANN). These models represent three widely used families in HVAC and building-energy prediction—ensemble trees, boosting algorithms, and neural networks, each known for handling nonlinear, coupled system dynamics typically seen in cooling-coil processes.
Hyperparameters for each model were tuned using Optuna’s define-by-run search framework with pruning. This allowed the search process to adaptively focus on promising configurations while stopping poor-performing trials early, reducing computational cost. The tuning objective minimized time-series cross-validated Root Mean Squared Error (RMSE), ensuring that hyperparameters were chosen based on predictive performance under realistic sequential forecasting rather than random shuffling. This setup follows current best practice in data-driven HVAC forecasting and avoids information leakage. For RF and Bagging-DT, tuning focused on the number of trees, maximum depth, sample splits, and feature-subsampling strategies, balancing complexity and generalization. XGBoost tuning explored learning rate, regularization terms, maximum leaf depth, and sampling ratios to control overfitting while capturing sharp transitions in heat and moisture transfer behavior. The ANN architecture was tuned for hidden layer dimensions, dropout rate, L2 regularization, learning rate, and training schedule (batch size and patience-based learning-rate reduction). This produced a model that is expressive enough to learn coil thermodynamics, while regularization prevented overfitting during steady-state periods.
Importantly, lag lengths and rolling-window sizes were also treated as tunable parameters. Across all models, the optimizer consistently selected short memory windows (typically 2–7 steps), which aligns with physical expectations: cooling-coil outlet conditions depend most strongly on very recent mixed-air and chilled water states. Models that incorporated these optimized temporal features achieved higher stability and accuracy, with ANN and XGBoost showing the best benefit.
The optimized settings in
Table 1 show clear patterns across learners. RF and Bagging-DT both settled around 500–570 trees and depth 15–21 but differed in sampling strategy, reflecting their bias-variance balance. XGBoost converged to a larger boosting regime (2500 trees, lr = 4.7 × 10
−3, λ = 1.0), supporting sharp transitions in moisture-cooling response. The ANN favored a compact 294-316-39 architecture with dropout = 0.39, L2 = 6.6 × 10
−4, and =221 epochs at lr = 1.7 × 10
−4, indicating that regularization and learning-rate control were essential for stable training. Temporal tuning consistently selected short lags (2–7 steps) and rolling windows (3–7 samples), matching expected coil time-constants. These tuned configurations directly contributed to the accuracy gains reported later, underscoring that both model architecture and temporal horizon selection matter for reliable dynamic HVAC prediction.
2.5. ANN Architecture (Hidden-Layer Layout)
A fully connected feed-forward neural network (ANN) was implemented to model nonlinear cooling-coil behavior. The network adopts a compact deep structure with three hidden layers configured as 294 → 316 → 39 neurons, followed by a single linear output neuron. Hidden layers employ ELU activation, with dropout = 0.386 and L2 regularization = 6.6 × 10
−4 applied throughout to enhance generalization under variable psychrometric and hydronic operating regimes. Training utilized the tuned parameters reported in
Table 1, including learning rate = 1.7 × 10
−4, batch size = 157, and 221 epochs with adaptive learning-rate scheduling and early stopping.
Although the schematic below (
Figure 5) displays the architecture for one output head, the study predicts three coil performance outputs (supply-air temperature, chilled water leaving temperature). Accordingly, two independent 1-unit output heads were trained using the same shared hidden-layer representation, ensuring consistent feature abstraction while allowing each target variable to learn its own terminal mapping.
Figure 6 confirms smooth and monotonic error reduction for both training and validation curves, with no divergence or late-epoch instability, indicating minimal overfitting and good generalization. The validation curve reaches a minimum near epoch 221 before flattening, validating the early-stopping trigger and ensuring that the model does not chase noise or transient anomalies in the conditioning data. The near-overlap of the curves further suggests that the combination of dropout + L2 regularization + dynamic learning-rate scheduling produced a stable and well-calibrated network, suitable for deployment in real-time supervisory control environments where prediction drift and temporal instability cannot be tolerated.
2.6. Leakage-Safe Stacking and Error-Analysis Framework
To combine complementary inductive biases while preserving time order, this study adopted a stacked ensemble with strict leakage control. After hyperparameter optimization, each base learner is retrained on the full historical training window. This investigation then runs a chronological Time Series Split so that for every fold, each model produces OOF predictions that forecast for timestamps it has never seen. These OOF predictions form the meta-feature matrix used to train a RidgeCV meta-learner. At test time, the tuned base models first generate predictions on the unseen horizon; RidgeCV blends them to produce the final two outputs: SAT and CHWLT.
In this study, OOF predictions from ANN, gradient boosting, and tree ensembles are often highly correlated because they learn from the same BMS signals and short-horizon dynamics. An unregularized linear blender can assign unstable, compensating weights. Ridge regression (ℓ2 penalty) shrinks coefficients just enough to stabilize the blend under collinearity without forcing any learner out of the mix, while cross-validation inside RidgeCV selects the penalty (α) that generalizes best on time-ordered folds. This yields a meta-model that is stable, interpretable, and fast. All features are past-only (lags/rolls at time t), and rows affected by the longest window are burned-in before fold assignment. The meta-learner is trained only on OOF rows; no future information enters training at any stage, making it practical for real-time AHU supervision.
The stacking meta-model is expressed as (Equation (6)):
Let
be the OOF meta-feature vector at time
(base predictions from RF/Bagging/XGB/ANN) and
. The meta-learner is
Equivalently, for
:
where the learned columns of
to
at test time to obtain
and
. All OOF features are generated with chronological splits and burn-in removal; no future information is used.
This study included Persistence (), Exponential Smoothing, and ARIMA (orders selected on training folds by AIC with ) as leakage-safe baselines. Separate univariate models are fit for SAT and CHW using only past data. Study evaluation follows the same chronological holdout as the ensemble.
2.7. Error Metrics
To evaluate performance on the held-out test horizon, we use RMSE, MAE, NRMSE, and R2. Each metric captures a slightly different behavior in the model’s predictions and helps confirm reliability from both a numerical and operational point of view.
RMSE measures the square-root of the average squared error (Equation (7)). Because larger mistakes are penalized more strongly, it highlights whether the model struggles during sudden changes in coil load or humidity spikes, exactly the moments where control instability can appear.
MAE reports the average absolute error in Equation (8). It gives a clearer sense of typical deviation in day-to-day operation, which is useful for understanding routine control precision rather than rare outliers.
In Equation (9), NRMSE scales RMSE by either the output range or mean, allowing direct comparison across variables with different magnitudes.
Here is the test-set mean of target (here ). Because both targets are temperatures with strictly positive test-set means, this normalization is well-defined and avoids outlier sensitivity inherent to range-based scaling.
R
2 measures how much of the variation in the true signal the model can explain in Equation (10). Higher values indicate that the model is not only tracking the trend correctly but also capturing coil dynamics consistently without drift.
In addition to point errors, this study reports interval uncertainty to support supervisory/MPC use. Two leakage-safe constructions are used: (a) stationary block bootstrap bands formed by resampling residual blocks from the chronological test horizon, yielding empirical 90% and 95% per-timestamp bands, and (b) split-conformal prediction intervals computed from OOF residuals of the stacked model, providing finite-sample marginal coverage under exchangeability. Both procedures respect time order (no future data in fit) and use the same train/validation segmentation as the main pipeline.
In summary, the proposed framework combines physics-aware feature construction, time-ordered validation, and a diverse ensemble of learning models with leakage-safe stacking. Dynamic lag and rolling windows give the models access to the short-term memory inherent to coil thermal and moisture behavior, while Optuna ensures that both model structure and temporal horizon are tuned objectively rather than assumed. By retraining each learner under a causal split and blending predictions only through out-of-fold information, the pipeline is designed to remain realistic to real HVAC supervisory operation and robust to varying load conditions.
3. Results
This section presents the evaluation of the proposed dynamic learning framework in accordance with the methodological structure. Initially, the predictive performance of the individually tuned base models is established to characterize their ability to represent cooling-coil behavior under realistic operating variability. Subsequently, the leakage-safe stacked ensemble is examined to demonstrate its contribution beyond single-model learning, particularly under short-term thermal and moisture transients that typify air-handling unit dynamics. The analysis integrates scalar performance metrics, parity agreement, and time-series fidelity, alongside residual diagnostics and feature-contribution patterns. This layered presentation enables a comprehensive assessment of numerical accuracy, temporal generalization, and physical coherence, ensuring that observed improvements are interpreted within the operational context of supervisory HVAC control and digital-twin applicability.
3.1. Benchmark Performance Across Base Learners
To ground the ensemble evaluation, each tuned base learner is first examined under identical dynamic conditions. This establishes the intrinsic predictive capacity of individual models when exposed to the fast temperature swings, moisture fluctuations, and valve-driven transients characteristic of cooling-coil operation. The goal here is not simply to present numbers but to understand how each algorithm responds to the physics of the system, short-cycle thermal inertia, latent moisture lag, and sensor-driven disturbances before combining their strengths in the stacked stage.
Table 2 summarizes the optimal lag/rolling configurations and average performance. Across the models, the optimal temporal windows fall between two and seven lags, aligning with the short memory horizon of coil thermal and mass-transfer processes. The Random Forest preferred a 4/3 lag-roll structure, providing balanced short-term prediction and robustness during mixed-air disturbances. Bagging converged to 3/4, trading a slightly shallower historical depth for smoother variance behavior and improved noise tolerance. XGBoost selected 2/3, emphasizing very near-term history, and this shorter receptive field advantaged humidity tracking, where recent latent signals dominate. The ANN settled deepest (5/7), as expected for a neural architecture exploiting temporal representation to model psychrometric continuity and moisture storage effects within the coil.
A clear pattern that emerges here is that tree-based learners excel at fast nonlinear shifts, while ANN provides the continuity essential for latent-load transitions. These behaviors are not incidental; they mirror the physics of coil heat and mass exchange. Moisture ratio evolves smoothly with film dynamics, whereas temperatures can snap under valve repositioning and chilled water step changes.
This nuance becomes more visible in
Figure 7, where R
2 values across outputs cluster in the 0.97–0.996 range for base learners, with the ANN reaching 0.996 on air temperature while XGBoost shows slight degradation in chilled water leaving-temperature prediction (0.959) due to sharper gradient-driven behavior. Bagging sits consistently high (0.995), reflecting its bias variance advantage over single-tree methods.
3.2. Comparative Error Structure and Dynamic Response
Error behavior was evaluated through parity alignment and dynamic tracking to assess model stability under realistic operating conditions. While scalar metrics indicate high accuracy across all learners, transient behavior reveals distinct response characteristics reflective of each algorithm’s learning bias and the underlying coil physics.
Tree-based methods demonstrate reliable behavior under mode switching and chilled water valve transitions. Random Forest maintains a consistent transient response, while Bagging exhibits reduced variance and smoother recovery, indicating its bias-variance advantage. XGBoost delivers sharp correction capability and excels on humidity transients yet displays minor overshoots during abrupt thermal shifts. The ANN produces the smoothest trajectories and strongest latent-load representation, though with modest inertia during step changes. The stacked ensemble consistently outperforms individual models, achieving the tightest clustering around the identity line and the lowest deviation during both rapid temperature ramps and moisture diffusion periods. By combining fast local corrections and smooth psychrometric dynamics, the ensemble delivers robust predictive stability under both static and transition regimes, aligning closely with the short thermal and mass-transfer memory characteristics of cooling coils.
Table 3 summarizes aggregate metrics across base learners and the stacked ensemble. The stacked approach achieved the highest average R
2 = 0.997 and the lowest NRMSE = 0.015 and MAE = 0.232, outperforming all individual models. Among base learners, the ANN demonstrated the strongest standalone fidelity (R
2 = 0.995, NRMSE = 0.017), while Bagging followed closely (R
2 = 0.993). XGBoost showed sharper sensitivity to humidity fluctuations but exhibited slightly higher error for water-side prediction (R
2 = 0.975), consistent with its aggressive gradient-driven fitting. Random Forest performed robustly overall (R
2 = 0.983), particularly under mode shifts and noise disturbances.
Figure 8 illustrates the stacked ensemble’s ability to remain aligned with real measurements across the three coil output variables. The model tracks rapid sensible cooling transitions and slower latent responses, retaining phase consistency even during abrupt temperature drops and mixed-air disturbances. This behavior reflects the physical short-horizon memory of cooling-coil dynamics and the ensemble’s ability to balance fast reactivity (tree methods) and smooth latent-load representation (ANN).
Although the two curves in
Figure 8 appear visually similar, this is an expected outcome of the controlled experimental setup. The AHU operated under stable cooling-only conditions, where the chilled water coil served as the primary heat exchanger between the air and water loops. Because both variables respond to the same coil load dynamics, they exhibit nearly synchronized transitions in magnitude and timing. In a dry coil, the air and water-side outlets are driven by the same load Q(t),
and
vary almost proportionally. Condensation requires the minimum coil surface temperature T
surf,min to fall below the mixed-air dew point T
dp,in.
From the plotted period, SAT is regulated near 55 °F, and CHWLT remains ≥ 56–58 °F; with a typical coil approach, a conservative bound is . Thus, Tsurf,min stayed at or above the likely Tdp,in under these runs, so the psychrometric path was essentially horizontal (no humidity ratio change) and no latent heat removal occurred. Additional empirical signatures consistent with a dry coil are visible in the figure with no latent plateau in SAT during step tests, near-unity proportionality between SAT and CHWLT across all ramps, and the absence of phase-change transients. In short, the water and air outlets co-move because the coil operates above the dew-point threshold; therefore, no condensation was physically possible in the intervals shown.
To isolate transient behavior,
Figure 9 compares supply-air temperature trajectories across all learners. Tree-based models respond quickly to sudden chilled water valves but show small high-frequency deviations during rapid ramps. The ANN offers smoother evolution yet presents minimal delays in steep transitions. XGBoost exhibits sharper corrections but occasionally overshoots. The stacked model combines desirable traits from each—fast step response, suppressed oscillations, and minimal phase lag, yielding the closest match to observed dynamics across all operating regimes.
This convergence of error behavior indicates that the ensemble not only minimizes residual magnitude but also preserves thermodynamic realism and operational stability, key properties for digital-twin and MPC applications. Full time-series overlays for all models and all variables, including chilled water output trajectories, are provided in
Appendix B for completeness.
3.3. Stacked Model Gains and Operational Fidelity
Building directly on the benchmark analysis, the stacked ensemble exhibits a consistent and meaningful uplift over every base learner, both numerically and behaviorally. The strongest individual model (ANN) already performed at a high level, achieving a macro-R
2 of 0.995 with an average NRMSE of 0.017. The stacked configuration further compresses error, reaching an average R
2 of 0.997 and NRMSE of 0.015 (
Table 3), representing a 12–18% normalized-error reduction relative to the ANN and 40–55% improvement over tree-only models. These gains are not marginal noise—they translate into visibly tighter alignment during the most demanding operational windows.
Improvements concentrate where cooling-coil dynamics are least forgiving: abrupt chilled water reset events, mixed-air ratio swings, and rapid humidity rebounds. In these regions, individual models occasionally show slight under- or overshoot, particularly when thermal and moisture transport time-constants diverge. The ensemble suppresses these deviations, maintaining near-zero-mean residuals even through steep trajectory changes. The behavior aligns with physical system response patterns—recognizing that air-side moisture diffusion lags temperature response and valve modulation induces local nonlinear jumps in coil capacity utilization.
This performance stems from complementary inductive biases:
Tree-based learners partition regimes effectively, offering crisp handling of sudden mode shifts and discrete operational boundaries.
The ANN preserves continuity, capturing psychrometric curvature, latent-load evolution, and slow humidity relaxation without introducing delay drift.
When combined, the stacked structure tracks coil thermal inertia and moisture-lag dynamics without the oscillatory correction tendencies or “anticipation errors” occasionally visible in single-model traces. The resulting trajectories are smoother where physics demands continuity and sharper where plant behavior genuinely transitions. The ensemble follows each output through steep ramps and recovery zones with no noise amplification, no premature inflection points, and minimal residual widening. Comparative overlays for the base models (
Appendix B) show more pronounced local deviations, reinforcing the ensemble’s ability to stay aligned during transient, high-gain regions rather than only steady periods.
Taken together, the stacked approach improves not only scalar metrics but temporal coherence, disturbance rejection, and physical consistency, properties that matter in real plant deployment. These characteristics position the ensemble as a suitable candidate surrogate for supervisory control, real-time optimization, and Model Predictive Control (MPC) workflows in AHU systems, where accurate short-horizon dynamics under uncertainty are essential [
33,
39].
3.4. Comparing Stacked Ensemble, GRU Sequence, and Classical Baseline
To provide a sequence-model reference under the same leakage-safe protocol used throughout this study, a compact gated recurrent unit (GRU) was implemented and tuned to consume short look-back windows of the causal feature stream and produce two one-step-ahead outputs (SAT and CHW). Its final configuration is reported in
Appendix A,
Table A4.
Here, in
Table 4, ΔNRMSE (pp) =
−
in percentage points (pp). Confidence level (Cl) via stationary block bootstrap and Diebold Mariano uses squared-error loss with Newey–West adjustment. The walk-forward row aggregates per-window results (mean values). Although both models operate at high fidelity, paired tests on per-timestep errors show that the stacked ensemble’s improvement over the ANN is statistically significant and operationally consistent. On the lockbox, ΔNRMSE CIs are strictly negative for both targets (SAT −0.70 pp [−0.95, −0.45]; CHW −0.70 pp [−0.96, −0.44]); walk-forward analysis yields similar gaps (SAT −0.62 pp [−0.84, −0.40]; CHW −0.65 pp [−0.88, −0.42]). Diebold–Mariano tests favor the stacked model (
p = 0.006–0.010) and Wilcoxon tests confirm significance (
p < 0.001). Thus, the ensemble’s blended learners provide a reproducible reduction in error beyond the ANN alone, especially around set-point steps and mixed-air transients where complementary inductive biases reduce residuals without sacrificing steady-state accuracy.
Comparative test performance is summarized in
Table 5, where stacking delivers a 29.2% reduction in RMSE and 88.4% reduction in mean-based NRMSE, but +0.025 absolute gain in R
2. The magnitude of improvement indicates simultaneous bias correction and variance reduction: the ensemble’s blended learners capture sharp step responses and mix-induced transients more faithfully, while preserving steady-state fidelity, whereas the single-stream GRU exhibits residual inertia around rapid set-point changes.
The GRU baseline usefully captures short memory and delivers smooth trajectories but exhibits modest inertia during fast valve/mixing events. The stacked ensemble’s diversity in trees for regime partitioning, boosting for sharp local correction, and ANN for smooth psychrometric mapping, which is combined via a regularized linear combiner (RidgeCV), yields tighter parity clustering and lower transient residuals without sacrificing steady-state fidelity. Practically, the ensemble’s error profile better supports stable SAT/CHW set-point moves and mitigates low-
penalties by reducing overshoot and recovery oscillations under routine BAS inputs. Walk-forward results show consistent ranking across windows and aggregate performance is reported as mean ± spread in
Appendix A Table A6, confirming robustness across operating periods.
The persistence baseline serves as a lower bound for short-horizon forecasting: by projecting
, it exploits the high lag-1 autocorrelation typical of SAT/CHWLT during quasi-steady operation. As expected, it performs reasonably when dynamics are slow, but it systematically lags and overshoots around set-point changes and mixed-air disturbances, producing large transient residuals and the weakest overall fidelity. ARIMA improves persistence by modeling linear autocorrelation and short-memory noise; this yields smaller phase lag and lower variance in steady regimes. However, because it is univariate and linear, ARIMA cannot encode exogenous drivers (valve movement, mixed-air shifts) or nonlinear regime changes in coil heat/mass transfer. Consequently, both methods underperform the learning-based models on the held-out horizon: their errors are concentrated at step events and rapid mixing periods, while the GRU and especially the stacked ensemble better reconciled transient response with steady-state accuracy by leveraging causal exogenous history and complementary inductive biases. For
Figure 8 and
Figure 9, shaded 90% of bands are overlaid: thin bands denote split-conformal intervals from OOF residuals; light bands denote block-bootstrap intervals on the lockbox window. Observed coverage on SAT/CHWLT is 0.89–0.92, with median half-widths 0.35–0.45 °F (steady) and 0.6–0.8 °F (transients).
3.5. Model Interpretability and Physical Alignment
To ensure the stacked framework remains transparent and physically credible, model interpretability was examined using leakage-safe permutation importance (PI) computed over the held-out test horizon. PI was selected over attention-based or gradient-based explainers due to its model-agnostic nature and minimal assumptions [
40,
41] and to avoid the risk of temporal leakage common in naive feature-attribution methods for time-series systems. Only past-available lags and rolling features were perturbed to preserve causal structure and avoid overstating contributions from future information.
Across models and targets, the mixed-air temperature and mixed-air humidity ratio features rank highest, followed by chilled water entering temperature and chilled water flow, with total supply airflow contributing primarily during higher load or reset conditions (
Figure 10). Dynamic features show a physically plausible temporal decay—short lags (e.g., t − 1) dominate, with diminishing contributions at t-3 and t-6—which mirrors the coil’s short-memory thermal and moisture time constants. Rolling statistics (3–12 samples) appear as stabilizers: they add value where sensor noise or rapid setpoint changes would otherwise increase variance, especially for humidity ratio prediction. These plots are the audit trail showing the models learned the right physics, not shortcuts. By perturbing past-available features (permutation importance) and reading split gain (boosting), the analysis confirms that the predictors anchor on mixed-air humidity/temperature and t−1 lags, which matches coil heat and mass-transfer fundamentals. That validation serves three purposes: (i) trust the stacked surrogate is suitable for supervisory/MPC use because it reacts to the same cues operators use; (ii) diagnostics—it identifies sensors whose drift would hurt forecasts most (mixed-air humidity/temperature, water-side signals); and (iii) design rationale—it explains why stacking works (trees handle regime jumps; boosting refines local humidity structure; ANN smooths psychometrics), giving a causal link between the features and the transient gains reported earlier.
Tree-based learners display sharper importance contrast (clearer separation among top features), reflecting their regime-partition behavior; the ANN shows a smoother importance spectrum, consistent with continuous psychrometric mapping. For chilled water leaving temperature, water-side variables and very short lags rise in rank, confirming the role of loop thermal inertia. For supply-air humidity ratio, mixed-air humidity and recent target lags are dominant, consistent with moisture-film and surface-sorption dynamics. These patterns align with
Section 2.3′s dynamic-feature design and the transient gains of the stacked model in
Section 3.3.
4. Practical Application
The predictive framework developed enhances modern building operations by providing stability and physical credibility in crucial conditions such as mixed-air variability, chilled water modulation, and rapid coil transients, areas where traditional approaches struggle, especially in humid climates. This framework focuses on generating short-horizon, leakage-free forecasts that are sensitive to humidity, allowing for reliable performance in supervisory and model predictive control (MPC) applications. The stacked surrogate model achieves this without necessitating a fully specified coil model, bridging the gap between heavy physical calibration and opaque black-box methods [
42,
43].
The stacked surrogate operates at sub-second latency at the native BMS cadence on commodity edge hardware and relies only on routine BMS points. A tiered missing data policy is applied (short-gap carry-forward, robust imputation for longer gaps) with a safe-mode fallback (persistence/rule-based) when inputs are unavailable or uncertainty bands widen. Forecasts are accompanied by leakage-safe 90% bands, which gate supervisory updates to avoid unstable set-point moves during transients or sensor faults. The prior literature indicates significant energy and comfort improvements with effective forecasting that can adapt to disturbances; the demonstrated ensemble model shows reduced overshoot and a smooth recovery during water-side steps, reinforcing this notion. Additionally, the model functions as a diagnostic companion by monitoring key signals, mixed-air humidity ratio, temperature, and chilled water levels. Deviations in these measurements can reveal sensor drift, enabling early detection of coil performance issues. Consequently, this framework not only serves as a predictive tool but also contributes to broader analytics frameworks or digital twins, ensuring plant health checks alongside forecasting capabilities. In supervisory and model predictive control settings, short-horizon, leakage-free forecasts with minute-scale memory and humidity sensitivity are essential [
44]. The stacked surrogate supplies this layer without requiring a fully identified coil model, offering a middle path between physics-heavy calibration and opaque black-box fitting. Prior building-MPC literature has consistently shown energy and comfort improvements when forecasts are reliable and responsive to disturbances; the behavior demonstrated here, particularly the ensemble’s reduced overshoot and smooth recovery during water-side steps, supports that same trajectory.
Operationally, the model acts as a diagnostics companion, capable of detecting deviations in mixed-air humidity ratio, temperature, and chilled water signals through drift in residual behavior. This allows the model to function not just as a forecasting tool but also as a soft-FDD (Fault Detection and Diagnosis) signal [
45], highlighting coil performance changes or sensing degradation before they lead to operational issues. Its integration into a broader digital-twin or analytics pipeline positions these models as vital for verifying plant health, rather than merely providing predictions [
46]. The novelty of the model lies in its combination of three key capabilities: (i) dynamic feature learning tailored to coil time constants, (ii) multi-output forecasting with leakage-safe stacking, and (iii) rigorous validation ensuring alignment with HVAC physics rather than mere proxy correlations. This advancement brings data-driven AHU (Air Handling Unit) modeling closer to reliable integration with control layers, emphasizing the coexistence of performance and interpretability. Future enhancements include closed-loop testing, focusing on two significant directions: integrating the surrogate model with a model predictive control (MPC) layer in a supervisory setting and enhancing transferability across similar AHUs through light re-training [
47], all while preserving the principle of leveraging machine learning to advance, rather than supplant, physical understanding in complex, variable, or time-sensitive systems.
5. Limitations and Future Work
This investigation was performed on a single laboratory AHU under controlled but diverse operating conditions (systematic variations in mixed air temperature/humidity, supply airflow, chilled water flow, and entering water temperature and valve/fan set-points). While this design supports clean causal evaluation, it does not encompass cross-site heterogeneity, long-term weather effects, or outdoor/occupancy disturbances typical of field operation. Future work and scope will therefore (i) extend validation to multiple AHUs with different coil geometries, plant configurations and control policies; (ii) include outdoor factors, such as seasonal weather, ambient moisture, solar gains, and occupancy schedules through multi-season datasets and strictly chronological holdouts; (iii) assess transferability with lightweight domain adaptation (feature re-scaling, few-shot fine-tuning of base learners, time-safe re-fitting of meta-weights); (iv) quantify forecast uncertainty (e.g., quantile/conformal methods) for risk-aware supervisory control; and (v) profile edge deployment (latency/memory) for BMS-integrated, real-time use. Also, to broaden external validity further, the next phase can include pursuing a staged program with multi-site validation across buildings with varied coil geometries and plant hydraulics, as well as targeted campaigns under humid, dehumidifying regimes to complement the present dry-coil setting. Subsequent studies will incorporate seasonal/outdoor drivers (weather, economizer operation) and occupancy covariates, assess multi-step rollouts, and test calibration-light transfer to support portability. This prioritized roadmap directly addresses generalizability, lowers the practical risk of industrial adoption, and broadens the applicability of the proposed dynamic stacked-ensemble from a single, well-instrumented unit to a multi-site, outdoor-aware setting aligned with real-world building operations.
6. Conclusions
This study developed a dynamic, multi-output stacked ensemble for real-time forecasting of AHU cooling-coil behavior, encoding short-horizon memory through input/target lags and rolling psychrometric features, using leakage-free time-series validation, and tuning four base learners (Random Forest, Bagging-DT, XGBoost, ANN) with Optuna before fusing their out-of-fold predictions via a Ridge meta-learner. The objective was an accurate, reproducible surrogate aligned with coil heat- and mass-transfer behavior and suitable for operational use.
1. The stacked ensemble delivers consistently tighter forecasts than any single model, with errors compressed into a narrow band and goodness of fit effectively at the ceiling for this dataset. Relative to the best base learner (ANN), the ensemble trims a small but persistent slice of error across the entire horizon, and this advantage remains stable when the evaluation is repeated over multiple walk-forward windows (low mean ± SD).
2. Against classical time-series references and a compact GRU, the ensemble’s advantage is not incremental but categorical; persistence struggles at step changes, ARIMA improves steady-state tracking yet misses nonlinear/exogenous effects, and the GRU smooths trajectories but retains lag around rapid transients. In contrast, stacking maintains low error across both calm periods and disturbances, indicating that the combined inductive biases generalize better than any single forecasting paradigm.
3. During chilled water valve ramps and mixed-air shifts, the ensemble exhibits smaller peaks and faster recovery; peak-step MAE drops by ~18% (95% CI: 12–24%) relative to the ANN, with visibly tighter parity clustering and narrower residual bands around set-point changes and less phase lag with no loss of steady-state agreement.
4. Permutation importance and ablations consistently elevate mixed-air temperature alongside short lags (t − 1, t − 3). Optuna’s selected lag/roll horizons match minute-scale coil inertia, indicating that the surrogate’s accuracy is anchored in load-driven dynamics rather than incidental correlations.
5. Paired tests on time-ordered errors confirm that stacked vs. ANN gains exceed noise: ΔNRMSE ≈ −0.62 to −0.70 pp with 95% CIs below zero, Diebold–Mariano p ≈ 0.006–0.010, Wilcoxon p < 0.001. The same ranking holds across walk-forward windows, underscoring that improvements are consistent, statistically reliable, and operationally relevant.
The present findings derive from a single laboratory AHU subjected to controlled perturbations (mixed-air, fan, and chilled water adjustments), which is a design that enables clean, causal evaluation but does not fully capture cross-site variability, weather/occupancy influences, coil fouling, or longer-horizon drift. Subsequent work will extend validation across multiple buildings and seasons under strict chronological splits, incorporating outdoor and occupancy covariates and assessing calibration-light transfer (time-safe re-fit, simple feature rescaling) to support portability across sites. It will introduce quantified uncertainty via quantile/conformal intervals with coverage diagnostics, benchmark against physics-informed gray-box coil surrogates and compact sequence baselines under the same leakage-safe protocol, and stress-test robustness to sensor dropout/bias and economizer events. Finally, closed-loop trials within supervisory/MPC layers reported with edge deployment profiling (latency, memory, fail-safe fallbacks) will quantify operational benefits in energy, comfort, and stability.
Overall, the hyperparameter-tuned, leakage-safe stacking framework functions as a high-fidelity, operationally practical surrogate for SAT and CHWLT forecasting, delivering tight transient tracking, stable steady-state behavior, and physics-consistent feature use, which are well-suited for set-point exploration, short-horizon control, and supervisory decision support in industrial AHUs.