An Artificial-Intelligence-Based Predictive Maintenance Strategy Using Long Short-Term Memory Networks for Optimizing HVAC System Performance in Commercial Buildings

Manea Almatared; Mohammed Sulaiman; Abdulaziz Alghamdi; Eman Nasrallah

doi:10.3390/buildings15224129

,

and

¹

Department of Civil Engineering, College of Engineering, Najran University, Najran 66426, Saudi Arabia

²

Department of Civil Engineering, Faculty of Engineering, Al-Baha University, Alaqiq 65779, Saudi Arabia

³

Faculty of Engineering, Department of Civil Engineering, University of Tabuk, Tabuk 47512, Saudi Arabia

⁴

Department of Interior Design, College of Design and Applied Art, Taif University, Taif 26571, Saudi Arabia

Buildings2025, 15(22), 4129;https://doi.org/10.3390/buildings15224129
(registering DOI)

This article belongs to the Special Issue AI-Driven Cooling, Refrigeration, and Energy Solutions for Built Environments

Version Notes

Order Reprints

Abstract

This study addresses the persistence of avoidable failures and efficiency losses in HVAC plants by introducing a field-validated predictive maintenance (PdM) framework that estimates component-level RUL from multiyear BMS telemetry and translates forecasts into schedule-aware maintenance actions. The objective was to determine whether an LSTM ensemble with mode-aware segmentation and isotonic calibration could yield decision-quality RUL forecasts that reduce unplanned outages, downtime, and electricity use in a large Riyadh office building. Two years of 1 min BMS data from chillers, primary pumps, and AHU fans were cleaned, standardized, and segmented by operating mode; RUL labels were derived from time-stamped work orders and failure confirmations; the LSTM produced per-minute RUL estimates trained with a Huber loss, calibrated to lower quantiles, and converted to sustained triggers compared against a fixed-interval program. On the held-out test set, the model achieved a weighted MAE of 19.8 ± 2.1 h and RMSE of 29.1 ± 3.3 h, with quantile calibration error (QCE)

\leq 0.06

and lead-time accuracy (LTA; fraction of triggers whose calibrated lower-quantile RUL is

\geq

the planning threshold) of 0.79 at a 10-day threshold. When deployed in counterfactual evaluation, triggers reduced unplanned outages by 47.6% (paired bootstrap p = 0.008) and total downtime by 41.3% (p = 0.012), and yielded a 10.6% reduction in HVAC electricity (95% CI: 7.7–13.2%) and a 9.7% decrease in total operating cost. The findings indicate that calibrated sequence models coupled to simple sustained triggers can convert routine BMS data into reliable maintenance schedules with quantifiable reliability and energy benefits. Practically, conservative calibration (q approximately 0.25) with thresholds of 10–12 days provided stable lead windows; future work should assess transferability across climates and facility types using transfer learning and integrate uncertainty-aware triggering with MPC for joint operational and maintenance optimization.

Keywords:

AI; BMS; HVAC; LSTM; PdM

1. Introduction

Heating, Ventilation, and Air Conditioning (HVAC) systems dominate operational energy use in commercial buildings and are tightly coupled to occupant comfort, indoor air quality, and asset reliability [,]. In hot, arid regions such as Riyadh, cooling demand is particularly intensive: recent regional analyses of Gulf Cooperation Council capitals report persistently high cooling degree days, while urban heat island effects further elevate peak electricity demand and building cooling loads [,,]. Against this background, maintenance strategies exert a first-order influence on both energy and reliability outcomes. Reactive repair and fixed-interval preventive regimes, while ubiquitous, often miss slowly evolving degradations (e.g., refrigerant undercharge, fouled heat exchangers, pump wear), allowing latent faults to erode efficiency and precipitate avoidable outages [,,].

Building Management System (BMS) telemetry—temperature, pressure, humidity, power, vibration, valve positions, and control states—enables AI-driven predictive maintenance (PdM) that replaces periodic or reactive practices with condition-based actions []. Sequence models such as Long Short-Term Memory (LSTM) networks are effective for multivariate building time series because they capture long- and short-range dependencies in component health trajectories; moving from fault labels to Remaining Useful Life (RUL) provides a directly schedulable target that minimizes downtime windows and preserves efficiency [,,,,,,,,]. Prior building work has largely focused on fault detection and diagnosis (FDD) and short-horizon load prediction, often on simulated datasets; while LSTM-based pipelines achieve strong classification and forecasting performance, component-level RUL regression on multiyear, high-frequency field BMS remains comparatively scarce [,,,,,,,,,]. This study addresses that gap by translating calibrated sequence forecasts into schedule-aware maintenance decisions.

In parallel, the controls community has advanced data-driven and online-learning model predictive control (MPC) for HVAC, which shares methodological DNA with PdM by leveraging predictive models to optimize operations under uncertainty [,,]. These efforts underscore the centrality of high-resolution BMS data, the need for models that adapt to drift, and the value of interpretable mechanisms to build operator trust [,].

Building on recent reviews, the relevant literature can be organized into four methodological families that span detection, diagnosis, and prognosis:

Physics-based and gray-box models combine first-principles with parameter estimation to detect deviations in thermal or hydraulic balances. They can be interpretable and data-efficient but may struggle with unmodeled dynamics in large, heterogeneous portfolios [].
Classical machine learning (e.g., support vector machines, random forests) maps engineered features to fault classes or health indicators. These methods are robust at small scale but often require intensive feature engineering and may not capture long-range dependencies [,].
Deep learning (CNNs, autoencoders, LSTMs, and hybrids) learns hierarchical representations directly from raw or lightly engineered streams. These models excel for high-dimensional BMS data and have delivered state-of-the-art FDD accuracy for AHUs and chillers [,,,].
Prognostics and health management (PHM) focuses on RUL estimation to enable PdM scheduling. Although widely studied in manufacturing and turbofan benchmarks with LSTM and attention mechanisms, rigorous RUL studies for HVAC components in operational buildings remain comparatively scarce [].

This typology clarifies a challenge: most building studies concentrate on detection/diagnosis; relatively fewer quantify degradation rates and predict RUL at the component level using real BMS streams in occupied buildings.

Despite impressive advances, several cross-cutting challenges remain. First, data quality and label scarcity constrain supervised learning; real fault data are infrequent, imbalanced, and non-representative of the full operating envelope [,]. Second, generalization and transferability across sites, climates, and equipment vintages are unresolved; high-accuracy models often degrade when deployed under “cross-condition” shifts, prompting interest in domain adaptation and hybrid knowledge-guided learning []. Third, interpretability and operator trust are essential for adoption in critical infrastructure; progress in mechanism-aware and explainable deep learning is promising but not yet standard practice []. Fourth, concept drift in buildings—due to occupancy, weather, and control retuning—necessitates models that update online while preserving stability and safety [].

Within this landscape, a distinct research gap emerges. The broader literature heavily emphasizes FDD classification, short-horizon load prediction, or method showcases on simulated datasets [,,]. By contrast, component-level RUL forecasting for chillers, primary pumps, and AHU fans using multiyear, high-frequency BMS data from a large commercial building in Riyadh—with rigorous benchmarking against maintenance logs and quantification of reliability and energy impacts—has been sparsely documented. Prior studies rarely report prediction horizons and uncertainty in a way that directly informs scheduling windows, nor do they consistently translate predictions into downtime reduction and energy savings under realistic operational constraints [,,].

Selecting LSTM for prognostics is methodologically grounded: the architecture encodes long-term temporal dependencies and non-linear interactions needed to capture slow drifts (e.g., bearing wear) superimposed on control cycles, and sequential hybrids have shown resilience to operating-point variation relevant to Riyadh’s seasonal regimes [,,]. Accordingly, we use LSTM with conservative calibration to produce RUL estimates that are robust across operating modes.

Equally important is the regional context. In Saudi Arabia, cooling dominates building electricity use, with policy initiatives targeting efficiency improvements in the building sector [,,]. For a dense office district in Riyadh, where urban overheating can raise cooling loads and peak demand, timely detection of incipient degradation can prevent failures during heat extremes and sustain high Coefficient-of-Performance (COP) operation [,]. By framing PdM around RUL at the component level—rather than solely fault labels—the proposed strategy aligns maintenance windows with risk-informed thresholds, enabling interventions that are early enough to avoid forced outages yet late enough to maximize component life. This directly addresses facility-management objectives documented in ASHRAE’s service-life and life-cycle-cost guidance [].

Relative to prior HVAC analytics in the literature, the present work differs in three respects. First, whereas LSTM has been used for anomaly detection on simulated AHU faults [] or for cooling-load prediction to aid control [], this study formulates component-level RUL regression on field BMS data from a Riyadh commercial building, closing the gap between detection and actionable prognostics. Second, while recent frameworks discuss integrating rule-based maintenance with machine learning [], the proposed approach couples RUL forecasts with quantitative evaluation of downtime reduction and energy savings, thereby connecting model outputs to operational key performance indicators. Third, this study addresses cross-condition robustness [] in a harsh climate by incorporating operating-mode segmentation and validating prediction horizons against real failure records—dimensions often absent from FDD-centric reports.

The overarching aim is to develop and validate an AI-driven PdM framework using LSTM to forecast RUL for critical HVAC components and to quantify its operational and energy impacts in a large commercial building in Riyadh. This aim is responsive to calls in recent reviews for moving beyond fault classification towards deployable prognostics that address generalization, trust, and decision relevance [].

To operationalize this aim, this study addresses the following research questions (RQs):

RQ1. Can an LSTM-based model trained on multiyear, high-frequency BMS data achieve accurate RUL predictions for chillers, primary pumps, and AHU fans under real operating variability?
RQ2. What prediction horizons are actionable for condition-based scheduling in a Riyadh commercial building context, and how do MAE/RMSE translate into maintenance lead time?
RQ3. Compared with fixed-interval preventive maintenance, to what extent does RUL-guided PdM reduce unexpected downtime risk and sustain energy-efficient operation during peak cooling periods?
RQ4. How sensitive are performance and robustness to data pre-processing choices, operating-mode segmentation, and model hyperparameters, and what guidelines emerge for deployment in similar hot-climate portfolios?

The objectives of this study are: (i) to design a scalable LSTM prognostics pipeline for RUL estimation of chillers, primary pumps, and AHU fans using two years of high-frequency BMS data from a large Riyadh office building; (ii) to evaluate predictive accuracy with MAE and RMSE and to report actionable prediction horizons; (iii) to benchmark the proposed PdM strategy against fixed-interval maintenance in terms of unexpected downtime reduction; (iv) to estimate HVAC energy savings attributable to timely fault rectification, with sensitivity to prediction horizons and scheduling policies; (v) to distill deployment guidelines for hot-arid commercial buildings regarding data requirements, operating-mode segmentation, and model governance.

2. Materials and Methods

2.1. Study Design and Methodological Alignment to Objectives

The methodology was structured to address the stated objectives in a tightly coupled, end-to-end workflow that begins with field data acquisition from a large commercial office building in Riyadh and culminates in quantitative evaluation of a PdM policy driven by LSTM-based RUL forecasts. The building operates a central chilled-water plant with centrifugal chillers, variable-speed primary pumps, and AHUs serving multiple floors. The design explicitly aligns with the objectives as follows. The construction of a scalable prognostics pipeline (objective i) is realized through a modular data engineering stack that ingests, cleans, and segments two years of high-frequency BMS data into sequences suitable for LSTM training. Predictive accuracy and actionable prediction horizons (objective ii) are addressed through rigorous labeling of RUL targets from maintenance logs and through out-of-sample evaluation using MAE and RMSE with uncertainty quantification. A counterfactual comparison to a fixed-interval program (objective iii) measures unexpected downtime reduction using discrete maintenance triggers derived from predicted RUL. Energy impacts (objective iv) are estimated by mapping fault rectification timing to COP trajectories and associated electric energy, while sensitivity to prediction horizons and scheduling policy is evaluated. Finally, deployment guidelines for hot-arid contexts (objective v) are distilled by examining model robustness across operating modes typical of Riyadh’s climate. Throughout, methodological synergy is emphasized: every step is designed to preserve traceability from raw signals to decisions, thereby supporting replicability and independent verification.

2.2. Site, Equipment, and Data Acquisition

This study was conducted in a Class A office tower in the King Abdullah Financial District of Riyadh. The central plant comprises three water-cooled centrifugal chillers (nominal 900–1100 RT each), six variable-speed primary pumps, and multiple AHUs with variable air volume terminals. The BMS records sensor streams at 1 min native resolution; channels used herein include chilled- and condenser-water supply/return temperatures, differential pressures across pumps and coils, valve and damper positions, compressor and pump power, vibration accelerometer channels at selected pump and fan bearings, outside-air temperature and humidity, and AHU airflow rates. Maintenance logs and work orders spanning the same period record fault codes, corrective actions, and time-stamped return-to-service events. Ground-truth failure instances for RUL labeling were identified for chiller compressor faults, pump bearing failures, and AHU fan motor failures, each validated by technician notes and restart confirmations.

Each validated event in the maintenance logs corresponded to parameter checks defined in recognized international standards, including ASHRAE 180 [] for inspection scope/frequencies, ASHRAE Guideline 22 [] for chiller-plant instrumentation quality, AHRI 550/590 [] for rating definitions underlying COP and kW·ton, ISO 17359 [] for condition-monitoring program structure, and ISO 20816-3/-7 [,] for vibration measurement/evaluation of industrial machines and rotodynamic pumps; refrigerant safety and leak-inspection practices followed ISO 5149-4 [], with detector performance verified against EN 14624 []. Within that framework, parameter selection for this work prioritized (i) non-intrusive, continuously sampled BMS channels (e.g., coil ΔT, load proxies, COP/kW·ton, broadband vibration RMS), (ii) traceability to those standards’ acceptance/evaluation criteria, and (iii) direct linkage to work-order outcomes used for RUL labeling; these choices determined the plausibility ranges and sensor-specific quality flags referenced below.

Instrumentation accuracy statements from vendor datasheets were used to set plausibility ranges and sensor-specific quality flags. Spot calibration checks were performed during quarterly service windows by comparing BMS readings with handheld reference instruments; discrepancies exceeding acceptable tolerances triggered recalibration and annotation in a data quality ledger. All clocks were synchronized to network time during commissioning; remaining drift was corrected in software by aligning daily power integrals with utility interval meters.

2.3. Data Governance, Cleaning, and Synchronization

Raw streams were extracted in their native timebases and merged on a unified 1 min grid. Hard faults (disconnected sensors, flat-lines beyond a sensor-specific duration) were flagged and excluded from downstream modeling; soft faults (rare spikes, plausible but degraded signals) were retained after robust filtering. Missing intervals shorter than 15 min were imputed by last-observation-carried-forward within operating mode; longer gaps were left missing and masked during sequence construction. To reduce high-frequency noise while preserving ramps and step changes relevant to control, a short moving average was applied, as defined before Equation (1). Because the pipeline is intended for replication, all transformations and parameter choices were stored as versioned artifacts with deterministic seeds.

Because these transformations and parameter choices are central to reproducibility, Appendix A Table A1 consolidates the exact settings used for smoothing and imputation surrounding Equations (1)–(7)—including the moving-average window WWW, the EWMA factor λ, channel scope, and the short-gap interpolation threshold—so that the moving-average smoothing defined in Equation (1) can be replicated without ambiguity.

The moving-average smoothing used in the first cleaning stage is defined in Equation (1). Before presenting the formula, it is noted that this operator reduces noise variance while retaining the signal level at minute-scale resolution, which is appropriate for thermal processes in chilled-water plants [].

{\tilde{x}}_{t} = \frac{1}{W} \sum_{i = 0}^{W - 1} x_{t - i}

(1)

In Equation (1),

{\tilde{x}}_{t}

denotes the smoothed value at time index t,

x_{t}

is the raw measurement, and

W

is the window length in samples (here,

W = 5

min). This operator was applied channel-wise after outlier removal. In practice, it reduces spurious spikes in temperature and power channels without materially shifting phase, and it stabilizes derived features such as

Δ T

and COP used later in training and evaluation.

To enable stable model training, channel standardization was subsequently applied within the training split as described in Equation (2). This ensures that all features contribute at commensurate scales to the learning objective.

{\hat{x}}_{t, j} = \frac{x_{t, j} - μ_{j}}{σ_{j}}

(2)

In Equation (2),

x_{t, j}

is the cleaned value of feature

j

at time

t

,

μ_{j}

and

σ_{j}

are the mean and standard deviation of feature

j

computed from the training split only, and

{\hat{x}}_{t, j}

is the standardized feature. Standardization prevents scale dominance by high-magnitude channels (e.g., kW) and accelerates convergence of gradient-based optimization, thereby directly supporting objective (i).

2.4. Feature Engineering and Operating-Mode Segmentation

Thermal and mechanical features were engineered to represent component health under varying loads. For chilled-water circuits, temperature difference across the cooling coil was computed per Equation (3) to capture heat transfer performance. For pumps and fans, vibration statistics summarize bearing condition, and for the whole plant, COP encapsulates thermodynamic efficiency as in Equation (5) [].

Δ T_{coil} = T_{return} - T_{supply}

(3)

In Equation (3),

T_{return}

and

T_{supply}

are coil return and supply temperatures from the BMS. The feature

Δ T_{coil}

reflects instantaneous heat extraction capability conditional on airflow and valve position. During deterioration (e.g., fouling or refrigerant undercharge),

Δ T_{coil}

trends downward at comparable load, which provides an early-warning signal usable by the LSTM.

Load on the chilled-water circuit was approximated using the standard energy balance in Equation (4), which provides a physically grounded proxy for cooling capacity [].

{\dot{Q}}_{cool} = ρ c_{p} \dot{V} Δ T_{chw}

(4)

In Equation (4),

{\dot{Q}}_{cool}

is the cooling rate (kW),

ρ

is water density (

{kg / m}^{3}

),

c_{p}

is specific heat (

kJ / (kg \cdot K)

),

\dot{V}

is volumetric flow rate (

m^{3} / s

) measured by differential pressure-derived estimates and pump curves, and

Δ T_{chw}

is the chilled-water temperature rise across the load. This thermal proxy, paired with measured electric power, enables computation of COP, thereby linking component health to system-level energy [].

\begin{array}{l} COP & = \frac{{\dot{Q}}_{cool}}{P_{el}} \\ kW / ton & = \frac{P_{el}}{{\dot{Q}}_{cool} / 3.517} \end{array}

(5)

In Equation (5),

P_{el}

is the measured electrical power of the chiller and auxiliaries, and

{\dot{Q}}_{cool}

is as defined above. COP and

kW / ton

were computed at 1 min resolution and subsequently aggregated to operating-mode segments. These metrics are central to objective (iv), as they permit translation of predicted maintenance timing into energy impacts.

For rotating equipment, broadband vibration levels were summarized using root-mean-square acceleration as in Equation (6). Where accelerometer channels existed at both inboard and outboard bearings, the maximum of the two RMS values was used to increase sensitivity to localized degradation.

a_{RMS} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} a_{i}^{2}}

(6)

In Equation (6),

a_{i}

is the

i

-th acceleration sample within a 1 min window and

N

is the number of samples in that window. The statistic

a_{RMS}

acts as a health indicator for bearings and motor mounts. It directly informs RUL learning by providing a slowly drifting signal prior to failure.

To preserve trends while further damping residual noise from turbulent flow and switching, an exponentially weighted moving average (EWMA) was applied to selected channels per Equation (7). EWMA is particularly effective for early drift detection [].

x_{t}^{EWMA} = λ x_{t} + (1 - λ) x_{t - 1}^{EWMA}

(7)

In Equation (7),

x_{t}^{EWMA}

is the filtered value at time

t

,

x_{t}

is the raw or pre-smoothed input, and

λ \in (0, 1]

is the smoothing factor (here,

λ = 0.2

). EWMA was applied to COP,

Δ T

, and

a_{RMS}

, improving signal-to-noise ratio for downstream sequence modeling without erasing gradual degradation.

Because building systems operate in distinct regimes, the data were segmented into operating modes derived from clustering over latent features including load proxies, valve positions, and ambient conditions. The within-mode homogeneity enhances the stationarity assumptions required by sequence learning. The clustering minimized the k-means objective in Equation (8).

J = \sum_{t = 1}^{T} {‖z_{t} - μ_{c (t)}‖}_{2}^{2}

(8)

In Equation (8),

z_{t}

is the feature vector at time

t

(including

{\dot{Q}}_{cool}

, valve positions, outside-air variables),

μ_{c (t)}

is the centroid of cluster

c (t)

assigned to time

t

, and

T

is the number of samples. Clusters corresponding to non-cooling hours or transient start-up phases were excluded from model training to reduce label noise; the remaining modes were used to stratify splits and calibrate thresholds for maintenance triggers.

These remaining modes were also validated for robustness by computing the silhouette coefficient and the Davies–Bouldin index under season-stratified splits and multiple random initializations of the k-means solver, confirming that a four-mode partition maintained stable separation and centroids across seasonal regimes; this validation supports the use of the operating modes to stratify evaluation and to calibrate thresholds for maintenance triggers, building on these operating modes.

Building on these operating modes, we quantified statistical justification of the engineered features via three steps that are consistent with the learning and calibration design in Section 2.7 and Section 2.8: (i) redundancy diagnostics using Pearson correlation matrices and variance inflation factors (VIF) computed on training data only to control leakage; (ii) predictive relevance via permutation importance and drop-one (“leave-one-feature-out”) ablations, with effects measured as changes in weighted MAE under the Huber loss and assessed by 24 h block bootstrap to respect temporal dependence; and (iii) sequence-model explainability using path-integrated gradient attributions aggregated from time-step to output (RUL) and summarized across operating modes to inspect stability. All diagnostics were stratified by component class and mode to align with the RUL labeling that follows.

Building on these operating modes, the attributions were computed with integrated gradients using a zero feature baseline and 50 path steps; per-feature contributions were averaged over the 720 min look-back window and over prediction episodes, normalized to sum to one per prediction, and then aggregated by operating mode and component class to yield stable per-feature drivers of RUL that we compare against the failure time definition below, enabling direct alignment with the failure time definition.

2.5. RUL Labeling and Target Construction

For each monitored component

m

, a final failure time

t_{f}^{(m)}

was defined as the timestamp when the asset was taken out of service for corrective repair and subsequently passed post-maintenance verification. The RUL target at time

t

was then computed as a nonnegative countdown as in Equation (9). This label preserves continuity of degradation and provides a regression target directly aligned with scheduling needs [].

{RUL}_{t}^{(m)} = \max (0, t_{f}^{(m)} - t)

(9)

In Equation (9),

{RUL}_{t}^{(m)}

is expressed in hours;

t

and

t_{f}^{(m)}

are in the same time units. For assets without a failure during the observation window, right-censoring was handled by masking the loss beyond the last observed time. Under this right-censoring, the Huber loss was evaluated only on uncensored time steps (i.e., masked steps were excluded), so each censored sequence contributed its prefix windows but incurred no penalty beyond the last observed time—a convention that can mildly bias RUL upward when censoring is heavy, which we mitigated via conservative lower-quantile calibration (Section 2.8) so that these partial sequences remain informative. This strategy provides informative partial sequences that still contribute to representation learning, and it avoids introducing bias toward assets with frequent failures.

As this strategy relies on work-order timestamps, this study validated each candidate failure anchor against technician notes and post-service restart confirmations already referenced in Section 2.2; interventions documented as inspections or prophylactic checks without corrective action were not used as failure anchors and instead contributed right-censored sequences within the same asset’s history. To accommodate the possibility that the recorded repair time postdates the onset of critical degradation, this study treated the log timestamp as a conservative upper bound and mitigated residual onset mismatch by training with a robust Huber loss, calibrating to lower quantiles for conservative lead times (Section 2.8), and converting forecasts to sustained triggers to suppress transient anomalies. These steps bound the effect of inevitable label noise while preserving traceability from logs to targets, and they precede the sequence construction that follows.

2.6. Sequence Construction, Splits, and Leakage Control

Sequences were constructed by sliding a fixed-length window over standardized features with stride equal to the native sampling period. The look-back length was set to 720 min to cover multiple control cycles and capture diurnal patterns; look-ahead labeling used the contemporaneous RUL per Equation (9). To prevent temporal leakage, all splits were made chronologically and by component instance: the first 60% of time for each component formed the training set, the subsequent 20% the validation set for model selection, and the final 20% the held-out test set for reporting. Operating-mode stratification ensured that each split contained representative distributions of load and ambient conditions.

Building on this operating-mode stratification, all normalization statistics (means and standard deviations in Equation (2)) and other data-driven preprocessing parameters were estimated exclusively from the training data within each chronological split and then frozen for application to the corresponding validation and test segments; furthermore, no class-balancing operations (oversampling, undersampling, or class-weight weighting) were used during model training, which aligns with the LSTM cell equations and subsequent learning dynamics.

2.7. LSTM Architecture and Learning

The LSTM cell equations are provided in Equation (10) to make the learning dynamics explicit. The network ingests a multivariate sequence and emits a scalar RUL estimate for each time step via a linear head. Two stacked LSTM layers with hidden sizes of 128 and 64 were used, followed by dropout and a dense projection. Layer normalization was applied to stabilize training [].

\begin{array}{l} i_{t} & = σ (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i}) \\ f_{t} & = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f}) \\ o_{t} & = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o}) \\ {\tilde{c}}_{t} & = \tanh (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c}) \\ c_{t} & = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ {\tilde{c}}_{t} \\ h_{t} & = o_{t} ⊙ \tanh (c_{t}) \end{array}

(10)

In Equation (10),

x_{t} \in ℝ^{d}

is the input feature vector at time

t

,

h_{t} \in ℝ^{h}

is the hidden state,

c_{t}

is the cell state, and

i_{t}

,

f_{t}

, and

o_{t}

are the input, forget, and output gates;

σ (\cdot)

denotes the logistic function and

⊙

is elementwise multiplication. Weight matrices

W \cdot

and

U \cdot

and biases

b \cdot

are learned. This recurrent structure encodes long- and short-range temporal dependencies critical for modeling gradual degradation superimposed on control cycles, satisfying objective (i) and enabling robust answers to RQ1.

The regression head maps the final hidden representation to a scalar RUL via a linear layer. The network was trained to minimize a robust loss to mitigate the influence of occasional mis-labels or abrupt maintenance actions. The Huber loss in Equation (11) provided a balance between

l_{1}

and

l_{2}

behavior [,].

\begin{array}{l} δ_{t} & = {\hat{r}}_{t} - r_{t} \\ L (δ_{t}) & = \{\begin{array}{l} \frac{1}{2} δ_{t}^{2}, & if |δ_{t}| \leq κ, \\ κ (|δ_{t}| - \frac{1}{2} κ), & otherwise \end{array} \end{array}

(11)

In Equation (11),

r_{t}

is the true RUL at time

t

from Equation (9),

{\hat{r}}_{t}

is the model’s prediction,

δ_{t}

is the residual, and

κ

is a tunable threshold (set to 6 h after calibration on validation data). As this threshold was tuned, this threshold was selected via a validation grid search over {3, 6, 9, 12} h that minimized the expected mis-scheduling penalty defined for decision-making (see sensitivity analyses), with performance remaining stable for values in the 5–8 h range, which preserves the near-zero quadratic sensitivity emphasized by the Huber formulation. The Huber formulation maintains quadratic sensitivity near zero for precise fitting while limiting sensitivity to large deviations, which occur around unplanned shutdowns or immediate repairs.

Optimization used Adam with decoupled weight decay, an initial learning rate of

10^{- 3}

, cosine decay scheduling, and early stopping based on validation RMSE. Building on this optimization with Adam and cosine decay scheduling, the exact optimizer and scheduler settings are provided in Appendix A Table A2, including the AdamW weight-decay coefficient and the cosine schedule parameters used for this study, ensuring precise reproducibility consistent with early stopping.

To reduce variance, five models with different seeds were trained and predictions were averaged (model ensembling). Hyperparameters were selected by Bayesian optimization over learning rate, hidden sizes, dropout, and sequence length using the validation split; the search space and seeds are provided in the accompanying configuration file to support replication.

Using the same search space, Bayesian optimization was run for 64 trials on the validation split; ensemble predictions averaged five independently trained models with fixed seeds {11, 19, 27, 35, 43}. For transparency, Appendix A Table A2 summarizes each tuned hyperparameter’s range (distribution/type) alongside the selected values used in this work, and these settings were applied consistently when constructing the parameter-matched baselines.

Building on the same hyperparameter search space and seeds, the hyperparameters were also used to configure parameter-matched baselines (GRU with 2 layers/128–64 units, TCN with 8 residual blocks and comparable receptive field, and a lightweight Transformer encoder with 2 heads and embedding size matched to the LSTM hidden state), each trained on identical splits with the Huber loss, isotonic calibration, and the sustained trigger policy.

2.8. Evaluation Metrics, Uncertainty, and Calibration

Predictive accuracy was summarized by MAE and RMSE, defined for completeness in Equation (12). Because this study requires actionable horizons in a live facility, the reporting explicitly links error magnitudes to maintenance lead time. Furthermore, empirical confidence intervals were computed by block bootstrapping to respect temporal dependence.

\begin{array}{l} MAE & = \frac{1}{N} \sum_{t = 1}^{N} |{\hat{r}}_{t} - r_{t}| \\ RMSE & = \sqrt{\frac{1}{N} \sum_{t = 1}^{N} {({\hat{r}}_{t} - r_{t})}^{2}} \end{array}

(12)

In Equation (12), N is the number of evaluated time steps over all components in the test split. MAE provides a robust notion of typical error, while RMSE emphasizes larger deviations that could affect scheduling safety margins. Both metrics were computed per component class and then aggregated by a weighted average proportional to operating hours.

To quantify parameter and data uncertainty, a block bootstrap with block length of 24 h was performed over test sequences. The 95% confidence interval (CI) for a generic statistic

θ

was obtained using the normal approximation in Equation (13), which proved adequate given the large number of blocks; percentile CIs were computed as a sensitivity check and yielded comparable bounds [].

{CI}_{95} = \hat{θ} \pm 1.96 {\hat{σ}}_{boot}

(13)

In Equation (13),

\hat{θ}

is the bootstrap mean of the statistic (e.g., MAE), and

{\hat{σ}}_{boot}

is its bootstrap standard deviation computed over resampled blocks. Reporting CIs is essential for objective (ii), as it conveys the reliability of claimed prediction horizons.

Calibration of point predictions to quantiles was performed to estimate conservative lead times. A monotonic isotonic regression map was fitted from residuals on validation data to convert

{\hat{r}}_{t}

to a predicted lower quantile

{\hat{r}}_{t}^{(q)}

corresponding to a chosen risk level

q

.

2.9. Maintenance Trigger Logic and Downtime Modeling

To connect regression outputs to decisions, a simple, auditable trigger was defined: when predicted RUL falls below a threshold

τ

for a sustained duration, a maintenance order is issued for the earliest feasible window. The trigger is formalized in Equation (14) [].

T_{t} = I ({\hat{r}}_{t}^{(q)} \leq τ)

(14)

In Equation (14),

T_{t}

is a binary indicator that a trigger occurs at time

t

,

{\hat{r}}_{t}^{(q)}

is the calibrated lower-quantile RUL, and

τ

is the policy threshold (calibrated between 6 and 12 days depending on component type). Sustained conditions were enforced by requiring

T_{t}

to be true on at least 80% of samples over a 12 h window to reduce false positives from transient anomalies. This trigger translates model outputs into actionable work orders and enables explicit computation of lead time and downtime risk windows, directly addressing objective (iii).

Downtime modeling compares observed unplanned outages under the historical schedule to counterfactual outcomes under the PdM trigger. If a failure would have occurred within the lead window

[t, t + τ]

absent intervention, the triggered policy avoids that outage; otherwise, it may induce an early but planned replacement. A cost model summarizes the trade-offs as in Equation (15) [].

\begin{array}{l} C_{total} & = C_{maint} + C_{energy} + C_{down} \\ C_{down} & = c_{down} \times H_{down} \\ C_{energy} & = \sum_{t \in H} p_{t} P_{el, t} Δ t \end{array}

(15)

In Equation (15),

C_{maint}

includes labor and parts for planned and unplanned actions,

C_{energy}

is the cost of electric energy during the analysis horizon

H

with tariff

p_{t}

and power

P_{el, t}

, and

C_{down}

monetizes downtime via unit cost

c_{down}

per hour and total downtime hours

H_{down}

. This decomposition permits quantitative evaluation of PdM versus fixed-interval policies under consistent assumptions and provides the foundation for the results reported subsequently.

2.10. Energy Impact Estimation

Energy impacts were estimated by mapping predicted maintenance timing to the evolution of COP. For each predicted trigger, two trajectories were constructed: a counterfactual baseline in which degradation continues until failure and a rectified trajectory in which maintenance restores COP to its post-service baseline. The energy differential was accumulated as in Equation (16) [].

E_{save} = \sum_{t \in W} (\frac{{\dot{Q}}_{cool, t}}{{COP}_{t}^{rect}} - \frac{{\dot{Q}}_{cool, t}}{{COP}_{t}^{base}}) Δ t

(16)

In Equation (16),

E_{save}

is electric energy saved over window

W

associated with a maintenance action,

{\dot{Q}}_{cool, t}

is cooling load from Equation (4),

{COP}_{t}^{base}

is the degraded baseline COP trajectory prior to maintenance,

{COP}_{t}^{rect}

is the rectified COP after maintenance, and

Δ t

is the sampling interval. The trajectories were estimated by smoothing observed COP before and after known service events and by transferring typical restoration deltas to predicted triggers of the same component type and operating mode.

Building on these trajectories, the trajectories embed weather dependence through

Q_{cool}

in Equation (4), so the differential

E_{save}

in Equation (16) is computed at the same contemporaneous load and therefore attributes savings to COP restoration rather than to ambient variation; to corroborate this attribution, we applied the change-point diagnostic in Equation (22) to verify that days surrounding triggers did not exhibit anomalous outside-air conditions beyond the fitted baseline, which supports the interpretation carried into Aggregating

E_{save}

across triggers yields annualized energy savings and directly supports objective (iv).

2.11. Baseline Policy and Counterfactual Construction

The benchmark program used a fixed-interval preventive schedule consistent with the building’s historical practice during the pre-study period. Within that fixed-interval program, refrigerant leak inspection (performed in accordance with ISO 5149-4 and using instruments conforming to EN 14624) and superheat/subcool verification (per OEM service procedures and the practice summarized in ASHRAE refrigeration guidance) were executed during scheduled service windows; these periodic checks informed work-order records and labels but were not used as continuous PdM features, complementing the pump and fan inspections.

For pumps and AHU fans, bearings and belts were inspected on a monthly schedule with quarterly replacements as needed; for chillers, seasonal overhauls and annual teardown inspections were standard. The counterfactual under the PdM policy was constructed by shifting maintenance events earlier to the first trigger time that satisfied Equation (14) while respecting labor constraints, modeled as a daily work-order capacity per trade. If a predicted lead window overlapped a historically observed unplanned failure, the outage was assumed avoided; otherwise, the action was considered an advance replacement or inspection. Sensitivity analyses varied

τ

, the quantile level

q

, and daily capacity to assess robustness of downtime and energy results.

2.12. Statistical Analysis and Reporting

All metrics were reported with 95% CIs from the block bootstrap in Equation (13). For downtime comparisons, paired differences at the asset-period level were computed between the PdM counterfactual and the baseline; significance was assessed by bootstrap-based p-values obtained as the fraction of resamples with non-negative improvement. For energy savings, uncertainty propagated both load variability and COP estimation error by resampling days and applying parametric noise calibrated from post-maintenance COP dispersion. To assess the stability of RUL predictions across operating modes, stratified errors were computed and compared using non-parametric tests appropriate for unequal variances; however, emphasis was placed on effect sizes and CIs rather than point null-hypothesis tests.

To ensure that improvements were not artifacts of information leakage, all data preprocessing parameters (e.g., standardization statistics) were fit on training data only and applied to validation and test sets. Time-ordered splits and component-wise segregation prevented future information from contaminating past predictions. Finally, a hold-out month coinciding with peak summer demand was reserved a priori for qualitative error analysis and stress testing of trigger behavior during heat extremes typical of Riyadh.

2.13. Quality Assurance, Calibration Checks, and Reproducibility

Quality assurance was embedded at multiple levels. Sensor plausibility bounds were enforced, and channels violating physical constraints were flagged for exclusion; for example,

Δ T_{coil}

outside [0.1, 15] K during cooling modes was considered implausible. Power integrals were reconciled with utility meters to verify kWh accounting within 2% over daily horizons.

Building on the reconciled power integrals, we quantified how sensor uncertainty propagates to engineered features and decisions by combining first-order (delta-method) variance propagation with Monte-Carlo (MC) perturbations at minute resolution. For coil temperature difference,

Δ T_{coil} = T_{return} - T_{supply}

, the variance is approximated by:

var (Δ T_{coil}) \approx var (T_{return}) + var (T_{supply}) - 2, cov (T_{return}, T_{supply})

(17)

For cooling rate,

Q_{cool} = ρ c_{p}, V, Δ T_{chw}

, we used:

var (Q_{cool}) \approx {(ρ c_{p}, Δ T_{chw})}^{2} var (V) + {(ρ c_{p}, V)}^{2} var (Δ T_{chw})

(18)

treating

ρ

and

c_{p}

as constants over the operating range. For efficiency,

COP = Q_{cool} / P_{el}

, the ratio form yields:

var (COP) \approx \frac{var (Q_{cool})}{P_{el}^{2}} + \frac{Q_{cool}^{2}}{P_{el}^{4}}, var (P_{el}) - 2, \frac{Q_{cool}}{P_{el}^{3}}, cov (Q_{cool}, P_{el})

(19)

where

T_{return}

and

T_{supply}

are BMS temperatures,

V

is volumetric flow,

P_{el}

is electrical power,

ρ

is water density, and

c_{p}

is specific heat. In parallel, to capture non-linear effects of uncertainty on prognostics, we drew zero-mean perturbations for temperature, flow, and power channels from vendor-stated standard uncertainties and applied them to the cleaned, standardized sequences; the LSTM then produced an ensemble of RUL trajectories, from which we computed the time-varying prediction standard deviation

σ_{RUL} (t)

. This propagated variance was integrated with isotonic quantile calibration by augmenting residual dispersion, ensuring conservative lower-quantile lead times are preserved under realistic sensor and data noise, and handoff to the post-maintenance baselines used in the rectified COP trajectory.

Post-maintenance baselines were estimated by averaging COP and vibration features over a 7-day window following service once the plant re-entered typical loads; this baseline informed the rectified trajectory in Equation (16).

Model reproducibility was guaranteed by fixing seeds, versioning code and configurations, and storing trained model artifacts with hashes. The entire pipeline, from raw BMS extracts through evaluation outputs, is replayable via a single orchestrated workflow that logs all intermediate tables and figures. To facilitate external replication, the parameter values for W in Equation (1),

λ

in Equation (7), the cluster count in Equation (8), sequence length and hidden sizes in Equation (10), and

κ

in Equation (11) are documented in the Appendix A.

2.14. Sensitivity Analyses and Ablation Studies

Sensitivity of outcomes to key design choices was evaluated systematically. The impact of smoothing choices was assessed by replacing EWMA in Equation (7) with a median filter; predictive performance and trigger stability were re-computed to confirm robustness. The sequence length was varied between 360 and 1440 min to probe the memory horizon needed for accurate RUL; results indicated diminishing returns beyond 720 min. The loss function in Equation (11) was replaced with pure

l_{1}

and

l_{2}

objectives in ablation experiments, demonstrating that the Huber choice improved both MAE and the stability of maintenance triggers. Finally, the operating-mode clustering in Equation (8) was replaced by rule-based segmentation on outside-air temperature and load percent; the learned representation from the clustering approach consistently yielded better generalization in the hot-arid regime.

2.15. Implementation Details and Compute Environment

All preprocessing, modeling, and evaluation were implemented in Python 3.12 with vectorized numerical libraries. Training was executed on a workstation equipped with a recent-generation GPU (24 GB memory), ensuring that sequence models trained within practical times to facilitate extensive hyperparameter search. To support deployment considerations typical of facilities engineering teams, an inference-only version of the model was exported as a lightweight container that ingests live BMS streams, applies the exact transforms of Equations (1)–(8), and emits calibrated RUL and trigger signals per Equations (11)–(14). This container integrates with the work-order system via a REST endpoint and writes audit logs for traceability.

2.16. Energy Baselining and Schedule-Aware Evaluation

The following formulations support the mapping from prediction accuracy to schedule-aware maintenance windows in the Riyadh context, completing the link from objective (ii) to (iv). The first defines the “lead-time accuracy” used in operational reporting: the fraction of triggers whose calibrated RUL lower-quantile exceeds a minimum planning horizon

τ_{\min}

as in Equation (20) [].

LTA = \frac{1}{K} \sum_{k = 1}^{K} I ({\hat{r}}_{t_{k}}^{(q)} \geq τ_{\min})

(20)

where K is the number of triggers in the test period,

{\hat{r}}_{t_{k}}^{(q)}

is the calibrated RUL at trigger time

t_{k}

, and

τ_{\min}

is the minimum lead time required by the maintenance contractor (here, 5 days). High LTA indicates that predictions translate into practicable schedules, thereby bridging model metrics and field realities.

The second relates prediction errors to expected mis-scheduling penalty via a simple quadratic penalty function in Equation (21), which is used in sensitivity analyses of

κ

and

τ

[].

Π = α E [\max {\{0, τ - r^{(q)}\}}^{2}] + β E [\max {\{0, r^{(q)} - τ\}}^{2}]

(21)

where

Π

is the expected penalty,

α

weights the cost of insufficient lead (

r^{(q)} < τ

) that may precipitate unplanned outages, and

β

weights the cost of overly early actions. This scalar surrogate informs policy tuning to balance risk and efficiency.

Finally, to connect thermal physics to schedule decisions in the hot-arid climate, a change-point model for daily cooling energy

E_{d}

was used in diagnostics as in Equation (19). This model captures how energy scales with outside-air temperature and provides a baseline for interpreting

E_{save}

from Equation (22) [,].

E_{d} = γ_{0} + γ_{1} \max (0, T_{d}^{OA} - T_{bal})

(22)

where

E_{d}

is daily chiller energy (kWh),

T_{d}^{OA}

is daily mean outside-air temperature,

T_{bal}

is the balance-point temperature separating base loads from cooling-driven loads, and

γ_{0}, γ_{1}

are fitted coefficients. This simple diagnostic supports attribution of observed savings to improved COP rather than to weather anomalies during the evaluation period.

3. Results

3.1. Data Integrity, Coverage, and Calibration Outcomes

Continuous acquisition yielded two full years of 1 min BMS telemetry from the Riyadh office tower’s central plant and AHU network, producing 1,051,200 timestamped rows per channel across 145 channels. Aggregation of the quality ledger showed that 97.6% of minutes were fully populated after cleaning and synchronization, with residual missingness masked at sequence construction. Four quarterly service windows provided reference spot-checks against handheld instruments; reconciled power integrals matched the utility interval meters within ±1.8% at daily resolution, confirming consistent kWh accounting. Hard faults (disconnections, encoder rollovers) were automatically excluded, while soft anomalies were attenuated by the smoothing stages specified in Equations (1) and (7) of the methodology. To document quantitative coverage, the dataset characteristics are presented in Table 1.

Table 1. Dataset composition, data quality, and calibration outcomes across monitored subsystems over the two-year observation horizon.

The table establishes that the cleaned dataset retained the granularity required for sequence modeling while preserving meter-level energy balance. Coverage exceeded 97% in all subsystems, and reconciliation errors centered well below the 2% threshold adopted a priori. These conditions support downstream analyses without undue bias from imputation. They also bounded the effect of propagated sensor uncertainty: applying the procedure described in Section 2.13 showed that uncertainty-induced variability in

Δ T_{coil}

,

Q_{cool}

, and

COP

remained small relative to the reported block-bootstrap confidence intervals for MAE/RMSE and did not change the conclusions drawn subsequently, thereby supporting the stability of the scheduling and energy findings before feature engineering.

3.2. Feature Behavior and Operating-Mode Structure

Feature engineering generated thermal and mechanical indicators that reflect equipment health and system efficiency. To characterize these indicators and the operating regimes used for stratification, the multivariate behavior of key engineered features and the resulting mode clusters are presented in Figure 1 across four analytical panels. Figure 1a displays the empirical distributions of

Δ T_{coil}

across AHUs during cooling operation, highlighting between-asset variability and load dependence. Figure 1b presents COP trajectories for the chilled-water plant aggregated by week, with EWMA smoothing applied to emphasize sustained shifts. Figure 1c shows pre-failure trajectories of

a_{RMS}

at pump and fan bearings, time-aligned to failure events (

t = 0

) to reveal acceleration patterns. Figure 1d depicts the operating-mode partition derived from clustering the latent feature vector

z_{t}

, with modes mapped against outside-air conditions and plant load to demonstrate climatological and operational separability.

Figure 1. Engineered feature behavior and operating-mode structure. (a)

Δ T_{coil}

distributions across AHUs under cooling, stratified by valve position deciles. (b) Plant COP trajectories by week with EWMA smoothing and annotated maintenance intervals. (c) Time-aligned

a_{RMS}

at pump and fan bearings over the 21 days preceding failures. (d) Operating-mode clusters in the

(T^{OA}, {\dot{Q}}_{cool})

plane with cluster centroids and occupancy share.

The feature distributions in panel a exhibit right-skewed

Δ T_{coil}

with medians between 5.2 K and 7.0 K across AHUs at mid-valve positions, while low

Δ T_{coil}

tails coincide with low-load or incipient degradation periods. Panel b shows COP fluctuations dominated by seasonal regimes; sustained downward drifts of 0.2–0.4 in COP over multiweek intervals precede several maintenance events, consistent with gradual performance loss. Panel c reveals monotonic growth in

a_{RMS}

as failures approach; median pre-failure acceleration rises by 38–55% relative to baseline at t = −21 days, supporting its role as an interpretable mechanical health indicator. Panel d partitions operation into four stable modes: low-load night setback, shoulder-load daytime, high-load peak cooling, and transient start/stop. The clustering proportionally assigns 17%, 39%, 36%, and 8% of hours to these modes, respectively, forming the basis for stratified evaluation.

Using this mode structure as the evaluation basis, redundancy diagnostics on the training split yielded a median absolute pairwise correlation of 0.41 (95th percentile 0.76) across features and a maximum VIF of 3.9, indicating acceptable multicollinearity for stable estimation. Drop-one ablations showed the largest MAE degradations (Δ hours, mean [95% CI]) for

Δ T_{coil}

(+4.1 [3.2, 5.0]),

a_{RMS}

(+3.5 [2.7, 4.2]), and

COP

(+2.8 [2.0, 3.6]); load proxies (

Q_{cool}

), valve position, and outside-air temperature produced smaller yet non-zero effects (+1.6 [0.9, 2.3], +1.3 [0.7, 2.0], and +1.1 [0.5, 1.8], respectively). Permutation importances (normalized to

Δ T_{coil} = 1.00

) ranked

Δ T_{coil}

(1.00),

a_{RMS}

(0.86),

COP

(0.74),

Q_{cool}

(0.49), valve position (0.38), and outside-air temperature (0.29), with rank-order stability across modes (Spearman

ρ \geq 0.82

). Attribution summaries were directionally consistent with physics: lower

Δ T_{coil}

and lower

COP

carried negative contributions to RUL (shorter predicted life), while higher

a_{RMS}

also decreased RUL, and these effects were monotone over the central quantiles; component-specific profiles emphasized

a_{RMS}

for fans/pumps and

Δ T_{coil}

–

COP

for chillers. These attribution summaries are visualized in Appendix A Figure A1, which reports integrated-gradient contributions by feature, component class, and operating mode, and are tabulated (permutation-importance, normalized to

Δ T_{coil} = 1.00

) in Appendix A Table A3; the dominant drivers—

Δ T_{coil}

,

a_{RMS}

, and COP—are consistent across modes and components and align with the subsequent lead-time accuracy analysis, which quantifies the impact of feature perturbations. As a procedural check, random feature permutation reduced lead-time accuracy at

τ = 10

days by 0.12 on average, confirming that the engineered set materially supports RUL target learning.

3.3. RUL Target Availability and Censoring Characteristics

Label construction yielded time-to-failure targets for components with observed outages and right-censored sequences for assets without failures in the observation window. Failure counts, label horizons, and censoring rates are summarized in Table 2 to describe target availability for supervised learning.

Table 2. Failure events, RUL label horizons, and censoring by component class.

The table shows adequate label density across classes for supervised regression with an emphasis on pumps and fans. Median horizons spanning 16–24 days ensure that sufficient pre-failure history exists for RUL learning. Right-censoring affects 10 assets but contributes additional healthy operation for representation learning under masked losses.

3.4. Sequence Inventory and Split Verification

Sequence construction used 720 min look-back windows and unit stride over the cleaned series. Chronological, component-wise splits produced distinct training, validation, and test inventories. The resulting counts are presented in Table 3, which also records the proportion of hours by operating mode within each split to verify distributional comparability and leakage control.

Table 3. Sequence counts and operating-mode shares by split and component class.

Comparable mode shares across splits confirm that the chronological strategy did not induce mode imbalance in the test horizon. The inventory size supports training and calibrated evaluation without temporal leakage.

3.5. Model Fitting Dynamics, Residual Structure, and Calibration

Convergence behavior, residual distributions, and calibration quality determine whether the sequence model generates reliable RUL estimates. The findings are summarized in Figure 2 across three analytical panels. Figure 2a reports training and validation losses over epochs for the five-seed ensemble, with shaded bands representing inter-seed variation. Figure 2b shows standardized residual distributions by component class on the held-out test set to examine symmetry and tail weight. Figure 2c presents reliability diagrams comparing predicted quantiles

{\hat{r}}^{(q)}

with empirical coverage to evaluate calibration.

Figure 2. Fitting dynamics, residuals, and calibration. (a) Huber loss over epochs for training and validation across five seeds; early stopping points indicated. (b) Standardized residuals on the test set, faceted by component class. (c) Reliability diagrams for q ∈ 0.1, 0.25, 0.5 with isotonic calibration curves and ideal 1:1 lines.

Panel a shows monotonic loss reduction with convergence by epoch 28–41 depending on seed, and with validation loss flattening before training loss, consistent with effective early stopping. Panel b indicates near-symmetric residuals centered close to zero; chillers exhibit slightly heavier positive tails relative to pumps and fans, consistent with occasional later-than-observed failure forecasts during abrupt compressor faults. Panel c shows post-isotonic curves closely tracking the diagonal, with mean absolute calibration error below 0.06 across quantiles, demonstrating adequate coverage for risk-aware trigger computation.

Quantitative accuracy metrics are consolidated in Table 4 to provide precise error magnitudes with confidence intervals.

Table 4. RUL prediction accuracy on the held-out test set. Values are mean ± 95% CI from block bootstrap (24 h blocks).

Errors remain well below typical planning horizons, and coverage errors at q = 0.25 stay within ±0.06 across classes, supporting the use of calibrated lower quantiles for conservative scheduling.

3.6. Actionable Horizons and Lead-Time Characteristics

Maintenance planning requires a quantification of how often calibrated forecasts exceed minimum lead thresholds. The analysis uses the lead-time accuracy metric defined by Equation (20) and stratifies results by policy threshold

τ

and component class. The outcomes are shown in Figure 3 across two panels. Figure 3a presents

LTA

as a function of

τ

for

q = 0.25

to reflect conservative use. Figure 3b decomposes

LTA

at

τ \in 5, 7, 10, 14

days by component class with bootstrap CIs.

Figure 3. Lead-time characteristics from calibrated quantiles. (a)

LTA

vs. policy threshold

τ

at q = 0.25 on the held-out test set. (b) Class-wise

LTA

at discrete

τ

values with 95% CIs.

Panel a shows a monotonic decline in

LTA

with increasing

τ

. At

τ = 5

days,

LTA = 0.91

(95% CI: 0.88–0.93); at

τ = 10

days,

LTA = 0.79

(0.75–0.82); at

τ = 14

days,

LTA = 0.66

(0.61–0.71). Panel b indicates that pumps and fans exhibit higher

LTA

than chillers at longer thresholds, consistent with smoother degradation paths in rotating equipment compared with sporadic compressor trips. These lead-time profiles inform the trigger threshold selection in the decision analysis.

3.7. Trigger Behavior, False Alarms, and Timeline Exemplars

Decision translation from forecasts to work orders was examined through thresholded, sustained triggers per Equation (14). Trigger dynamics, lead-time distributions, and false-trigger rates are summarized in Figure 4 across three panels. Figure 4a depicts a representative chiller’s timeline over three months, with predicted

{\hat{r}}^{(0.25)}

, trigger state, COP, and the observed failure time, illustrating how a sustained low-quantile crossing anticipates an outage. Figure 4b presents kernel density estimates of achieved lead times (difference between trigger time and failure time) for all components at

τ = 10

days. Figure 4c reports the false-trigger incidence per 30 operating days as a function of

τ

under the sustained condition (≥80% of samples in a 12 h window).

Figure 4. Trigger dynamics and quality. (a) Chiller timeline with calibrated RUL, trigger state, COP, and failure marker. (b) Lead-time distributions at

τ = 10

days across component classes. (c) False-trigger incidence per 30 days vs.

τ

with bootstrap 95% CIs.

In panel a, the calibrated RUL trajectory crosses the

τ = 10

days line 11.3 days before the observed failure and remains below threshold, generating a single sustained trigger; COP degradation accelerates in the final week, aligning with the predicted window. Panel b shows median achieved lead times of 11.0, 12.4, and 12.1 days for chillers, pumps, and fans, respectively, with interquartile ranges of 8.5–14.7 days across classes. Panel c indicates low false-trigger rates under sustained conditions; at

τ = 10

days the mean incidence is 0.28 per 30 days (95% CI: 0.19–0.39) aggregated across classes, decreasing further at

τ = 14

days due to higher confidence thresholds. These dynamics demonstrate that calibrated quantiles and sustained logic yield stable, actionable signals.

3.8. Predictive Performance Against Point Metrics and Cross-Mode Robustness

Accuracy within and across operating modes was assessed to evaluate generalization under different thermal loads. The class- and mode-specific errors are compiled in Table 5, with MAE and RMSE reported for the low, shoulder, and peak modes; transient windows are omitted from training by design and therefore excluded.

Table 5. Mode-stratified RUL errors (hours) with 95% CIs (block bootstrap, 24 h blocks).

Errors increase modestly under peak load for all classes, reflecting greater variability in control states and ambient drivers. The magnitude of the increase remains within planning horizons for the thresholds examined in Section 3.6, indicating that mode-aware segmentation preserved generalization.

3.9. Comparative Evaluation Against Fixed-Interval Maintenance

Counterfactual analysis compared the historical fixed-interval program to the PdM triggers derived from calibrated RUL. The primary comparison focused on unplanned outage counts, total downtime hours, and the ratio of planned to unplanned actions. The quantitative outcomes are summarized in Table 6 with bootstrap 95% CIs and paired bootstrap p-values over asset-period units.

Table 6. Comparison of baseline fixed-interval program and PdM trigger policy at

τ = 10

days, q = 0.25. Values are means over test horizon with 95% CIs; p-values from paired block bootstrap.

The PdM policy reduced unplanned outages by approximately half and decreased total downtime by 41.3% with statistically significant paired differences. The planned-to-unplanned ratio more than doubled, reflecting a shift toward scheduled interventions derived from RUL triggers. Lead times centered near 12 days under the selected threshold, consistent with Section 3.7.

3.10. Energy Impacts from Rectification Timing

Energy effects were evaluated by contrasting degraded COP trajectories with rectified post-maintenance baselines around each trigger episode and summing the implied kWh differential per Equation (16). The results are presented in Figure 5 across three panels. Figure 5a shows the mean COP trajectory aligned to the maintenance timestamp (t = 0) across episodes, with pre/post behavior aggregated by component class. Figure 5b reports per-episode energy savings estimates across the 30 days following maintenance, stratified by class. Figure 5c summarizes the building-level annualized savings distribution under the PdM policy with bootstrap CIs.

Figure 5. Energy effects of PdM timing. (a) Mean COP trajectories aligned to maintenance at t = 0, with 95% bands. (b) Per-episode 30-day

E_{save}

distributions (kWh) by class. (c) Annualized HVAC energy savings percentage with 95% CIs from resampling days and COP trajectories.

Panel a shows COP recovery steps of 0.18–0.31 for chillers and 0.05–0.09 for pumps and fans (the latter reflecting auxiliary loads), with recovery persistence sustained over four weeks. Panel b indicates median per-episode

E_{save}

of 29.4 MWh for chillers (IQR: 21.1–38.7 MWh), 4.2 MWh for pumps (2.9–6.1 MWh), and 2.8 MWh for fans (1.9–4.1 MWh). Panel c aggregates these effects to an annualized savings of 10.6% of HVAC electricity (95% CI: 7.7–13.2%), benchmarked against the baseline schedule’s realized maintenance timing. The distribution remains right-skewed due to a few large chiller rectifications during peak season.

3.11. Operational Cost Outcomes Under a Unified Tariff

The cost model from Equation (15) was instantiated using a unified tariff of 0.18 SAR/kWh to maintain consistency across scenarios; for cross-reference, 1 USD was taken as 3.75 SAR. Labor and parts costs were drawn from internal work-order records and standardized per action type across scenarios. Because a unified volumetric rate excludes time-of-use differentials and peak-demand charges common in commercial tariffs, the currency-denominated savings in Table 7 should be interpreted as conservative: for the same kWh reduction and avoided outages—concentrated during high-load afternoon operation—higher peak energy prices and demand charges would further decrease

C_{energy}

and

C_{down}

relative to the baseline rather than diminish them. With this conservative framing, the results in Table 7 show a 9.7% reduction in total operational cost; under the stated conversion rate, the reduction corresponds to ~261 kUSD per year.

Table 7. Annualized cost components under baseline and PdM policies at

τ = 10

days, q = 0.25; values in SAR (thousands), mean [95% CI].

3.12. Sensitivity of Policy Parameters and Ablation of Modeling Choices

Policy thresholds and modeling design choices shape outcomes across reliability, cost, and energy. The sensitivity and ablation findings are shown in Figure 6 across four panels. Figure 6a quantifies total downtime hours as a function of

τ

at

q = 0.25

. Figure 6b maps false-trigger incidence versus quantile level

q

at fixed

τ = 10

days. Figure 6c compares MAE across loss functions (Huber,

l_{1}

,

l_{2}

) on the held-out test set. Figure 6d characterizes MAE as a function of sequence length (look-back horizons of 360, 720, 1080, and 1440 min).

Figure 6. Policy sensitivity and modeling ablations. (a) Total downtime vs. threshold

τ

. (b) False-trigger incidence per 30 days vs. quantile

q

at

τ = 10

days. (c) MAE by loss function with 95% CIs. (d) MAE vs. sequence length.

Panel a shows a U-shaped downtime curve with a shallow minimum near

τ = 10

–12 days; shorter thresholds occasionally compress lead windows and allow late interventions, while longer thresholds increase early replacements without additional avoidance benefit. Panel b indicates decreasing false-trigger incidence as

q

decreases (more conservative calibration); at

q = 0.25

incidence remains below 0.3 per 30 days, while at

q = 0.5

it rises to ~0.6 per 30 days. Panel c demonstrates that the Huber loss yields lower MAE than pure

l_{2}

(relative reduction 7.9%) and comparable or slightly better MAE than

l_{1}

, with narrower CIs, in line with robustness to label noise and abrupt post-failure corrections. Panel d reveals diminishing returns beyond 720 min; extending the horizon to 1440 min yields marginal MAE change within overlapping CIs, suggesting that the chosen memory length captures the salient degradation context without overfitting.

3.13. Mode Generalization and Peak-Season Stress Testing

Generalization under the Riyadh peak season was examined using the reserved hold-out month at the summer apex. The results are presented in Figure 7 across three panels. Figure 7a reports per-class MAE and RMSE during the hold-out month compared with the overall test period to evaluate stability under extreme cooling demand. Figure 7b shows

LTA

at

τ = 10

days for the hold-out month versus the rest of the test set. Figure 7c displays the distribution of achieved lead times for chiller episodes occurring in the hold-out month.

Figure 7. Peak-season robustness on the hold-out month. (a) MAE and RMSE per class during peak month vs. overall test. (b)

LTA

at

τ = 10

days in peak month vs. rest. (c) Achieved lead times for chiller episodes during peak month.

Panel a shows modest error inflation during the peak month: chillers MAE 27.9 h (vs. 24.7 h overall), pumps 21.2 h (vs. 19.6 h), fans 19.1 h (vs. 17.3 h). RMSE increases are proportional and remain within actionable ranges. Panel b indicates

LTA

of 0.75 (peak month) versus 0.79 (rest) at

τ = 10

days and

q = 0.25

, with overlapping CIs. Panel c reveals that achieved chiller lead times remain centered above 10 days, with a slight left-tail extension due to several rapid degradations under extreme load. The combined patterns confirm that the segmentation and calibration strategies maintain utility under peak stress.

3.14. Negative and Edge Outcomes with Root-Cause Evidence

A subset of episodes did not yield the intended operational benefits. These cases are documented in Table 8, which enumerates instances of late triggers (achieved lead time < 5 days), false negatives (no trigger prior to failure), and inconsequential triggers (trigger led to inspection with no corrective action), alongside plausible root causes drawn from the QA ledger and technician notes.

Table 8. Edge outcomes and observed contributors during the test horizon.

The small number of late triggers and false negatives is consistent with the tail behavior in Figure 2b and the left-tail of achieved lead times in Figure 4b. The lodged notes indicate that exogenous operational interventions and rare mode combinations occasionally break the learned degradation patterns, suggesting future gains from causal features capturing control overrides and upstream plant conditions.

As these edge outcomes include ‘inconsequential triggers’ in Table 8, this study defines that class as sustained triggers that led to inspection or routine service without parts replacement and without a durable shift in COP or vibration relative to pre-trigger baselines in the subsequent follow-up window; consistent with the technician notes, these were neither counted as avoided outages nor as false triggers in downtime comparisons and are reported separately for transparency. This accounting distinguishes conservative early warnings under low-load modes from false alarms, and the comparative baselines that follow are interpreted under this convention.

3.15. Comparative Baselines Within the Modeling Family

Sequence learning was benchmarked against baselines spanning both non-recurrent and recurrent families to isolate the value of temporal recurrence and attention. The comparative results are consolidated in Table 9, which reports MAE and

LTA

at

τ = 10

days for five alternatives: (i) feed-forward network on aggregated statistics (no recurrence), (ii) rule-based health indices, (iii) a parameter-matched GRU, (iv) a parameter-matched temporal convolutional network (TCN), and (v) a lightweight Transformer encoder, alongside the configured LSTM ensemble.

Table 9. Comparative baselines (non-recurrent and recurrent) versus the configured LSTM ensemble on the held-out test set. Values are mean ± 95% CI from 24 h block bootstrap; LTA computed at τ = 10 days.

The sequence-aware approach outperforms aggregation-only and rule-based strategies on both error and lead-time metrics, consistent with the capacity to encode long- and short-range dependencies described in the methodology.

3.16. Translating Prediction Quality to Scheduling Reliability

The operational utility of forecasts depends on the joint distribution of residuals and policy parameters. The relationship between mis-scheduling penalty from Equation (21) and the calibration quantile is reported in Figure 8 as a single-panel analysis. Figure 8 plots the expected penalty

Π

against

q

for

τ \in 7, 10, 14

days, using bootstrap estimates of the residual distribution.

Figure 8. Expected mis-scheduling penalty

Π

vs. calibration quantile

q

at three thresholds

τ

, with bootstrap 95% CIs.

The curves exhibit shallow minima between

q = 0.20

and

q = 0.30

across thresholds, reflecting the trade-off between avoiding insufficient lead and minimizing overly early actions. The selected configuration at

q = 0.25

lies near the empirical minima for all classes, aligning with the low false-trigger incidence observed in Figure 4c and the high LTA in Figure 3.

3.17. End-to-End Reproducibility Checks and Inference Performance

Operational reproducibility was audited by re-running the full pipeline from raw extracts using the versioned configurations and seeds. Byte-wise equality checks confirmed identity of intermediate feature tables and final predictions across two independent runs. Inference latency and resource utilization were then measured on the export container to verify deployability against BMS streaming rates. The quantitative deployment metrics are compiled in Table 10.

Table 10. Inference performance and throughput on the export container.

Throughput exceeds the requirements of minute-level streaming from the monitored assets by more than an order of magnitude, and end-to-end latency remains below 200 ms, confirming operational feasibility.

4. Discussion

The findings demonstrate that an LSTM-based PdM pipeline trained on multiyear BMS telemetry yields RUL estimates with MAE near one day, quantile calibration error (QCE)

\leq 0.06

, and lead-time accuracy that remains

\geq 0.79

at a 10-day threshold. These properties translated into fewer unplanned outages (−47.6%), shorter total downtime (−41.3%), and a shift toward planned interventions, indicating that sequence-aware representations and isotonic calibration jointly produced decision-relevant forecasts rather than nominal point estimates. The associated COP recovery around triggered actions generated a 10.6% reduction in HVAC electricity and a 9.7% decrease in total annual operating cost despite slightly higher planned maintenance expenditure, which indicates that anticipatory scheduling captured degradation early enough to avert efficiency losses while maintaining practicable lead windows.

Relative to prior work that concentrates on FDD classification for AHUs and chillers, the present results address prognostics by producing component-level RUL on field BMS data and by quantifying operational consequences, including downtime and energy effects [,]. Studies applying LSTM to anomaly detection or load prediction reported strong pattern recognition but generally stopped short of translating outputs into maintenance timing or cost deltas, often because labels were simulated or limited to short horizons []. Prognostic gains comparable to industrial PHM benchmarks are observed here, which aligns with evidence that recurrent models capture slow drifts in health indices; divergence in absolute errors from manufacturing reports likely reflects greater exogenous variability and control interventions in buildings []. Calls for domain adaptation and mode-aware learning in building analytics are addressed by the mode-stratified performance and peak-season stability, which suggest partial transferability across operating regimes within a hot-arid climate [,]. The explicit energy and cost attribution advances beyond earlier FDD-centric case studies that inferred, but did not compute, system-level impacts [].

Several limitations temper generalization. The single-site design—a Class A office tower in Riyadh with a pronounced weekday day–night load cycle—constrains external validity across facility types and climates; facilities with irregular or near-constant loads (e.g., hospitals, data centers, manufacturing plants) may exhibit distinct degradation trajectories and trigger economics that alter policy behavior. Label scarcity and right-censoring may bias RUL tails; semi-/self-supervised pretraining, hierarchical Bayesian calibration, and conformal quantile adjustment could widen valid operating envelopes. Exogenous control overrides and upstream plant disturbances produced late or missed triggers in a minority of episodes; incorporating supervisory-control states, causal water-side features, and operator-in-the-loop verification may reduce these errors.

Exogenous control overrides and upstream plant disturbances also motivate a hybrid, physics-guided residual design in which first-principles chiller-plant balances (Equations (3)–(5)) provide a mechanistic prior and the LSTM learns only the residual RUL dynamics under realistic operating variability; in parallel, we will augment the feature set with discrete supervisory-control flags (manual/auto, local/remote, lockout, alarm inhibits) and upstream loop status indicators (e.g., condenser-water pump trips, valve faults) so the model can distinguish external system shocks from internal degradation, thereby improving operator trust while preserving the calibrated trigger policy, which complements the COP-based energy attribution that follows.

COP-based energy attribution assumes stable post-maintenance baselines; joint identification of capacity and efficiency with weather-normalized change-point models could refine kWh estimates. Counterfactual comparisons embed assumptions on scheduling capacity and tariffs; prospective randomized or quasi-experimental deployments would mitigate policy confounding.

These assumptions on tariffs imply that the currency results reported under a unified 0.18 SAR/kWh rate are conservative; in tariff regimes with time-of-use and peak-demand charges, avoided peak-period failures and sustained COP would amplify monetary savings for the same kWh reduction, motivating site-specific tariff models in future deployments before moving to Within these limitations, this study recognizes that RUL labels derived from maintenance logs can include onset-time uncertainty and class imbalance.

Within these limitations, this study recognizes that RUL labels derived from maintenance logs can include onset-time uncertainty and class imbalance; the Methods therefore combine robust loss (Equation (11)), conservative lower-quantile calibration (Section 2.8), and sustained triggers (Equation (14)) to reduce the operational impact of mis-anchored events. To better exploit abundant unlabeled ‘healthy’ hours and further attenuate label noise, future iterations will add semi-/self-supervised representation learning on operating-mode-stratified sequences (e.g., masked sequence prediction or autoencoding of normal regimes), followed by supervised fine-tuning and the same quantile calibration; this design keeps governance simple while expanding coverage of healthy variability, and directly complements the failure-mode assessment that follows.

These counterfactual comparisons and limitations motivate a focused assessment of operational failure modes and explainability in this study, specifically the risks posed by sensor drift, outlier effects, exogenous control overrides, and cross-condition shifts. Within the quality-assurance steps already implemented—plausibility bounds and spot-calibration checks to attenuate slow drift and timebase inconsistencies, robust cleaning with masking of longer gaps, and encoder-rollover handling—outliers and soft faults are damped before sequence learning, while EWMA-filtered indicators (e.g., COP and

Δ T_{coil}

support change monitoring under minute-level noise. At the modeling layer, the Huber loss, block-bootstrap uncertainty, and isotonic lower-quantile calibration combined with a sustained-condition trigger (80% over 12 h) form a conservative policy that reduces spurious alarms and cushions late detections under abrupt overrides; observed late or missed episodes were associated with manual setpoint clamps and upstream disturbances and are therefore mitigable by incorporating supervisory-control states and water-side causal features. Explainability is addressed by path-integrated gradient attributions aggregated by operating mode, which consistently elevate

Δ T_{coil}

,

a_{RMS}

, and COP and align with physics-based expectations; in deployment, we will log these attributions with each trigger to provide operator-visible rationales (e.g., declining

Δ T_{coil}

at comparable load alongside rising

a_{RMS}

). These governance and interpretability measures strengthen operator trust and make the trigger logic auditable, positioning multi-site evaluations and transfer learning to proceed with actionable transparency.

Future research should pursue multi-site evaluations across Saudi regions, climates, and facility types (e.g., hospitals, data centers, and manufacturing plants), with transfer learning and domain adaptation to accommodate equipment heterogeneity. Concretely, a portfolio model can be pre-trained on extended office-building telemetry and fine-tuned on modest periods of site-specific data while re-calibrating quantiles and re-tuning trigger thresholds to local maintenance lead times. Label-efficient architectures that fuse attention, temporal convolutions, and physics-informed constraints merit testing against rare failure modes. Uncertainty-aware deployment using conformal prediction and risk-sensitive triggers could align RUL quantiles with service-level objectives. Coupling PdM with MPC to co-optimize setpoints and maintenance windows, and conducting prospective A/B trials with operator governance, would establish causal performance and clarify scalability. Integrating economic dispatch, comfort constraints, and embodied carbon accounting would extend decision relevance beyond reliability and energy alone.

5. Conclusions

This study demonstrates that an LSTM-driven PdM pipeline operating on multiyear BMS telemetry delivers decision-quality RUL forecasts that translate into measurable reliability and energy benefits. On the held-out test set, MAE remained near one day (weighted MAE 19.8 ± 2.1 h; RMSE 29.1 ± 3.3 h), quantile calibration error (QCE)

\leq 0.06

, and lead-time accuracy reached 0.79 at a 10-day threshold (95% CI: 0.75–0.82). When converted to operational actions via sustained triggers, these properties reduced unplanned outages by 47.6% (paired bootstrap p = 0.008) and total downtime by 41.3% (p = 0.012), while shifting the mix toward planned interventions (planned/unplanned ratio: 1.68 vs. 0.74). COP recovery at maintenance aligned with these triggers produced an annualized 10.6% reduction in HVAC electricity (95% CI: 7.7–13.2%), yielding a 9.7% decrease in total operating cost despite a modest rise in planned maintenance expenditure. These outcomes directly address this study aim by linking sequence-aware prognostics to scheduling lead time, reliability outcomes, and kWh impacts under hot-arid operating regimes.

These outcomes also extend building-analytics literature by demonstrating, on multiyear field BMS streams, component-level RUL forecasting that is (i) quantile-calibrated and converted into sustained, schedule-aware triggers, (ii) mode-aware for cross-condition robustness, and (iii) counterfactually linked to downtime, COP, and cost so forecasts translate into decision-quality lead windows. For policy and energy management, the evidence supports adopting conservative lower-quantile calibration (≈0.25) with 10–12 day thresholds as governance defaults; tracking lead-time accuracy as a service-level KPI in maintenance contracts; prioritizing BMS data quality and auditability to align with existing inspection standards; and integrating trigger-driven work orders with seasonal peak-readiness and tariff-aware scheduling to sustain COP and reduce unplanned outages. These implications provide portfolio-level guidance while naturally connecting to the forthcoming component-level and mode-stratified analyses.

Component-level and mode-stratified analyses clarify where the approach is most reliable and where margins tighten. Pumps and AHU fans exhibited lower errors than chillers (e.g., fans MAE 17.3 ± 2.2 h vs. chillers 24.7 ± 3.8 h), consistent with smoother degradation in rotating equipment, and all classes showed controlled error inflation at peak load (e.g., chillers MAE 26.1 ± 4.1 h at peak vs. 22.8 ± 4.5 h at low load). Peak-season stress testing preserved actionability, with LTA 0.75 during the hold-out month compared with 0.79 for the remainder and chiller lead times centered above 10 days. Internal benchmarks confirmed the methodological contribution: the configured LSTM ensemble outperformed GRU (20.7 ± 2.3 h; LTA 0.77 ± 0.04), TCN (21.4 ± 2.5 h; LTA 0.75 ± 0.05), a lightweight Transformer (22.6 ± 2.8 h; LTA 0.73 ± 0.05), as well as aggregation-only and rule-based baselines (25.6 ± 3.1 h/0.68 ± 0.05 and 31.9 ± 4.4 h/0.54 ± 0.06), indicating that temporal recurrence with gated memory and conservative quantile calibration are essential for schedule-relevant forecasts.

Several constraints temper generalization and delineate priorities for subsequent work. The single-site design in Riyadh limits climatic and equipment diversity; transferability to coastal or humid regions requires replication with distinct condenser-water behavior and load profiles. Label scarcity and right-censoring may bias tail behavior, as reflected by rare late triggers and two false negatives; semi-/self-supervised pretraining, hierarchical calibration, and conformal quantiles could mitigate these effects. Exogenous control overrides and upstream disturbances occasionally disrupted degradation regularities, suggesting value in incorporating supervisory control states and water-side causal features. Energy attribution assumed stable post-maintenance COP baselines; joint identification of capacity and efficiency under weather-normalized change-point models would refine kWh estimates. Counterfactual policy comparisons depended on assumptions about work-order capacity and tariffs; prospective randomized or quasi-experimental deployments would strengthen causal inference.

Future research should extend evaluation across multiple Saudi regions and climates, pair domain adaptation with transfer learning for heterogeneous equipment fleets, and integrate uncertainty-aware triggering with service-level objectives through conformal prediction. Joint optimization that couples PdM with MPC to co-schedule setpoints and maintenance windows merits testing in live operations, and prospective A/B deployments with operator governance would quantify causal improvements in downtime and energy. Expanding the cost framework to include comfort constraints and embodied carbon would broaden decision relevance while preserving the reproducible pipeline demonstrated here. Collectively, the results establish that calibrated sequence models, mode-aware segmentation, and auditably simple trigger logic form a coherent framework for reliable, energy-aware maintenance scheduling in cooling-dominated commercial buildings.

Author Contributions

Conceptualization, M.A.; Methodology, M.A.; Software, M.A.; Validation, M.A., M.S., A.A. and E.N.; Formal analysis, M.A.; Investigation, M.A.; Resources, M.A.; Data curation, M.A.; Writing—original draft, M.A.; Writing—review and editing, M.A., M.S., A.A. and E.N.; Visualization, M.A.; Supervision, M.A., M.S., A.A. and E.N.; Project administration, M.A., M.S., A.A. and E.N.; Funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Acknowledgments

The authors are thankful to the Deanship of Graduate Studies and Scientific Research at Najran University for funding this work under the Easy Funding Program grant code (NU/EFP/SERC/13/222). During the preparation of this manuscript, the authors used OpenAI ChatGPT (GPT-5) to assist with text drafting and readability improvements. The authors reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1 summarizes preprocessing parameters associated with Equations (1)–(7); the table also notes which settings were fixed a priori versus tuned.

Table A1. Preprocessing parameters for smoothing and imputation (Methods, Equations (1)–(7)).

Step (Eq.)	Parameter	Value/Setting	Applied Channels	Fixed vs. Tuned	Rationale/Source
Moving-average smoothing (Equation (1))	Window (W)	5 min	Channel-wise after outlier re-moval; stabilizes temperature & power; derived features ΔT, COP) computed after smoothing	Fixed	Explicitly stated in Section 2.3; chosen to reduce minute-scale noise without phase shift.
Standardization (Equation (2))	$μ_{j}, σ_{j}$ scope	Computed on training split only; applied to val/test	All standardized features	Fixed (policy)	Prevents leakage and scale dominance; Methods specify train-only fitting.
Short-gap interpolation (pre-Equation (1))	Gap threshold	≤15 min	All features within operating mode	Fixed	LOCF within mode for gaps ≤ 15 min; longer gaps masked during sequence construction.
Short-gap interpolation (pre-Equation (1))	Method	Last-observation-carried-forward (LOCF) within operating mode	All features within operating mode	Fixed	As stated in Section 2.3; avoids biasing across modes.
Long-gap handling (pre-Equation (1))	Masking rule	Gaps (>) 15 min left missing and masked	All features	Fixed	Ensures no synthetic fill across extended outages; aligns with leakage control.
EWMA filtering (Equation (7))	Smoothing factor λ	0.2	$COP, Δ T$ $, a_{RMS}$	Fixed	Explicitly stated; dampens residual noise while preserving drifts.
EWMA filtering (Equation (7))	Application scope	Applied only to the three indicators listed at left	$COP, Δ T$ $, a_{RMS}$	Fixed	Channel scope enumerated in Section 2.4.

Table A2. Hyperparameter search space (Bayesian optimization) and selected values (validation RMSE).

Hyperparameter	Search Space (Type)	Selected Value (Used in This Work)	Notes
Learning rate (Adam)	log-uniform in [1 × 10⁻⁴, 1 × 10⁻²]	0.003	Cosine decay; early stopping on validation RMSE
Weight decay (AdamW)	fixed (policy)	0.01	Decoupled weight decay (AdamW); not applied to bias and layer-normalization parameters
Hidden size—LSTM layer 1	categorical {64, 128, 192}	128	Two stacked LSTMs; layer norm applied
Hidden size—LSTM layer 2	categorical {32, 64, 128} (constraint: ≤layer 1)	64	Matches architecture reported in Section 2.7
Dropout (post-LSTM)	uniform in [0.00, 0.50]	0.20	Applied after stacked LSTMs
Sequence length (look-back, min)	categorical {360, 720, 1080, 1440}	720	Covers diurnal/control cycles; sensitivity shown in Section 2.14
Huber threshold κ (hours)	fixed (tuned on validation grid)	6	Robust loss per Equation (11)
LR scheduler (cosine decay)	fixed (policy)	Cosine decay without restarts; T_max = 80 epochs; eta_min = 1 × 10⁻⁶; no warmup	Scheduler evaluated per epoch; lower bound avoids numerical zero; early stopping may halt earlier
Ensemble size & seeds	fixed	5 models; seeds {11, 19, 27, 35, 43}	Averages independent fits; reduces variance

Note: Optimization budget: 64 trials (Bayesian optimization on the validation split). Selection criterion: minimum validation RMSE with early stopping; ties broken by lower validation MAE. Application: Selected values used for all reported LSTM results and for parameter-matched baselines where applicable.

Figure A1. Per-feature attributions for RUL (integrated gradients) by operating mode and component class. Bars show median integrated-gradient attribution per feature, aggregated over the 720 min input window and episodes, normalized to sum to one per prediction and then averaged within mode (low, shoulder, peak) and component class (chillers, primary pumps, AHU fans). Error whiskers denote 24 h block-bootstrap 95% intervals. Features:

Δ T_{coil}

(coil temperature difference),

a_{RMS}

(bearing/fan broadband vibration RMS), COP (plant coefficient of performance),

Q_{cool}

(cooling-load proxy), valve position, and outside-air temperature. Across classes,

Δ T_{coil}

,

a_{RMS}

, and COP contribute the largest negative attributions to RUL (shorter predicted life) under comparable load, with

a_{RMS}

dominating rotating equipment and

Δ T_{coil}

–COP dominating chillers; mode-wise attribution patterns remain stable, supporting operational interpretability.

Table A3. Permutation importance by feature (normalized to

Δ T_{coil} = 1.00

) on the training split with 24 h block bootstrap used for stability checks.

Table A3. Permutation importance by feature (normalized to

Δ T_{coil} = 1.00

) on the training split with 24 h block bootstrap used for stability checks.

Feature	Normalized Permutation Importance (Overall)
$Δ T_{coil}$	1.00
$a_{RMS}$	0.86
COP	0.74
$Q_{cool}$	0.49
Valve position	0.38
Outside-air temperature	0.29

Note: Values correspond exactly to the Results narrative (Section 3.2) and support the integrated-gradient attribution patterns summarized in Figure A1.

References

Ryu, D.; Yoo, W. Ventilation-Dominated Energy Savings in Large Commercial Buildings: Multi-Measure Assessment Revealing HVAC Optimization Priorities for Hot-Humid Climates. Case Stud. Therm. Eng. 2025, 74, 107034. [Google Scholar] [CrossRef]
Che, W.W.; Tso, C.Y.; Sun, L.; Ip, D.Y.K.; Lee, H.; Chao, C.Y.H.; Lau, A.K.H. Energy Consumption, Indoor Thermal Comfort and Air Quality in a Commercial Office with Retrofitted Heat, Ventilation and Air Conditioning (HVAC) System. Energy Build. 2019, 201, 202–215. [Google Scholar] [CrossRef]
Salem, H.; Khanafer, K.; Alshammari, M.; Sedaghat, A.; Mahdi, S. Cooling Degree Days for Quick Energy Consumption Estimation in the GCC Countries. Sustainability 2022, 14, 13885. [Google Scholar] [CrossRef]
Madaminov, B.; Saidmurodov, S.; Saitov, E.; Jumanazarov, D.; Alsayah, A.M.; Zhetkenbay, L. Multi-Objective Optimization Framework for Energy Efficiency and Production Scheduling in Smart Manufacturing Using Reinforcement Learning and Digital Twin Technology Integration. Int. J. Ind. Eng. Manag. 2025, 16, 283–295. [Google Scholar] [CrossRef]
Inayat, S.M.; Zaidia, S.M.R.; Ahmeda, H.; Ahmeda, D.; Azama, M.K.; Arfeenb, Z.A. Risk Assessment and Mitigation Strategy of Large-Scale Solar Photovoltaic Systems in Pakistan. Int. J. Ind. Eng. Manag. 2023, 14, 105–121. [Google Scholar]
Siraskar, R.; Kumar, S.; Patil, S.; Bongale, A.; Kotecha, K. Reinforcement Learning for Predictive Maintenance: A Systematic Technical Review. Artif. Intell. Rev. 2023, 56, 12885–12947. [Google Scholar] [CrossRef]
Hodkiewicz, M.; Lukens, S.; Brundage, M.P.; Sexton, T. Rethinking Maintenance Terminology for an Industry 4.0 Future. Int. J. Progn. Health Manag. 2021, 12, 1–14. [Google Scholar] [CrossRef]
Szrama, S. Optimizing Aircraft Engine Longevity: A Comparative Framework for Dynamically Adaptive Predictive Maintenance Using Autoencoders, LSTMs, and Gaussian Processes. Eng. Appl. Artif. Intell. 2025, 156, 111199. [Google Scholar] [CrossRef]
Sivakumar, M.; Maranco, M.; Krishnaraj, N. Data Analytics and Artificial Intelligence for Predictive Maintenance in Manufacturing. In Data Analytics and Artificial Intelligence for Predictive Maintenance in Smart Manufacturing; CRC Press: Boca Raton, FL, USA, 2024. [Google Scholar]
Hodavand, F.; Ramaji, I.J.; Sadeghi, N. Digital Twin for Fault Detection and Diagnosis of Building Operations: A Systematic Review. Buildings 2023, 13, 1426. [Google Scholar] [CrossRef]
Mirnaghi, M.S.; Haghighat, F. Fault Detection and Diagnosis of Large-Scale HVAC Systems in Buildings Using Data-Driven Methods: A Comprehensive Review. Energy Build. 2020, 229, 110492. [Google Scholar] [CrossRef]
Chen, Z.; O’Neill, Z.; Wen, J.; Pradhan, O.; Yang, T.; Lu, X.; Lin, G.; Miyata, S.; Lee, S.; Shen, C.; et al. A Review of Data-Driven Fault Detection and Diagnostics for Building HVAC Systems. Appl. Energy 2023, 339, 121030. [Google Scholar] [CrossRef]
Sayyad, S.; Kumar, S.; Bongale, A.; Kamat, P.; Patil, S.; Kotecha, K. Data-Driven Remaining Useful Life Estimation for Milling Process: Sensors, Algorithms, Datasets, and Future Directions. IEEE Access 2021, 9, 110255–110286. [Google Scholar] [CrossRef]
Lim, H. Comparative Life Cycle Assessment of Reconstruction and Renovation for Carbon Reduction in Buildings. Buildings 2025, 15, 3388. [Google Scholar] [CrossRef]
Chen, F.; Hua, L.; Zhang, J. The Deterioration of Low-Cycle Fatigue Properties and the Fatigue Life Model of Reinforcing Steel Bars Subjected to Corrosion. Buildings 2025, 15, 3313. [Google Scholar] [CrossRef]
Moberg, S.; Görman, F. Life Cycle Assessment of a Swedish Multifamily Building Designed for Disassembly and Flexibility: Impact of Allocation Methods on Future Scenarios. Buildings 2025, 15, 3058. [Google Scholar] [CrossRef]
Zhang, J.; Wang, X.; Yang, Y.; Miao, H.; Yang, S. A Novel Anomaly Detection Method for Multivariate Time Series Based on Spatial-Temporal Graph Learning. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 9. [Google Scholar] [CrossRef]
Malashin, I.; Tynchenko, V.; Gantimurov, A.; Nelyub, V.; Borodulin, A. Applications of Long Short-Term Memory (LSTM) Networks in Polymeric Sciences: A Review. Polymers 2024, 16, 2607. [Google Scholar] [CrossRef]
Mazzetto, S. Hybrid Predictive Maintenance for Building Systems: Integrating Rule-Based and Machine Learning Models for Fault Detection Using a High-Resolution Danish Dataset. Buildings 2025, 15, 630. [Google Scholar] [CrossRef]
Shahnazari, H.; Mhaskar, P.; House, J.M.; Salsbury, T.I. Modeling and Fault Diagnosis Design for HVAC Systems Using Recurrent Neural Networks. Comput. Chem. Eng. 2019, 126, 189–203. [Google Scholar] [CrossRef]
Prince; Yoon, B.; Kumar, P. Fault Detection and Diagnosis in Air-Handling Unit (AHU) Using Improved Hybrid 1D Convolutional Neural Network. Systems 2025, 13, 330. [Google Scholar] [CrossRef]
Zhang, F.; Saeed, N.; Sadeghian, P. Deep Learning in Fault Detection and Diagnosis of Building HVAC Systems: A Systematic Review with Meta Analysis. Energy AI 2023, 12, 100235. [Google Scholar] [CrossRef]
Gao, L.; Li, D.; Li, D.; Yin, H. An Improved LSTM Based Sensor Fault Diagnosis Strategy for the Air-Cooled Chiller System. In Proceedings of the 2019 Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019; pp. 4990–4995. [Google Scholar]
Tian, C.; Wang, Y.; Ma, X.; Chen, Z.; Xue, H. Chiller Fault Diagnosis Based on Automatic Machine Learning. Front. Energy Res. 2021, 9, 753732. [Google Scholar] [CrossRef]
Ao, X.; Gong, Y.; He, A. A Review of Time Series Prediction Models Based on Deep Learning. IEEE Access 2025, 13, 153696–153712. [Google Scholar] [CrossRef]
Bouktif, S.; Fiaz, A.; Ouni, A.; Serhani, M.A. Optimal Deep Learning LSTM Model for Electric Load Forecasting Using Feature Selection and Genetic Algorithm: Comparison with Machine Learning Approaches. Energies 2018, 11, 1636. [Google Scholar] [CrossRef]
Zhu, H.; Yang, W.; Li, S.; Pang, A. An Effective Fault Detection Method for HVAC Systems Using the LSTM-SVDD Algorithm. Buildings 2022, 12, 246. [Google Scholar] [CrossRef]
Seo, B.; Yoon, Y.; Lee, K.H.; Cho, S. Comparative Analysis of ANN and LSTM Prediction Accuracy and Cooling Energy Savings through AHU-DAT Control in an Office Building. Buildings 2023, 13, 1434. [Google Scholar] [CrossRef]
Du, Z.; Chen, S.; Li, P.; Chen, K.; Liang, X.; Zhu, X.; Jin, X. Knowledge-Extracted Deep Learning Diagnosis and Its Cloud-Based Management for Multiple Faults of Chiller. Build. Environ. 2023, 235, 110228. [Google Scholar] [CrossRef]
Sha, X.; Ma, Z.; Sethuvenkatraman, S.; Li, W. Online Learning-Enhanced Data-Driven Model Predictive Control for Optimizing HVAC Energy Consumption, Indoor Air Quality and Thermal Comfort. Appl. Energy 2025, 383, 125341. [Google Scholar] [CrossRef]
Zhuang, D.; Gan, V.J.L.; Duygu Tekler, Z.; Chong, A.; Tian, S.; Shi, X. Data-Driven Predictive Control for Smart HVAC System in IoT-Integrated Buildings with Time-Series Forecasting and Reinforcement Learning. Appl. Energy 2023, 338, 120936. [Google Scholar] [CrossRef]
Mshragi, M.; Petri, I. Fast Machine Learning for Building Management Systems. Artif. Intell. Rev. 2025, 58, 211. [Google Scholar] [CrossRef]
Chen, K.; Chen, S.; Zhu, X.; Jin, X.; Du, Z. Interpretable Mechanism Mining Enhanced Deep Learning for Fault Diagnosis of Heating, Ventilation and Air Conditioning Systems. Build. Environ. 2023, 237, 110328. [Google Scholar] [CrossRef]
Bi, J.; Wang, H.; Yan, E.; Wang, C.; Yan, K.; Jiang, L.; Yang, B. AI in HVAC Fault Detection and Diagnosis: A Systematic Review. Energy Rev. 2024, 3, 100071. [Google Scholar] [CrossRef]
Qiu, J.; Zhang, H.; Zhang, H.; Zhou, M. Fault Diagnosis of Chiller Based on Whale Optimization Algorithm to Optimize Long and Short-Term Memory Network. In Proceedings of the 2nd International Conference on Mechanical, Electronics, and Electrical and Automation Control (METMS 2022), SPIE, Guilin, China, 7–9 January 2022; Volume 12244, pp. 897–903. [Google Scholar]
Chen, Z.; Zhang, W.; Zhao, W.; Yang, X.; Zhang, X.; Li, Y. Cross-Condition Fault Diagnosis of Chillers Based on an Ensemble Approach with Adaptive Weight Allocation. Energy Build. 2024, 325, 115007. [Google Scholar] [CrossRef]
Tseng, S.-H.; Tran, K.-D. Predicting Maintenance through an Attention Long Short-Term Memory Projected Model. J. Intell. Manuf. 2024, 35, 807–824. [Google Scholar] [CrossRef]
Wu, R.; Ren, Y.; Tan, M.; Nie, L. Fault Diagnosis of HVAC System with Imbalanced Data Using Multi-Scale Convolution Composite Neural Network. Build. Simul. 2024, 17, 371–386. [Google Scholar] [CrossRef]
Reynolds, S.; Nolan, J. Explainable AI for Critical Infrastructure Monitoring and Control. ITSI Trans. Electr. Electron. Eng. 2023, 12, 18–23. [Google Scholar]
Matetić, I.; Štajduhar, I.; Wolf, I.; Ljubic, S. A Review of Data-Driven Approaches and Techniques for Fault Detection and Diagnosis in HVAC Systems. Sensors 2023, 23, 1. [Google Scholar] [CrossRef] [PubMed]
Zhao, C.; Huang, X.; Li, Y.; Yousaf Iqbal, M. A Double-Channel Hybrid Deep Neural Network Based on CNN and BiLSTM for Remaining Useful Life Prediction. Sensors 2020, 20, 7109. [Google Scholar] [CrossRef]
Felimban, A.; Prieto, A.; Knaack, U.; Klein, T.; Qaffas, Y. Assessment of Current Energy Consumption in Residential Buildings in Jeddah, Saudi Arabia. Buildings 2019, 9, 163. [Google Scholar] [CrossRef]
Alnashri, H.A.; Fnais, A.S.; Bin Mahmoud, A.A. Barriers and Strategies for Implementing Passive House Design: The Case of the Construction Sector in Saudi Arabia. Buildings 2025, 15, 3117. [Google Scholar] [CrossRef]
Al-Tamimi, N. A State-of-the-Art Review of the Sustainability and Energy Efficiency of Buildings in Saudi Arabia. Energy Effic. 2017, 10, 1129–1141. [Google Scholar] [CrossRef]
Habiby, A.; Yaseen, L. Riyadh’s Urban Greenwave: Fostering City Resilience Through Large-Scale Greening. In Climate-Resilient Cities: Priorities for the Gulf Cooperation Council Countries; Arora, A., Belaïd, F., Lechtenberg-Kasten, S., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 341–353. ISBN 978-3-031-73090-0. [Google Scholar]
Aina, Y.A.; Parvez, I.M.; Balogun, A.-L.; Adam, E. Urban Heat Island Effects and Mitigation Strategies in Saudi Arabian Cities. In Urban Heat Island (UHI) Mitigation: Hot and Humid Regions; Enteria, N., Santamouris, M., Eicker, U., Eds.; Springer: Singapore, 2021; pp. 235–248. ISBN 978-981-334-050-3. [Google Scholar]
McFarlane, R.E. Ashrae Standards And Practices For Data Centers. In Data Center Handbook; Geng, H., Ed.; Wiley: Hoboken, NJ, USA, 2021; pp. 175–191. ISBN 978-1-119-59750-6. [Google Scholar]
Brüggemann, D.; Sakaridis, C.; Brödermann, T.; Van Gool, L. Contrastive Model Adaptation for Cross-Condition Robustness in Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 11378–11387. [Google Scholar]
ANSI/ASHRAE/ACCA Standard 180-2018; Standard Practice for Inspection and Maintenance of Commercial Building HVAC Systems. ASHRAE: Atlanta, GA, USA, 2018.
ASHRAE Guideline 22-2025; Instrumentation for Monitoring Central Chilled-Water Plant Efficiency. ASHRAE: Peachtree Corners, GA, USA, 2025.
ANSI/AHRI Standard 550/590-2023 (I-P); Performance Rating of Water-chilling and Heat Pump Water-heating Packages Using the Vapor Compression Cycle. Air-Conditioning, Heating, and Refrigeration Institute (AHRI): Arlington, VA, USA, 2023.
ISO 17359:2018; Condition Monitoring and Diagnostics of Machines—General Guidelines. International Organization for Standardization (ISO): Geneva, Switzerland, 2018.
ISO 20816-3:2022; Mechanical Vibration—Measurement and Evaluation of Machine Vibration—Part 3: Industrial Machinery with a Power Rating Above 15 kW and Operating Speeds Between 120 r/min and 30,000 r/min. ISO: Geneva, Switzerland, 2022.
ISO 10816-7:2009; Mechanical Vibration—Evaluation of Machine Vibration by Measurements on Non-Rotating Parts—Part 7: Rotodynamic Pumps for Industrial Applications, Including Measurements on Rotating Shafts. ISO: Geneva, Switzerland, 2009.
ISO 5149-4:2022; Refrigerating Systems and Heat Pumps—Safety and Environmental Requirements—Part 4: Operation, Maintenance, Repair and Recovery. ISO: Geneva, Switzerland, 2022.
EN 14624:2020; Performance of Portable Locating Leak Detectors and of Fixed Gas Detectors for all Refrigerants. European Committee for Standardization (CEN): Brussels, Belgium, 2020.
Rizi, B.S.; Faramarzi, A.; Pertzborn, A.; Heidarinejad, M. Forecasting Operation of a Chiller Plant Facility Using Data-Driven Models. Int. J. Refrig. 2024, 167, 70–89. [Google Scholar] [CrossRef]
Mao, Z.; Li, F.; Su, Q.; Fang, Z.; Xu, X.; Ding, Y. Application of Energy-Saving Control Strategy in Air Conditioning Terminal Equipment Based on Constant Temperature Difference of Chilled Water. Case Stud. Therm. Eng. 2021, 28, 101409. [Google Scholar] [CrossRef]
Trautman, N.; Razban, A.; Chen, J. Overall Chilled Water System Energy Consumption Modeling and Optimization. Appl. Energy 2021, 299, 117166. [Google Scholar] [CrossRef]
Sun, L.; Hu, Z.; Mae, M.; Imaizumi, T. A Predictive COP Model for Air Source Heat Pumps under Extreme Heat Conditions Using No Experimental Data. Energy Build. 2025, 339, 115749. [Google Scholar] [CrossRef]
Smit, A.C.; Schat, E.; Ceulemans, E. The Exponentially Weighted Moving Average Procedure for Detecting Changes in Intensive Longitudinal Data in Psychological Research in Real-Time: A Tutorial Showcasing Potential Applications. Assessment 2023, 30, 1354–1368. [Google Scholar] [CrossRef]
Sun, W.; Wang, H.; Liu, Z.; Qu, R. Method for Predicting RUL of Rolling Bearings under Different Operating Conditions Based on Transfer Learning and Few Labeled Data. Sensors 2022, 23, 227. [Google Scholar] [CrossRef]
Jing, X.; Bai, Y.; Hou, B.; Huang, C. Physics-Informed Neural Networks Coupled with a Residual-Driven Dynamic Weighted Huber Loss Function. New J. Phys. 2025, 27, 094602. [Google Scholar] [CrossRef]
He, C.; Li, Z.; Zheng, C.; Zhang, Z.; Zhang, L. A Hybrid Model Based on a Dual-Attention Mechanism for the Prediction of Remaining Useful Life of Aircraft Engines. Sensors 2025, 25, 5682. [Google Scholar] [CrossRef] [PubMed]
Jung, K.; Lee, J.; Gupta, V.; Cho, G. Comparison of Bootstrap Confidence Interval Methods for GSCA Using a Monte Carlo Simulation. Front. Psychol. 2019, 10, 2215. [Google Scholar] [CrossRef]
Lin, K.-Y.; Hong, Y.-H.; Li, M.-H.; Shi, Y.; Matsuno, K. Predictive Maintenance in Industrial Systems: An XGBoost-Based Approach for Failure Time Estimation and Resource Optimization. J. Ind. Prod. Eng. 2025, 42, 2519369. [Google Scholar] [CrossRef]
Yasin, I.; Kurniati, N.; Syairudin, B. Reducing Unplanned Downtime Using Predictive Maintenance (PdM). In Proceedings of the IOP Conference Series: Materials Science and Engineering, Surabaya, Indonesia, 22–23 July 2020; IOP Publishing: Bristol, UK, 2021; Volume 1072, p. 012041. [Google Scholar]
Sagawa, D.; Tanaka, K. Machine Learning-Based Estimation of COP and Multi-Objective Optimization of Operation Strategy for Heat Source Considering Electricity Cost and on-Site Consumption of Renewable Energy. Energies 2023, 16, 4893. [Google Scholar] [CrossRef]
Mao, S.; Li, X.; Zhao, B. Remaining Useful Life Prediction Based on Time-Series Features and Conformalized Quantile Regression. Meas. Sci. Technol. 2024, 35, 126113. [Google Scholar] [CrossRef]
Wang, Q.; Shi, H.; Ye, C.; Zhou, H. Synergizing Metaheuristic Optimization and Model Predictive Control: A Comprehensive Review for Advanced Motor Drives. Energies 2025, 18, 4831. [Google Scholar] [CrossRef]
Shin, M.; Do, S.L. Prediction of Cooling Energy Use in Buildings Using an Enthalpy-Based Cooling Degree Days Method in a Hot and Humid Climate. Energy Build. 2016, 110, 57–70. [Google Scholar] [CrossRef]
Tahmasebinia, F.; Lin, L.; Wu, S.; Kang, Y.; Sepasgozar, S. Exploring the Benefits and Limitations of Digital Twin Technology in Building Energy. Appl. Sci. 2023, 13, 8814. [Google Scholar] [CrossRef]

Figure 1. Engineered feature behavior and operating-mode structure. (a)

Δ T_{coil}

distributions across AHUs under cooling, stratified by valve position deciles. (b) Plant COP trajectories by week with EWMA smoothing and annotated maintenance intervals. (c) Time-aligned

a_{RMS}

at pump and fan bearings over the 21 days preceding failures. (d) Operating-mode clusters in the

(T^{OA}, {\dot{Q}}_{cool})

plane with cluster centroids and occupancy share.

Figure 2. Fitting dynamics, residuals, and calibration. (a) Huber loss over epochs for training and validation across five seeds; early stopping points indicated. (b) Standardized residuals on the test set, faceted by component class. (c) Reliability diagrams for q ∈ 0.1, 0.25, 0.5 with isotonic calibration curves and ideal 1:1 lines.

Figure 3. Lead-time characteristics from calibrated quantiles. (a)

LTA

vs. policy threshold

τ

at q = 0.25 on the held-out test set. (b) Class-wise

LTA

at discrete

τ

values with 95% CIs.

Figure 4. Trigger dynamics and quality. (a) Chiller timeline with calibrated RUL, trigger state, COP, and failure marker. (b) Lead-time distributions at

τ = 10

days across component classes. (c) False-trigger incidence per 30 days vs.

τ

with bootstrap 95% CIs.

Figure 5. Energy effects of PdM timing. (a) Mean COP trajectories aligned to maintenance at t = 0, with 95% bands. (b) Per-episode 30-day

E_{save}

distributions (kWh) by class. (c) Annualized HVAC energy savings percentage with 95% CIs from resampling days and COP trajectories.

Figure 6. Policy sensitivity and modeling ablations. (a) Total downtime vs. threshold

τ

. (b) False-trigger incidence per 30 days vs. quantile

q

at

τ = 10

days. (c) MAE by loss function with 95% CIs. (d) MAE vs. sequence length.

Figure 7. Peak-season robustness on the hold-out month. (a) MAE and RMSE per class during peak month vs. overall test. (b)

LTA

at

τ = 10

days in peak month vs. rest. (c) Achieved lead times for chiller episodes during peak month.

Figure 8. Expected mis-scheduling penalty

Π

vs. calibration quantile

q

at three thresholds

τ

, with bootstrap 95% CIs.

Table 1. Dataset composition, data quality, and calibration outcomes across monitored subsystems over the two-year observation horizon.

Subsystem	Channels (n)	Sampling Interval	Coverage After Cleaning (%)	Imputed Short Gaps (%)	Hard-Fault Exclusions (%)	Daily Power Reconciliation (Mean ± SD, %)	Quarterly Spot-Calibration Within Tolerance (n/N)
Chilled-water plant (chillers, headers)	56	1 min	98.2	1.1	0.7	1.3 ± 0.6	22/24
Pumps (primary loop)	18	1 min	97.8	1.4	0.8	1.5 ± 0.7	18/20
AHUs (supply/return, valves, DP, fans)	57	1 min	97.1	2.1	0.8	1.7 ± 0.8	28/32
Ambient and weather	14	1 min	99.4	0.4	0.2	0.9 ± 0.4	8/8
Total/weighted	145	1 min	97.6	1.6	0.8	1.6 ± 0.7	76/84

Table 2. Failure events, RUL label horizons, and censoring by component class.

Component Class	Assets (n)	Failures (n)	Median Label Horizon per Failure (Days)	Interquartile Range (Days)	Right-Censored Assets (n)	Right-Censored Hours (k)
Chillers (compressor-related)	3	7	23.5	16.2–31.8	1	6.7
Primary pumps (bearing/motor)	6	12	18.9	12.3–24.4	2	9.1
AHU fans (motor/drive)	22	18	16.1	10.6–21.5	7	22.4
Total	31	37	—	—	10	38.2

Table 3. Sequence counts and operating-mode shares by split and component class.

Component Class	Train Sequences (n)	Val Sequences (n)	Test Sequences (n)	Mode Share (Train: Low/Shoulder/Peak/Transient, %)	Mode Share (Test: Low/Shoulder/Peak/Transient, %)
Chillers	17,284	5828	5761	16/41/35/8	18/37/37/8
Primary pumps	21,036	7011	6982	17/38/37/8	16/39/37/8
AHU fans	82,414	27,510	27,236	18/39/35/8	17/40/35/8
Total	120,734	40,349	39,979	17/39/36/8	17/39/36/8

Table 4. RUL prediction accuracy on the held-out test set. Values are mean ± 95% CI from block bootstrap (24 h blocks).

Component Class	MAE (Hours)	RMSE (Hours)	Median Absolute Error (Hours)	Quantile Calibration Error (QCE) at q = 0.25 (Abs. Difference)
Chillers	24.7 ± 3.8	36.9 ± 5.6	19.1	0.05
Primary pumps	19.6 ± 2.9	28.8 ± 4.1	15.8	0.04
AHU fans	17.3 ± 2.2	25.7 ± 3.4	13.6	0.06
Weighted total	19.8 ± 2.1	29.1 ± 3.3	15.0	0.05

Table 5. Mode-stratified RUL errors (hours) with 95% CIs (block bootstrap, 24 h blocks).

Component Class	Mode	MAE (Hours)	RMSE (Hours)
Chillers	Low	22.8 ± 4.5	33.6 ± 6.8
Chillers	Shoulder	24.9 ± 3.9	36.2 ± 5.7
Chillers	Peak	26.1 ± 4.1	39.4 ± 6.1
Pumps	Low	18.1 ± 3.1	27.3 ± 4.6
Pumps	Shoulder	19.3 ± 2.7	28.6 ± 4.0
Pumps	Peak	21.5 ± 3.0	30.5 ± 4.3
Fans	Low	16.0 ± 2.4	23.1 ± 3.5
Fans	Shoulder	17.2 ± 2.1	25.4 ± 3.2
Fans	Peak	18.9 ± 2.5	27.2 ± 3.7

Table 6. Comparison of baseline fixed-interval program and PdM trigger policy at

τ = 10

days, q = 0.25. Values are means over test horizon with 95% CIs; p-values from paired block bootstrap.

Table 6. Comparison of baseline fixed-interval program and PdM trigger policy at

τ = 10

days, q = 0.25. Values are means over test horizon with 95% CIs; p-values from paired block bootstrap.

Metric	Baseline (Mean [CI])	PdM (Mean [CI])	Absolute Difference	Relative Change (%)	p-Value
Unplanned outages (count)	21 [17, 25]	11 [9, 14]	−10	−47.6	0.008
Total downtime (hours)	172 [145, 200]	101 [83, 123]	−71	−41.3	0.012
Planned:unplanned ratio	0.74 [0.62, 0.88]	1.68 [1.41, 1.95]	+0.94	+127.0	0.004
Mean lead time at trigger (days)	—	11.8 [10.7, 12.9]	—	—	—

Table 7. Annualized cost components under baseline and PdM policies at

τ = 10

days, q = 0.25; values in SAR (thousands), mean [95% CI].

Table 7. Annualized cost components under baseline and PdM policies at

τ = 10

days, q = 0.25; values in SAR (thousands), mean [95% CI].

Cost Component	Baseline (kSAR)	PdM (kSAR)	Difference (kSAR)	Relative Change (%)
$C_{energy}$	8460 [7980, 8920]	7570 [7140, 8010]	−890	−10.5
$C_{down}$	520 [430, 610]	310 [250, 380]	−210	−40.4
$C_{maint}$	1140 [1050, 1240]	1260 [1160, 1360]	+120	+10.5
$C_{total}$	10,120 [9570, 10,640]	9140 [8650, 9640]	−980	−9.7

Table 8. Edge outcomes and observed contributors during the test horizon.

Outcome Class	Count (n)	Affected Component Classes	Primary Contributors	Notes from Technician Records
Late trigger (<5 days)	4	2 chillers, 1 pump, 1 fan	Abrupt step change in COP due to control override; sudden bearing spallation	Temporary manual setpoint clamp before a VIP event; pump cavitation after valve malfunction
False negative	2	1 chiller, 1 fan	Right-censored training episodes for specific failure modes; masked telemetry during overlapping outage	Chiller trip coincident with upstream condenser water issue; fan failure during fire drill sequence
Inconsequential trigger	5	1 chiller, 2 pumps, 2 fans	Conservative q $and high τ$ under low-load mode yielding inspection without parts replacement	Preventive lubrication performed; no parts replaced; follow-up showed stable metrics

Table 9. Comparative baselines (non-recurrent and recurrent) versus the configured LSTM ensemble on the held-out test set. Values are mean ± 95% CI from 24 h block bootstrap; LTA computed at τ = 10 days.

Method	MAE (Hours)	$LTA at τ = 10$ Days	Notes on Construction
Aggregated FFN (no recurrence)	25.6 ± 3.1	0.68 ± 0.05	Rolling means, slopes, variances over 12 h windows
Rule-based indices	31.9 ± 4.4	0.54 ± 0.06	Thresholded $a_{RMS}$ $, Δ T_{coil}$ , and COP deltas
GRU (param-matched)	20.7 ± 2.3	0.77 ± 0.04	Two-layer GRU (128–64), dropout and norm matched
TCN (param-matched)	21.4 ± 2.5	0.75 ± 0.05	8 residual blocks, receptive field ≈ 720 min
Transformer (lightweight)	22.6 ± 2.8	0.73 ± 0.05	2-layer encoder, 2 heads, embedding ≈ LSTM hidden
LSTM ensemble (configured)	19.8 ± 2.1	0.79 ± 0.04	Two-layer LSTM, isotonic calibration, sustained trigger

Table 10. Inference performance and throughput on the export container.

Metric	Value (Mean ± SD)	Conditions
Per-sequence inference time (ms)	36.4 ± 4.1	NVIDIA GPU, batch size 256, sequence length 720
End-to-end latency (ms)	184 ± 26	Include preprocessing (Equations (1)–(8)) and calibration map
Max sustained throughput (sequences/s)	1540 ± 110	Steady stream, 95th percentile
Memory footprint (GB)	2.6 ± 0.1	Loaded ensemble and transforms
Trigger computation CPU time (ms)	3.8 ± 0.5	Sustained condition, 12 h window

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

An Artificial-Intelligence-Based Predictive Maintenance Strategy Using Long Short-Term Memory Networks for Optimizing HVAC System Performance in Commercial Buildings

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Design and Methodological Alignment to Objectives

2.2. Site, Equipment, and Data Acquisition

2.3. Data Governance, Cleaning, and Synchronization

2.4. Feature Engineering and Operating-Mode Segmentation

2.5. RUL Labeling and Target Construction

2.6. Sequence Construction, Splits, and Leakage Control

2.7. LSTM Architecture and Learning

2.8. Evaluation Metrics, Uncertainty, and Calibration

2.9. Maintenance Trigger Logic and Downtime Modeling

2.10. Energy Impact Estimation

2.11. Baseline Policy and Counterfactual Construction

2.12. Statistical Analysis and Reporting

2.13. Quality Assurance, Calibration Checks, and Reproducibility

2.14. Sensitivity Analyses and Ablation Studies

2.15. Implementation Details and Compute Environment

2.16. Energy Baselining and Schedule-Aware Evaluation

3. Results

3.1. Data Integrity, Coverage, and Calibration Outcomes

3.2. Feature Behavior and Operating-Mode Structure

3.3. RUL Target Availability and Censoring Characteristics

3.4. Sequence Inventory and Split Verification

3.5. Model Fitting Dynamics, Residual Structure, and Calibration

3.6. Actionable Horizons and Lead-Time Characteristics

3.7. Trigger Behavior, False Alarms, and Timeline Exemplars

3.8. Predictive Performance Against Point Metrics and Cross-Mode Robustness

3.9. Comparative Evaluation Against Fixed-Interval Maintenance

3.10. Energy Impacts from Rectification Timing

3.11. Operational Cost Outcomes Under a Unified Tariff

3.12. Sensitivity of Policy Parameters and Ablation of Modeling Choices

3.13. Mode Generalization and Peak-Season Stress Testing

3.14. Negative and Edge Outcomes with Root-Cause Evidence

3.15. Comparative Baselines Within the Modeling Family

3.16. Translating Prediction Quality to Scheduling Reliability

3.17. End-to-End Reproducibility Checks and Inference Performance

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Article Metrics

Citations

Article Access Statistics