Explainable Smart-Building Energy Consumption Forecasting and Anomaly Diagnosis Framework Based on Multi-Head Transformer and Dual-Stream Detection

Cai, Yuanyu; Liao, Dan; Liu, Bin

doi:10.3390/app16083836

Open AccessArticle

Explainable Smart-Building Energy Consumption Forecasting and Anomaly Diagnosis Framework Based on Multi-Head Transformer and Dual-Stream Detection

by

Yuanyu Cai

¹,

Dan Liao

² and

Bin Liu

^1,*

¹

School of Physics and Optoelectronics, Xiangtan University, Xiangtan 411105, China

²

School of Automation and Electronic Information, Xiangtan University, Xiangtan 411105, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 3836; https://doi.org/10.3390/app16083836

Submission received: 26 March 2026 / Revised: 12 April 2026 / Accepted: 13 April 2026 / Published: 15 April 2026

(This article belongs to the Special Issue Emerging Applications of AI and Machine Learning in Industry)

Download

Browse Figures

Versions Notes

Abstract

Fine-grained energy management in smart-campus buildings requires accurate load forecasting together with reliable and interpretable anomaly diagnosis. This study presents an integrated forecasting–diagnosis framework for building energy systems. Hourly energy demand is modeled using a Transformer-based sequence-to-sequence architecture, in which a domain-aware attention mechanism is introduced to separately represent historical consumption dynamics, environmental influences, and temporal regularities commonly observed in building energy use. Anomaly diagnosis is conducted through a dual-scale strategy that supports both the timely detection of abrupt abnormal events and the identification of gradual performance degradation. Short-term anomalies are detected from forecasting residuals using adaptive thresholds, while long-term anomalies are identified by comparing current residual patterns with same-season historical baselines and validating multi-window trends over a 48 h horizon. The two detection streams are jointly used to distinguish point, pattern, and composite anomalies. To support practical operation and maintenance, SHAP-based explanations are provided to interpret both energy predictions and detected anomalies. Case studies on two educational buildings from the Building Data Genome Project 2 demonstrate that the proposed framework achieves the best overall forecasting performance against both conventional baselines and stronger recent Transformer-based models, with mean absolute percentage errors of approximately 3%. The results indicate that the proposed framework provides a practical solution for data-driven energy monitoring and decision support in smart buildings.

Keywords:

energy consumption forecasting; campus buildings; transformer; anomaly diagnosis; smart buildings; building energy management

1. Introduction

The global push toward carbon neutrality has placed energy efficiency in the built environment at the forefront of emission-reduction strategies. Buildings and the construction sector remain major contributors to global energy consumption and greenhouse gas emissions, accounting for approximately 32% of global energy use and 34% of CO₂ emissions [1]. Campus building clusters represent a typical yet challenging energy-management scenario: they include heterogeneous building types, behavior-driven demand patterns, schedules governed by academic calendars, and relatively clear management boundaries. Pronounced seasonality (e.g., heating and cooling periods) and strong periodicities (weekday-weekend differences, teaching versus holiday schedules) complicate energy-demand modeling, while comprehensive campus metering and monitoring infrastructures provide a practical foundation for data-driven forecasting, anomaly diagnosis, and energy optimization. Improving the accuracy and reliability of forecasting and anomaly detection for campus buildings can therefore deliver immediate energy-saving benefits and produce transferable methodologies for broader public and institutional building portfolios.

Energy forecasting is a prerequisite for demand response, energy optimization, and equipment health management [2,3]. Traditional statistical models such as ARIMA are constrained by linear assumptions and often struggle to capture nonlinear couplings among weather, occupancy, and operational schedules [2]. Consequently, data-driven building energy forecasting has evolved from classical machine learning and recurrent architectures toward attention-based deep learning to better model long-range dependencies and complex exogenous interactions [2]. Related evidence from adjacent urban-energy forecasting further highlights the importance of heterogeneous exogenous information: Cavus et al. developed a hybrid deep learning framework for EV charging-demand forecasting in smart cities, showing that behavioral, demographic, and spatial features can support meaningful and scalable demand prediction even in the absence of high-resolution temporal inputs [4]. Recently, Transformer-based long-term series forecasting (LTSF) has gained increasing attention due to its capability to model long sequences and rich temporal structures, as summarized by systematic LTSF Transformer reviews [5]. Representative architectures include Informer with efficient ProbSparse attention, Autoformer with progressive decomposition and an auto-correlation mechanism, and FEDformer with frequency-enhanced decomposition [6]. Patch-based and channel-independent designs such as PatchTST further improve efficiency and accuracy for multivariate long-horizon forecasting [7]. These advances suggest Transformers can serve as strong backbones for building energy forecasting—provided that domain heterogeneity and interpretability requirements are explicitly addressed.

A key domain challenge is heterogeneous feature composition: energy signals, environmental measurements (e.g., temperature and humidity), and calendar/time indicators differ in meaning, scale, and causal roles [2]. When standard multi-head self-attention mixes all channels in a shared attention space, heterogeneous variables may compete within the same attention distribution, leading to semantic entanglement and unstable explanations. Prior work addresses heterogeneity mainly via (i) feature-/variable-aware attention that reweights inputs through variable-importance weighting or channel-wise gating [8] and (ii) semantic grouping with group-wise encoding-fusion that separates sources (e.g., load-weather-calendar) into dedicated branches (or attention groups) before fusion, reported in both building and smart-grid forecasting [9]. Decomposition-oriented Transformers provide complementary inductive bias on temporal structure but do not explicitly constrain cross-domain feature interactions [10].

Rather than introducing a fundamentally new attention operator, the proposed DS-MHA should be understood as a structured, building-oriented extension of existing feature-aware reweighting and semantic-grouping strategies. Its main distinction lies not in redefining the self-attention formula itself, but in where and how semantic separation is enforced.

Specifically, feature-aware reweighting or channel-wise gating methods mainly recalibrate the importance of heterogeneous inputs before or within shared feature interaction, while semantic grouping or branch-based fusion methods separate source groups at the branch level and then fuse them downstream. By contrast, DS-MHA imposes semantic constraints directly at the attention-head level within a unified attention module: energy-, environment-, and time-related features are routed to dedicated heads through semantic masks before fusion. In this sense, the proposed design makes cross-domain interaction control more explicit within the attention mechanism itself while preserving a single Transformer backbone.

The practical value of this design is therefore not to claim a fundamentally new theoretical operator but to provide a more explicit and interpretable architectural instantiation of domain-semantic separation for smart-building forecasting. This head-level formulation also enables head-wise ablation and interpretation, which are more directly aligned with the energy/environment/time decomposition used in the present study.

Beyond forecasting, anomaly detection and diagnosis are critical for the safe and efficient operation of building energy systems. Liu et al. [11] represent the line of classical unsupervised learning through isolation-based outlier detection. Himeur et al. [12] reviewed anomaly detection for building energy consumption and showed that the field has evolved from statistical and machine-learning methods to more recent AI-based approaches. Jia et al. [13] further summarized recent advances in deep anomaly detection for time series, particularly methods based on reconstruction error and prediction residuals. Therefore, smart-building anomaly detection has formed a mature methodological landscape spanning classical unsupervised learning, building-energy-specific review perspectives, and deep learning approaches for time-series anomalies. Surveys and HVAC-focused FDD reviews consistently emphasize that no single family dominates across all anomaly types; practical performance depends on anomaly morphology (point vs. collective/pattern), nonstationarity and concept drift, interpretability requirements, and deployment constraints [14].

Recent reviews have further highlighted the importance of season-aware calibration and context-aligned baselines for reducing false alarms under regime shifts [14]. An empirical example of this line is the seasonal-threshold design for prediction-based outlier detection in building energy data reported in [15]. Building on these mature lines, we propose a dual-stream detector that couples (a) a short-scale residual-based stream with a dynamically updated threshold for rapid point-outlier alerts and (b) a long-scale stream that performs baseline-and-trend validation under matched seasonal and weekday-type contexts, using persistence and monotonic-trend verification over consecutive windows, as a seasonally aligned evidence-fusion layer for operational diagnosis.

Explainable AI (XAI) provides an important pathway to mitigate the “black-box” limitation of deep models. In the context of smart-campus energy applications, Sadeeq [16] introduced the XDL-Energy framework, which combines energy forecasting with anomaly justification and demonstrates the practical value of explanation in linking model outputs to abnormal conditions. For building energy prediction, Ul Haq et al. [17] proposed an explainable attention-based LSTM framework that enhances transparency by quantifying and interpreting feature contributions from the supply-side perspective. At the building energy management level, Teixeira et al. [18] developed an explainable AI framework for reliable and transparent automated energy management, stressing that interpretability is essential for trustworthy operational decision support. Taken together, these studies show that XAI has been used not only to interpret forecasting drivers but also to improve confidence in operational applications. Nevertheless, explainability in most prior work is still treated as an additional analysis component rather than being tightly integrated with the full pipeline from forecasting to anomaly detection and diagnosis. As a result, frameworks that simultaneously support accurate forecasting, multi-type anomaly detection, and feature-level attribution for root-cause analysis are still scarce.

To provide a more structured positioning of the state of the art, Table 1 summarizes representative recent empirical and framework-oriented studies related to building-energy forecasting, anomaly detection, and explainability, with emphasis on their model design, application domain, data basis, evaluation focus, and main limitations.

Table 1 shows that representative recent empirical studies have typically advanced one aspect of the problem—such as heterogeneous-feature forecasting, residual- or baseline-based anomaly detection, or explainable analysis—but rarely integrate explicit semantic interaction control, seasonally aligned multi-scale anomaly evidence, and diagnosis-oriented attribution within one pipeline.

Motivated by the above observations across forecasting, anomaly detection, and explainability, we summarize the remaining gaps as follows. Despite rapid progress, three gaps still hinder practical and trustworthy deployment in campus buildings: (1) Limited domain-semantic inductive bias for heterogeneous inputs. Most Transformer forecasting advances focus on efficiency and temporal decomposition [19], but rarely constrain interactions among energy, environment, and time/calendar semantics, which can weaken robustness and explanation stability under heterogeneous inputs. (2) Insufficient anomaly-morphology coverage under seasonality and drift. Single-scale detectors often trade off fast spike detection against early degradation discovery, and seasonal regime shifts increase false alarms when evidence is not seasonally aligned. (3) Weak evidence chain from prediction to diagnosis. Post hoc explanations are often detached from actionable detection evidence (type, persistence, severity, trend), limiting root-cause utility. In addition, in the current study, anomaly source identification (e.g., equipment-/occupancy-/environment-related) is not formulated as a supervised classification task. This is because the evaluated public dataset does not provide event-level anomaly annotations, directly validated source labels, or maintenance-confirmed records; moreover, anomaly labels in practical building operation data are often sparse, incomplete, and difficult to align unambiguously with aggregate energy deviations [20]. Therefore, anomaly source identification is treated here as an attribution-guided, operator-in-the-loop triage procedure for practical decision support, rather than as an automated classifier.

To address these gaps, we propose an explainable forecasting-detection-diagnosis framework that integrates two explicitly structured mechanisms for smart-building applications. First, the proposed domain-semantic constrained attention (DS-MHA) can be viewed as a building-oriented extension of existing feature-grouping and feature-aware attention strategies, with the main distinction that semantic separation is imposed directly at the attention-head level. Second, the proposed seasonal-aligned dual-stream detector combines short-scale residual alerting with long-scale baseline-and-trend validation under matched seasonal and weekday contexts, thereby providing more comprehensive anomaly evidence for both abrupt and gradual deviations. Third, by linking forecasting outputs, structured anomaly evidence, and SHAP (SHapley Additive exPlanations)-based attribution in one workflow, the framework provides a more diagnosis-oriented use of explainability than prior post hoc add-on schemes.

2. Methods

This section details the proposed explainable framework, including the multi-head Transformer forecasting module, the dual-stream anomaly detection module, and the SHAP-based explainability module, as shown in Figure 1. The design integrates domain knowledge (feature subspaces and multi-scale anomalies) with data-driven learning to improve forecasting accuracy, broaden anomaly coverage, and provide interpretable diagnostic evidence for deployment in smart-campus energy management.

2.1. Framework Overview

The proposed framework consists of three synergistic modules: (1) multi-head Transformer forecasting module: processes historical multimodal data and predicts future energy consumption; (2) dual-stream anomaly detection module: identifies diverse anomalies through complementary short-scale and long-scale detection mechanisms; (3) SHAP-based explainability module: provides global explanations of the forecasting mechanism and local attribution for anomaly causes. The framework processes data streams with an hour as the basic time unit. The prediction module takes 24 h of historical data as input and generates energy consumption predictions for the next hour. The anomaly detection module runs continuously, identifying deviations that exceed preset thresholds by comparing actual measurements with predicted values. When anomalies are detected, the explainable module generates attribution analysis to assist operation and maintenance personnel in quickly locating the root cause of the problem. This end-to-end design ensures a complete workflow from data input to decision support, providing a comprehensive solution for energy management in campus buildings.

2.2. Multi-Head Transformer Forecasting Module

This subsection describes the forecasting component of the proposed framework in a step-by-step manner. We first introduce the input representation and feature embedding, then present the domain-specific multi-head attention mechanism, followed by the encoder layer and the masked decoder used for one-step prediction. This organization is intended to guide the reader from data representation to feature interaction modeling and finally to forecast generation.

2.2.1. Input Representation and Embedding Layer

The forecasting process begins with constructing a structured representation of the historical input window. Let the historical window length be L = 24 (hourly). Each time step contains F = 43 input features, forming the input tensor as defined in Equation (1):

X \in R^{B \times L \times F}

(1)

where B is the batch size. The input features are divided into three semantic groups: (1) energy-related features (historical load and lag/rolling statistics), (2) environmental features (weather variables), and (3) time semantic features (hour, weekday/weekend, season, etc.). A complete list of the 43 input features is provided in Table 2 to improve transparency and reproducibility.

This grouping provides the basis for the domain-specific attention design. First, the F-dimensional vector at each time step is linearly projected to obtain a

d_{model}

-dimensional embedding (Equation (2)).

E = X W_{e} + b_{e}, E \in R^{B \times L \times d_{m o d e l}}

(2)

To preserve temporal positional information, a sinusoidal positional encoding P is added to the embedding, as shown in Equation (3):

Z = E + P, P \in R^{L \times d_{m o d e l}}

(3)

where the positional encoding is defined by sine/cosine functions at different frequencies, as shown in Equations (4) and (5), enabling the model to distinguish different time positions within the historical window.

P (p o s, 2 i) = s i n (\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}})

(4)

P (p o s, 2 i + 1) = \cos (\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}})

(5)

2.2.2. Domain-Specific Multi-Head Attention (DS-MHA)

After obtaining the embedded input representation, the next step is to model interactions among heterogeneous feature groups through a domain-specific attention mechanism. In standard self-attention, linear projections are applied to obtain the query, key, and value matrices, as shown in Equation (6):

Q = Z W_{Q}, K = Z W_{K}, V = Z W_{V}

(6)

The scaled dot-product attention is computed as given in Equation (7):

A t t n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d k}}) V .

(7)

To enable different attention heads to focus on distinct domain information and reduce feature interference, three parallel branches are introduced: an energy head, an environment head, and a time head. The energy head applies a feature mask to isolate energy-related inputs (e.g., lag_1h, lag_24h, lag_168h, roll_mean_24h, roll_max_24h), capturing temporal dependencies across multiple horizons and learning variations such as weekday/weekend differences and daily peak–valley shifts. The environment head uses a feature mask to process only environmental features (e.g., dry-bulb temperature, wind speed), learning correlations between weather conditions and energy consumption and adapting to the impact of external environmental changes. The time head focuses on temporal semantic features (hour, day of week, season, weekend flag) to capture periodic patterns, including daily, weekly, and seasonal cycles, and to identify energy-use behaviors under different time periods, such as working hours, nights, weekends, and holidays.

The core is to construct three mask vectors and filter the embedded representation channel-wise, as shown in Equation (8):

Z^{(e)} = Z ⊙ m^{(e)}, Z^{(e n v)} = Z ⊙ m^{(e n v)}, Z^{(t)} = Z ⊙ m^{(t)}

(8)

The three branches then perform multi-head attention, as shown in Equation (9):

H^{(e)} = M H A (Z^{(e)}), H^{(e n v)} = M H A (Z^{(e n v)}), H^{(t)} = M H A (Z^{(t)})

(9)

Finally, the three outputs are concatenated along the channel dimension and linearly fused back to

d_{model}

, as shown in Equation (10):

H^{(e)} = M H A (Z^{(e)}), H^{(e n v)} = M H A (Z^{(e n v)}), H^{(t)} = M H A (Z^{(t)})

(10)

Compared with branch-level semantic grouping, the present design enforces semantic separation inside a single multi-head attention block so that each head is explicitly tied to a predefined domain subspace while the overall Transformer backbone remains unified. This formulation was chosen to make semantic interaction control more transparent and to support direct head-wise ablation and interpretation.

2.2.3. Transformer Encoder Layer

Based on the attention-enhanced feature representation, the encoder is then used to capture global temporal dependencies across the historical window. The encoder extracts global temporal representations from the historical window. Each encoder layer consists of a domain-specific multi-head attention (DS-MHA) module followed by a position-wise feed-forward network (FFN), both equipped with residual connections and layer normalization. Specifically, the output of the DS-MHA sublayer is computed via the residual connection and normalization shown in Equation (11), yielding the intermediate representation

\tilde{Z}

. Then, the FFN sublayer is applied to

\tilde{Z}

and combined with another residual connection and layer normalization as defined in Equation (12), producing the final output

Z^{'}

of the encoder layer.

\tilde{Z} = L a y e r N o r m (Z + D r o p o u t (D S - M H A (Z)))

(11)

Z^{'} = L a y e r N o r m (\tilde{Z} + D r o p o u t (F F N (\tilde{Z})))

(12)

The FFN is a position-wise nonlinear mapping defined in Equation (13):

F F N (u) = W_{2} σ (W_{1} u + b_{1}) + b_{2}

(13)

where

σ ()

denotes the ReLU activation.

2.2.4. Masked Decoder and Prediction Head (Encoder–Decoder Architecture)

After encoding the historical context, the decoder maps the learned representation to the final next-step load prediction. This work focuses on one-step forecasting (H = 1, i.e., predicting the next hour) using an encoder–decoder structure with a masked decoder. Compared with encoder-only prediction, the masked decoder improves robustness and provides a unified architecture that can be naturally extended to multi-step forecasting. The decoder consists of three main components. First, masked self-attention with a causal mask prevents each position from attending to future information, ensuring strict temporal causality. The masked self-attention output is computed as Equation (15):

M_{i j} = {\begin{matrix} 0, j \leq i \\ - \infty, j > i, \end{matrix}

(14)

D_{1} = M H A (Y, Y, Y; M)

(15)

Second, cross-attention allows the decoder to attend to the encoder output, integrating historical information into the decoding process, as shown in Equation (16):

D_{2} = M H A (D 1, Z^{'}, Z^{'})

(16)

Third, a feed-forward network (FFN) with residual connections and layer normalization, identical to the encoder, enhances nonlinear representation while stabilizing training. The decoder output is finally mapped to the next-step scalar forecast through an MLP head, as defined in Equation (17):

{\hat{y}}_{t + 1} = f_{o u t} (d_{t}), f_{o u t} = M L P (L i n e a r (d_{m o d e l} \to 32) \to R e L U \to L i n e a r (32 \to 1))

(17)

During inference, the model takes the last available historical window as input and outputs the next-hour prediction

{\hat{y}}_{t + 1}

. When extended to multi-step forecasting, the predicted values can be fed back iteratively to maintain a generative forecasting structure.

The forecasting module provides the prediction baseline for the subsequent diagnosis stage. Specifically, the discrepancy between predicted and observed energy consumption is used as the core signal for the anomaly detection module described in the next subsection.

2.3. Dual-Stream Anomaly Detection Module

Building on the forecasting results, this subsection details how anomaly evidence is generated from prediction residuals. To address the diverse anomaly patterns in smart-building operation, the detector employs a dual-stream structure consisting of a short-scale stream for abrupt deviations and a long-scale stream for gradual deviations, followed by a decision-fusion stage to integrate the two types of evidence. Specifically designed for 24 × 7 continuous monitoring of campus energy systems, this module identifies both sudden point anomalies and gradual pattern anomalies using prediction residuals. By integrating fast response and long-term trend analysis, it is intended to provide broader anomaly coverage and more structured diagnostic evidence than a single-threshold design.

2.3.1. Short-Scale Threshold Stream (Point Anomaly Detection)

The short-scale stream aims to rapidly detect sudden point anomalies caused by abrupt faults, abnormal events, or sensor spikes. The core idea is to use a rolling-window residual distribution to compute a dynamic threshold that adapts to recent operating conditions. First, the absolute prediction error at each time step is computed as in Equation (18):

e_{t} = |y_{t} - {\hat{y}}_{t}|

(18)

Then, a dynamic threshold is estimated from the most recent 4 h error window, as defined in Equation (19):

{Threshold}_{t} = μ_{s} + k σ_{s}

(19)

where

μ_{s}

and

σ_{s}

denote the mean and standard deviation within the 4 h error window, respectively. Here, k = 4. In the main benchmark, the detector uses the selected/default operating setting, while Section 4.4 varies k for sensitivity analysis. When the error exceeds this dynamic threshold for 2 consecutive hours, the system identifies it as a point anomaly. This process is primarily targeted at scenarios such as sudden equipment failure, transient sensor anomaly, and unexpected load changes.

This method features adaptability and low computational complexity, making it particularly suitable for continuous real-time detection in the distributed monitoring system of campus buildings. As the first line of defense for anomaly detection, the short-scale threshold stream can quickly identify sudden anomalous events that significantly deviate from normal patterns.

2.3.2. Long-Scale Threshold Stream (Pattern Anomaly Detection)

The long-scale stream focuses on pattern anomalies, i.e., gradual deviations that evolve over days to weeks (e.g., progressive efficiency degradation). Such anomalies may not exceed the short-scale threshold at early stages but can be identified by sustained deviation from historical normal error patterns. Principle and implementation: the core idea is to build a “normal error baseline” and detect anomalies by comparing current errors with historical normal errors. The workflow is as follows:

(1): Historical error baseline: from historical normal-operation data, extract 20 historical windows (each 48 h) that match the same season and the same weekday type (weekday/weekend) as the current period. For each historical window $i$ , we compute its mean absolute prediction error $μ_{e r r}$ . The baseline error distribution is then summarized by the mean and standard deviation across the 20 windows, as defined in Equations (20) and (21):

μ_{b a s e l i n e} = \frac{1}{20} \sum_{i = 1}^{20} μ_{e r r, i}

(20)

σ_{b a s e l i n e} = \sqrt{\frac{1}{19} \sum_{i = 1}^{20} {(μ_{e r r, i} - μ_{b a s e l i n e})}^{2}}

(21)

where the term

μ_{e r r, i}

denotes the mean absolute error of the

i

-th historical window.

(2): Current-window error: For the current 48 h window, we compute the mean absolute prediction error $μ_{current}$ as in Equation (22)

μ_{current} = \frac{1}{48} \sum_{t = 1}^{48} |y_{t} - {\hat{y}}_{t}|

(22)

(3): Anomaly indicator: We quantify the deviation of the current window from the historical baseline using the standardized score in Equation (23):

D e v i a t i o n = \frac{μ_{c u r r e n t} - μ_{b a s e l i n e}}{σ_{b a s e l i n e}}

(23)

Pattern anomalies typically manifest as steadily accumulating prediction errors that eventually deviate significantly from a historical normal-error baseline. To capture such progressive anomalies while reducing false alarms, a three-level decision rule is adopted. First, the current prediction error must exceed the historical baseline by two standard deviations, i.e., Deviation > z, where z = 2. In the main benchmark, the detector uses the selected/default operating setting, while Section 4.4 varies z for sensitivity analysis. Second, this condition is required to persist across two consecutive 48 h windows. Third, the error must exhibit an increasing trend, with the deviation in the second window larger than that in the first. As an illustrative example, consider a chiller experiencing gradual heat-exchanger fouling. As fouling accumulates, system efficiency declines and energy consumption increases. In the early stage, the deviation may not trigger short-scale thresholds; however, relative to same-period historical normal-operation data, prediction errors continue to accumulate. When the standardized deviation exceeds 2.0 in two consecutive 48 h windows and shows a sustained upward trend, a pattern anomaly is declared.

The long-scale stream targets progressive anomalies such as gradual HVAC efficiency degradation (e.g., heat-exchanger fouling or refrigerant leakage), increased energy consumption from lighting system aging, and deterioration of building envelope insulation. To ensure robustness under varying operating conditions, the historical baseline is updated seasonally. This seasonal update re-estimates the baseline as operating conditions shift—for instance, between summer cooling and winter heating—thereby maintaining consistent comparability for anomaly detection decisions over time.

Overall, the short-scale stream provides rapid alarms for sudden anomalies at a 4 h horizon, while the long-scale stream uses a same-season baseline and 48 h trend verification to detect gradual deviations. The complementary design improves coverage of diverse anomalies and reduces missed detections for slow degradation that is common in real building operations.

2.3.3. Decision Fusion

The two streams operate in parallel and produce short-scale and long-scale anomaly flags. The fusion logic combines their outputs to infer anomaly type and trigger timing. The short-scale stream has higher urgency and prioritizes immediate response, while the long-scale stream provides trend evidence to support maintenance planning and root-cause analysis. When both conditions are satisfied, the system reports a composite anomaly to indicate concurrent sudden deviation and long-term degradation. Executable fusion rules for engineering deployment and reproducibility are as follows.

(1): Short-scale trigger: Let $e_{t} = | y_{t} - {\hat{y}}_{t} |$ , and the dynamic threshold $T_{t} = μ_{e} + 4 σ_{e}$ (where $μ_{e}$ and $σ_{e}$ are calculated based on the most recent 4 h error window). If $e_{t} > T_{t}$ is satisfied for $τ_{s} = 2$ consecutive hours, a short-scale anomaly is determined with $S_{t} = 1$ , and an alarm is triggered immediately.
(2): Long-scale trigger: Calculate the window-level error $E_{w}$ (e.g., window Mean Absolute Error, MAE) with a 48 h window, and align it with the historical baseline error of the same period (mean $μ_{w}$ , standard deviation $σ_{w}$ ). Let $z_{w} = {(E}_{w} - μ_{w}) / σ_{w}$ . If $z_{w} > 2.0$ holds for $τ_{l} = 2$ consecutive 48 h windows and $E_{w 2} > E_{w 1}$ (showing an upward trend), a long-scale anomaly is determined with $L_{t} = 1$ , and an alarm is triggered at the end of the second window.
(3): Fusion decision: If $S_{t} = 1$ , output point anomaly; otherwise, if $L_{t} = 1$ , output pattern anomaly; if both conditions are satisfied simultaneously, output composite anomaly (prioritize disposal following the emergency logic of the short-scale flow, with supplementary long-scale trend information attached).
(4): Anti-chattering: Enter a cooling period of $κ_{s} = 4$ h (for short-scale)/ $κ_{l} = 48$ h (for long-scale) after an alarm is triggered, during which no new alarms will be repeatedly triggered. If the anomaly conditions persist, the events will be merged into a single anomaly event, and the duration will be updated accordingly.

To help the energy management team allocate response resources, the system generates an “anomaly type–evidence” record based on the triggering stream(s) and decision details. An anomaly triggered by the short-scale stream is labeled as a “sudden point anomaly” and is associated with the maximum residual deviation within the 4 h window (as a severity indicator). An anomaly triggered by the long-scale stream is labeled as a “progressive pattern anomaly” and is associated with the error-increase slope across consecutive windows (as a degradation-rate indicator). If both streams are triggered (e.g., a sudden fault occurs on top of previously unnoticed progressive degradation), the event is labeled as a “composite anomaly”: the response follows the urgent logic of the short-scale stream while attaching the long-scale degradation trend to provide complete evidence for root-cause analysis.

To enable quantitative validation of the dual-stream detector beyond qualitative case studies, a synthetic anomaly injection protocol is adopted for benchmarking with standard detection metrics (detailed in Section 3.4), and a dedicated sensitivity analysis of the key detection parameters k and z is provided in Section 4.4. These additions are intended to demonstrate the detection capability and parameter robustness of the proposed anomaly module under controlled conditions.

2.4. SHAP-Based Explainability Module

Following the forecasting and anomaly detection stages, the final component of the proposed framework is to generate interpretable diagnostic evidence. This subsection first provides a global explanation of the model’s forecasting behavior, followed by local attribution analysis tailored to anomaly interpretation, establishing a clear connection between model outputs and operationally meaningful insights for facility operators.

Given that campus building operation and maintenance (O&M) demands explanations that are comparable across different operating regimes (e.g., weekday/weekend, seasonal variations), decomposable into traceable evidence chains for individual anomalies, and actionable for prioritizing troubleshooting, direct feature-level attributions are difficult for engineers to interpret due to the inclusion of rich one-hot encodings and derived statistical features in the model inputs. To address this, we employ semantic feature aggregation by grouping features that reflect consistent engineering semantics and compute explanations within this aggregated feature space to better align with practical O&M knowledge. We use SHAP as the core attribution method due to its unified framework for quantifying global feature importance and local contributions via additive decomposition, supporting consistent importance ranking and reliable evidence accumulation. Critically, SHAP attributions are integrated with the dual-stream anomaly evidence (including anomaly type, persistence, severity, and degradation rate) to generate diagnosis-oriented results instead of isolated visualizations. Specifically, we anchor explanations to anomaly morphology through carefully selected explanation windows and adopt regime-matched baselines for fair comparison; the resulting attribution patterns are further translated into a standardized heuristic triage guideline for operators, focusing on equipment, occupancy, and environmental factors, rather than relying on an automated classification system.

2.4.1. Forecast Explanation (Global)

This subsection summarizes the model’s forecasting logic for campus O&M. All analyses are performed in an aggregated semantic feature space to reduce one-hot fragmentation and align with operator terminology. We provide two global views:

(1): Importance ranking via mean absolute SHAP values to identify key drivers and guide data-quality prioritization and threshold calibration;
(2): SHAP summary visualization in the aggregated feature space to show the distribution, direction, and variability of sample-level feature contributions.

The global explanation outputs and their deployment applications are summarized in Table 3.

2.4.2. Anomaly Attribution (Local)

This subsection focuses on specific anomaly events detected by the dual-stream detector and aims to provide precise, actionable attribution to reduce diagnostic complexity and time cost. Four complementary functions are incorporated.

(1): Feature attribution is performed by computing SHAP values for each anomalous sample, thereby quantifying the positive or negative contribution of individual features to the observed prediction deviation.
(2): Counterfactual analysis adopts a hypothesis–verification paradigm by simulating targeted feature corrections—such as restoring an abnormal sensor temperature to its normal range—and observing whether the anomaly decision is alleviated. This process helps validate root-cause hypotheses and identify which feature adjustments are sufficient to suppress the alarm.
(3): Temporal attribution is applied to time-series data to identify historical time steps that exert the greatest influence on the current anomaly decision. If a progressive anomaly is most sensitive to a time point several days earlier, operators can trace back to potential triggering events, such as operational changes or abrupt weather disturbances.
(4): Anomaly-source triage (heuristic, operator-in-the-loop). At the current stage, anomaly source identification is performed through a rule-based, operator-in-the-loop triage procedure based on attribution signatures in the aggregated semantic feature space. After computing grouped SHAP contributions Φ(g), we rank semantic groups by ∣Φ(g)∣ and map the dominant groups to three O&M families: equipment-related (HVAC/equipment variables), occupancy/schedule-related (calendar and schedule indicators), and environment-related (weather/environment variables). An alarm is triaged according to which family dominates the top-k contributions (or exceeds a predefined share of total ∣Φ∣); ambiguous cases are marked as “mixed/uncertain” for operator review. Operators then confirm or revise the triage label using contextual information (e.g., maintenance notes, schedule logs, or weather records). This protocol standardizes prioritization and response planning, but it does not constitute an automated classifier.

Overall, SHAP decomposes the forecast into feature-level contributions, enabling an explanation chain from “importance ranking” to “contribution direction” to “anomaly attribution.” In experiments, we present global importance and representative case explanations to verify consistency with domain knowledge. Because anomaly-source labels are produced via operator-in-the-loop triage in this study, we do not report automated classification metrics. Instead, we provide representative case studies and explanation-consistency checks to demonstrate that the attribution signatures align with domain knowledge. Systematic benchmarking with labeled maintenance records is left for future work.

3. Data Acquisition and Preprocessing

This section describes the dataset, preprocessing pipeline, and model training protocol used in the experiments. To facilitate reproducibility, the complete implementation and experimental resources are publicly available in the open-source repository.

3.1. Dataset Description

This study uses the public Building Data Genome Project 2 (BDG2) dataset [21]. We select hourly electricity loads and corresponding weather records from two education buildings—Moose_education_Ricardo and Cockatoo_education_Erik—and construct two independent datasets (Moose and Cockatoo). The datasets cover different geographic and climatic backgrounds, capturing differences in seasonality, periodicity, and weather sensitivity, and thus provide a basis for evaluating model generalization. The key statistical characteristics of the two datasets are summarized in Table 4.

All datasets are sampled at an hourly temporal resolution, providing fine-grained records of building electricity consumption and associated environmental and contextual conditions. The Moose dataset contains 17,328 hourly records (approximately two years), while the Cockatoo dataset contains 15,360 hourly records (approximately 1.75 years). Each record includes a timestamp, total active power consumption (kW), key weather variables (dry-bulb temperature in °C and wind speed in m/s), and contextual features capturing calendar and seasonal effects, such as weekday type, a weekend indicator, and the season state. As a public benchmarking dataset for energy forecasting, BDG2 does not provide event-level anomaly annotations, anomaly-source labels, or maintenance-confirmed fault records for the selected buildings.

3.2. Data Preprocessing

To improve transparency and reproducibility, the data preprocessing procedure was organized into a structured pipeline and applied independently to the Moose and Cockatoo datasets. The main steps are described as follows.

(1): Data quality control and record screening. All records were first checked for timestamp consistency and core-feature completeness. Duplicate timestamps were removed, and severely corrupted segments with irrecoverable long missing intervals were excluded before feature construction. The two datasets contained 32,688 hourly records in total before cleaning. After quality control and preprocessing, 27,784 valid records were retained for modeling, corresponding to 85.00% of the original data. This process improves data consistency while preserving the main temporal and seasonal characteristics of the two buildings.
(2): Missing-value imputation. Missing values were handled according to the length of consecutive gaps. For short gaps of fewer than 3 consecutive hours, forward filling was applied. For longer gaps, a similar-day interpolation strategy was used. In this study, similar days were selected based on the same weekday type, the same seasonal condition, and comparable weather conditions so that the imputed values remained consistent with building operating patterns under similar temporal and environmental contexts. The 3 h cutoff was used as the default setting in the main experiments.
(3): Outlier detection and treatment. Outliers in continuous variables were identified using a modified Z-score method with a threshold of 2.5. Detected outliers were corrected using median replacement to reduce the influence of abnormal observations on downstream modeling. This treatment was adopted to avoid distortion of temporal patterns while maintaining robustness to occasional measurement noise and recording errors.
(4): Time-semantic feature engineering. Discrete temporal features were extracted from the timestamp to encode regular building operation patterns. Specifically, the preprocessing stage generated weekday indicators, a weekend/working-day flag, hour-of-day indicators, and season indicators. These time-semantic variables provide contextual information for capturing daily, weekly, and seasonal regularities in building energy consumption. Their final encoded feature groups are summarized in the input-feature table of the forecasting module.
(5): Historical energy feature construction. To characterize short-term and multi-scale load dependence, historical energy features were derived from past electricity consumption, including lag_1h, lag_24h, lag_168h, roll_mean_24h, and roll_max_24h. These variables were used together with environmental and temporal features to form the final 43-dimensional input at each time step. The lag and rolling features were computed only from past observations so as to preserve temporal causality and avoid information leakage. This feature composition is consistent with the model input definition given in Section 2.2.1.
(6): Data standardization. Continuous variables, including electricity load, air temperature, and wind speed, were standardized using Min–Max normalization. To prevent information leakage, the normalization parameters were estimated using the training set only, and the same transformation was then applied to the validation and test sets. All forecasting metrics were finally reported on the original physical scale (kW) after inverse transformation.
(7): Robustness check of preprocessing heuristics. To examine whether the experimental conclusions were sensitive to specific preprocessing choices, a one-factor-at-a-time sensitivity analysis was further conducted on the missing-gap threshold, the outlier Z-score threshold, and the outlier replacement strategy. Across these settings, the fraction of modified samples ranged from 0.00% to 2.82% for Moose and from 1.55% to 6.77% for Cockatoo. The resulting performance variations remained limited (Moose: ΔMAE ≤ 2.82 kW, ΔRMSE ≤ 4.56 kW; Cockatoo: ΔMAE ≤ 3.11 kW, ΔRMSE ≤ 4.06 kW), indicating that the overall conclusions of this study are not driven by a particular preprocessing heuristic.

3.3. Model Training and Evaluation

(1): Data Split and Sample Construction

To preserve temporal causality and prevent data leakage, Moose and Cockatoo are split chronologically into training, validation, and test sets with ratios of 80%/10%/10%. Within each subset, supervised samples are constructed using a sliding window: historical length L = 24 h, prediction horizon H = 1 h, and stride 1 h. For a continuous sequence of length N, the total number of available windows is computed as Equation (24):

N_{W I N} = N - (L + H) + 1

(24)

This construction satisfies the need of short-term load forecasting to depend on recent history and ensures strict temporal separation across training/validation/test.

(2): Training Settings and Hyperparameter Selection

The model is implemented in PyTorch2.8.0 (with CUDA 12.9) and trained on the BDG2 datasets using the AdamW optimizer. Hyperparameters are selected via grid search on the validation set, with MAE and RMSE as primary criteria while also considering computational cost. The search ranges include window sizes of 12, 24, and 48 h; batch sizes of 16, 32, and 64; embedding dimensions of 64, 128, and 256; 1–3 encoder layers; and learning rates of 5 × 10⁻⁶, 5 × 10⁻⁵, and 5 × 10⁻⁴. Based on validation performance and training efficiency, the final configuration is set to a 24 h window, batch size of 32, embedding dimension of 128, a single encoder layer, and a learning rate of 5 × 10⁻⁵. Early stopping with a patience of 20 epochs is employed to mitigate overfitting. Training time is reported as an implementation-level reference under the experimental environment; for transparency, the selected setting contains 0.594M parameters and requires 1.40 s/epoch (Moose) and 1.26 s/epoch (Cockatoo) on a CUDA-enabled GPU under the same hardware/software environment.

Hyperparameter sensitivity and cost trade-offs. To improve reproducibility, we additionally conduct a one-at-a-time sensitivity check around the selected configuration by varying embedding dimension, encoder depth, and learning rate while keeping other settings fixed. In these ranges, we observe that performance is most affected by learning rate and model capacity and that some extreme settings can yield suboptimal solutions under a fixed training budget. Therefore, we rely on validation-based selection with early stopping to identify stable configurations. Under the same training protocol, the maximum observed deviations relative to the selected setting are ΔMAE = 3.70 kW/ΔRMSE = 5.58 kW on Moose and ΔMAE = 3.87 kW/ΔRMSE = 8.03 kW on Cockatoo. Meanwhile, increasing d or L raises computational cost (e.g., more parameters and longer wall-clock time per epoch), which motivates our final choice as a practical balance between accuracy and efficiency.

(3): Evaluation Metrics and Experimental Design

The following metrics are reported on the test set: mean absolute error (MAE), root mean squared error (RMSE), mean absolute percentage error (MAPE), and coefficient of determination (R²). Ablation studies are conducted to quantify the contribution of each module. In addition, SHAP visualizations are used to interpret feature contributions, enabling a comprehensive evaluation from three aspects: forecasting accuracy, anomaly/pattern responsiveness, and explainability.

The MAPE may suffer from numerical instability when the actual electricity consumption is close to zero. To improve robustness, we adopt an ε-stabilized MAPE defined in Equation (25):

M A P E = \frac{1}{n} \cdot \sum \frac{|y_{t} - {\hat{y}}_{t}|}{m a x (|y_{t}|, ε)}

(25)

where

ε = 10^{- 3} kW

. Additionally, MAE and RMSE are used as the main error indicators for supplementary verification to avoid bias caused by a single percentage indicator.

Fair baseline setting: To avoid unfair comparisons, ARIMA, LSTM, Informer, Autoformer, FEDformer, PatchTST, the standard Transformer, and our proposed model use the same time-ordered split (80/10/10), the same input feature set, the same history length L = 24 (hourly), and the same horizon H = 1. Deep models use AdamW, early stopping on the validation set (patience = 20), and grid search within consistent ranges; ARIMA orders (p, d, q) are selected on the training set by jointly considering AIC/BIC and validation errors.

3.4. Synthetic Anomaly Injection for Quantitative Evaluation

Since the BDG2 dataset does not provide event-level anomaly annotations or maintenance-confirmed fault labels, a synthetic anomaly injection protocol is adopted to enable quantitative evaluation of the anomaly detection module using standard detection metrics. This approach is widely used in time-series anomaly detection benchmarking when labeled ground-truth data is unavailable [13,14].

Two types of synthetic anomalies are injected into the test-set ground-truth energy consumption series to simulate commonly observed building-energy fault patterns:

(1) Point anomalies: Short-lived perturbations spanning 2–3 consecutive hours, with a multiplicative magnitude factor of ±5%, simulating sensor faults, transient communication disturbances, or sudden equipment malfunctions. The injection ratio is approximately 1% of test-set samples, yielding 30-point anomaly events on Cockatoo and 34 on Moose.

(2) Pattern anomalies: 48 h gradual drift segments, modeled as a linear ramp from zero to +10% of the base load at the injection start, simulating progressive equipment degradation such as heat-exchanger fouling or refrigerant leakage. The injection ratio provides approximately 25% temporal coverage of the test set, yielding 9 pattern anomaly events on Cockatoo and 13 on Moose.

The injected anomalies are applied to the original test-set observations, while the Transformer predictions remain based on the clean (un-injected) training data. Prediction residuals are then computed as the absolute difference between the anomaly-injected observations and the model predictions.

To provide a rigorous benchmark, three baseline detectors are compared against the proposed dual-stream detector: (a) a 3σ dynamic threshold detector using a rolling window of 4 h and k = 3.0; (b) Isolation Forest applied to the residual series with contamination = 0.15; and (c) an LSTM predictor with 3σ threshold detection on LSTM-based residuals.

Evaluation is conducted at the event level rather than the point level, which better reflects operational utility. For point anomalies, an injected event (2–3 h window) is considered detected if at least one alarm hour falls within the event window. For pattern anomalies, an injected event (48 h window) is considered detected if at least 20 alarm hours fall within the event window. A false-positive event is defined as a contiguous alert segment that does not overlap any injected anomaly region. Standard metrics, including event-level precision, recall, F1-score, and false alarm rate (FAR), are reported.

3.5. Reproducibility and Implementation Availability

All experimental results of this study were derived from the implementation based on Python3.11.4 and PyTorch, with the complete code and experimental resources publicly available in the open-source repository (https://github.com/cai1157/Explainable-Smart-Building-Energy-Consumption-Forecasting (accessed on 3 April 2026)). This repository encapsulates the full computational environment configuration (including the required Python and dependent library versions such as PyTorch, NumPy 1.26.4, and Pandas 2.2.2), processed datasets (Cockatoo_data.csv and Moose_data.csv), model training/evaluation notebooks, pre-saved model checkpoints, result visualization plots, and baseline output files for time-series forecasting of smart building energy consumption. It also provides the implementation of multiple forecasting models (ARIMA, LSTM, standard Transformer, and multi-Head Transformer) and baseline outputs of advanced time-series models (Autoformer, FEDformer, Informer, PatchTST) for comparative experiments, alongside detailed feature construction rules and data preprocessing pipelines for the 43-dimensional input features used in single-step building energy consumption forecasting.

4. Results

4.1. Forecasting Performance

In this section, we compare the forecasting performance of different models on two datasets. Table 5 and Table 6 report the results of the proposed multi-head Transformer and seven baselines, including ARIMA, LSTM, PatchTST, FEDformer, Autoformer, Informer, and the standard Transformer, on the Cockatoo_data and Moose_data datasets, respectively. As shown in Table 5, the proposed model achieves the best overall performance on Cockatoo_data, outperforming not only conventional baselines such as ARIMA and LSTM but also stronger recent Transformer-based models. Among the added recent baselines, FEDformer is the strongest competitor on this dataset, but the proposed model still achieves the lowest MAE, RMSE, and MAPE. On Moose_data (Table 6), the proposed model again yields the best overall performance. In particular, PatchTST provides a competitive result on this dataset, indicating that recent patch-based Transformer designs are strong baselines for building-energy forecasting; however, the proposed model still attains the lowest MAE, RMSE, and MAPE, showing that explicit semantic separation remains beneficial even against stronger modern baselines.

The training and validation loss curves in Figure 2 show that losses on both datasets decrease rapidly and subsequently stabilize, indicating good convergence and stable training, which supports the model’s strong predictive performance. Figure 3 and Figure 4 present the first 200-step forecasting results for Cockatoo and Moose on both the training and test sets. In all cases, the predicted load (orange curve) closely follows the true load (blue curve), accurately capturing daily fluctuations as well as peak and valley patterns. These visual results are consistent with the quantitative low-error metrics and further suggest that the specialized attention heads effectively capture complex dependencies, including usage patterns and environmental influences.

To further strengthen the comparative presentation of the forecasting results, we additionally visualize the distribution of absolute forecasting errors for representative models on the two datasets in Figure 5. The compared models were selected according to both methodological relevance and dataset-specific competitiveness. Specifically, the standard Transformer was included as the direct backbone reference to highlight the benefit of introducing semantic head separation over a vanilla attention architecture. In addition, for each dataset, we selected the strongest recent competing model according to the quantitative results in Table 5 and Table 6, namely FEDformer for Cockatoo and PatchTST for Moose. This strategy avoids overloading the figure with too many models while still enabling a focused comparison against the most competitive recent baseline for each dataset.

As shown in Figure 5a, on the Cockatoo dataset, the proposed multi-head Transformer exhibits the lowest median absolute error and a more concentrated interquartile range than both the standard Transformer and FEDformer. This indicates that the proposed model not only improves the average forecasting accuracy but also reduces the dispersion of prediction errors and limits large-error cases more effectively. In Figure 5b, on the Moose dataset, PatchTST is used as the most competitive recent baseline because its overall average metrics are close to those of the proposed model. Although the central error distribution of PatchTST is broadly comparable to that of the multi-head Transformer, the proposed model shows noticeably fewer extreme outliers and a more controlled upper tail. This suggests that the multi-head Transformer provides more stable prediction behavior and better robustness against occasional large deviations. Overall, these distributional results complement the average metrics reported in Table 5 and Table 6 and further demonstrate that the proposed model achieves not only strong mean performance but also improved stability across different operating conditions.

4.2. Ablation Study

On the Cockatoo_data test set, we conduct ablation studies to verify the contribution of each component. The results are shown in Table 7 and Figure 6. The ablation results yield several important insights. First, the energy head plays a dominant role: removing it leads to a substantial performance drop, with MAE increasing by 34.5% and R² decreasing by 12.0%. This highlights the critical importance of historical load dependencies—such as lagged values and rolling statistics—in modeling temporal dynamics, particularly weekday–weekend differences and daily peak–valley variations. Second, the time head is also essential. Its removal increases MAE and MAPE by 16.3% and 17.4%, respectively, and reduces R² by 3.6%, confirming that explicit time semantics (hour of day, day of week, and season) are indispensable for capturing strong periodic patterns in campus buildings, where energy usage is highly schedule-driven. Third, the environment head contributes significantly to model accuracy. Excluding this head results in an 18.2% increase in MAE, reflecting the strong coupling between weather conditions (e.g., outdoor temperature and wind speed) and building energy consumption. In campuses with pronounced seasonal transitions, neglecting environmental factors weakens the model’s ability to adapt to heating and cooling load variations.

Overall, these results validate the effectiveness of the proposed domain-specific multi-head attention mechanism. By allowing each attention head to focus on a distinct semantic subspace through feature masking, the model reduces feature interference and achieves a more comprehensive representation of complex load dependencies. Notably, the combined contribution of the energy head and the time head is most critical for capturing campus-specific usage pattern changes, which is well aligned with domain knowledge.

4.3. Anomaly Detection Benchmark

To provide quantitative validation of the anomaly detection module, we evaluate the proposed dual-stream detector against three baselines using the synthetic anomaly injection protocol described in Section 3.4. Table 8 reports the event-level detection metrics on both datasets. The dual-stream detector in Table 8 uses the selected operating configuration with k = 4 and z = 2, while Section 4.4 reports a parameter sensitivity analysis over the predefined grid.

Several observations can be drawn from Table 8. First, the proposed dual-stream detector achieves the highest F1 scores on both datasets (0.386 on Cockatoo and 0.452 on Moose), substantially outperforming all three baselines. The advantage is most pronounced on the Moose dataset, where the dual-stream detector achieves an F1 of 0.452 compared to 0.1174 for the strongest baseline (Isolation Forest).

Second, the most distinctive advantage of the proposed detector lies in pattern anomaly recall. On Moose, the dual-stream detector captures 69.2% of pattern events (9 out of 13), compared to 0% for both the 3σ and LSTM + 3σ baselines and 30.8% for Isolation Forest. On Cockatoo, the dual-stream detector detects 66.7% of pattern events (6 out of 9), while the 3σ and LSTM + 3σ baselines detect none. This confirms that single-scale threshold methods are largely blind to gradual degradation patterns, which is precisely the gap that the long-scale stream is designed to address.

Third, the 3σ baseline achieves higher point recall than the dual-stream detector on Cockatoo (56.7% vs. 36.7%), because it triggers at every individual threshold violation without the consecutive-hour confirmation required by the proposed method. However, this higher point sensitivity comes at the cost of substantially lower precision (0.069 vs. 0.347), indicating that the 3σ baseline generates a much higher false-alarm burden that would be impractical for real-world building operation and maintenance.

Fourth, the overall precision values across all detectors remain moderate (ranging from 0.059 to 0.347), and the false alarm rates are correspondingly elevated. This is a systematic phenomenon shared by all compared methods rather than a limitation specific to the proposed detector, and it can be attributed to two factors. (a) The false positive events are predominantly generated by the short-scale stream responding to natural load variability, rather than by the long-scale stream, indicating that most false alarms arise from the point-detection pathway, where normal residual fluctuations can occasionally exceed adaptive thresholds. (b) Importantly, the BDG2 dataset used in this study is real-world operational data that has not been manually verified to be entirely anomaly-free. The test set may contain genuine but unlabeled anomalies—such as unrecorded equipment events, unreported schedule changes, or latent sensor issues—that the detector correctly identifies but that are counted as false positives because only the synthetically injected anomalies carry ground-truth labels. This labeling incompleteness is an inherent limitation of synthetic injection on real-world data and likely contributes to an underestimation of the true precision of all evaluated methods. Despite these factors, the proposed dual-stream detector consistently achieves precision values 3–5× higher than the baselines, demonstrating substantially better discrimination ability under the same evaluation conditions. These precision limitations are further discussed in Section 5.3.

Overall, the benchmark results demonstrate that the dual-stream design provides broader anomaly coverage than single-scale alternatives, particularly for progressive degradation patterns that are common in real building operations but difficult to detect with conventional threshold methods.

4.4. Sensitivity Analysis of Anomaly Detection Parameters

To examine the robustness of the dual-stream detector with respect to its key heuristic parameters, a systematic sensitivity analysis was conducted by varying the short-scale threshold multiplier k ∈ {2.0, 2.5, 3.0, 3.5, 4.0} and the long-scale deviation threshold z ∈ {1.5, 2.0, 2.5, 3.0}, yielding 20 parameter combinations for each dataset. All other parameters, including the short-window length, long-window length, consecutive-hour requirement, and baseline-window configuration, were kept fixed at their default values. In the implementation, the detector performs a grid search over the predefined k and z values, and the resulting metrics are computed using the same event-level evaluation protocol as the main benchmark. The sensitivity analysis of dual-stream detector parameters is summarized in Table 9.

The sensitivity analysis reveals two clear findings. First, the short-scale threshold multiplier k has a dominant and monotonic effect on detection performance. As k increases from 2.0 to 4.0 (at z = 2.0), the F1-score improves steadily on both datasets (Cockatoo: 0.229→0.386; Moose: 0.345→0.452), precision increases substantially (Cockatoo: 0.144→0.347; Moose: 0.218→0.333), and the number of false-positive events decreases. This improvement is driven by the suppression of short-scale false alarms: stricter thresholds filter out residual fluctuations caused by normal load variability, thereby improving precision without severely sacrificing recall. Across the full grid, the best F1-scores are achieved at k = 4.0 on both datasets (0.419 at z = 1.5 for Cockatoo, 0.504 at z = 1.5 for Moose).

Second, the long-scale threshold z also influences detection performance, though less strongly than k. On Moose, increasing z from 1.5 to 3.0 at k = 4.0 reduces F1 from 0.504 to 0.367 and precision from 0.380 to 0.261, because a stricter z suppresses some pattern anomaly detections. On Cockatoo, the effect of z is comparatively smaller: at k = 4.0, F1 decreases from 0.419 to 0.360 as z increases from 1.5 to 3.0. The direction of the z effect—lower z yields better performance—is consistent with the fact that a more lenient long-scale threshold allows the detector to capture more pattern anomaly events, which are a key strength of the proposed dual-stream design.

Third, the detection performance remains reasonably stable across the entire parameter grid: F1-scores range from 0.212 to 0.419 on Cockatoo and from 0.274 to 0.504 on Moose, with no configuration producing degenerate behavior such as zero recall or complete failure. This indicates that the dual-stream architecture is not overly sensitive to specific parameter choices and can be expected to function reliably under a range of practical settings.

The selected operational configuration (k = 4.0, z = 2.0) provides a practical balance between detection capability and false-alarm control, achieving the best F1 among the configurations with z = 2.0. For practical deployment, k should be treated as the primary tuning parameter, as it directly controls the trade-off between detection sensitivity and false-alarm burden. The long-scale threshold z can be set within a moderate range (1.5–2.5); lower values improve pattern anomaly recall at the cost of slightly more long-scale false alarms.

Overall, the sensitivity analysis indicates that the proposed dual-stream detector is reasonably robust across the explored parameter grid. No degenerate configuration is observed, and the metric variations remain gradual rather than abrupt. From a practical perspective, k = 4.0 provides the best overall trade-off between detection performance and false-alarm suppression on both datasets. For operational deployment, k should therefore be treated as the primary tuning parameter, while z can typically be set within a moderate default range without substantially affecting system behavior.

4.5. Explainability Analysis

To enhance the transparency of the proposed multi-head Transformer predictor, we use SHAP for global explanation and local attribution. Because the input contains many one-hot encodings and derived variables, we adopt an aggregated feature space for explanation: dimensions corresponding to the same “semantic feature” (e.g., hour, weekday, historical_electricity_data) are grouped, and their SHAP contributions are aggregated to obtain results that better match engineering semantics and academic reporting.

4.5.1. Forecast Explanation (Global)

Figure 7 shows the global feature-importance results in the aggregated feature space (mean (|SHAP|) over all evaluated samples). The forecast is mainly driven by time and historical-load factors: hour and historical_electricity_data have the largest mean absolute contributions, followed by calendar features such as weekday and season. Environmental features, e.g., air temperature and wind speed, contribute less but still have a visible influence. This matches typical campus load patterns: loads are strongly shaped by schedules and recent inertia, while weather provides additional modulation.

Figure 8 presents a SHAP summary (beeswarm) plot for aggregated features, showing the distribution of sample-level contributions. The horizontal spread reflects the variability and strength of a feature’s influence; the sign indicates whether the feature increases or decreases the prediction relative to the baseline. The SHAP distributions of hour and historical_electricity_data have larger spans, indicating that the model relies heavily on these features to adjust predictions and capture intra-day fluctuations and short-term dependencies. Features such as wind speed are mostly concentrated near zero, suggesting a weaker overall influence with occasional corrective effects in certain scenarios.

4.5.2. Anomaly Attribution

Global explanations alone are insufficient for O&M-oriented anomaly diagnosis. Therefore, we further perform local attribution analysis for individual predictions (or anomalous moments). Figure 9 presents a representative SHAP force plot explanation in the model’s aggregated feature space: for this single energy prediction, negative contributions (marked in blue, e.g., historical_electricity_data, hour) push the prediction downward, while positive contributions (marked in red, e.g., weekday, season) push it upward; the larger the absolute value of the SHAP value, the stronger the influential power of the corresponding feature.

As shown in Figure 9, single-sample forecasts are often dominated by a small number of key features (e.g., historical_electricity_data and hour), whose contributions far exceed those of others. This suggests that the model uses historical load and time-of-day as anchors for short-term forecasting and then applies calendar and environmental features as secondary refinements. Such a “dominant factors + secondary corrections” structure makes it easier to translate anomaly alarms into actionable troubleshooting clues.

For anomaly diagnosis, SHAP local attribution can be combined with the dual-stream detector outputs to support operator triage by highlighting dominant attribution signatures and linking alarms to likely source families (e.g., sensor/data issues, equipment behavior, or contextual schedule/weather factors).

The source labels used in the following case studies are obtained via the proposed heuristic triage guideline with operator confirmation rather than an automated classifier.

(1): Sensor drift (progressive deviation).

Anomalies caused by sensor drift typically present as prediction errors that increase gradually over several consecutive days. Although point-wise deviations within a short time span may appear mild, the long-term trend is pronounced. Therefore, this type of anomaly is usually captured more sensitively by the long-scale detection stream (i.e., pattern anomalies). From the interpretability perspective, SHAP often indicates that the model’s attribution distribution becomes abnormal under normal operating conditions: either the contributions of environmental and temporal features exhibit atypical changes, or the model shows an unusually strong reliance on exogenous variables to fit the current observations. Such patterns suggest systematic measurement drift and a progressive change in the sensor signal. Accordingly, recommended actions include recalibrating the sensors, verifying the consistency and integrity of the data acquisition pipeline, and examining whether preprocessing procedures (e.g., imputation and cleaning) introduce unintended bias.

(2): Equipment degradation (efficiency drop).

Anomalies resulting from gradual equipment degradation are typically observed under operating contexts that are broadly similar to historical conditions, yet the actual energy consumption increases progressively, leading to sustained accumulation of forecasting errors over longer horizons. In detection, the long-scale stream commonly triggers first, and the short-scale stream may subsequently be activated as the deviation grows. SHAP explanations often reveal a transition from the normal “history- and time-dominated” attribution structure toward larger correction terms (e.g., strengthened environmental effects or systematic sign shifts), which is consistent with changes in equipment response caused by efficiency deterioration. Therefore, maintenance-oriented inspections are recommended, focusing on typical degradation mechanisms such as reduced HVAC heat-exchange efficiency, clogged filters, or refrigerant leakage, and the inferred onset time should be cross-validated using maintenance records and operational logs.

(3): Abnormal occupancy/activity (sudden load surge).

Anomalies driven by abnormal occupancy or temporary activities typically manifest as abrupt load spikes during expected low-demand periods (e.g., weekends or nights). Consequently, the short-scale stream tends to trigger rapidly. In this case, SHAP analysis often shows that time-semantic features (e.g., hour, is_weekend, weekday) and historical-load features contribute strongly in the “low-load expectation” direction, yet the observed consumption is substantially higher. This mismatch indicates that the anomaly is more likely caused by occupancy/schedule shocks rather than sensor distortion. Recommended actions include checking event schedules and space-use records, verifying whether control strategies were manually overridden, and confirming the presence of temporary activities such as meetings, sports events, or other special operating modes.

A deeper reading of these representative cases shows that the detected anomalies are associated with different error-evolution patterns rather than a single generic deviation mode. Sudden occupancy shocks or transient sensor disturbances typically generate sharp residual excursions within a short horizon, making them more readily captured by the short-scale stream. By contrast, sensor drift and equipment degradation are better characterized by gradual error accumulation and sustained departure from the historical residual baseline, which explains why the long-scale stream provides more reliable evidence in such cases. This distinction is important because it shows that the proposed dual-stream design does not merely increase detection complexity but captures two practically different anomaly morphologies: abrupt deviations and progressive degradation. In this sense, the anomaly results provide a more informative diagnosis basis than a single-threshold strategy, especially when model residuals evolve differently across operational scenarios.

In summary, SHAP explanations not only answer “which features are important” but also provide actionable evidence for “why the anomaly happens” and “what to check first.” This turns alarms into an operational diagnostic chain and improves the engineering deployability and decision trustworthiness of the proposed framework.

4.6. Condition-Dependent Forecasting Performance

To provide deeper insight into model robustness across different operating regimes, the test-set forecasting results are disaggregated along two dimensions: (1) by season (spring, summer, autumn, winter) and (2) by load regime (peak hours: weekday 08:00–18:00; off-peak hours: nights, weekends, and holidays). Table 10 reports the MAE, RMSE, and MAPE for each condition on both datasets.

Several observations emerge from the regime-based analysis. First, seasonal performance variations are observed in both datasets, though they remain moderate. On the Cockatoo dataset, the MAPE varies only slightly across seasons (3.36–3.43%), with summer achieving the lowest MAPE (3.36%) and winter the highest (3.43%). On the Moose dataset, seasonal variation is more pronounced: autumn achieves the lowest MAPE (2.29%), while summer shows the highest (3.02%). This difference is likely attributable to the more variable weather and occupancy patterns during summer months on the Moose building, whereas the Cockatoo building exhibits relatively stable seasonal behavior. Notably, the absolute MAE on Moose is highest during winter (10.05 kW), which is consistent with increased load variability during heating periods in a cold-climate building (temperature range: −28.8 to 33.9 °C).

Second, the peak versus off-peak comparison reveals complementary patterns across the two datasets. On Cockatoo, the model achieves slightly lower MAPE during peak hours (3.33%) than during off-peak hours (3.43%), while on Moose, peak-hour MAPE (2.35%) is substantially lower than off-peak MAPE (2.95%). However, in terms of absolute error (MAE), peak-hour errors are higher on both datasets (Cockatoo: 6.44 vs. 5.37 kW; Moose: 8.79 vs. 8.32 kW), indicating that while the model captures the proportional load patterns well during peak periods, the larger absolute load magnitudes during peak hours naturally lead to higher absolute errors. The lower peak-hour MAPE suggests that the model benefits from the more structured and predictable occupancy-driven load patterns during working hours compared to the more diverse and less regular consumption during off-peak periods.

Overall, the proposed model maintains MAPE below 3.5% across all conditions on both datasets, indicating that the framework provides robust forecasting performance across diverse operating regimes rather than being effective only under specific conditions. The maximum MAPE variation within any single dataset is 0.07 percentage points on Cockatoo and 0.73 percentage points on Moose, confirming that the domain-semantic attention mechanism and the temporal feature encoding effectively capture seasonal and diurnal regularities across different operating contexts.

5. Discussion

5.1. Comparison with Existing Studies

As shown in Table 5 and Table 6, the proposed multi-head Transformer achieves the best overall forecasting performance on both the Cockatoo and Moose datasets when compared with ARIMA, LSTM, Informer, Autoformer, FEDformer, and PatchTST, and the standard Transformer. This indicates that the gain of the proposed model does not arise merely from using a Transformer backbone but also from the explicit semantic organization and fusion of heterogeneous energy-, environmental-, and time-related features.

On the Cockatoo dataset, the proposed model outperforms all compared methods, with FEDformer as the strongest competitor; this indicates that decomposition-based Transformer designs are competitive for building-energy forecasting, and the proposed semantic head separation further enhances performance by providing additional benefits. In contrast, on the Moose dataset, PatchTST achieves a highly competitive result that nearly matches the proposed model, suggesting that patch-based sequence modeling is well-suited for this specific building—yet the proposed model still maintains the best overall accuracy on this dataset. Building on these two dataset-specific findings, we can draw a general conclusion: recent Transformer-based forecasting models serve as strong baselines for building-energy forecasting tasks, and the proposed semantic head separation remains effective and beneficial across both datasets. Going a step further, the comparative advantage of the proposed framework should not be limited to lower aggregate MAE, RMSE, or MAPE values; revised error-distribution analysis reveals that different Transformer variants exhibit distinct strengths depending on dataset characteristics, with FEDformer performing better on the more regular Cockatoo series and PatchTST excelling on the more variable Moose series. Against this backdrop, the core practical value of the proposed multi-head Transformer lies in its balanced robustness across both datasets: unlike methods relying on a single temporal inductive bias, it explicitly organizes heterogeneous energy, environmental, and temporal information, leading to advantages not only in mean accuracy but also in reduced error dispersion and better control of extreme deviations—factors that are critical for monitoring-oriented building energy applications.

From a methodological perspective, the contribution of the present study does not lie in proposing a fundamentally new theoretical attention operator. Instead, for the forecasting module, its contribution lies in making domain-semantic separation more explicit within the attention architecture itself, with three key distinctions relative to the closest related approaches. First, compared with feature-aware reweighting methods such as spectral clustering with Temporal Fusion Transformers [8], which recalibrate input importance before or within shared feature interaction, DS-MHA constrains cross-domain interactions directly within the attention mechanism itself by routing energy-, environment-, and time-related features to dedicated heads through semantic masks, preventing heterogeneous features from competing in the same attention distribution. Second, compared with branch-based semantic grouping methods [9,10] that separate source groups into parallel backbones and fuse downstream, DS-MHA achieves semantic separation within a single unified Transformer backbone, avoiding the parameter overhead of multiple parallel encoders while still ensuring explicit semantic control. Third, beyond these architectural differences, the head-level formulation enables two capabilities that are central to the proposed framework but difficult to achieve with alternative designs: (a) head-wise ablation that independently quantifies each semantic domain’s contribution (Table 7: removing the energy head causes a 34.5% MAE increase, the environment head 18.2%, and the time head 16.3%), and (b) natural alignment between attention heads and the SHAP-based explainability module, where the energy/environment/time decomposition at the attention level directly corresponds to the semantic feature groups used for diagnosis-oriented attribution in Section 4.5. These interpretability advantages make the head-level formulation not merely an incremental variant of existing approaches but a necessary architectural choice for the integrated forecasting–diagnosis–explanation pipeline proposed in this study.

From the anomaly-detection perspective, many existing studies rely on a single detection logic, such as thresholding, reconstruction error, or prediction residuals alone. By contrast, the proposed framework combines residual-based short-scale alerting and seasonally aligned long-scale validation to better reflect the multi-scale nature of real building anomalies. In addition, compared with prior XAI-based building-energy studies that mainly use explanation as a post hoc add-on, the present framework links forecasting, structured anomaly evidence, and SHAP-based attribution within one workflow, thereby providing a closer connection between prediction, alarm generation, and diagnosis.

The quantitative benchmark results (Table 8) further confirm that the dual-stream design provides measurably better anomaly coverage than single-scale alternatives. On both datasets, the proposed detector achieves the highest event-level F1 scores (0.386 and 0.452), with the most pronounced advantage in pattern anomaly detection, where single-scale baselines largely fail (0% pattern recall for 3σ and LSTM + 3σ baselines vs. 66.7–69.2% for the proposed method). The quantitative evidence complements the qualitative interpretation in Section 4.5.2 and substantiates the claim that the complementary short-scale and long-scale streams are both practically necessary.

5.2. Methodological Interpretation and Practical Implications

The ablation results provide important insight into why the proposed forecasting module is effective. The most significant performance degradation occurs when the energy head is removed, confirming that historical load dynamics remain the dominant driver of short-term building energy prediction. The time head also contributes strongly, indicating that campus buildings exhibit pronounced schedule-driven regularities shaped by teaching timetables, weekday–weekend transitions, and seasonal operation. The environment head further improves performance by accounting for weather-sensitive variation, especially during cooling and heating periods. Together, these findings support the methodological rationale of semantic head separation.

The dual-stream anomaly detector is also practically meaningful for operation and maintenance. The short-scale stream provides rapid alarms for sudden deviations, while the long-scale stream captures slow degradation that accumulates over time. Their combination is intended to provide richer and more structured anomaly evidence than a single-threshold strategy, which may be advantageous for real campus energy-management scenarios involving both abrupt deviations and gradual degradation.

The parameter sensitivity analysis (Section 4.4) provides additional insight into the detector’s practical behavior. The strong influence of k on the false-alarm rate underscores that the short-scale threshold is the primary control lever for balancing detection sensitivity and operational alarm burden. In contrast, the relative insensitivity to z indicates that the long-scale stream’s multi-criterion decision logic—combining deviation threshold, consecutive-window persistence, and trend verification—provides inherent robustness, making the long-scale detection behavior stable across a reasonable range of threshold settings. This finding supports the practical deployability of the proposed framework, as operators primarily need to calibrate a single parameter (k) to their site-specific noise levels.

The explainability module further improves engineering usability. Rather than providing only feature-importance visualization, the framework uses SHAP to connect prediction results with anomaly evidence and operator-oriented interpretation. This helps move from “what is important” to “what likely happened” and “what should be checked first”, which is more valuable for building operation and maintenance than a stand-alone post hoc explanation.

The condition-dependent analysis (Section 4.6) provides further evidence of the framework’s practical robustness. The proposed model maintains consistent forecasting accuracy across different seasons and load regimes, with MAPE ranging from 3.33% to 3.43% on Cockatoo and from 2.29% to 3.02% on Moose. This is particularly relevant for campus buildings where operating conditions vary substantially between teaching and holiday periods and between heating and cooling seasons. The relatively small performance variation across conditions suggests that the explicit incorporation of temporal semantics (hour, weekday, season) through dedicated attention heads enables the model to adapt to regime-specific patterns without requiring separate models for different operating conditions.

5.3. Limitations and Future Work

Several limitations of this study should be acknowledged. First, the experiments are conducted only on two educational buildings from the BDG2 dataset, which limits the generalizability of the proposed framework. Further validation is therefore required across more diverse building categories, climate zones, and operational contexts to confirm its applicability in different scenarios. Second, anomaly-source identification in this study is not formulated as a supervised classification problem, primarily due to the lack of event-level fault labels and maintenance-confirmed source annotations in public datasets. As a result, the current diagnosis stage should be interpreted as attribution-guided operator support rather than an automated fault classification system. Third, the present work focuses solely on one-step-ahead forecasting, which restricts its practical application in scenarios requiring longer-horizon predictive insights. In addition, although DS-MHA is positioned relative to feature-aware reweighting and branch-based semantic grouping methods at the conceptual level, the current experiments do not include a direct implementation-level comparison against all such closely related architectural variants. Therefore, the empirical advantages and limitations of the proposed head-level formulation still need to be clarified more systematically through targeted benchmarking. Furthermore, while the synthetic anomaly injection protocol provides a controlled and reproducible evaluation framework, the injected anomalies may not fully represent the diversity of real-world building faults. Future work should validate the detector on datasets with maintenance-confirmed fault records when such datasets become available.

To address the above limitations and further improve the framework, several directions for future research are proposed. First, the framework can be extended to multi-step prediction, cross-building transfer learning, and maintenance-record-assisted anomaly source learning, which would improve both its predictive capability and generalizability. Second, while the current regime-based analysis (Section 4.6) demonstrates robust performance across seasons and peak/off-peak conditions, finer-grained regime definitions (e.g., holiday versus teaching periods, extreme weather events) could provide additional insight and are left for future investigation. In addition, future work will include more targeted benchmarking against alternative semantic-separation designs so as to clarify more rigorously the empirical advantages and boundaries of the proposed head-level DS-MHA formulation.

6. Conclusions

This study proposes an explainable smart-building energy forecasting and anomaly diagnosis framework that integrates domain-semantic constrained attention, dual-stream anomaly detection, and SHAP-driven interpretation into a unified pipeline. Rather than introducing a fundamentally new standalone algorithm, the main contribution of the framework lies in explicitly structuring and connecting three practically important components for smart-building energy management: heterogeneous-feature forecasting, multi-scale anomaly evidence generation, and diagnosis-oriented attribution. Specifically, the proposed framework extends semantic grouping ideas to the attention-head level in the forecasting model, combines residual-based short-scale alerting with seasonally aligned long-scale validation in the anomaly detector, and links forecasting outputs with anomaly evidence and SHAP-based explanation in a unified workflow.

Experimental results on the Moose and Cockatoo buildings from the Building Data Genome Project 2 dataset show that the proposed method consistently outperforms ARIMA, LSTM, the standard Transformer, and several recent Transformer-based baselines, including PatchTST, FEDformer, Autoformer, and Informer. The framework achieves MAPE values of approximately 3% on both datasets, while the ablation study further confirms the importance of semantic separation across energy, time, and environmental feature groups. The condition-dependent analysis further demonstrates that the model maintains robust forecasting accuracy across all seasons and load regimes, with MAPE varying by only 0.07 percentage points on Cockatoo and 0.73 percentage points on Moose, confirming the practical reliability of the framework under diverse operating conditions. In addition, the explainability analysis demonstrates that the proposed framework can provide actionable evidence for anomaly understanding and operator-oriented troubleshooting, rather than merely reporting prediction errors or feature rankings.

Furthermore, the anomaly detection module is quantitatively validated through a synthetic anomaly injection benchmark with standard event-level detection metrics. The proposed dual-stream detector achieves F1-scores of 0.386 and 0.452 on the Cockatoo and Moose datasets, respectively, substantially outperforming the 3σ threshold, Isolation Forest, and LSTM-based baselines. Its most notable advantage lies in pattern anomaly detection, where pattern recall reaches 66.7% on Cockatoo and 69.2% on Moose, confirming that the long-scale stream can effectively capture progressive degradation patterns that are largely missed by single-scale methods. A dedicated parameter sensitivity analysis further shows that the detector remains stable across a practical range of parameter settings, with the short-scale threshold multiplier k acting as the primary control lever for balancing detection sensitivity and false-alarm burden, while the long-scale threshold z has only a comparatively limited influence. Together, these results strengthen the practical credibility of the proposed anomaly detection module beyond qualitative case interpretation alone.

Several limitations should also be acknowledged, including the evaluation scope being limited to two educational buildings, the absence of supervised anomaly-source classification due to label unavailability, and the restriction to one-step-ahead forecasting. Future work will extend the framework toward multi-step prediction, cross-building generalization, maintenance-record-assisted anomaly source learning, and validation on datasets with confirmed fault records so as to further improve its robustness and deployment value in real building energy-management systems.

Author Contributions

Conceptualization, Y.C. and B.L.; methodology, Y.C.; software, Y.C.; validation, Y.C., D.L. and B.L.; formal analysis, Y.C.; investigation, Y.C. and D.L.; resources, B.L.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, D.L. and B.L.; visualization, Y.C.; supervision, B.L.; project administration, B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are publicly available from the Building Data Genome Project 2 (BDG2) dataset (https://github.com/buds-lab/building-data-genome-project-2 (accessed on 3 April 2026)). The processed data and all associated code for the study are openly accessible in the GitHub repository at https://github.com/cai1157/Explainable-Smart-Building-Energy-Consumption-Forecasting (accessed on 3 April 2026).

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-4o) to identify publicly available datasets and for English language polishing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

United Nations Environment Programme (UNEP); Global Alliance for Buildings and Construction (GlobalABC). Not Just Another Brick in the Wall: The Solutions Exist—Scaling Them Will Build on Progress and Cut Emissions Fast; Global Status Report for Buildings and Construction 2024/2025; United Nations Environment Programme: Paris, France, 2025. [Google Scholar]
Li, Y.; Arellano-Espitia, F.; Aler, R.; Igualada, L.; Corchero, C. Data-driven methods and their applications to building HVAC energy consumption prediction: A review. J. Build. Eng. 2025, 116, 114612. [Google Scholar] [CrossRef]
Gao, Y.; Shi, S.; Miyata, S.; Akashi, Y. Successful application of predictive information in deep reinforcement learning control: A case study based on an office building HVAC system. Energy 2024, 291, 130344. [Google Scholar] [CrossRef]
Cavus, M.; Ayan, H.; Dissanayake, D.; Sharma, A.; Deb, S.; Bell, M. Forecasting Electric Vehicle Charging Demand in Smart Cities Using Hybrid Deep Learning of Regional Spatial Behaviours. Energies 2025, 18, 3425. [Google Scholar] [CrossRef]
Su, L.; Zuo, X.; Li, R.; Wang, X.; Zhao, H.; Huang, B. A systematic review for transformer-based long-term series forecasting. Artif. Intell. Rev. 2025, 58, 80. [Google Scholar] [CrossRef]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. In Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2022; Volume 162, pp. 27268–27286. [Google Scholar]
Zhao, J.; Chu, F.; Xie, L.; Che, Y.; Wu, Y.; Burke, A.F. A survey of transformer networks for time series forecasting. Comput. Sci. Rev. 2026, 60, 100883. [Google Scholar] [CrossRef]
Zheng, P.; Zhou, H.; Liu, J.; Nakanishi, Y. Interpretable building energy consumption forecasting using spectral clustering algorithm and temporal fusion transformers architecture. Appl. Energy 2023, 349, 121607. [Google Scholar] [CrossRef]
Lin, J.; Ma, J.; Zhu, J.; Cui, Y. Short-term load forecasting based on LSTM networks considering attention mechanism. Int. J. Electr. Power Energy Syst. 2022, 137, 107818. [Google Scholar] [CrossRef]
Moveh, S.; Merchán-Cruz, E.A.; Abuhussain, M.; Dodo, Y.A.; Alhumaid, S.; Alhamami, A.H. Deep Learning Framework Using Transformer Networks for Multi Building Energy Consumption Prediction in Smart Cities. Energies 2025, 18, 1468. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data 2012, 6, 3. [Google Scholar] [CrossRef]
Himeur, Y.; Ghanem, K.; Alsalemi, A.; Bensaali, F.; Amira, A. Artificial intelligence based anomaly detection of energy consumption in buildings: A review, current trends and new perspectives. Appl. Energy 2021, 287, 116601. [Google Scholar] [CrossRef]
Jia, X.; Xun, P.; Peng, W.; Zhao, B.; Li, H.; Shen, C. Deep anomaly detection for time series: A survey. Comput. Sci. Rev. 2025, 58, 100787. [Google Scholar] [CrossRef]
Chen, J.; Zhang, L.; Li, Y.; Shi, Y.; Gao, X.; Hu, Y. A review of computing-based automated fault detection and diagnosis of heating, ventilation and air conditioning systems. Renew. Sustain. Energy Rev. 2022, 161, 112395. [Google Scholar] [CrossRef]
Takahashi, K.; Ooka, R.; Kurosaki, A. Seasonal threshold to reduce false positives for prediction-based outlier detection in building energy data. J. Build. Eng. 2024, 84, 108539. [Google Scholar] [CrossRef]
Sadeeq, M.A.M. XDL-Energy: Explainable hybrid deep learning framework for smart campus energy forecasting and anomaly justification using DPU-ALDOSKI dataset. Energy Build. 2025, 347, 116305. [Google Scholar] [CrossRef]
Ul Haq, M.S.; Ji, W.; Pei, X.; Liu, S.; Geng, Y.; Lin, B.; Ali, H. Explainable deep learning combined attention-based LSTM for building energy prediction: A framework from the perspective of supply side. Energy Build. 2026, 350, 116638. [Google Scholar] [CrossRef]
Teixeira, B.; Carvalhais, L.; Pinto, T.; Vale, Z. Explainable AI framework for reliable and transparent automated energy management in buildings. Energy Build. 2025, 347, 116246. [Google Scholar] [CrossRef]
Minassian, R.; Mihăiţă, A.S.; Shirazi, A. Optimizing indoor environmental prediction in smart buildings: A comparative analysis of deep learning models. Energy Build. 2025, 327, 115086. [Google Scholar] [CrossRef]
Wang, Z.W.; Qin, Y.J.; Kong, Y.F.; Wang, L.; Leng, Q.; Zhang, C.X. Advanced fault detection, diagnosis and prognosis in HVAC systems: Lifecycle insight, key challenges, and promising approaches. Renew. Sustain. Energy Rev. 2025, 219, 115867. [Google Scholar] [CrossRef]
Miller, C.; Kathirgamanathan, A.; Picchetti, B.; Arjunan, P.; Park, J.Y.; Nagy, Z.; Raftery, P.; Hobson, B.W.; Shi, Z.; Meggers, F. The Building Data Genome Project 2, energy meter data from the ASHRAE Great Energy Predictor III competition. Sci. Data 2020, 7, 368. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The schematic diagram of the proposed explainable framework.

Figure 2. Training/validation loss curves for (a) Cockatoo and (b) Moose.

Figure 3. First 200 prediction steps on the Cockatoo (a) training set and (b) test set.

Figure 4. First 200 prediction steps on the Moose (a) training set and (b) test set.

Figure 5. Distribution of absolute forecasting errors for representative models on (a) Cockatoo and (b) Moose.

Figure 6. Comparison of ablation-study results.

Figure 7. Global feature importance based on SHAP values.

Figure 8. SHAP summary for aggregated features.

Figure 9. SHAP explanation for a single prediction.

Table 1. Comparative summary of representative recent studies and the positioning of the proposed framework.

Ref.	Model	Application Domain	Dataset	Evaluation Focus	Main Limitation
[8]	Spectral clustering + Temporal Fusion Transformer	Building energy consumption forecasting	Building energy consumption data	Forecasting accuracy and interpretability	Focuses on interpretable forecasting but does not integrate anomaly detection or diagnosis-oriented evidence
[9]	Attention-based LSTM	Short-term load forecasting	Power/load time-series data	Forecasting accuracy	Uses attention for feature emphasis, but semantic interactions among heterogeneous domains remain implicit
[10]	Transformer networks	Multi-building energy consumption prediction in smart cities	Multi-building smart-city energy data	Forecasting performance	Strengthens multi-source forecasting but does not explicitly constrain cross-domain semantic interactions or provide integrated anomaly diagnosis
[15]	Seasonal threshold/baseline-aligned residual detection	Building energy outlier detection	Building energy data	False-positive reduction under seasonal regime shifts	Improves seasonal robustness but focuses on one detection logic rather than multi-scale anomaly evidence
[16]	Explainable hybrid deep learning (XDL-Energy)	Smart-campus energy forecasting and anomaly justification	Smart-campus energy dataset	Forecasting and anomaly justification	Links explanation to anomalies, but does not integrate structured multi-scale anomaly evidence
[17]	Explainable attention-based LSTM	Building energy prediction	Building energy prediction data	Transparency and feature contribution interpretation	Improves interpretability, but the explanation remains attached mainly to forecasting rather than an integrated diagnosis workflow
[18]	Explainable AI framework for automated energy management	Building energy management	Building energy management scenarios	Trustworthy operation and decision support	Highlights transparent management, but does not jointly address heterogeneous forecasting and anomaly evidence generation
This study	DS-MHA + seasonal-aligned dual-stream detector + SHAP-based attribution	Smart-campus building energy forecasting, anomaly detection, and diagnosis support	BDG2: Moose and Cockatoo buildings	Forecasting accuracy; event-level anomaly detection benchmarking; ablation analysis; SHAP-based explainability	Addresses heterogeneous-feature forecasting, abrupt/gradual anomaly evidence, and diagnosis-oriented attribution within one unified framework

Table 2. Complete list of model input features used in the forecasting module.

Category	Features	No.	Description
Environmental features	airTemperature, windSpeed	2	Outdoor weather variables used to characterize environmental conditions affecting building electricity demand.
Weekday encoding	weekday_0–weekday_6	7	One-hot encoded weekday indicators representing Monday to Sunday.
Weekend indicator	is_weekend_1	1	Binary indicator showing whether the timestamp falls on a weekend.
Hour encoding	hour_0–hour_23	24	One-hot encoded hour-of-day indicators representing 24 hourly positions.
Season encoding	season_0–season_3	4	One-hot encoded seasonal indicators representing spring, summer, autumn, and winter.
Historical energy features	lag_1h,lag_24h,lag_168h, roll_mean_24h, roll_max_24h	5	Lagged and rolling statistics derived from historical electricity consumption to capture short-term, daily, and weekly temporal patterns.

Table 3. Global explanation outputs and how they are used in deployment.

Output	Decision Supported	Why It Matters in Campus Buildings
Global importance (mean \|SHAP\|)	Data QA prioritization; driver-focused threshold calibration; feature selection for refinement	Campus loads are affected by heterogeneous drivers. Prioritizing key drivers improves reliability and reduces troubleshooting time
SHAP summary (aggregated-feature beeswarm)	Contribution-distribution inspection; direction and variability of driver effects; consistency check across samples	Campus operation is highly schedule-driven and weather-sensitive; sample-level contribution patterns help interpret how major drivers influence forecasts under different operating contexts

Table 4. Key statistical characteristics of the Moose and Cockatoo datasets.

Feature	Moose Dataset	Cockatoo Dataset
Collection period	2016-01 to 2017-12	2016-04 to 2017-12
Building area (m²)	14,000	13,051.8
Average electrical load (kW)	308.89	167.74
Temperature range (°C)	−28.8~33.9	−20.0~33.9

Table 5. Comparison of forecasting performance on the Cockatoo_data dataset.

Model	MAE (kW)	RMSE (kW)	MAPE (%)
ARIMA	19.36	24.62	10.48%
LSTM	12.69	15.66	8.42%
PatchTST	6.50	8.49	4.14%
FEDformer	5.67	7.58	3.63%
Autoformer	12.05	16.87	7.64%
Informer	9.63	12.17	6.37%
Standard Transformer	6.14	7.63	3.79%
Multi-head Transformer	4.84	6.16	2.99%

Table 6. Comparison of forecasting performance on the Moose_data dataset.

Model	MAE (kW)	RMSE (kW)	MAPE (%)
ARIMA	43.98	48.19	15.37%
LSTM	36.34	42.70	10.56%
PatchTST	10.67	13.76	3.05%
FEDformer	23.41	27.68	6.69%
Autoformer	28.15	34.36	8.32%
Informer	38.91	44.17	11.26%
Standard Transformer	21.92	26.86	6.31%
Multi-head Transformer	10.39	12.62	3.00%

Table 7. Ablation study on the Cockatoo_data test set.

Configuration	MAE	RMSE	MAPE (%)	R²
Full model	4.838477	6.16452	2.999196	0.891495
Without the time head	5.633549	7.019906	3.520409	0.859294
Without the environment head	5.719622	7.224838	3.583425	0.850959
Without the energy head	6.514335	8.689878	4.143906	0.784386

Table 8. Quantitative anomaly detection benchmark using synthetic anomaly injection (event-level metrics).

Dataset	Detector	Precision	Recall	F1	FAR	Point Recall	Pattern Recall
Cockatoo	Dual-Stream	0.347	0.436	0.386	0.776	36.7% (11/30)	66.7% (6/9)
Cockatoo	3σ Threshold	0.069	0.436	0.119	0.761	56.7% (17/30)	0.0% (0/9)
Cockatoo	Isolation Forest	0.067	0.436	0.116	0.677	43.3% (13/30)	44.4% (4/9)
Cockatoo	LSTM + 3σ	0.059	0.333	0.101	0.731	43.3% (13/30)	0.0% (0/9)
Moose	Dual-Stream	0.333	0.702	0.452	0.657	70.6% (24/34)	69.2% (9/13)
Moose	3σ Threshold	0.075	0.468	0.129	0.713	64.7% (22/34)	0.0% (0/13)
Moose	Isolation Forest	0.108	0.447	0.174	0.667	47.1% (16/34)	38.5% (5/13)
Moose	LSTM + 3σ	0.066	0.340	0.111	0.665	47.1% (16/34)	0.0% (0/13)

Table 9. Sensitivity analysis of dual-stream detector parameters (F1-score/Precision/FAR).

k	z	Cockatoo F1	Cockatoo Prec.	Cockatoo FAR	Moose F1	Moose Prec.	Moose FAR
2.0	1.5	0.247	0.156	0.823	0.377	0.244	0.756
2.0	2.0	0.229	0.144	0.830	0.345	0.218	0.749
2.0	2.5	0.227	0.142	0.832	0.289	0.181	0.745
2.0	3.0	0.212	0.132	0.818	0.274	0.169	0.746
2.5	1.5	0.295	0.200	0.809	0.420	0.284	0.724
2.5	2.0	0.276	0.186	0.814	0.369	0.243	0.723
2.5	2.5	0.276	0.186	0.814	0.317	0.206	0.729
2.5	3.0	0.256	0.171	0.795	0.303	0.195	0.732
3.0	1.5	0.319	0.237	0.800	0.427	0.299	0.709
3.0	2.0	0.295	0.217	0.807	0.386	0.264	0.705
3.0	2.5	0.295	0.217	0.807	0.328	0.221	0.713
3.0	3.0	0.274	0.200	0.800	0.317	0.211	0.718
3.5	1.5	0.364	0.300	0.783	0.476	0.350	0.670
3.5	2.0	0.337	0.274	0.790	0.430	0.306	0.667
3.5	2.5	0.337	0.274	0.790	0.364	0.254	0.678
3.5	3.0	0.311	0.250	0.781	0.353	0.244	0.683
4.0	1.5	0.419	0.383	0.766	0.504	0.380	0.652
4.0	2.0	0.386	0.347	0.776	0.452	0.333	0.657
4.0	2.5	0.386	0.347	0.776	0.379	0.274	0.670
4.0	3.0	0.360	0.320	0.780	0.367	0.261	0.676

Table 10. Condition-dependent forecasting performance on the test sets.

Condition	Cockatoo MAE (kW)	Cockatoo RMSE (kW)	Cockatoo MAPE (%)	Moose MAE (kW)	Moose RMSE (kW)	Moose MAPE (%)
Spring	5.24	6.60	3.39	9.02	11.58	2.87
Summer	6.28	8.04	3.36	7.86	10.27	3.02
Autumn	5.79	7.49	3.42	7.00	9.21	2.29
Winter	5.17	6.72	3.43	10.05	12.63	2.93
Peak	6.44	8.19	3.33	8.79	11.12	2.35
Off-peak	5.37	6.91	3.43	8.32	10.91	2.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cai, Y.; Liao, D.; Liu, B. Explainable Smart-Building Energy Consumption Forecasting and Anomaly Diagnosis Framework Based on Multi-Head Transformer and Dual-Stream Detection. Appl. Sci. 2026, 16, 3836. https://doi.org/10.3390/app16083836

AMA Style

Cai Y, Liao D, Liu B. Explainable Smart-Building Energy Consumption Forecasting and Anomaly Diagnosis Framework Based on Multi-Head Transformer and Dual-Stream Detection. Applied Sciences. 2026; 16(8):3836. https://doi.org/10.3390/app16083836

Chicago/Turabian Style

Cai, Yuanyu, Dan Liao, and Bin Liu. 2026. "Explainable Smart-Building Energy Consumption Forecasting and Anomaly Diagnosis Framework Based on Multi-Head Transformer and Dual-Stream Detection" Applied Sciences 16, no. 8: 3836. https://doi.org/10.3390/app16083836

APA Style

Cai, Y., Liao, D., & Liu, B. (2026). Explainable Smart-Building Energy Consumption Forecasting and Anomaly Diagnosis Framework Based on Multi-Head Transformer and Dual-Stream Detection. Applied Sciences, 16(8), 3836. https://doi.org/10.3390/app16083836

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Explainable Smart-Building Energy Consumption Forecasting and Anomaly Diagnosis Framework Based on Multi-Head Transformer and Dual-Stream Detection

Abstract

1. Introduction

2. Methods

2.1. Framework Overview

2.2. Multi-Head Transformer Forecasting Module

2.2.1. Input Representation and Embedding Layer

2.2.2. Domain-Specific Multi-Head Attention (DS-MHA)

2.2.3. Transformer Encoder Layer

2.2.4. Masked Decoder and Prediction Head (Encoder–Decoder Architecture)

2.3. Dual-Stream Anomaly Detection Module

2.3.1. Short-Scale Threshold Stream (Point Anomaly Detection)

2.3.2. Long-Scale Threshold Stream (Pattern Anomaly Detection)

2.3.3. Decision Fusion

2.4. SHAP-Based Explainability Module

2.4.1. Forecast Explanation (Global)

2.4.2. Anomaly Attribution (Local)

3. Data Acquisition and Preprocessing

3.1. Dataset Description

3.2. Data Preprocessing

3.3. Model Training and Evaluation

3.4. Synthetic Anomaly Injection for Quantitative Evaluation

3.5. Reproducibility and Implementation Availability

4. Results

4.1. Forecasting Performance

4.2. Ablation Study

4.3. Anomaly Detection Benchmark

4.4. Sensitivity Analysis of Anomaly Detection Parameters

4.5. Explainability Analysis

4.5.1. Forecast Explanation (Global)

4.5.2. Anomaly Attribution

4.6. Condition-Dependent Forecasting Performance

5. Discussion

5.1. Comparison with Existing Studies

5.2. Methodological Interpretation and Practical Implications

5.3. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI