1. Introduction
Buildings account for approximately 30% of global final energy consumption and 26% of energy-related greenhouse gas emissions [
1,
2,
3], making operational energy performance assessment a critical lever for decarbonization. Over the past two decades, building energy benchmarking systems—from the U.S. ENERGY STAR Portfolio Manager [
4] to the EU Energy Performance of Buildings Directive (EPBD) [
5] and ASHRAE Building EQ [
6]—have become standard tools for portfolio-level performance evaluation. These systems predominantly rely on annual aggregate metrics, most commonly Energy Use Intensity (EUI), to rank buildings against peer populations or regulatory thresholds.
Existing regulatory frameworks reinforce this annual-metric paradigm. ENERGY STAR [
4], the EPBD [
5], and ASHRAE Standard 100 all evaluate buildings based on annual aggregate energy use without assessing whether hourly or daily operational patterns are consistent with efficient operation, creating a systematic blind spot for temporal anomalies.
Despite the proliferation of benchmarking frameworks, a fundamental gap persists: current systems evaluate how much energy a building consumes over a year but remain structurally blind to how that energy is consumed across temporal dimensions. A building that offsets daytime efficiency gains with nighttime waste, or one that follows erratic scheduling patterns invisible in annual aggregates, receives the same score as a genuinely well-operated peer. This paper addresses this gap by proposing a hierarchical diagnostic framework that integrates annual efficiency assessment with temporal pattern evaluation using zero-shot prediction error from a population-trained forecasting model.
1.1. The Limitations of Annual Energy Benchmarking
The ENERGY STAR framework [
4] compresses dynamic operational behavior into a single annual scalar, structurally incapable of detecting buildings that achieve low annual EUI through offsetting patterns or that appear inefficient despite consistent operations.
Subsequent advances—EnergyStar++ [
7] with gradient-boosted trees and SHAP interpretability, the Benchmark 8760 initiative [
8] advocating hourly benchmarking, and Piscitelli et al.’s [
9] multi-KPI framework on BDG-2—have partially addressed these gaps but none provides a single, temporal pattern metric benchmarked against population norms.
1.2. The Two-Dimensional EUI × CV Framework and Its Limitations
Combining EUI with the Coefficient of Variation (CV = σ/μ) in a two-dimensional framework [
10] introduces a temporal variability dimension. However, CV is self-referential: it measures variability relative to a building’s own statistics, providing no information about whether observed variability is typical or atypical relative to comparable buildings. A sports stadium with high CV due to event scheduling is indistinguishable from an office with anomalous HVAC cycling—motivating an externally referenced alternative.
These limitations motivate replacing self-referenced CV with a population-referenced pattern consistency metric—one that measures deviation not from a building’s own history but from what buildings of its type, size, and climate context typically do. This is precisely what zero-shot prediction error from a population-trained model provides.
1.3. Foundation Models as Population-Referenced Pattern Benchmarks
The emergence of large-scale pre-trained foundation models for time series [
11,
12,
13] has enabled zero-shot inference across diverse domains without task-specific fine-tuning. In the building energy domain, NREL’s BuildingsBench [
14] provides a suite of Transformer-based models pre-trained on 900,000 simulated building load profiles, demonstrating competitive zero-shot performance on real buildings (
Section 2.3).
We recognize a critical and previously unexploited property of zero-shot prediction error from such a population-trained model: when the model—trained to represent the joint distribution of building energy patterns across a diverse, statistically calibrated population—fails to predict a building’s consumption, that failure is proportional to the building’s deviation from population-representative operational norms. A building whose consumption pattern lies well within the learned manifold of typical buildings will have low CVRMSE; a building exhibiting genuinely unusual patterns will have high CVRMSE—not because the model is poorly calibrated, but because the building’s patterns are population-atypical.
1.4. Research Objectives and Contributions
Against this backdrop, this paper makes the following contributions:
Theoretical framing: We formally establish zero-shot CVRMSE from a population-trained forecasting model as a population-referenced pattern consistency metric that is theoretically and empirically distinct from both CV (self-referenced variability) and EUI (aggregate efficiency). We develop the conceptual justification for interpreting model prediction error as building-level diagnostic information rather than model accuracy information.
Empirical independence demonstration: We demonstrate across 611 real buildings (BDG-2) that EUI and CVRMSE are near-orthogonal (r = −0.029 for all buildings; r = −0.082, R2 = 0.007 for 583 CBECS-mapped buildings), establishing their validity as independent diagnostic dimensions and confirming that neither is a proxy for the other.
Hierarchical diagnostic framework: We develop and validate a three-level hierarchical evaluation framework—(L1) EUI × CVRMSE quadrant classification, (L2) CVRMSE decomposition into CV-driven and genuinely atypical components, and (L3) NMBE directional analysis—that progressively refines diagnosis from building-level classification to actionable hourly resolution intervention recommendations.
ENERGY STAR blind spot quantification: Benchmarking against CBECS 2018 [
15] Table C14 median EUI thresholds, we empirically quantify that 64.7% of ENERGY STAR-certifiable buildings (EUI Score ≥ 75, n = 85) exhibit temporal operational irregularities undetectable by annual EUI alone. Conversely, a substantial proportion of buildings with above-median EUI exhibit consistent temporal patterns, indicating structural rather than operational inefficiency. We characterize four systematic types of evaluation reversal when CVRMSE is added to the assessment.
2. Related Work
2.1. Building Energy Benchmarking Systems: From Annual to Temporal
Building energy benchmarking has evolved from simple EUI league tables to regression-based systems that control for building characteristics. The ENERGY STAR Portfolio Manager [
4] uses WLS regression on CBECS survey data to predict building-type-specific EUI given floor area, climate zone, operating hours, worker density, and plug load intensity, producing percentile scores adjusted for these confounders. The framework has achieved substantial policy uptake—over 240,000 U.S. commercial properties are benchmarked annually, and ENERGY STAR certification is mandatory for large buildings in many U.S. cities under building performance standards.
However, fundamental critiques of single-indicator benchmarking have accumulated. Bordass [
16] argued that single indicators systematically mislead because they reduce multidimensional performance to an oversimplified scalar that obscures the distinct contributions of building systems, occupancy, and operations. Scofield [
17] demonstrated empirically that LEED certification does not reliably predict metered energy savings in practice, partly because annual EUI masks occupancy variability and operational dynamics—a critique equally applicable to any single-metric annual benchmarking system including ENERGY STAR. ASHRAE’s Building EQ [
6] and the EU’s Smart Readiness Indicator [
5] have moved toward multi-dimensional assessment, but these require extensive manual data collection that constrains deployment at scale.
Attempts to introduce temporal resolution into building performance assessment have followed several streams. The IMT’s Benchmark 8760 initiative [
8] explicitly called for 8760 h benchmarking as necessary for capturing demand flexibility, grid interaction, and occupant comfort—but without prescribing a specific methodology. Granderson et al. [
18] developed statistical methods for 15 min to hourly baseline modeling under ASHRAE Guideline 14 [
19] for measurement and verification (M&V) applications, establishing CVRMSE and NMBE as standard accuracy metrics in that context. Our work repurposes these M&V metrics in a benchmarking context, inverting the interpretation: rather than measuring how accurately a model predicts a building (M&V view), we use the prediction error magnitude to characterize how atypical a building is relative to a population (benchmarking view).
2.2. Multi-Dimensional and Pattern-Based Frameworks
Park and Miller [
20] proposed load shape benchmarking using archetypal profiles derived from k-means clustering on daily normalized load curves across a large, diverse building dataset, conceptually demonstrating that hourly pattern information adds value beyond annual metrics. However, clustering-based approaches provide no connection to a reference population benchmark—cluster membership reveals “which pattern archetype” but not “how far from typical”.
Andrews and Jain [
21] introduced a two-dimensional framework combining energy efficiency with demand flexibility scores via k-medoids clustering on sub-hourly data from New York City’s Local Law 97 disclosure dataset. This framework captures the grid interaction dimension that ENERGY STAR misses, but flexibility is distinct from temporal consistency and requires sub-hourly resolution data not universally available.
Piscitelli et al. [
9] applied the most rigorous multi-KPI temporal approach to BDG-2, extracting time series-based KPIs covering thermal sensitivity, load shape, volatility, and operational schedules. However, the approach requires domain expertise for KPI design, uses peer-relative rather than population-fixed references, and does not distinguish building-type-typical from genuinely anomalous patterns—limitations our single-metric CVRMSE approach addresses through the model’s learned representations.
The EUI × CV quadrant framework [
10] is the direct conceptual predecessor to ours: we extend it by replacing CV with population-referenced CVRMSE as the second axis, and by introducing the CVRMSE decomposition (Level 2) that resolves the fundamental ambiguity in CV-based classification.
2.3. Foundation Models for Building Energy Time Series
Large-scale pre-trained foundation models for time series have advanced rapidly, demonstrating that representations learned from diverse multi-domain corpora support competitive zero-shot inference on unseen tasks and domains [
11,
12,
13]. In the building energy domain, NREL’s BuildingsBench [
14] established the first standardized benchmark, providing both a 900K-building simulation corpus and a diverse set of real-world evaluation datasets including BDG-2.
The TransformerWithGaussian-L model—the zero-shot forecasting backbone—was shown by Emami et al. [
14] to achieve state-of-the-art zero-shot performance on multiple real building datasets, outperforming persistence baselines and competing with building-specific ARIMA and LSTM models despite no access to target building data. This model is selected because it is, to the best of our knowledge, the only publicly available foundation model pre-trained specifically on building energy load data with CBECS-representative population statistics—a property essential for interpreting CVRMSE as population-referenced atypicality. General-purpose time series models (Chronos [
12], MOMENT [
11], TimesFM [
13]) lack this domain-specific training and cannot provide the population-calibrated reference distribution the proposed framework requires. Pre-trained weights are publicly available, enabling full reproducibility.
The BuildingsBench evaluation framework includes transfer learning benchmarks where pre-trained models are fine-tuned on target buildings with limited data [
14], and broader applications of foundation models in energy systems—including anomaly detection via prediction residuals and occupancy estimation—are an active area of research. However, none of this work uses the zero-shot prediction error magnitude itself as a building performance benchmarking metric—this is the distinctive contribution of the present study.
2.4. Positioning and Distinction from FDD
Compared to ENERGY STAR (annual EUI, CBECS regression, requires area/hours/workers) [
4], EnergyStar++ (annual EUI, full metadata) [
7], Benchmark 8760 (hourly, no metric specified) [
8], EUI × CV quadrant (annual + self-referential CV) [
10], and Piscitelli et al. (hourly multi-KPI, peer-relative, expert design) [
9], the proposed framework uniquely combines hourly resolution, a 900K-building population reference, minimal inputs (hourly load + lat/lon), and four-dimensional diagnostics (efficiency + pattern + cause + direction).
The proposed framework is distinct from fault detection and diagnosis (FDD) approaches [
22,
23], which diagnose specific equipment-level faults using detailed BMS subsystem data. The framework operates at the whole-building portfolio screening level using only aggregate meter data, identifying which buildings exhibit population-atypical temporal patterns to prioritize for detailed investigation. The two approaches are complementary.
3. Data
3.1. Evaluation Dataset: Building Data Genome Project 2 (BDG-2)
3.1.1. Dataset Overview
The Building Data Genome Project 2 (BDG-2) [
24] is one of the largest publicly available collections of real building energy meter data, released in conjunction with the ASHRAE Great Energy Predictor III (GEPIII) competition. The full dataset contains 3053 m from 1636 buildings across multiple sites in North America and Europe, covering electricity, chilled water, hot water, and steam meters.
For this study, the following inclusion criteria are applied:
Electricity meters only: Electricity is the only energy carrier available for all buildings; other carriers have partial coverage and require load conversion assumptions.
Valid zero-shot predictions: Buildings where Box–Cox normalization succeeded (positive mean load) and the model produced no NaN or infinite prediction sequences.
Sufficient prediction timesteps: A minimum of 8000 valid hourly observation-prediction pairs per building, ensuring reliable CVRMSE and NMBE estimation (equivalent to approximately 333 days).
Floor area availability: Required for EUI computation (Level 1).
These criteria yield a final study dataset of 611 buildings with 9,247,992 total observation-prediction timestep pairs, spanning four geographic sites (Bear, Fox, Rat, Panther) across North American climate zones 2C through 6A.
3.1.2. Study Dataset Characteristics
The dataset (
Table 1) is dominated by education buildings (41.6%), reflecting the composition of the BDG-2 sites. All four sites are in North America, providing climate diversity (marine, humid subtropical, humid continental, subarctic) without requiring cross-continental generalization claims.
Floor area (100.0%) and building type (98.7%) are available for nearly all buildings. Operating hours—a required input for ENERGY STAR’s WLS regression—are entirely unavailable in BDG-2, motivating the CBECS population-referenced z-score approach for EUI scoring, which requires no building-specific metadata beyond lat/lon.
4. Methodology: Three-Level Hierarchical Evaluation Framework
4.1. Pre-Trained Model: BuildingsBench TransformerWithGaussian-L
4.1.1. Architecture
TransformerWithGaussian-L is a causal Transformer encoder [
25] from the BuildingsBench model family [
14] that outputs Gaussian predictive distributions over hourly building load. The architecture employs multi-head self-attention with causal masking, enabling it to process variable-length historical context sequences and produce 24-step-ahead probabilistic forecasts. Geographic context (latitude, longitude) is provided as static auxiliary features concatenated to the learned time series embeddings, enabling the model to implicitly condition on climate zone without explicit climate variable inputs.
4.1.2. Training Data: Buildings-900K
The model was pre-trained on Buildings-900K, a corpus of 900,000 one-year (8760 h) hourly load profiles generated using EnergyPlus building energy simulation software. The corpus is parameterized to reflect the statistical distribution of the U.S. commercial building stock as characterized by CBECS [
26]:
Building types: A total of 11 commercial building archetypes (office, retail, warehouse, education, hotel, healthcare, food service, food sales, strip mall, religious worship, miscellaneous).
Climate zones: A total of 16 U.S. climate zones (ASHRAE 169-2013, spanning 2A through 8A).
Vintage: Representative construction vintages from pre-1980 through 2018, with HVAC system types and envelope properties calibrated accordingly.
Schedules: ASHRAE Standard 90.1 reference schedules for each building type, providing typical occupancy, lighting, and equipment load profiles.
CBECS sampling: Building geometry, floor area, number of floors, and system type assignments are drawn according to CBECS survey weights to ensure the 900K corpus represents the actual U.S. commercial stock distribution.
This training procedure means the model learns not individual building behaviors but the statistical manifold of energy patterns consistent with CBECS-characterized U.S. commercial building stock—the joint distribution over diurnal shape, seasonal variation, climate response, and magnitude-to-variability relationships. This is the theoretical foundation for interpreting CVRMSE as population-referenced atypicality.
4.1.3. Zero-Shot Inference Configuration
All predictions in this study are generated in strict zero-shot mode—no fine-tuning on BDG-2 data whatsoever. Model configuration:
Input context length: 168 h (7 days of historical consumption).
Prediction horizon: 24 h.
Sliding window stride: 24 h.
Input features: (1) Energy consumption time series, (2) latitude and longitude as static auxiliary inputs.
Load normalization: Box–Cox transformation with λ estimated independently per building from its historical data, applied before model input and inverted for post-prediction metric computation.
Output: Gaussian predictive distribution (μ, σ2) at each prediction step; the mean prediction μ is used for CVRMSE and NMBE computation.
4.2. Metric Definitions
Energy Use Intensity (EUI): Annual electricity consumption divided by gross floor area (sqft), yielding kWh/sqft/year. Computed as (mean hourly load in kWh × 8760 h)/sqft. Annual totals are computed by summing available hourly meter readings; for buildings with fewer than 365 days of data, values are pro-rated assuming uniform seasonal distribution.
Data Consistency Requirement: Electricity-Only Evaluation
This framework evaluates building performance using Advanced Metering Infrastructure (AMI) electricity meter data, which is the most widely deployed high-frequency energy data source in commercial building portfolios. To ensure methodologically consistent evaluation, all EUI calculations and reference benchmarks use electricity consumption only, excluding natural gas, district heat, fuel oil, and other fuel sources.
Rationale for electricity-only evaluation:
AMI data availability: Smart electricity meters (AMI) provide hourly or sub-hourly resolution at scale. Gas metering is typically monthly or daily, and district energy/fuel oil are rarely sub-metered at all. A framework requiring multi-fuel hourly data would exclude 80%+ of buildings that have only electricity AMI.
Foundation model training data: The BuildingsBench TransformerWithGaussian-L model was trained on electricity load profiles from the Buildings-900K dataset [
14], which contains simulated hourly electricity consumption. The model has learned electricity consumption patterns—not total site energy patterns. Feeding total site EUI to an electricity-trained model introduces category mismatch.
Coefficient of Variation (CV): Standard deviation of hourly load divided by mean hourly load over the observation period:
CV is computed from measured consumption data only; no model prediction is involved. CV characterizes a building’s inherent load variability relative to its own mean—a self-referential metric.
Coefficient of Variation in Root Mean Square Error (CVRMSE): Standardized measure of zero-shot model prediction error:
where
is measured consumption at timestep
for building
,
is the model’s zero-shot point prediction,
is mean measured consumption, and
is the number of valid timesteps. CVRMSE normalizes RMSE by the building’s mean load, enabling cross-building comparison of pattern deviation regardless of absolute consumption magnitude.
Normalized Mean Bias Error (NMBE): Systematic directional bias in model predictions:
Positive NMBE indicates the building systematically consumes more than the model predicts (model under-predicts actual; the building is over-consuming relative to population expectations). Negative NMBE indicates the building systematically consumes
less (the building is under-consuming relative to expectations—potentially indicating efficient operational practices). The NMBE formulation follows ASHRAE Guideline 14 [
19] conventions, ensuring consistency with M&V literature.
Excess CVRMSE (defined in
Section 4.5): The portion of a building’s CVRMSE that exceeds what would be predicted from its inherent load variability (CV) alone, capturing the model’s unique contribution to pattern atypicality diagnosis.
4.3. Conceptual Architecture
The proposed framework is designed around three diagnostic questions (
Figure 1), each answered with increasing specificity:
Level 1: Is this building performing well on both the aggregate efficiency dimension (EUI) and the temporal pattern consistency dimension (CVRMSE)? → Four-quadrant classification.
Level 2: For buildings with irregular temporal patterns: Is this irregularity inherent to the building’s use type, or does it represent genuine population-level anomaly? → CVRMSE decomposition.
Level 3: For genuinely anomalous buildings: Does the anomaly manifest as over-consumption or under-consumption relative to population expectations? → NMBE directional analysis.
The necessity of each level can be illustrated through diagnostic failures that arise when levels are omitted. A building with EUI Score = 82 but CVRMSE = 35% and NMBE = +12% would be certified as efficient by EUI alone, yet it systematically over-consumes relative to its temporal context. Conversely, a Quadrant D building with CVRMSE = 28% appears problematic, but Level 2 decomposition reveals its high CV places it in the CV_DRIVEN category (Excess CVRMSE = 2 pp), requiring no intervention. Each level is designed to address a diagnostic ambiguity that the prior level cannot capture.
The framework is strictly hierarchical—each level’s question is only meaningful given the prior level’s answer. Applying NMBE analysis to all buildings regardless of CVRMSE classification would produce noisy, uninterpretable signals (a point validated statistically in
Section 5.4).
4.4. Level 1: EUI × CVRMSE Quadrant Classification
4.4.1. EUI Score Computation
To enable absolute, population-referenced EUI evaluation, EUI Score is obtained through z-score normalization against CBECS 2018 Table C14 electricity consumption intensity benchmarks [
15]. For the building type, the EUI Score is computed as:
are the standard normal cumulative distribution function, the CBECS 2018 Table C14 median electricity EUI for building type (
Table 2, see
Section 4.4.1), and the estimated standard deviation. The negative sign ensures that lower EUI (more efficient) yields higher scores.
Table 2 presents the electricity consumption intensity reference thresholds directly from CBECS 2018 Table C14. CBECS 2018 is used (rather than 2012) because the evaluation dataset (BDG-2) contains buildings with 2016–2017 m data, making CBECS 2018 the temporally appropriate reference population.
CBECS 2018 Table C14 (“Electricity consumption and expenditure intensities”) provides comprehensive quartile distributions of electricity intensity by principal building activity, including median, 25th percentile, and 75th percentile values in kWh/sqft. Unlike aggregated consumption tables, C14 directly publishes the intensity distributions needed for benchmarking, eliminating the need for estimation. These values are used directly:
The IQR-based std estimation is robust and assumes an approximately normal distribution within the central 50% of the data—appropriate for population-level EUI distributions that are typically right-skewed but well-behaved in the interquartile range.
Building type coverage: 583 of 611 BDG-2 buildings map to C14 categories. A total of 28 buildings (Parking, Technology, Utility, Other without specific C14 categories) are excluded from EUI scoring but retain Pattern Score evaluation.
4.4.2. Pattern Score Computation
Within-type z-score normalization of CVRMSE:
Pattern Score is mapped from z-score to percentile via the standard normal CDF, with sign inversion so that lower CVRMSE (more consistent pattern) yields higher Pattern Score:
The z-score normalization removes between-type CVRMSE differences driven by inherent building type characteristics: parking garages have structurally low CVRMSE (simple, predictable 24/7 or daytime-only patterns), while public assembly buildings have structurally high CVRMSE (event-driven, highly irregular consumption). The Pattern Score reflects within-type relative pattern consistency—directly analogous to ENERGY STAR’s type-conditional percentile scoring.
4.4.3. Four-Quadrant Classification
Buildings are classified using Score = 50 as the threshold on both axes, with inclusive inequality (≥50):
| | Pattern Score ≥ 50 (temporally consistent) | Pattern Score < 50 (temporally irregular) |
| EUI Score ≥ 50 (efficient) | Quadrant A: Excellent—Efficient and consistent | Quadrant B: Efficient but Irregular |
| EUI Score < 50 (inefficient) | Quadrant C: Consistent but Inefficient | Quadrant D: Needs Improvement |
Note on boundary handling: A score of exactly 50.0 corresponds to the population median and is classified as “at or above median” (favorable). This inclusive threshold follows the standard percentile convention used in ENERGY STAR (≥75 for certification).
4.5. Level 2: CVRMSE Decomposition
4.5.1. The CV-CVRMSE Relationship
For buildings in Quadrants B and D (Pattern Score < 50), the elevated CVRMSE may be driven by either: (a) genuinely high inherent load variability (high CV) that naturally challenges prediction regardless of pattern typicality, or (b) a genuinely anomalous pattern that the model cannot capture even accounting for inherent variability.
Distinguishing these cases requires understanding the systematic relationship between CV and CVRMSE across the building population. An ordinary least squares (OLS) regression is fitted across all 611 buildings:
yielding:
Figure 2 visualizes this regression with the 5 pp Excess CVRMSE threshold that separates CV_DRIVEN from ATYPICAL buildings. This relationship reveals that 70% of cross-building variance in CVRMSE is explained by a building’s inherent load variability (CV). The coefficient α = 0.541 has a clear physical interpretation: for every unit increase in CV (every percentage point increase in load variability relative to mean), the model’s CVRMSE increases by approximately 0.541 percentage points, on average across the population. This is expected—buildings with more variable loads are inherently harder to predict, even if their patterns are population-typical.
4.5.2. Excess CVRMSE and ATYPICAL Classification
Excess CVRMSE is defined as:
Buildings with Excess CVRMSE > 5 percentage points are classified as ATYPICAL (genuine pattern deviation beyond what inherent variability would predict); buildings with Excess CVRMSE ≤ 5 pp are classified as CV_DRIVEN (elevated CVRMSE primarily explained by inherent load variability).
The 5 pp threshold is selected based on convergent evidence: the IQR outlier fence (Q3 + 1.5 × IQR = 11.6 pp) places the boundary well above the selected threshold; Cohen’s d for |NMBE| separation reaches the “large” effect size (d = 0.88) at 5 pp; and 5 pp is the highest threshold maintaining n ≥ 10 in all Level 3 subcategories. This threshold is sample-specific and requires recalibration on other datasets. A comprehensive sensitivity analysis varying the Excess CVRMSE threshold from 3 to 10 pp confirms that Cohen’s d exceeds the “large” effect size (d = 0.8) for all thresholds at or above 4 pp, with d = 0.88 at the selected 5 pp threshold (
Appendix D,
Figure A1).
Buildings with Pattern Score ≥ 50 (Quadrants A and C) are classified as NORMAL—no CVRMSE decomposition is needed as their pattern consistency is already established.
4.6. Level 3: NMBE Directional Analysis
For ATYPICAL buildings, the model’s systematic prediction error direction (NMBE) provides actionable guidance on the nature of the deviation:
OVER-CONSUMING (NMBE > +2%): The model systematically under-predicts actual consumption. The building consumes more than population-representative expectations for its temporal context. This may indicate after-hours equipment operation, HVAC scheduling failures, plug load proliferation, or other patterns where operational adjustments could reduce consumption toward model expectations.
UNDER-CONSUMING (NMBE < −2%): The model systematically over-predicts. The building consistently consumes less than expected given its temporal context. This potentially indicates efficient operational strategies, though the specific mechanisms cannot be confirmed from meter data alone.
NEUTRAL (|NMBE| ≤ 2%): The pattern is atypical but without directional bias. The model’s errors are symmetric around zero; the atypicality manifests in pattern shape or structure rather than systematic over- or under-consumption. Further investigation of the temporal error structure (hour-of-day NMBE profiles) is warranted.
The ±2% threshold follows ASHRAE Guideline 14 [
19] acceptance criteria for M&V models, which require |NMBE| ≤ 5% and CVRMSE ≤ 15% for monthly models. The more conservative ±2% is adopted to identify only buildings with clear, systematic directional bias.
The restriction to ATYPICAL buildings is not arbitrary—it is statistically validated. As demonstrated in
Section 5.4, NMBE is systematically non-zero only for ATYPICAL buildings; for NORMAL and CV_DRIVEN buildings, NMBE is near zero and statistically indistinguishable from noise. Applying NMBE interpretation to non-ATYPICAL buildings would generate spurious recommendations.
4.7. Use of Artificial Intelligence Tools
This study employed Claude (Anthropic, Claude Opus 4.6), a large language model-based AI assistant, to support manuscript drafting, cross-checking computed evaluation metrics, generating figures from processed data, and verifying reference formatting. All core scientific computations—including BuildingsBench zero-shot inference [
14], CVRMSE and NMBE computation across 9,247,992 observation–prediction pairs, OLS regression fitting, correlation testing, and hierarchical classification—were performed using Python 3.13.5 with NumPy 2.4.6, SciPy 1.16.3, pandas 2.3.1, scikit-learn 1.5.2, and Matplotlib 3.10.8, without AI involvement. The framework design, threshold selection, and all scientific interpretations reflect the authors’ domain expertise. All AI-generated outputs were critically reviewed and edited by the authors, who assume full responsibility for the content of this work.
5. Results
5.1. Baseline Statistics: CVRMSE Distribution and Its Drivers
Table 3 presents the CVRMSE distribution across building types. Parking garages exhibit the lowest median CVRMSE (7.7%), consistent with their simple, predictable daytime-operation-only patterns. Entertainment and public assembly buildings exhibit the highest median CVRMSE (22.3%), reflecting their event-driven, irregular consumption schedules. This between-type variation motivates the within-type z-score normalization for Pattern Score.
Entertainment/assembly exhibits notably high standard deviation (32.7%, exceeding its mean of 28.9%), reflecting a small number of high-CVRMSE outliers with event-driven consumption; the within-type z-score normalization for Pattern Score accounts for such skewness.
The CV-CVRMSE regression (R
2 = 0.700) confirms that inherent load variability is the primary driver of cross-building CVRMSE variation. Among the 185 buildings with CVRMSE > 20%, three distinct (non-mutually exclusive) causal mechanisms contribute: high inherent variability (87.6%), small mean load inflating the normalized metric (47.6%), and genuine pattern deviation (33.5%). The detailed causal decomposition is provided in
Appendix B.
5.2. EUI and CVRMSE Are Empirically Independent
The Pearson correlation between raw EUI and raw CVRMSE across all 611 buildings is r = −0.029 (p = 0.48); on the 583 CBECS-mapped subset, r = −0.082 (p = 0.047, R2 = 0.007). Although the latter is marginally significant, EUI explains less than 1% of CVRMSE variance, confirming near-independence. The correlation between EUI Score (C14 median reference) and Pattern Score across 583 CBECS-mapped buildings is r = −0.291 (p < 0.001). The weak negative association at the score level—explaining only 8.5% of variance—reflects a modest tendency for very energy-intensive buildings to exhibit irregular patterns, possibly due to oversized HVAC systems with excessive cycling. However, the explained variance is far from insufficient for either metric to substitute for the other; the vast majority of variation in each score is orthogonal to the other dimension.
Figure 3 (EUI Score vs. Pattern Score scatter plot, n = 583) illustrates this near-independence, with a relatively uniform distribution of buildings across the two-dimensional score space rather than a diagonal concentration that would indicate collinearity.
This finding has a critical implication: CVRMSE cannot be inferred from EUI, and EUI cannot be inferred from CVRMSE. An assessment based on either dimension alone will systematically miss the information provided by the other. Two buildings with identical EUI Scores may have Pattern Scores spanning the full 0–100 range; two buildings with identical Pattern Scores may have EUI Scores spanning the full range. Both dimensions must be measured.
5.3. Level 1: Quadrant Distribution
Table 4 presents the quadrant classification results. Note: 583 buildings represent CBECS-mappable types only (Education, Office, Lodging, Retail, Public Assembly, Public Services, Warehouse, Healthcare, Food Service, Worship). A total of 28 buildings without C14 mapping are excluded (
Section 4.4.1).
Quadrant C (Consistent but Inefficient) is the most populous category at 42.9% (n = 250), substantially higher than when using within-sample relative ranking. This finding reveals a critical insight: when BDG-2 buildings are benchmarked against the CBECS 2018 Table C14 national median rather than each other, over 42% exhibit higher-than-national-median electricity consumption while maintaining operationally consistent patterns. For these buildings, operational interventions (schedule changes, setpoint adjustments, BMS optimization) are unlikely to produce large efficiency gains; the framework correctly directs attention toward capital improvements in building systems or envelope. This distinction—between operational and capital problems—is precisely what EUI-only benchmarking cannot make.
The absolute CBECS 2018 C14 benchmark reveals that 39.3% of BDG-2 buildings perform at or below the national median electricity EUI (EUI Score ≥ 50, n = 229). This is substantially lower than the 50% expected from a nationally representative sample, suggesting that the BDG-2 dataset—drawn from four U.S. university campuses—may have systematically higher electricity intensity than the broader commercial building stock. Possible explanations include building vintage (many campus buildings predate modern energy codes), operational hours (24/7 research facilities), plug load density (laboratory equipment, servers), and HVAC requirements (ventilation for labs).
Quadrant B (18.4%, n = 107) represents the ENERGY STAR blind spot: these buildings appear efficient in annual EUI terms but exhibit irregular temporal patterns. As demonstrated in
Section 5.6, a majority of ENERGY STAR-certifiable buildings (EUI Score ≥ 75) exhibit pattern irregularities undetectable by annual EUI alone.
Building type patterns in quadrant distribution (
Appendix C) reveal systematic differences in operational consistency. Lodging facilities exhibit the highest Quadrant A proportion (55.6%), consistent with their operationally regular, schedule-driven patterns. Public Assembly buildings show the highest Quadrant C proportion (62.5%), indicating consistent temporal patterns despite high electricity intensity. Education buildings exhibit the highest Quadrant B proportion among listed types (22.8%), likely reflecting strong academic calendar seasonality and diverse sub-type composition (K-12 vs. university research facilities).
5.4. Level 2: CVRMSE Decomposition and Statistical Validation
Table 5 presents the complete Level 2 classification across 583 CBECS-mapped buildings using CBECS 2018 C14 absolute thresholds.
Of the 583 CBECS-mapped buildings, 372 (63.8%) are classified as NORMAL (Pattern Score ≥ 50). Of the 211 buildings in Quadrants B and D, 153 (72.5%) are CV_DRIVEN: their high CVRMSE is substantially explained by inherent load variability. Only 58 buildings (27.5% of B + D; 9.9% of 583) are ATYPICAL—exhibiting pattern deviations genuinely beyond what their inherent variability would predict.
5.5. Level 3: NMBE Direction for ATYPICAL Buildings
Table 7 presents the NMBE directional classification. OVER-CONSUMING buildings exhibit the highest mean Excess CVRMSE (20.5 pp), indicating that the most directionally biased buildings also deviate most strongly from population-expected patterns—the directional bias and pattern atypicality are correlated, reinforcing the diagnostic signal.
Over-consuming ATYPICAL buildings (25 buildings, 43.1% of ATYPICAL) represent the highest-priority operational intervention targets: these buildings deviate from population-typical patterns and systematically consume more than the model predicts for their temporal context. The mean NMBE of +10.1% suggests these buildings consume approximately 10% more energy per hour than a comparable population-typical building would in the same temporal context. Reducing consumption toward model expectations—through HVAC scheduling optimization, after-hours load management, or occupancy-linked controls—represents a quantifiable, population-referenced efficiency target.
Under-consuming ATYPICAL buildings (10 buildings, 16.9% of ATYPICAL) exhibit systematic under-consumption. Their negative NMBE (mean −3.6%) indicates systematic under-consumption relative to expectations—genuine energy saving that the model, calibrated to typical buildings, did not anticipate.
5.6. ENERGY STAR Blind Spot Analysis
ENERGY STAR eligibility is simulated using a within-type EUI percentile score ≥ 75 as a proxy for certification eligibility, benchmarked against CBECS 2018 Table C14 median EUI thresholds. Of the 583 CBECS-mapped buildings:
A total of 85 buildings achieve EUI Score ≥ 75 (ENERGY STAR-certifiable by EUI criterion, 14.6%);
Of these, 55 buildings (64.7%) have Pattern Score < 50 (irregular temporal operations);
Of the 55 irregular ENERGY STAR-certifiable buildings, 16 (29.1%) are ATYPICAL—exhibiting genuine, non-CV-explained pattern anomalies;
The remaining 39 of 55 are CV_DRIVEN—their pattern irregularity is largely inherent to their use type.
Conversely, 250 Quadrant C buildings (42.9%) exhibit consistent temporal patterns despite exceeding the national median EUI, confirming that their efficiency gap is structural rather than operational (
Section 5.3). Across all assessment reversal types, 71.2% of buildings receive substantively different assessments when CVRMSE is incorporated alongside EUI.
6. Discussion
6.1. Theoretical Implications: Zero-Shot Error as Population-Referenced Atypicality
The central theoretical contribution of this paper is the reframing of zero-shot prediction error from an accuracy measure to a population-referenced diagnostic signal. This reframing rests on a key structural asymmetry between a population-trained model and a building-specific model.
A building-specific model (e.g., ARIMA fitted to a single building’s history) learns whatever patterns that specific building exhibits; its prediction error reflects random fluctuations and measurement noise but not population-atypicality, because the model has no knowledge of what other buildings do. A population-trained model, by contrast, has learned the joint distribution of patterns across thousands of diverse buildings; its prediction error for a new building reflects the distance of that building’s patterns from the learned population manifold.
This creates a diagnostic instrument: large zero-shot prediction error implies population-atypical patterns, irrespective of absolute consumption level. As demonstrated in
Section 5.2, the model does not fail to predict high-EUI buildings systematically; it fails to predict buildings whose patterns are unusual, regardless of their magnitude.
The parallel to ENERGY STAR is instructive. ENERGY STAR’s WLS regression is also a population model—it predicts a building’s expected EUI from its characteristics, and the prediction error (actual vs. predicted EUI) becomes the score. We are doing the same thing at the temporal level: the model predicts a building’s expected hourly consumption from its own history and context, and the prediction error (CVRMSE) becomes the temporal pattern consistency score. The conceptual structure is identical; the temporal resolution is fundamentally different.
An important caveat is that ATYPICAL classification identifies population-atypical temporal patterns, not confirmed operational faults. A building may exhibit atypical patterns due to genuine inefficiency, mission-critical functions, or unique occupancy schedules. Without ground-truth labels from physical audits, the framework functions as a portfolio-level screening tool rather than a definitive diagnostic system. We recommend a two-stage pipeline: automated ATYPICAL screening to prioritize buildings (9.9% in this study) for targeted on-site audits.
6.2. The CVRMSE Decomposition: Resolving the CV Ambiguity
The CV-CVRMSE regression (CVRMSE = 0.541 × CV − 0.030, R2 = 0.700) is a methodologically important finding that resolves the fundamental ambiguity in CV-based pattern assessment. By establishing the expected CVRMSE for a building given its inherent variability, we can determine whether the building’s pattern is atypical conditional on its variability level—an assessment impossible with CV alone.
6.3. Robustness of the EUI-CVRMSE Independence Finding
The raw EUI-CVRMSE correlation (r = −0.082,
p = 0.047, R
2 = 0.007) requires no EUI Score computation and directly demonstrates near-independence between annual energy intensity and temporal pattern deviation. The missing ENERGY STAR covariates (operating hours, worker count) primarily affect EUI level, not temporal pattern structure, which CVRMSE captures independently.
Appendix A confirms robustness across five alternative EUI scoring approaches. Additionally, the Box–Cox transformation used for load normalization (following BuildingsBench [
14]) was tested for sensitivity: perturbing λ by ±20% preserves CVRMSE-based rankings (Spearman ρ > 0.999) and Pattern Score rankings (ρ > 0.985), confirming that the framework’s results are not sensitive to the choice of transformation parameter.
6.4. Addressing the Simulation-to-Reality Gap
The sim-to-real gap concern is addressed at multiple levels. Empirically, buildings with predictable operational patterns achieve low zero-shot CVRMSE on real data: parking garages (median 7.7%) and lodging facilities (10.9%) both fall well below ASHRAE Guideline 14’s 15% threshold [
19], confirming that the model is well-calibrated for regular real buildings despite being trained exclusively on simulated data. Conceptually, the within-type z-score normalization for Pattern Score (
Section 4.4.2) absorbs systematic sim-to-real offsets: if all education buildings show consistently elevated CVRMSE on real versus simulated data, this shift is removed by the z-score, and only relative within-type deviations affect classification.
A critical distinction must be drawn between the training data and the evaluation data. The Buildings-900K training corpus reflects the actual age distribution and characteristics of U.S. commercial buildings as parameterized by CBECS survey weights [
26], spanning vintages from pre-1980 through 2018. The evaluation data (BDG-2) consists of real meter readings from 2016 to 2017. The model does not predict future behavior from historical data; it evaluates whether observed real-world patterns deviate from the statistical norm learned from a population-calibrated corpus.
Regarding model generalizability, the framework is model-agnostic (
Section 6.6).
6.5. Practical Deployment Pathway
The framework’s minimal data requirement—hourly metered energy consumption + latitude/longitude—aligns naturally with smart meter infrastructure being deployed globally under mandatory benchmarking programs. In jurisdictions with building disclosure requirements (New York Local Law 97, California AB 802, EU Energy Performance of Buildings Directive), the required data is already collected annually or continuously.
A practical concern for deployment is interpretability: while the framework can flag a building as ATYPICAL, facility managers need guidance on the nature of the anomaly. The hourly prediction residuals provide a first level of temporal localization (
Figure 4), revealing patterns such as concentrated evening or weekend over-consumption, systematic weekday business-hour savings, or phase mismatches between actual and population-typical scheduling. While this temporal localization identifies when anomalies occur, it does not explain the underlying mechanistic causes, which would require model-internal analysis such as SHAP-based feature attribution or attention map analysis.
Framework Recalibration Protocol
To facilitate transfer to new building portfolios, the CV–CVRMSE regression must be re-fitted on the target portfolio, as the slope (α) and intercept (β) are population-dependent, and the Excess CVRMSE threshold re-derived using IQR-based outlier fencing and Cohen’s d analysis. A minimum of 30 buildings per type is recommended for stable within-type normalization; rare types should be aggregated or evaluated against the all-type pooled distribution. When a new foundation model replaces the forecasting backbone, the entire calibration sequence should be repeated. A Leave-One-Site-Out cross-validation across the four BDG-2 sites confirms this stability (α range: 0.47–0.61; overall agreement: 91.4%, Cohen’s κ = 0.837;
Appendix E).
6.6. Limitations and Boundary Conditions
Several limitations relate to the framework’s input data scope. The EUI Score computation uses a CBECS population-referenced z-score approach that does not adjust for operating hours, worker count, or plug load intensity within building types, unlike ENERGY STAR’s regression-based adjustment. This may introduce noise in the EUI Score but does not affect the CVRMSE-based Pattern Score, and the core EUI–CVRMSE independence finding is robust across alternative EUI scoring methods (
Appendix A). Buildings with mean hourly load below 5 kWh/h (23 of 611 in BDG-2) are excluded from ATYPICAL and NMBE analysis due to denominator inflation risk in normalized metrics. The framework currently uses latitude and longitude as static proxies for climate zone; while the model has internalized 16 ASHRAE climate zones through training, actual hourly dry-bulb temperature and humidity data would provide finer-grained meteorological context and help distinguish operational anomalies from weather-driven variability. Building envelope characteristics and cooling set-point strategies also influence temporal consumption patterns in ways not captured by meter data alone [
27]. Furthermore, the framework is limited to electricity because the BuildingsBench model was trained exclusively on electricity load profiles, and hourly gas and thermal metering data are rarely available at scale. Extension to multi-fuel diagnostics awaits foundation models trained on diverse energy carriers.
The CV–CVRMSE decomposition regression (CVRMSE = (
Section 4.5) and the 5 pp Excess CVRMSE threshold are fitted on BDG-2’s 611 buildings across four North American sites; these parameters will differ for other building populations, climate contexts, or meter types, and require recalibration on new datasets (see Section Framework Recalibration Protocol). Validation on independent datasets—ASHRAE GEPIII full dataset, urban energy disclosure datasets, European building portfolios—is necessary to confirm generalizability. All metrics in this study derive from a single model, TransformerWithGaussian-L; while the framework’s diagnostic logic is model-agnostic, concordance across alternative foundation models remains to be established. The model’s training corpus (Buildings-900K) reflects CBECS 2012-era building stock; modern technologies such as heat pump electrification, rooftop solar, and post-pandemic work patterns may cause technologically progressive buildings to be misclassified as ATYPICAL, making periodic retraining on updated CBECS data essential.
The 168 h (7-day) context window, a design parameter of the TransformerWithGaussian-L architecture, is sufficient for capturing weekly operational cycles but may not fully represent broader seasonal transitions or multi-week operational patterns. During shoulder seasons where heating and cooling mode transitions occur, the limited temporal memory may generate elevated CVRMSE not attributable to operational anomalies; the framework’s annual aggregation of CVRMSE across all prediction windows partially mitigates this by averaging over multiple seasonal contexts, but systematic evaluation of context window sensitivity is warranted. BDG-2’s building type categories are coarse—the Education label spans K-12 schools to research universities with substantially different operational profiles—and finer-grained taxonomy, potentially automated through deep learning-based structural type identification [
28], would improve within-type normalization precision. Buildings with behind-the-meter resources (photovoltaics, battery storage, geothermal systems) create temporal load signatures indistinguishable from operational anomalies using net meter data alone; BDG-2 (2016–2017) predates large-scale commercial deployment of such resources, but application to post-2020 portfolios will require gross metering or BTM registries.
Collectively, these limitations point to several directions for future investigation: external validation on independent building portfolios, finer-grained building type taxonomy, cross-model concordance testing, integration of hourly weather data, and extension to non-electric energy carriers. Addressing these areas would strengthen the framework’s generalizability and practical utility across diverse deployment contexts.
7. Conclusions
This paper presented a hierarchical, three-level building energy performance evaluation framework that integrates EUI-based efficiency assessment with zero-shot prediction CVRMSE from a large-scale pre-trained load forecasting model as a temporal pattern consistency metric. Applied to 611 real buildings from the Building Data Genome Project 2 (BDG-2) encompassing 9,247,992 observation-prediction timestep pairs, the framework yielded the following principal findings:
Finding 1: EUI and zero-shot CVRMSE are empirically near-independent (r = −0.029, p = 0.48 for all 611 buildings; r = −0.082, p = 0.047, R2 = 0.007 for 583 CBECS-mapped buildings; r = −0.291, R2 = 0.085 for normalized scores). Neither metric is a proxy for the other; both must be measured for a complete building performance assessment. The independence is theoretically expected—EUI captures aggregate annual consumption intensity while CVRMSE captures the temporal structure of energy use patterns relative to a population-calibrated reference.
Finding 2: Benchmarked against CBECS 2018 Table C14 electricity-only median EUI thresholds, the hierarchical four-quadrant framework (Level 1) classifies 583 CBECS-mapped BDG-2 buildings as: Excellent (A: 122, 20.9%), Efficient but Irregular (B: 107, 18.4%), Consistent but Inefficient (C: 250, 42.9%), and Needs Improvement (D: 104, 17.8%). The framework correctly directs buildings in Quadrant C (consistent, inefficient) toward capital investment rather than operational changes—a distinction invisible to EUI-alone benchmarking. The dominance of Quadrant C reflects BDG-2’s university campus characteristics (research facilities, 24/7 operations, laboratory equipment), which systematically exceed national commercial building medians.
Finding 3: The CVRMSE decomposition (Level 2) reveals that 70% of cross-building CVRMSE variance is explained by inherent load variability (CV), and only 58 buildings (9.9% of 583 CBECS-mapped) are genuinely ATYPICAL—exhibiting pattern deviations exceeding what their inherent variability would predict. The large majority of high-CVRMSE buildings (CV_DRIVEN, 153 buildings) have inherently complex consumption patterns consistent with their use type, making operational intervention less immediately actionable.
Finding 4: A total of 64.7% of ENERGY STAR-certifiable buildings (EUI Score ≥ 75, n = 85) exhibit temporal pattern irregularities (Pattern Score < 50) undetectable by annual EUI benchmarking. Conversely, 42.9% of buildings (Quadrant C, n = 250) with EUI Score < 50 (exceeding CBECS 2018 C14 median) exhibit highly consistent temporal patterns (Pattern Score ≥ 50), indicating their efficiency gap is structural rather than operational. Across all four reversal case types, buildings receive substantively different assessments when the temporal dimension is incorporated beyond annual EUI alone.
The broader implication is that zero-shot prediction error from population-trained models constitutes a previously invisible diagnostic instrument. By reframing this error as population-referenced pattern atypicality and structuring its interpretation through a validated hierarchical framework, we demonstrate a practical pathway from “how much energy?” toward “how does this building use energy, and what should be done?” Future work should prioritize external validation on independent building portfolios, sensitivity analysis across multiple foundation models, and recalibration of decomposition parameters for diverse building populations and climate contexts.
Author Contributions
Conceptualization, H.-H.Y. and J.-U.K.; Methodology, H.-H.Y.; Software, H.-H.Y.; Formal Analysis, H.-H.Y.; Validation, H.-H.Y.; Investigation, H.-H.Y.; Data Curation, H.-H.Y.; Writing—Original Draft Preparation, H.-H.Y.; Writing—Review and Editing, J.-U.K.; Visualization, H.-H.Y.; Supervision, J.-U.K.; Funding Acquisition, J.-U.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by the Ministry of Trade, Industry and Energy, Korea Institute of Energy Technology Evaluation and Planning (Project name: Development and demonstration of energy demand efficiency technology through autonomous operation of common devices in commercial facilities, Project number: RS-2023-00238487).
Data Availability Statement
Acknowledgments
The authors acknowledge the National Renewable Energy Laboratory (NREL) for developing and releasing the BuildingsBench framework, model weights, and Buildings-900K simulation dataset under open-source license. We acknowledge the Building Data Genome Project team for curating and releasing the BDG-2 dataset. We also acknowledge the U.S. Energy Information Administration for the CBECS survey data that underpins the Buildings-900K simulation corpus. During the preparation of this manuscript, the author(s) used Claude (Anthropic, Claude Opus 4.6) for the purposes of manuscript drafting, data analysis verification, figure generation, and reference verification. The authors have reviewed and edited the output and take full responsibility for the content of this publication.
Conflicts of Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Jeong-Uk Kim reports administrative support was provided by the Ministry of Trade, Industry and Energy, Korea Institute of Energy Technology Evaluation and Planning (Project name: Development and demonstration of energy demand efficiency technology through autonomous operation of common devices in commercial facilities, Project number: RS-2023-00238487). The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Appendix A. Sensitivity Analysis—EUI Score Calculation Method
To verify the robustness of the EUI-CVRMSE independence finding to EUI calculation methodology, we repeated the correlation analysis using alternative EUI scoring approaches:
Table A1.
Sensitivity analysis of EUI-CVRMSE correlation to EUI scoring method.
Table A1.
Sensitivity analysis of EUI-CVRMSE correlation to EUI scoring method.
| EUI Scoring Method | EUI ↔ CVRMSE (Raw) | EUI Score ↔ Pattern Score |
|---|
| CBECS C14 z-score (our method, n = 583) | r = −0.082 (R2 = 0.007) | r = −0.291 |
| Within-type percentile (BDG-2 relative ranking) | r = −0.082 (R2 = 0.007) | r = −0.276 |
| All-buildings percentile (no type adjustment) | r = −0.082 (R2 = 0.007) | r = −0.278 |
| Raw EUI (no normalization) | r = −0.082 (R2 = 0.007) | – |
| Log-transformed EUI percentile | r = −0.082 (R2 = 0.007) | r = −0.278 |
Appendix B. Causal Decomposition of High-CVRMSE Buildings
Table A2.
Causal decomposition of high-CVRMSE buildings (CVRMSE > 20%, n = 185).
Table A2.
Causal decomposition of high-CVRMSE buildings (CVRMSE > 20%, n = 185).
| Cause | Count | % of High-CVRMSE | Description |
|---|
| HIGH_CV (CV above type median) | 162 | 87.6% | High inherent variability drives CVRMSE |
| LOW_DENOM (mean < 25th percentile) | 88 | 47.6% | Small mean load inflates normalized CVRMSE |
| ATYPICAL (Excess > 5 pp) | 62 | 33.5% | Genuine pattern deviation from population model |
| CV_DRIVEN only (single cause) | 66 | 35.7% | CVRMSE fully explained by CV alone |
Appendix C. Quadrant Distribution by Building Type
Table A3.
Quadrant distribution by building type (CBECS 2018 C14, selected types, n ≥ 15).
Table A3.
Quadrant distribution by building type (CBECS 2018 C14, selected types, n ≥ 15).
| Building Type | n | A: Excellent (%) | B: Eff. but Irreg. (%) | C: Cons. Ineffic. (%) | D: Needs Impr. (%) |
|---|
| Lodging | 54 | 55.6 | 16.7 | 18.5 | 9.3 |
| Public Services | 98 | 24.5 | 22.4 | 34.7 | 18.4 |
| Office | 71 | 18.3 | 11.3 | 42.3 | 28.2 |
| Education | 254 | 17.3 | 22.8 | 44.9 | 15.0 |
| Public Assembly | 80 | 10.0 | 7.5 | 62.5 | 20.0 |
Appendix D. Threshold Sensitivity Analysis
Table A4.
Excess CVRMSE threshold sensitivity analysis (583 CBECS-mapped buildings, Pattern Score < 50, NMBE direction threshold fixed at ±2%).
Table A4.
Excess CVRMSE threshold sensitivity analysis (583 CBECS-mapped buildings, Pattern Score < 50, NMBE direction threshold fixed at ±2%).
Threshold (pp) | n ATYPICAL | n CV_DRIVEN | Mean |NMBE| ATYPICAL (%) | Mean |NMBE| CV_DRIVEN (%) | Cohen’s d | % OVER | % UNDER | % NEUTRAL |
|---|
| 3 | 69 | 142 | 4.69 | 1.40 | 0.74 | 40.6 | 14.5 | 44.9 |
| 4 | 64 | 147 | 4.94 | 1.40 | 0.80 | 42.2 | 15.6 | 42.2 |
| 5 | 58 | 153 | 5.29 | 1.41 | 0.88 | 43.1 | 17.2 | 39.7 |
| 6 | 52 | 159 | 5.54 | 1.47 | 0.93 | 44.2 | 17.3 | 38.5 |
| 7 | 47 | 164 | 6.00 | 1.46 | 1.05 | 44.7 | 19.1 | 36.2 |
| 8 | 43 | 168 | 6.46 | 1.46 | 1.17 | 48.8 | 20.9 | 30.2 |
| 10 | 36 | 175 | 7.01 | 1.54 | 1.29 | 47.2 | 22.2 | 30.6 |
Figure A1.
Effect size (Cohen’s d) for |NMBE| separation between ATYPICAL and CV_DRIVEN groups as a function of Excess CVRMSE threshold. The dashed red line indicates the “large” effect size boundary (d = 0.8). The selected 5 pp threshold yields d = 0.88.
Figure A1.
Effect size (Cohen’s d) for |NMBE| separation between ATYPICAL and CV_DRIVEN groups as a function of Excess CVRMSE threshold. The dashed red line indicates the “large” effect size boundary (d = 0.8). The selected 5 pp threshold yields d = 0.88.
Appendix E. Leave-One-Site-Out Cross-Validation
Table A5.
CV-CVRMSE regression parameters and classification agreement for Leave-One-Site-Out cross-validation (4 BDG-2 sites).
Table A5.
CV-CVRMSE regression parameters and classification agreement for Leave-One-Site-Out cross-validation (4 BDG-2 sites).
Fold (Held-Out Site) | n Train | n Test | Alpha | Beta | R2 Train | R2 Test | Agreement (%) | Kappa |
|---|
| Bear | 520 | 91 | 0.541 | −0.034 | 0.707 | 0.634 | 100.0 | 1.000 |
| Fox | 476 | 135 | 0.475 | −0.005 | 0.698 | 0.679 | 95.4 | 0.895 |
| Rat | 331 | 280 | 0.613 | −0.045 | 0.698 | 0.642 | 85.5 | 0.743 |
| Panther | 506 | 105 | 0.559 | −0.039 | 0.713 | 0.485 | 95.7 | 0.893 |
| All (baseline) | 611 | — | 0.541 | −0.030 | 0.700 | — | 91.4 | 0.837 |
References
- United Nations Environment Programme (UNEP); Global Alliance for Buildings and Construction (GlobalABC). 2023 Global Status Report for Buildings and Construction: Beyond Foundations; UNEP: Nairobi, Kenya, 2023. [Google Scholar] [CrossRef]
- United Nations Environment Programme (UNEP). 2022 Global Status Report for Buildings and Construction; UNEP: Nairobi, Kenya, 2022; Available online: https://globalabc.org/resources/publications/2022-global-status-report-buildings-and-construction (accessed on 5 April 2026).
- Pérez-Lombard, L.; Ortiz, J.; Pout, C. A review on buildings energy consumption information. Energy Build. 2008, 40, 394–398. [Google Scholar] [CrossRef]
- US Environmental Protection Agency (EPA). ENERGY STAR Score for Commercial Buildings in the United States; Technical Reference; EPA: Washington, DC, USA, 2023. Available online: https://portfoliomanager.energystar.gov/pdf/reference/ENERGY%20STAR%20Score.pdf (accessed on 5 April 2026).
- European Parliament and Council. Directive (EU) 2018/844 amending Directive 2010/31/EU on the energy performance of buildings and Directive 2012/27/EU on energy efficiency. Official Journal of the European Union, 19 June 2018.
- American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE). Building EQ: A Tool for Building Energy Audits and Benchmarking; ASHRAE: Atlanta, GA, USA, 2014. [Google Scholar]
- Arjunan, P.; Miller, C.; Poolla, K. EnergyStar++: Towards more accurate and explanatory building energy benchmarking. Appl. Energy 2020, 276, 115413. [Google Scholar] [CrossRef]
- Institute for Market Transformation (IMT). Benchmark 8760: The Case for Hourly Benchmarking of Commercial Buildings; IMT: Washington, DC, USA, 2022; Available online: https://benchmark8760.org/ (accessed on 5 April 2026).
- Piscitelli, M.S.; Giudice, R.; Capozzoli, A. A holistic time series-based energy benchmarking framework for applications in large stocks of buildings. Appl. Energy 2024, 357, 122550. [Google Scholar] [CrossRef]
- Li, X.; Yao, R.; Li, Q.; Ding, Y.; Li, B. An object-oriented energy benchmark for the evaluation of the office building stock. Util. Policy 2018, 51, 1–11. [Google Scholar] [CrossRef]
- Goswami, M.; Szafer, K.; Choudhry, A.; Cai, Y.; Li, S.; Dubrawski, A. MOMENT: A family of open time-series foundation models. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 21–27 July 2024. [Google Scholar]
- Ansari, A.F.; Stella, L.; Turkmen, C.; Zhang, X.; Mercado, P.; Shen, H.; Metaxas, D.; Lim, B.; de Bayser, M.; Bengio, S.; et al. Chronos: Learning the language of time series. Trans. Mach. Learn. Res. 2024. Available online: https://openreview.net/forum?id=gerNCVqqtR (accessed on 5 April 2026).
- Das, A.; Kong, W.; Sen, R.; Zhou, Y. A decoder-only foundation model for time-series forecasting. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 21–27 July 2024. [Google Scholar]
- Emami, P.; Adams, S.; Bhaskaran, S.; He, T.; Lunacek, M. BuildingsBench: A large-scale dataset of 900K buildings and benchmark for short-term load forecasting. In Advances in Neural Information Processing Systems (NeurIPS 2023); Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36. [Google Scholar] [CrossRef]
- US Energy Information Administration (EIA). 2018 Commercial Buildings Energy Consumption Survey (CBECS): Consumption and Expenditures Tables; EIA: Washington, DC, USA, 2022. Available online: https://www.eia.gov/consumption/commercial/data/2018/ (accessed on 5 April 2026).
- Bordass, B.; Cohen, R.; Field, J. Energy performance of non-domestic buildings: Closing the credibility gap. In Proceedings of the International Conference on Improving Energy Efficiency in Commercial Buildings (IEECB’04), Frankfurt, Germany, 19–22 April 2004. [Google Scholar]
- Scofield, J.H. Do LEED-certified buildings save energy? Not really. Energy Build. 2009, 41, 1386–1390. [Google Scholar] [CrossRef]
- Granderson, J.; Price, P.N.; Jump, D.; Addy, N.; Sohn, M.D. Automated measurement and verification: Performance of public domain whole-building electric baseline models. Appl. Energy 2015, 144, 106–113. [Google Scholar] [CrossRef]
- American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE). ASHRAE Guideline 14-2014: Measurement of Energy, Demand, and Water Savings; ASHRAE: Atlanta, GA, USA, 2014. [Google Scholar]
- Park, J.Y.; Yang, X.; Miller, C.; Arjunan, P.; Nagy, Z. Apples or oranges? Identification of fundamental load shape profiles for benchmarking buildings using a large and diverse dataset. Appl. Energy 2019, 236, 1280–1295. [Google Scholar] [CrossRef]
- Andrews, A.; Jain, R.K. Beyond energy efficiency: A clustering approach to embed demand flexibility into building energy benchmarking. Appl. Energy 2022, 327, 119989. [Google Scholar] [CrossRef]
- Katipamula, S.; Brambley, M.R. Methods for Fault Detection, Diagnostics, and Prognostics for Building Systems—A Review, Part I. HVACR Res. 2005, 11, 3–25. [Google Scholar] [CrossRef]
- Zhao, Y.; Li, T.; Zhang, X.; Zhang, C. Artificial Intelligence-Based Fault Detection and Diagnosis Methods for Building Energy Systems: Advantages, Challenges and the Future. Renew. Sustain. Energy Rev. 2019, 109, 85–101. [Google Scholar] [CrossRef]
- Miller, C.; Kathirgamanathan, A.; Picchetti, B.; Arjunan, P.; Park, J.Y.; Nagy, Z.; Raftery, P.; Hobson, B.W.; Shi, Z.; Meggers, F. The building data genome project 2, energy meter data from the ASHRAE great energy predictor III competition. Sci. Data 2020, 7, 368. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
- US Energy Information Administration (EIA). 2018 Commercial Buildings Energy Consumption Survey (CBECS): Building Characteristics; EIA: Washington, DC, USA, 2022. Available online: https://www.eia.gov/consumption/commercial/ (accessed on 5 April 2026).
- Xiong, J.; Zhang, Y.; Han, M.; Wu, J.; Tian, Z. Building energy-saving mechanism for indoor cooling temperature set-point with different envelope: A case study in Guangzhou. J. Chin. Archit. Urban. 2023, 5, 0877. [Google Scholar] [CrossRef]
- Xu, Z.; Liang, C.; Mu, Z.; Tian, Y.; Gu, D. Structural Type Identification of City-Scale Building Groups Based on Deep Learning. J. Comput. Civ. Eng. 2026, 40, 4025119. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |