1. Introduction
Buildings account for a substantial share of global electricity consumption, with heating, cooling, and ventilation systems representing some of the most energy-intensive [
1,
2] and operationally flexible end uses. Improving visibility into these loads is essential for applications such as energy performance benchmarking, demand-side management, fault detection, and operational optimization. However, fine-grained sub-metering remains costly and difficult to deploy at scale. Non-intrusive load monitoring (NILM) offers a scalable alternative by estimating end-use consumption from aggregate electricity measurements.
Non-intrusive load monitoring (NILM) techniques offer several potential benefits in building energy analytics, including improved visibility into end-use consumption, support for personalized energy efficiency measures [
3], inference of occupant-related patterns [
4], fault detection in heating, ventilation, and air-conditioning (HVAC) systems [
5], and more accurate demand-side management strategies [
6]. Despite these advantages, reliable NILM remains particularly challenging for heating and cooling loads. Unlike many conventional appliances with discrete on-off behavior, HVAC systems typically operate in a continuously modulated manner, which complicates their separation from aggregate measurements [
7]. This challenge is amplified in commercial and public buildings, where monitoring is often limited to low temporal resolutions such as hourly or 15 min data. While most NILM research has focused on high-resolution measurements that enable event-based pattern recognition, only a limited number of studies have demonstrated effective HVAC load disaggregation at low sampling rates [
8,
9,
10].
Historically, NILM research has focused on residential environments and high-frequency electrical measurements, where appliance-level signatures and switching events can be exploited. In contrast, commercial and public buildings are typically monitored using low-frequency smart meter data, often at 15 min or hourly resolution [
11]. At these temporal scales, transient signatures are largely absent, and classical event-based NILM techniques become ineffective [
12]. Disaggregation in such settings must therefore rely on contextual information, including weather variables, calendar effects, and other exogenous drivers.
A growing body of work has demonstrated that temperature-dependent and other context-driven components can be partially isolated from aggregate measurements. These approaches include regression-based [
13,
14], probabilistic, spectral, and learning-based methods [
15]. While these can achieve high apparent accuracy, many disaggregation formulations impose structural assumptions that effectively require all non-baseline energy to be attributed to the modeled drivers [
16]. In such settings, electricity consumption reflects contextual influences, control logic, occupant behavior, and stochastic effects that are not fully captured by available drivers.
Previous work by the authors compared three representative algorithmic families for low-frequency temperature-dependent load disaggregation, including Bayesian, time–frequency mask-based, and bidirectional LSTM models [
17]. While some methods achieved low pointwise error under favorable conditions, their performance degraded substantially when baseline assumptions were violated or observability was incomplete. These findings indicate that the central difficulty in low-frequency disaggregation lies not only in model selection, but in how the attribution problem is formulated under partial observability.
Beyond numerical accuracy, the practical value of load disaggregation in smart buildings depends critically on the interpretability and stability of the resulting estimates [
18]. Disaggregation outputs are rarely used in isolation; instead, they support operational reasoning tasks such as fault diagnosis, control strategy assessment, retrofit evaluation, and cross-building benchmarking [
19]. In these settings, over-attribution of unexplained energy to contextual drivers can lead to misleading conclusions, for example by overstating the influence of weather or concealing control-related inefficiencies. Models that implicitly assume complete explainability may therefore achieve high apparent accuracy while providing limited decision-support value. This motivates formulations that preserve unexplained energy rather than forcing attribution when explanatory evidence is weak [
15]. Methodologically, MD-ADD departs from existing NILM formulations by explicitly decoupling attribution from completeness, enabling uncertainty-aware, multi-driver decomposition under partial observability rather than forcing full explanation of aggregate demand.
These limitations are illustrated by the ADRENALIN Load Disaggregation Challenge [
20], which released a curated dataset for temperature-dependent load disaggregation [
5]. Post hoc analysis of the winning algorithms showed that highly competitive solutions achieved low normalized mean absolute error by aggressively fitting seasonal structure and residual variability into temperature-dependent components [
21]. While effective under the competition metric, these approaches were sensitive to baseline formulation, building type, and operational regime, and often relied on implicit completeness assumptions that are difficult to justify in real-world deployments. This tension between benchmark optimization and operational interpretability directly motivates the design choices of the proposed framework.
The proposed framework is explicitly not designed to optimize leaderboard-style error metrics. Instead, MD-ADD targets low-frequency, context-driven settings where partial observability, regime variability, and limited metadata are the dominant constraints. The framework is therefore designed for analytical and diagnostic workflows in which interpretability, robustness, and explicit uncertainty exposure are more critical than maximizing explained variance or minimizing pointwise error.
This paper proposes MD-ADD, a multi-driver automatic dependency disaggregation framework designed for low-frequency smart meter data in commercial and public buildings. The framework explicitly treats contextual drivers as informative but incomplete, supports an explicit unexplained energy component, and incorporates uncertainty-aware attribution mechanisms. Driver contributions are estimated using out-of-fold modeling to reduce leakage, and uncertainty is quantified through block bootstrap resampling. Optional temporal and time–frequency consistency constraints are included to restrict attributions to scales compatible with the expected physical influence of each driver. The framework is evaluated on the ADRENALIN Challenge dataset [
5], which provides validated sub-metering for performance assessment.
Taken together, these findings establish that low-frequency HVAC load disaggregation is fundamentally constrained by partial observability, regime variability, and incomplete contextual information. The present work builds on these insights by proposing a conservative, uncertainty-aware disaggregation formulation for low-frequency smart meter data. Rather than introducing a new algorithmic family, the MD-ADD framework provides a unifying, uncertainty-aware attribution pipeline designed to produce robust and interpretable decompositions under realistic smart meter conditions.
This paper makes three main contributions. First, it formulates low-frequency contextual disaggregation as a decomposition problem in which unexplained energy is an expected and informative outcome, rather than an error term that must be eliminated through forced attribution. The proposed framework is explicitly not designed to optimize leaderboard-style error metrics, but to prioritize interpretability, attribution stability, and diagnostic usefulness under incomplete observability. Second, it presents an integrated pipeline that combines robust baseline estimation, leakage-resistant out-of-fold contextual modelling, conservative driver attribution derived from explainability outputs without hard completeness constraints, and uncertainty quantification using block bootstrap resampling, with optional temporal and time–frequency consistency mechanisms. Third, it evaluates the formulation on the ADRENALIN Challenge dataset using normalized mean absolute error alongside stability and residual structure diagnostics, clarifying the trade-off between metric optimization and operational interpretability in commercial and public buildings.
The remainder of this paper is organized as follows.
Section 2 reviews the state of the art in non-intrusive load monitoring with emphasis on low-frequency data, contextual disaggregation, evaluation practices, and limitations related to interpretability and incomplete observability.
Section 3 describes the ADRENALIN Challenge dataset, including building selection, data validation, and preprocessing procedures.
Section 4 presents the proposed MD-ADD methodology, detailing the problem formulation, baseline estimation, multi-driver attribution framework, uncertainty quantification, and experimental evaluation setup.
Section 5 reports and analyzes the empirical results in comparison with established low-frequency disaggregation methods and competition-optimized baselines.
Section 6 discusses the findings, implications, limitations, and directions for future research. Finally,
Section 7 concludes the paper and summarizes the main contributions.
4. Proposed Approach: MD-ADD
4.1. Problem Formulation and Scope
The objective of the proposed methodology is to disaggregate aggregate building electricity consumption into interpretable components associated with multiple contextual drivers, while explicitly accounting for uncertainty and incomplete observability. The method targets low-frequency data, such as hourly or sub-hourly smart meter measurements, and is designed for commercial and public buildings where appliance-level signatures are not available.
Let
denote the aggregate electricity consumption at time
, and let
denote a set of contextual driver variables, such as outdoor temperature and other environmental measurements. Consistent with additive NILM formulations, the aggregate signal is decomposed as
where
represents a baseline component largely independent of the modeled drivers,
denotes the estimated contribution associated with contextual driver
, and
represents unexplained energy.
Equation (1) follows the standard additive perspective widely adopted in non-intrusive load monitoring (NILM), in which aggregate electricity consumption is represented as the superposition of multiple contributing components and a residual term [
15]. However, the decomposition above should be interpreted as an attribution-based representation rather than as a generative physical model of load formation. The contextual drivers
do not appear explicitly in Equation (1) because their influence is incorporated through the estimated attribution terms
. In classical NILM formulations, additive components often correspond to appliance-level loads or signature-based functions. In the present work, the quantities
are inferred from empirical relationships between contextual variables
and variations in aggregate consumption, rather than being assumed to be deterministic functions of the drivers. For clarity,
Table 1 summarizes the notation used in the decomposition and subsequent modeling stages.
In this work, the term “data-driven” refers to the fact that all driver-response relationships and attribution magnitudes are inferred directly from observed data rather than prescribed by fixed physical models, parametric energy signature forms, or rule-based assumptions. The baseline component, contextual regression relationships, and attribution terms are estimated from historical smart meter and contextual measurements using statistical learning under a time-series cross-validation scheme. No building-specific physical parameters, equipment specifications, or predefined temperature-load functions are imposed. Instead, the structure of the decomposition in Equation (1) is fixed, while the functional relationships that determine are learned independently for each building from data. Uncertainty is quantified through block bootstrap resampling rather than imposed through parametric distributional assumptions. In this sense, the disaggregation behavior emerges from empirical patterns in the data rather than from predefined physical or rule-based models.
The methodology does not assume that the available contextual drivers are sufficient to explain all non-baseline energy. Instead, unexplained energy is treated as a meaningful component of the decomposition, reflecting unobserved influences, stochastic variation, and modeling uncertainty. This design choice avoids forcing attribution when explanatory evidence is weak and supports more interpretable and robust disaggregation outcomes.
While the experimental evaluation presented in this paper focuses primarily on weather-related drivers, particularly outdoor air temperature and calendar effects, this choice reflects the availability of validated ground truth and the dominant role of temperature-dependent demand in the ADRENALIN Challenge dataset rather than a limitation of the proposed formulation. The MD-ADD framework is designed to support multiple contextual drivers simultaneously, with each driver contributing an additive, explainability-derived attribution term and an associated uncertainty estimate. Additional drivers, such as occupancy proxies, indoor environmental variables, or control signals, can be incorporated directly into the contextual feature set without modification of the baseline separation, attribution logic, or uncertainty quantification pipeline.
Section 5 includes an explicit multi-driver experiment incorporating calendar-derived contextual features to demonstrate simultaneous driver attribution under controlled driver expansion.
MD-ADD is best understood as a disaggregation framework rather than a single estimator. Its core contribution lies in the formulation of conservative attribution under incomplete observability, supported by modular baseline strategies, interchangeable regression backends, and optional consistency constraints. This design allows the framework to be adapted to different building types, contextual feature sets, and analytical objectives without altering its attribution philosophy.
4.2. Data Preparation and Alignment
The method requires two aligned datasets: a dependent time series representing aggregate electricity consumption and a multivariate time series of contextual drivers. All series must share a common timestamp index.
Data are resampled to a uniform temporal resolution if required. For energy applications, the default configuration uses hourly aggregation, with energy values summed and driver values averaged over each interval. Non-numeric driver columns are excluded automatically.
Time steps with missing aggregate consumption values are removed. Driver values are retained as long as at least one driver is available for a given time step. This design avoids artificially imputing energy consumption while allowing flexible handling of incomplete contextual data.
4.3. Baseline Estimation
The baseline component represents electricity consumption that is largely insensitive to the modeled drivers. In commercial buildings, this component typically includes standby loads, continuously operating equipment, and other relatively stable consumption patterns.
It is important to emphasize that the baseline component in MD-ADD is an operational construct rather than a physically distinct end-use. The baseline is not assumed to be strictly independent of temperature or other contextual drivers in a physical sense. Instead, it represents a conservative lower-envelope separation of aggregate consumption intended to isolate excess variability for attribution analysis. In buildings with pronounced seasonal structure or year-round HVAC operation, the estimated baseline may therefore exhibit correlation with outdoor temperature. Within the MD-ADD formulation, such behavior is treated as diagnostically informative rather than erroneous, because it indicates the absence of a stable, temperature-neutral minimum load. Consequently, the baseline is not interpreted as a temperature-independent demand component, and MD-ADD explicitly avoids drawing physical conclusions from baseline magnitude alone.
The methodology supports several interchangeable baseline estimation strategies, all operating solely on the aggregate signal . A constant baseline may be defined as the median of . A rolling quantile baseline can be constructed as the lower envelope of over a moving window, combining a rolling quantile and a rolling mean and selecting the minimum of the two to improve robustness against transient fluctuations. A schedule-based baseline may be estimated using categorical hour of day and day of week variables within a robust regression framework. In addition, a seasonal trend decomposition baseline may be derived using robust STL decomposition when sufficient periodicity is present.
By default, the rolling quantile baseline is applied, as it is non parametric, domain agnostic, and well suited to long-term building energy data. Edge effects are handled through forward and backward-filling to ensure a complete baseline estimate across the full temporal range.
In addition to these generic formulations, MD-ADD also supports a contextual baseline strategy, implemented as a banded baseline, and used in selected experiments to address cases where envelope-based baselines are known to distort attribution results. In buildings with persistent temperature-driven operation, such as year-round heating or cooling, a stable lower envelope of aggregate demand may not exist. In these cases, the banded baseline is constructed by identifying periods of approximately neutral outdoor conditions, inferred from the data rather than fixed thresholds, and deriving a conservative, time-structured reference profile from those observations. This baseline remains an operational reference and does not aim to recover a physically meaningful baseline, but instead avoids systematically subtracting temperature-driven variability when no temperature-neutral minimum load is present.
4.4. Excess Energy Definition
After baseline estimation, the residual signal is defined as
This excess energy represents the portion of consumption that may be influenced by the modeled drivers. Importantly, it is not assumed that this signal is fully explainable.
4.5. Driver-Based Modeling of Excess Energy
A nonlinear regression model is trained to approximate the relationship between excess energy and the contextual drivers,
Tree-based ensemble models are used due to their ability to capture nonlinearities and interactions without requiring explicit feature engineering. Multiple backends are supported, including gradient-boosted decision trees implemented via XGBoost, LightGBM, or scikit-learn.
To avoid look-ahead bias and overly optimistic attribution, model training is performed using time-series cross-validation. For each fold, the model is trained on past data and evaluated on a held-out future segment. Out-of-fold predictions are retained for all time steps.
4.6. Attribution via Model Explainability
Local driver contributions are obtained using Shapley-value-based explainability methods. For tree-based models, TreeSHAP-compatible implementations are employed to compute additive contributions for each driver at each time step.
For a given time
, the model output can be expressed as
where
is the model intercept and
represents the contribution of driver
at time
.
These quantities are computed on out-of-fold predictions to ensure that attributions reflect generalizable relationships rather than in-sample fit.
4.7. Attribution Strategy and Treatment of Unexplained Energy
Driver contributions are derived directly from model explainability outputs and are not rescaled to enforce exact agreement with the observed excess energy. This attribution strategy avoids imposing hard mass-balance constraints that would require the modeled drivers to explain all variability in the excess signal. As a result, unexplained energy naturally emerges when the available drivers provide insufficient explanatory power or when excess energy contains stochastic or unobserved effects.
The unexplained component is therefore defined as
Negative unexplained values can optionally be clipped to zero when non-negativity is required for interpretability.
4.8. Attribution Logic and Mapping to HVAC Energy
The explainability formulation yields additive model decompositions that must be interpreted carefully in the context of HVAC energy attribution. This subsection clarifies the roles of SHAP values, the intercept term, and their mapping to the evaluated temperature-dependent HVAC signal.
As introduced in
Section 4.6, the model prediction Ê(t) admits an additive SHAP decomposition:
Within the MD-ADD framework, the driver contributions are interpreted as contextual attributions to temperature-dependent HVAC energy. The intercept term is not associated with any physical driver. It represents the baseline expectation of the predictive model under the background data distribution and captures systematic structure not attributable to specific contextual variables. Accordingly, the intercept does not constitute attributable driver energy.
The estimated HVAC-related energy attributed to contextual drivers is defined as:
The unexplained component is computed as:
Importantly, this definition allows to remain non-zero. When contextual drivers do not fully explain the observed excess energy, the residual term explicitly represents unexplained variability rather than being absorbed into contextual attributions. This separation preserves interpretability and aligns with the conservative attribution philosophy of MD-ADD.
In the experimental evaluation, is compared directly to the temperature-dependent HVAC ground-truth signal provided in the ADRENALIN dataset. The intercept term is not mapped to HVAC energy, ensuring that only driver-supported contributions are evaluated against the reference HVAC measurements.
4.9. Temporal and Time–Frequency Consistency Constraints
To suppress spurious attributions caused by temporal-scale mismatch, an optional time–frequency consistency mechanism is applied as a post-processing step on the driver attribution time series. A short-time Fourier transform (STFT) is computed for the excess-energy signal and for each driver signal using the same window length and overlap. For each driver, a time-varying spectral agreement score is computed by comparing the driver and excess-energy STFT magnitudes within a predefined frequency band that corresponds to plausible driver dynamics (for example, low-frequency components for meteorological drivers). This agreement score defines a gating factor in that attenuates the driver attribution at time steps where spectral alignment is weak. The intent is to restrict driver attributions to temporal scales consistent with the driver’s expected physical influence, while leaving the core modeling and SHAP attribution unchanged. This option is used for sensitivity analysis and is not required for the baseline MD-ADD formulation.
To formalize this operation, let
denote the raw attribution time series for driver
, and let
denote the time–frequency agreement score derived from STFT magnitude overlap between the excess-energy signal and driver
. The gated attribution is defined as
When , attribution is fully suppressed at time ; when , the original attribution is preserved. This post-processing step does not alter the regression or SHAP computation, but restricts attribution to temporally plausible scales.
4.10. Attribution Thresholding and Sparsity
Small-magnitude attributions that arise from numerical noise or weak correlations are optionally suppressed using a global or percentile-based threshold. This step encourages sparse and interpretable decompositions and reduces the risk of attributing negligible energy to irrelevant drivers.
4.11. Uncertainty Quantification via Block Bootstrap
Uncertainty in driver attributions is quantified using a moving-block bootstrap procedure that preserves temporal autocorrelation. Contiguous blocks of fixed length are sampled with replacement to generate bootstrap replicates.
For each replicate, the full modeling and attribution pipeline is re-executed. Driver contributions are then aggregated over the desired temporal resolution, such as daily totals. Empirical confidence intervals are computed from the resulting distributions.
This procedure provides uncertainty estimates for both absolute attributions and relative energy shares.
Let
denote the aggregated attribution for driver
obtained from bootstrap replicate
, where aggregation is performed over a fixed temporal window such as daily totals. For
bootstrap replicates, the empirical attribution distribution is
Uncertainty intervals are computed as empirical quantiles of this distribution, and the median is used as the reported point estimate.
4.12. Diagnostics and Validation
Model adequacy is evaluated using a combination of quantitative accuracy measures and stability diagnostics that reflect both predictive performance and interpretability. Normalized mean absolute error is used to quantify the discrepancy between estimated and measured HVAC energy, providing a scale-independent measure suitable for comparison across buildings and aggregation levels. In addition to pointwise accuracy, residual structure is examined through autocorrelation analysis of the unexplained component, including inspection of the autocorrelation function and application of the Ljung–Box test. Importantly, persistent residual autocorrelation is not interpreted solely as a modeling defect. Instead, it is treated as a diagnostic signal indicating the presence of unobserved drivers, regime shifts, or operational dynamics that are not captured by the available contextual variables. The stability of driver attributions is evaluated by comparing attribution patterns across time series cross-validation folds and bootstrap resampling replicates. Together, these diagnostics provide a more comprehensive assessment of model behavior than accuracy metrics alone, supporting evaluation of robustness, transparency, and physical plausibility.
4.13. Attribution Stability and Residual-Structure Metrics
In addition to NMAE, two diagnostic families are used to evaluate whether driver attributions are sufficiently stable for operational interpretation. First, attribution stability is quantified using bootstrap replicates of aggregated driver energy (for example, daily totals). For each driver and building, stability is summarized by a normalized dispersion statistic computed as the interquartile range divided by the median absolute attribution across bootstrap replicates, and a sign-consistency rate computed as the fraction of replicates with the same attribution sign as the median. Let
denote the aggregated attribution for driver
in bootstrap replicate
. Attribution dispersion is defined as
and sign consistency is defined as
where
denotes the indicator function. Low dispersion and high sign consistency indicate a transferable driver relationship that is robust to plausible temporal perturbations.
Second, residual structure is summarized using autocorrelation-derived statistics on the unexplained component. In addition to visual inspection of the autocorrelation function, a small set of fixed-lag autocorrelations is reported at lags corresponding to one interval, one day, and one week (depending on sampling rate). A Ljung–Box test is used as a complementary check for non-random residual structure over a defined lag window. Persistent residual structure is interpreted as evidence of missing drivers, regime changes, or operational logic not represented in the contextual features, rather than being treated solely as modeling error.
The attribution stability metrics introduced here are intended as diagnostic indicators rather than formal statistical hypothesis tests. No universal thresholds are assumed for dispersion or sign-consistency, and the framework does not claim statistical significance in the classical inferential sense. Instead, stability is interpreted comparatively across drivers, buildings, and temporal resolutions, with high dispersion or inconsistent attribution sign indicating weak or non-transferable explanatory relationships under the available contextual features. This design choice reflects the exploratory and decision-support-oriented nature of low-frequency disaggregation. The primary objective is to expose uncertainty and sensitivity rather than to assert definitive causal attribution.
4.14. Comparison Algorithms and Evaluation Setup
To assess the performance of the proposed disaggregation framework, results are compared against six reference algorithms drawn from prior work [
17] and the winning algorithms from the ADRENALIN Challenge [
21]. These algorithms are not re-derived in this paper; instead, they are implemented following their original descriptions and applied under identical data and evaluation conditions.
The first group of comparison methods consists of three general-purpose disaggregation algorithms. The Bayesian regression approach models energy consumption as a probabilistic function of contextual drivers, yielding posterior estimates of temperature-dependent load. The time–frequency masking method applies short-time Fourier transforms to isolate components of the aggregate signal that correlate with contextual variables. The Bi-LSTM model learns temporal dependencies directly from the data using a sequence-to-sequence formulation.
The algorithms from the ADRENALIN Challenge were specifically tuned for low-frequency commercial building data and incorporate domain-informed assumptions regarding baseline behavior and temperature dependency. While differing in implementation, these methods emphasize strong explanatory coverage of temperature-dependent energy under competition-oriented evaluation criteria, providing a useful contrast to the uncertainty-aware and non-forced attribution strategy adopted in this work.
For all comparison algorithms, the same input data, temporal resolution, and preprocessing steps are used as for the proposed method. Performance is evaluated using normalized mean absolute error (NMAE) against measured temperature-dependent energy. No post hoc adjustments are applied to favor any method.
This evaluation setup ensures that observed performance differences arise from methodological design choices rather than data handling or metric definition.
4.15. Implementation Details for Experimental Evaluation
The implementation was carried out in Python 3.10.11 using the following packages: XGBoost 3.0.2, SHAP 0.48.0, statsmodels 0.14.4, NumPy 2.0.2, pandas 2.2.3, matplotlib 3.10.0, and scikit-learn 1.6.1.
All experiments were conducted at hourly temporal resolution. Aggregate electricity consumption was resampled by summation, while contextual drivers were averaged over each hourly interval. Weather drivers included outdoor air temperature (T), relative humidity (Rh), global horizontal solar irradiance (SolGlob), wind speed (Ws), and wind direction (Wd), depending on building-specific data availability. Calendar features were derived from timestamp information and included hour-of-day, day-of-week, month, week-of-year index, and a binary weekend indicator. Hour-of-day, day-of-week, month, and week-of-year were one-hot encoded, while the weekend indicator was retained as a binary variable.
Three experimental configurations were evaluated. In the first configuration, baseline estimation was performed using a rolling-envelope formulation with a fixed window length of 14 days and a lower quantile of 0.2. In the second configuration, a temperature-banded baseline was applied to restrict baseline estimation to quasi-neutral regimes. In the third configuration, baseline subtraction was omitted entirely, and attribution was applied directly to the aggregate signal using both weather and calendar-derived drivers.
The regression backend used in all reported experiments is XGBoost. Hyperparameters were fixed across buildings to ensure comparability and avoid building-specific tuning. Model training and attribution were performed using forward-chaining time-series cross-validation with five expanding-window folds. For each fold, models were trained exclusively on historical data and evaluated on a contiguous future segment. All reported attributions are based on out-of-fold predictions to prevent information leakage.
Uncertainty estimates were obtained using a moving-block bootstrap with a block length of 48 h and 200 bootstrap replicates. Optional time–frequency masking constraints were not activated in the final reported experiments unless explicitly stated.
All configuration parameters used in the reported experiments are summarized in
Table 2. In addition,
Table 3 documents implementation-sensitive defaults and operational settings, including SHAP computation mode, clipping behavior, alignment rules, bootstrap configuration, and preprocessing edge cases that may influence reproducibility.
4.16. Summary of the Workflow
The complete workflow consists of sequential and modular stages. The process begins with data alignment and preprocessing, followed by baseline estimation from aggregate consumption. Excess energy is then modeled using contextual drivers, after which attribution is derived through explainability methods. Optional temporal and time frequency constraints may be applied to refine attribution behavior. The framework explicitly computes unexplained energy to prevent enforced completeness and maintain conservative attribution. Uncertainty is quantified using a block bootstrap procedure, and the results are subjected to diagnostic evaluation and visualization. This modular structure ensures reproducibility, transparency, and adaptability across different buildings and datasets.
5. Results
This section evaluates MD-ADD under a structured comparison protocol and positions its behavior relative to established low-frequency disaggregation approaches. Two groups of reference results are used, three general low-frequency methods reported in the earlier comparison study [
17], and the three top-performing ADRENALIN Challenge methods [
21], all evaluated on the same dataset and metric.
In addition to point accuracy (NMAE), stability and residual-structure diagnostics are reported to assess attribution robustness under partial observability. These include bootstrap uncertainty intervals, dispersion measures, sign consistency, and residual autocorrelation.
Results are computed using the MD-ADD decomposition defined in
Section 4. Hourly HVAC estimates correspond to the temperature-driver attribution
defined in Equation (1), optionally modified by the gating operation in
Section 4.9. Daily values are computed by summation over non-overlapping 24 h windows. Normalized mean absolute error (NMAE) is computed between aggregated attributions and measured HVAC ground truth. Bootstrap uncertainty intervals, dispersion statistics, and sign-consistency rates are computed according to the definitions in
Section 4.10,
Section 4.11 and
Section 4.12.
5.1. Experimental Design and Evaluation Protocol
The evaluation compares multiple disaggregation approaches under a consistent protocol applied to the same buildings, temporal resolution, and ground-truth definitions. Results are reported for the Bayesian approach without weekday separation, the Bayesian approach with weekday separation, and the time–frequency mask-based method from the earlier comparison study, as well as for the three ADRENALIN Challenge algorithms (Adjusted STL, GMM-based clustering, and base-load decomposition). These reference results are reproduced from their respective sources and serve as quantitative baselines under identical evaluation conditions. The proposed MD-ADD framework is evaluated alongside these methods to enable direct comparison.
Within MD-ADD, three configurations are reported to assess the influence of internal modeling choices, represented in
Table 4. The initial configuration corresponds to the baseline implementation described in the original submission, while the refined configuration incorporates adjusted baseline handling and validation settings. This distinction allows the effect of baseline formulation and backend validation to be examined without altering the overall attribution philosophy.
Results are reported at both hourly and daily aggregation levels where applicable. Daily aggregation is obtained by summing over non-overlapping 24 h intervals. Reporting both resolutions allows separation of short-term variability from systematic bias and clarifies whether performance differences are driven primarily by high-frequency fluctuations or persistent structural effects.
Performance is primarily evaluated using normalized mean absolute error (NMAE). For the earlier comparison study, additional metrics (MAE, RMSE, and R2) are included as originally reported to preserve consistency with the reference publication. In addition to accuracy, stability and uncertainty diagnostics are also shown. These include bootstrap-based uncertainty intervals, dispersion and sign-consistency measures across replicates, and residual autocorrelation diagnostics. Together, these metrics provide a more comprehensive assessment of attribution robustness beyond point error alone.
5.2. Baseline and Comparative Methods Included
Two reference groups are included in the comparative evaluation. The first group consists of methods reported in the earlier comparison study, including two Bayesian variants and a time–frequency mask-based algorithm. These approaches provide a structured baseline for interpreting how alternative low-frequency modeling assumptions perform across buildings. The second group comprises the ADRENALIN Challenge reference methods, namely Adjusted STL, GMM-based clustering, and a base load decomposition approach, which represent competition optimized baselines evaluated on the ADRENALIN benchmark. In addition to these external references, MD ADD is evaluated in both an initial and a refined configuration. The experimental protocol further incorporates sensitivity analyses and a baseline leakage diagnostic to assess robustness, attribution stability, and residual structure.
5.3. Reference Results from the First Comparison Study
Table 5,
Table 6 and
Table 7 summarize the results of the Bayesian and time–frequency mask-based algorithms from the earlier comparison study. These results are included here to establish a quantitative reference point for interpreting subsequent comparisons; readers are referred to the original publication for full methodological details and extended discussion [
17].
Both Bayesian variants show extreme variability across buildings. While some buildings achieve NMAE below 0.30, others exceed 3.0, accompanied by strongly negative R2 values. This indicates severe over- or under-attribution when model assumptions are violated.
The mask-based approach achieves substantially lower NMAE than the Bayesian methods on most buildings, but still exhibits negative R2 on several cases, indicating systematic misallocation despite good aggregate error.
5.4. Reference Results from the ADRENALIN Challenge
Table 8 summarizes the NMAE results of the three winning competition algorithms across buildings and resolutions [
21]. The ADRENALIN Challenge algorithms achieve uniformly low NMAE across buildings, reflecting effective optimization for the benchmark metric under the fixed evaluation protocol used in the challenge.
5.5. MD-ADD Results: Initial Configuration
Table 9 reports the performance of MD-ADD using an XGBoost backend with weather drivers. Both hourly and daily NMAE are shown. In this configuration, MD-ADD achieves its lowest error on L14.B03 and its highest on L06.B01. In this initial configuration, daily aggregation reduces NMAE across all buildings, indicating that residual errors are dominated by short-term variability rather than long-term bias. MD-ADD yields higher NMAE than competition-optimized baselines, but the systematic error reduction under daily aggregation reflects the deliberate decision not to force short-term variability into driver attributions when contextual evidence is weak.
5.6. MD-ADD Results: Refined Configuration
Table 10 reports MD-ADD performance using a refined configuration, incorporating baseline handling and backend validation while preserving the core conservative attribution strategy. In the refined configuration, the effect of daily aggregation becomes building-dependent, reflecting the interaction between baseline formulation and attribution stability rather than a uniform reduction in short-term error. Relative to the initial configuration, the refined setup reduces hourly NMAE for several buildings, with the most pronounced improvement observed for L06.B01. In other cases, particularly for buildings with weaker or more irregular temperature dependence, the refined configuration trades numerical accuracy for improved baseline separation and attribution stability.
5.7. Quantitative Comparison with Established Low-Frequency Disaggregation Methods
To contextualize the performance of the proposed MD-ADD framework, its results are compared against six established low-frequency disaggregation methods drawn from two prior studies. The first group consists of three representative algorithmic families evaluated in an earlier comparative study, namely a Bayesian regression-based method, a time–frequency mask-based approach, and a bidirectional LSTM sequence model. The second group consists of the three top-performing algorithms from the ADRENALIN Load Disaggregation Challenge, including the adjusted STL-based method, a Gaussian mixture model (GMM)-based clustering approach, and a baseline-oriented decomposition strategy.
All comparison results are reported using normalized mean absolute error (NMAE) and are evaluated on the same buildings, temporal resolutions, and ground truth definitions as used for MD-ADD. No retraining, retuning, or post-processing adjustments were applied beyond those described in the original sources, ensuring that observed differences reflect methodological characteristics rather than experimental artifacts.
Table 11 summarizes hourly NMAE values for all methods across the evaluated buildings. The results show that the ADRENALIN Challenge algorithms achieve the lowest NMAE overall, reflecting their optimization for the competition metric under fixed evaluation constraints. The time–frequency mask-based method from the earlier comparison study also achieves relatively low NMAE on several buildings, although with notable variability.
In contrast, MD-ADD exhibits higher NMAE across most buildings, particularly at hourly resolution. However, unlike several reference methods, MD-ADD avoids extreme failure cases and maintains consistent behavior across heterogeneous building types. This reflects the deliberate design choice to avoid forced completeness and to preserve unexplained energy when contextual drivers provide insufficient explanatory power.
When results are aggregated to daily resolution, MD-ADD exhibits a consistent reduction in error, indicating that residual discrepancies are dominated by short-term variability rather than systematic bias. This behavior contrasts with competition-optimized methods, which aggressively fit short-term fluctuations to minimize pointwise error.
5.8. Multi-Driver Attribution Behavior
MD-ADD operates at the level of individual contextual drivers. Each feature in the contextual set produces a separate out-of-fold signed attribution time series. Aggregation into broader driver families, such as a composite weather-dependent component, is performed only for alignment with available benchmark targets.
To clarify this driver-level structure,
Figure 1 presents the daily aggregated attributions for the individual meteorological variables prior to aggregation. The contributions correspond to SHAP-based signed driver attributions estimated by the trained model, alongside the unexplained residual component.
The figure illustrates that meteorological variables contribute independently and exhibit distinct temporal patterns. This confirms that MD-ADD estimates simultaneous contextual effects rather than relying on a single composite weather signal. The residual component remains non-zero, consistent with the conservative attribution philosophy of preserving unexplained energy.
To examine redistribution behavior when an additional driver family is introduced, the contextual feature set was extended by incorporating calendar-derived variables, specifically hour-of-day and weekday indicators, alongside the weather drivers. This configuration allows assessment of how explanatory mass is redistributed across distinct contextual driver families.
Figure 2 presents the mean diurnal decomposition for building L14_B03_1H under the multi-driver configuration. The figure shows that the calendar-attributed component closely follows the pronounced working-hour profile of the aggregate signal, capturing systematic daytime structure that cannot be explained by temperature variation alone. In contrast, the weather-attributed component exhibits smoother variation consistent with longer-term meteorological influence. The unexplained component remains non-zero across the full 24 h cycle, indicating that additional drivers redistribute explained energy rather than forcing complete attribution.
Figure 3 displays the daily stacked decomposition for January 2020 for the same building. The calendar component exhibits clear weekday structure, while the weather component varies more gradually across days. The persistence of a residual component throughout the month further confirms that the expanded driver set reduces but does not eliminate unexplained variability, consistent with the conservative attribution philosophy of MD-ADD.
Together, these results demonstrate that when a second contextual driver family is introduced, MD-ADD redistributes explanatory mass in a stable and interpretable manner rather than absorbing schedule-driven structure into weather attribution.
Table 12 summarizes the mean daily attribution shares for weather, calendar, and unexplained components for the evaluated L14 buildings (hourly data, daily aggregation). The results show that calendar contribution is non-trivial and building-dependent (approximately 0.31 to 0.53 mean share), supporting the claim that MD-ADD can attribute multiple drivers simultaneously rather than implicitly folding schedule effects into temperature attribution.
The stability diagnostics in
Table 12 indicate that the driver shares are reasonably concentrated (share CI widths around 0.024 to 0.041 for weather and 0.028 to 0.040 for calendar), while the unexplained share remains consistently non-zero. This supports the intended interpretation that introducing an additional driver redistributes explained energy rather than eliminating unexplained energy.
5.9. Stability, Uncertainty, and Residual Diagnostics
Stability and residual-structure diagnostics defined in
Section 4.13 are reported systematically across all experimental configurations to provide a consistent mapping between methodological definitions and empirical behavior. While
Table 12 presents the full multi-driver diagnostics for configuration C3,
Table 13 provides a configuration-consistent summary of key stability and residual-structure indicators across C1–C3. The table reports mean daily weather attribution share, mean unexplained share, bootstrap confidence interval width as a stability indicator, and the lag-1 autocorrelation of daily residual energy as a residual-structure indicator.
Across configurations, clear structural differences emerge. Relative to C1, the refined baseline configuration (C2) generally increases the unexplained share while maintaining comparable bootstrap confidence interval widths, indicating a more conservative separation of baseline and driver-attributed energy without loss of stability. When calendar drivers are introduced in C3, the weather share decreases substantially across buildings while the unexplained component remains non-zero, demonstrating redistribution of explanatory mass rather than forced completeness. These trends demonstrate that the stability and residual-structure metrics defined in
Section 4.13 characterize attribution behavior consistently across configurations.
The high lag-1 autocorrelation values observed for daily residual energy (0.92–0.98) reflect persistent regime-level structure rather than short-term stochastic noise. Because residuals are evaluated at daily aggregation, they retain low-frequency patterns associated with operational schedules, occupancy regimes, and other slowly varying drivers not explicitly modeled. Consequently, the residual component should not be interpreted as random error but as temporally structured unexplained variability. This behavior is consistent with the design objective of MD-ADD, which preserves structured unexplained energy instead of forcing its absorption into contextual driver attributions.
As shown in
Table 12 for the multi-driver configuration, weather and calendar share CI widths remain moderate (approximately 0.024–0.041 for weather and 0.028–0.040 for calendar), indicating stable attribution under bootstrap perturbations. The unexplained share remains consistently non-zero across buildings, and residual autocorrelation values between 0.92 and 0.98 indicate persistent structured variability not absorbed by contextual drivers. These results reinforce the conservative attribution behavior of MD-ADD under the expanded multi-driver configuration.
5.10. Sensitivity to Baseline Formulation and Backend Tuning
Additional experiments were conducted to assess the sensitivity of MD-ADD to backend hyperparameter tuning and baseline formulation.
Across all buildings, hyperparameter tuning of the XGBoost backend produced negligible changes in performance. Differences in hourly NMAE between tuned and untuned configurations remained below 0.3 percentage points for all cases. This indicates that backend optimization is not the dominant factor controlling performance.
In contrast, baseline formulation exhibited systematic and building-dependent effects. Configurations using a rolling-envelope baseline, no explicit baseline, and a temperature-banded baseline produced distinct error patterns. For buildings with stable schedules and pronounced seasonal structure, differences between baseline formulations were small. For buildings with mixed regimes or year-round operation, baseline choice measurably affected both hourly and daily NMAE.
These results confirm that baseline handling dominates performance differences within MD-ADD, while backend tuning plays a secondary role.
5.11. Baseline Leakage Diagnostic
The sensitivity analysis highlights that baseline behavior is a critical design choice in low-frequency contextual disaggregation, rather than a fixed modeling component. In particular, rolling-envelope baselines derived from the aggregate signal can exhibit correlation with outdoor temperature in buildings with strong seasonal demand. This behavior is quantified in
Table 14, where high correlations between the estimated baseline and temperature are observed for several L14 buildings.
Importantly, this correlation does not indicate a modeling error within the MD-ADD framework. The rolling baseline is not intended to represent a physically temperature-independent load; rather, it serves as a conservative lower-envelope separation that adapts to long-term changes in aggregate consumption. In buildings with pronounced seasonal structure, such adaptation is expected and reflects genuine changes in minimum operational demand.
Crucially, MD-ADD does not rely on a single baseline formulation. When temperature coupling of the rolling baseline is undesirable, the framework provides alternative configurations that explicitly resolve this behavior. In the refined configuration, this is achieved either by enforcing a quasi-constant baseline through temperature-banded estimation or by omitting baseline subtraction altogether and allowing the unexplained component to absorb low-frequency structure.
The temperature-banded baseline enforces constancy by construction, restricting baseline estimation to periods where temperature influence is minimal. Where such neutral regimes are stable, this approach yields a baseline that is effectively independent of seasonal variation. Where neutral regimes are ill-defined or absent, degraded performance indicates that the data do not support a meaningful constant baseline, rather than a failure of the disaggregation formulation.
From this perspective, baseline-temperature correlation is best interpreted as a diagnostic signal rather than a defect. It reveals whether the building exhibits a stable, temperature-independent minimum load and guides the choice of baseline strategy. The MD-ADD framework accommodates this variability explicitly, ensuring that baseline behavior does not force spurious attribution or conceal unexplained structure.
5.12. Diagnostic Analysis: Qualitative Behavior, Residual Structure, and Attribution Stability
Quantitative error metrics provide only a partial view of model behavior in low-frequency contextual disaggregation. To complement NMAE-based evaluation, qualitative diagnostics and uncertainty-based analyses are used to examine temporal alignment, residual structure, and the stability of driver attributions under temporal perturbations.
Figure 4 and
Figure 5 illustrate daily ground truth versus MD-ADD temperature-attributed estimates for two representative buildings with contrasting behavior.
Figure 4 shows a building with well-defined weather sensitivity and relatively regular operation, where the model captures the dominant temperature-driven pattern while conservatively handling short-term deviations. The close alignment between measured and estimated temperature-dependent energy indicates that, in such cases, MD-ADD produces consistent and interpretable attributions.
In contrast,
Figure 5 shows daily ground truth versus MD-ADD estimates for building L06.B01. Substantial deviations remain, particularly during periods of mixed or irregular operation. These discrepancies highlight the limitations of weather-driven contextual models in buildings where multiple drivers, control strategies, or occupancy patterns interact in complex ways. Rather than forcing attribution in such cases, MD-ADD preserves a significant unexplained component.
Residual structure is further examined in
Figure 6, which presents the autocorrelation of unexplained energy for L06.B01. The residual exhibits strong temporal dependence rather than white noise, indicating the presence of unmodeled drivers or regime shifts. This behavior supports the interpretation that residual structure in MD-ADD is not merely noise but an informative signal of missing contextual information.
In addition to qualitative diagnostics, attribution stability is assessed using block bootstrap resampling. For each building, bootstrap replicates produce empirical distributions of driver contributions at the chosen aggregation level, enabling confidence intervals for both absolute energy attribution and relative energy shares. In buildings where temperature-driven behavior is stable across seasons, the resulting intervals are narrow, reflecting consistent driver relevance across resampled segments. In buildings with mixed regimes, shifting schedules, or unobserved control changes, intervals widen, indicating that driver contributions are not stable under plausible temporal perturbations of the data.
Beyond interval width, bootstrap distributions provide a direct stability diagnostic. When a driver contribution frequently changes sign or collapses toward zero across bootstrap replicates, it indicates weak or non-transferable explanatory power under the available features. Conversely, persistent contributions with limited dispersion indicate a robust relationship compatible with operational interpretation. These uncertainty and stability outputs therefore complement pointwise error metrics by revealing when driver attributions can be treated as reliable evidence and when they should instead be interpreted as tentative signals requiring additional contextual data.
Together, these diagnostics demonstrate that MD-ADD not only estimates temperature-dependent energy but also exposes when such estimates are stable and when they are highly sensitive to plausible temporal variations in building operation.
6. Discussion
This section provides a deeper interpretation of the results by explicitly contrasting MD-ADD with two groups of reference methods, namely the earlier low-frequency NILM algorithms evaluated in the comparison study and the competition-optimized algorithms from the ADRENALIN Challenge. The emphasis is placed on methodological implications, attribution stability, and robustness under partial observability.
MD-ADD is evaluated against methods that prioritize different objectives under the same data constraints. Some baselines are designed to minimize pointwise error by maximizing explained variance and enforcing near-complete allocation of excess energy, while MD-ADD prioritizes conservative attribution and explicit separation of unexplained variability. The discussion therefore interprets performance differences as outcomes of these objective choices, and it contrasts two MD-ADD configurations to isolate the effect of baseline handling without changing the attribution philosophy.
6.1. MD-ADD and Classical Low-Frequency NILM Algorithms
The Bayesian, mask-based, and BI-LSTM algorithms from the earlier comparison study represent three distinct modeling philosophies that are commonly applied to low-frequency NILM.
The Bayesian methods rely on strong structural assumptions, in particular that periods of low consumption correspond to HVAC inactivity and that temperature dependence dominates excess energy. The reported results confirm that these assumptions hold only for a subset of buildings. While Buildings 6 and 8 achieve very low NMAE values, several other buildings exhibit extreme errors, with NMAE values exceeding 3.0 and highly negative R2. These failures are not marginal. They indicate that the model is systematically misallocating energy when its core assumptions are violated, for example in buildings with year-round HVAC operation or strong non-thermal drivers.
MD-ADD differs fundamentally in this respect. Although its best-case NMAE values are higher than those of the Bayesian model, it avoids catastrophic failure across buildings. The worst MD-ADD case, L06.B01, reaches an NMAE of approximately 2.43, which is still large but notably lower than the Bayesian worst cases. In the refined configuration, this value is substantially reduced. This reflects the effect of explicitly allowing unexplained energy rather than forcing all residual consumption into a temperature-driven component. In practice, this means that MD-ADD sacrifices best-case accuracy to gain robustness across heterogeneous building behaviors.
The time–frequency mask-based algorithm demonstrates substantially better numerical performance. Its NMAE values are consistently below 0.8 and often below 0.4. From a metric-focused perspective, this makes it clearly superior to MD-ADD. However, the accompanying R2 values reveal an important limitation. Several buildings exhibit strongly negative R2 despite low NMAE, indicating that the model captures the magnitude of HVAC energy but misrepresents its temporal structure. This is consistent with a formulation that prioritizes reconstruction accuracy over physical plausibility.
MD-ADD produces higher NMAE values but avoids this inconsistency. Residual diagnostics show that unexplained energy retains structure rather than being absorbed into the HVAC estimate. This difference highlights a key conceptual distinction. The mask-based approach treats unexplained structure as error to be minimized, whereas MD-ADD treats it as information about missing drivers or non-stationary behavior.
The BI-LSTM model occupies an intermediate position. Its NMAE values are more stable than the Bayesian approach and higher than the mask-based method. MD-ADD performs comparably to BI-LSTM on several buildings, particularly those with strong seasonal patterns. The critical difference is interpretability. BI-LSTM implicitly learns to explain residual structure through latent representations, while MD-ADD explicitly exposes uncertainty and residual behavior. From an analytical standpoint, this makes MD-ADD more suitable for diagnostic use, even when numerical accuracy is similar or worse.
6.2. MD-ADD and ADRENALIN Challenge Algorithms
The comparison with the ADRENALIN Challenge algorithms reveals the largest numerical gap. The adjusted STL, GMM-based clustering, and base-load decomposition methods achieve average NMAE values between 0.23 and 0.27, far below those of MD-ADD.
This difference should not be interpreted as a simple measure of methodological superiority. The competition algorithms were explicitly optimized for NMAE under fixed evaluation constraints. Their formulations include strong heuristic elements such as envelope fitting, clipping, reference-week selection, and implicit or explicit forcing of completeness. These design choices are effective for leaderboard performance but rely on assumptions that may not generalize beyond the benchmark setting.
MD-ADD intentionally avoids these mechanisms. As a result, it does not exploit opportunities to reduce error through aggressive fitting. This is particularly visible in buildings such as L14.B05, where competition algorithms achieve near-zero NMAE. MD-ADD remained around 0.59 despite the apparent regularity of the building. This persists in the refined configuration, indicating that the remaining error is not primarily due to misestimated baseline magnitude, but to the deliberate decision not to force short-term deviations into temperature-driven attributions when contextual evidence is weak.
In buildings with known complexity, such as L06.B01, the contrast becomes even more informative. Competition algorithms achieve moderate NMAE values, while MD-ADD performs poorly in absolute terms. However, residual diagnostics show that MD-ADD leaves substantial structured unexplained energy. The competition results themselves identify this building as problematic across methods. The difference is that MD-ADD makes this difficulty explicit rather than concealing it within the HVAC estimate.
To summarize the conceptual distinctions discussed above and to provide a structured comparison of modeling assumptions and attribution behavior,
Table 15 presents a qualitative comparison between MD-ADD and representative low-frequency disaggregation methods considered in this study.
6.3. The Role of Temporal Aggregation
One consistent observation across MD-ADD results is the reduction in NMAE when moving from hourly to daily aggregation. This pattern suggests that a large fraction of the error arises from short-term variability rather than systematic bias. In other words, MD-ADD captures the correct long-term magnitude of HVAC energy but does not attempt to explain all intra-day fluctuations.
This behavior contrasts with competition algorithms, which often fit short-term variations aggressively. While this reduces hourly NMAE, it increases the risk of attributing non-HVAC behavior, such as occupancy-driven or control-related effects, to temperature-dependent loads. MD-ADD effectively filters out this behavior by design.
The reduction in error under daily aggregation is observed consistently across both the initial and refined MD-ADD configurations, confirming that baseline refinements primarily affect magnitude separation rather than the treatment of short-term variability.
6.4. Residual Structure as an Analytical Signal
A defining outcome of MD-ADD is the presence of structured residuals. Residual autocorrelation plots show clear temporal dependence rather than white noise, particularly in buildings with complex operation. This should not be interpreted as model inadequacy alone. Instead, it indicates that the available drivers are insufficient to explain observed behavior.
In practical terms, this information is valuable. Structured residuals can signal missing contextual variables, changes in control strategy, or abnormal operation. Competition-optimized models suppress this signal by construction, whereas MD-ADD preserves it, enabling subsequent analysis.
The multi-driver experiment presented in
Section 5 further reinforces this interpretation. When calendar-derived contextual features are introduced alongside weather drivers, attribution mass is redistributed rather than absorbed into a single dominant component. Buildings exhibiting strong schedule regularity show increased calendar attribution, while weather-dominated buildings retain temperature-driven shares. Importantly, the unexplained component remains consistently non-zero. This confirms that adding contextual drivers refines explanatory structure without enforcing completeness, and that residual energy reflects genuinely unmodeled variability rather than mere numerical error.
6.5. Implications for NILM Evaluation
The comparisons reinforce a central insight. Low NMAE is not equivalent to reliable disaggregation in low-frequency, context-driven NILM. Methods that enforce completeness can achieve excellent numerical performance while providing limited insight into underlying building behavior.
MD-ADD demonstrates an alternative evaluation philosophy. It prioritizes robustness, interpretability, and explicit uncertainty representation over metric optimization. This leads to higher error but produces outputs that are better aligned with diagnostic and decision-support applications.
MD-ADD is intended for settings where disaggregation outputs are used as evidence in diagnostic reasoning. In such workflows, conservative attributions and explicit residual structure can be more informative than maximizing explained variance, because they separate driver-supported variability from variability that requires additional contextual data or operational investigation.
6.6. Baseline Formulation and Leakage Effects
The sensitivity analysis highlights baseline formulation as a dominant factor in low-frequency contextual disaggregation. In the rolling-envelope baseline used by default in MD-ADD, the estimated baseline can increase during periods of elevated seasonal demand. As a result, a portion of temperature-driven energy may be absorbed into the baseline rather than attributed to contextual drivers.
This effect is quantified by the observed correlation between baseline estimates and outdoor temperature, which exceeds 0.8 for several buildings with strong seasonal structure. Such coupling can reduce apparent temperature dependence under rolling baseline formulations but is explicitly resolved in MD-ADD through alternative baseline strategies or by omitting baseline subtraction altogether.
The temperature-banded baseline was introduced to mitigate this effect by restricting baseline estimation to HVAC-neutral regimes. Its mixed performance across buildings is informative. Where neutral regimes are stable, the approach reduces leakage and improves separation. Where buildings exhibit continuous or overlapping operation, performance degrades, indicating that neutral regimes are either unstable or poorly defined.
Rather than constituting a failure, this degradation exposes regime complexity that would otherwise be concealed by forced attribution. Stronger baseline assumptions therefore act as a stress test on the data, revealing limitations in observability and driver availability.
6.7. Limitations
Several limitations of the current work should be acknowledged. First, MD-ADD relies on the availability and quality of contextual drivers. When relevant drivers are missing or poorly measured, residual energy can remain large, leading to higher error metrics. While the reported experiments focus primarily on weather-driven disaggregation as a representative and widely studied use case, the MD-ADD framework supports multiple contextual drivers by design, and additional drivers can be incorporated without changes to the core attribution and uncertainty mechanisms. Second, the current implementation focuses on weather-driven disaggregation and does not explicitly model occupancy, control logic, or operational schedules beyond baseline estimation. Third, the evaluation is limited to the ADRENALIN Challenge dataset, which, while diverse and carefully curated, does not cover the full range of commercial building behaviors.
MD-ADD is expected to perform poorly in buildings where no contextual variables exhibit stable or interpretable relationships with aggregate consumption, such as environments dominated by stochastic occupancy, rapidly changing control strategies, or highly heterogeneous end-use mixes. In such cases, the framework will typically produce large unexplained components and high attribution uncertainty. Within the intended analytical scope, this behavior is considered an informative outcome rather than a failure, as it indicates that additional sensing, metadata, or operational insight is required for credible interpretation.
These limitations reflect deliberate scope choices rather than implementation deficiencies. Nevertheless, they constrain the applicability of the current framework and motivate future extensions.
6.8. Future Research Needs
Future research should extend MD-ADD toward multi-context disaggregation in settings where weather alone is insufficient. While the framework is inherently multi-driver, the reported experimental evaluation focuses primarily on weather-driven disaggregation as a representative and widely studied use case enabled by the availability of validated HVAC sub-metering. This choice does not restrict the generality of the formulation, which supports the integration of additional contextual drivers without changes to the attribution or uncertainty mechanisms. In particular, incorporating additional drivers such as occupancy proxies, indoor environmental measurements, or control signals is expected to reduce structured residuals and improve attribution quality without reintroducing forced completeness. This is particularly relevant for buildings where temperature alone explains only a fraction of variability.
Future versions of MD-ADD could include automated mechanisms to assess driver relevance and suppress drivers that do not exhibit consistent explanatory power. Such mechanisms could prevent spurious attributions and further improve robustness across heterogeneous buildings.
Integrating causal assumptions or soft physical constraints may help distinguish coincidental correlations from meaningful dependencies, especially in the presence of correlated drivers. This could improve interpretability while preserving the framework’s conservative attribution philosophy.
Extending the framework to operate in an online or rolling setting would enable detection of regime changes, control strategy shifts, or sensor faults. Structured residuals could then be used as triggers for further investigation rather than as static outputs.
Future work should continue to explore evaluation criteria beyond pointwise error metrics. Stability, uncertainty, residual structure, and plausibility should be treated as first-class evaluation dimensions, particularly for decision-support applications in smart buildings.
7. Conclusions
This work introduced MD-ADD, a multi-driver disaggregation framework designed for low-frequency smart meter data in commercial and public buildings. Unlike most existing NILM approaches, the framework explicitly avoids forced completeness and treats unexplained energy as a valid and informative outcome rather than a modeling failure.
The experimental results demonstrate that this design choice has measurable and quantifiable consequences. In terms of normalized mean absolute error (NMAE), MD-ADD does not outperform state-of-the-art algorithms optimized for leaderboard performance. Competition-winning methods achieve average hourly NMAE values in the range of approximately 0.23–0.27, whereas MD-ADD yields higher values under both its initial and refined configurations.
However, relative performance analysis highlights important improvements in robustness. In the initial configuration, MD-ADD exhibits a worst-case hourly NMAE of 2.43 (L06.B01). In the refined configuration, this worst-case value is reduced to 0.65, corresponding to an approximate 73% relative reduction in maximum error. This improvement is achieved without altering the attribution philosophy, but through revised baseline handling and validation consistency.
In contrast, classical Bayesian formulations exhibit worst-case NMAE values exceeding 3.9 under identical evaluation conditions, indicating catastrophic failure when structural assumptions are violated. MD-ADD maintains bounded error across all evaluated buildings and does not exhibit such extreme instability.
Temporal aggregation further clarifies the nature of residual discrepancies. Across buildings, daily aggregation reduces NMAE by approximately 10–40% relative to hourly values in the initial configuration. This confirms that remaining discrepancies are dominated by short-term variability rather than systematic bias in long-term HVAC magnitude.
These findings support a formulation that preserves unexplained energy as an explicit analytical signal under realistic smart meter conditions.
In practical applications, this formulation is particularly suitable for diagnostic and decision-support workflows in which interpretability, failure transparency, and uncertainty awareness are more valuable than maximal explained variance. Suitable use cases include screening-level HVAC benchmarking, identification of regimes requiring additional sensing, and investigation of abnormal operational behavior. In such contexts, the unexplained component serves as an operational signal highlighting regimes that cannot be credibly interpreted using weather and calendar drivers alone.