Data-Driven Electricity Load Analysis in Smart Buildings: A Multi-Driver Automatic Dependency Disaggregation Approach

Tolnai, Balázs András; Ma, Zheng Grace; Jørgensen, Bo Nørregaard

doi:10.3390/electronics15050929

Open AccessArticle

Data-Driven Electricity Load Analysis in Smart Buildings: A Multi-Driver Automatic Dependency Disaggregation Approach

by

Balázs András Tolnai

,

Zheng Grace Ma

and

Bo Nørregaard Jørgensen

^*

SDU Center for Energy Informatics, Maersk Mc-Kinney Moller Institute, The Faculty of Engineering, University of Southern Denmark, 5230 Odense, Denmark

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(5), 929; https://doi.org/10.3390/electronics15050929

Submission received: 28 January 2026 / Revised: 23 February 2026 / Accepted: 24 February 2026 / Published: 25 February 2026

(This article belongs to the Special Issue New Trends in Energy Saving, Smart Buildings and Renewable Energy)

Download

Browse Figures

Versions Notes

Abstract

Disaggregating end-use electricity consumption from aggregate meter data remains a fundamental challenge in non-intrusive load monitoring, particularly in smart buildings where heating, ventilation, and air-conditioning systems dominate demand and direct sub-metering is often unavailable. Contextual variables such as weather and calendar information provide valuable explanatory signals, but in low-frequency settings, these drivers are typically insufficient to fully characterise building operation. As a result, attribution strategies that implicitly assume complete explainability can lead to unstable driver contributions and reduced physical interpretability when building behaviour is non-stationary or partially unobserved. This paper introduces MD-ADD, a multi-driver automatic dependency disaggregation framework designed for low-frequency smart meter data in commercial and public buildings. The framework supports joint attribution of multiple contextual drivers. It explicitly represents unexplained energy as a meaningful component of the decomposition. It combines robust baseline estimation, leakage-resistant out-of-fold contextual modelling, conservative driver attribution without hard mass-balance constraints, and uncertainty quantification using block bootstrap resampling. A consistency mechanism is included to restrict driver attributions to temporal scales compatible with their expected physical influence. The framework is evaluated on the ADRENALIN Load Disaggregation Challenge dataset, which contains multi-resolution electricity and weather data from commercial and public buildings, using normalized mean absolute error alongside stability and residual-structure diagnostics. Rather than optimising solely for pointwise accuracy, the proposed formulation emphasises robustness, interpretability, and diagnostic transparency, making it suitable for decision-support and analytical workflows under realistic low-frequency monitoring conditions.

Keywords:

non-intrusive load monitoring; electricity load disaggregation; contextual disaggregation; low-frequency energy data; smart buildings

1. Introduction

Buildings account for a substantial share of global electricity consumption, with heating, cooling, and ventilation systems representing some of the most energy-intensive [1,2] and operationally flexible end uses. Improving visibility into these loads is essential for applications such as energy performance benchmarking, demand-side management, fault detection, and operational optimization. However, fine-grained sub-metering remains costly and difficult to deploy at scale. Non-intrusive load monitoring (NILM) offers a scalable alternative by estimating end-use consumption from aggregate electricity measurements.

Non-intrusive load monitoring (NILM) techniques offer several potential benefits in building energy analytics, including improved visibility into end-use consumption, support for personalized energy efficiency measures [3], inference of occupant-related patterns [4], fault detection in heating, ventilation, and air-conditioning (HVAC) systems [5], and more accurate demand-side management strategies [6]. Despite these advantages, reliable NILM remains particularly challenging for heating and cooling loads. Unlike many conventional appliances with discrete on-off behavior, HVAC systems typically operate in a continuously modulated manner, which complicates their separation from aggregate measurements [7]. This challenge is amplified in commercial and public buildings, where monitoring is often limited to low temporal resolutions such as hourly or 15 min data. While most NILM research has focused on high-resolution measurements that enable event-based pattern recognition, only a limited number of studies have demonstrated effective HVAC load disaggregation at low sampling rates [8,9,10].

Historically, NILM research has focused on residential environments and high-frequency electrical measurements, where appliance-level signatures and switching events can be exploited. In contrast, commercial and public buildings are typically monitored using low-frequency smart meter data, often at 15 min or hourly resolution [11]. At these temporal scales, transient signatures are largely absent, and classical event-based NILM techniques become ineffective [12]. Disaggregation in such settings must therefore rely on contextual information, including weather variables, calendar effects, and other exogenous drivers.

A growing body of work has demonstrated that temperature-dependent and other context-driven components can be partially isolated from aggregate measurements. These approaches include regression-based [13,14], probabilistic, spectral, and learning-based methods [15]. While these can achieve high apparent accuracy, many disaggregation formulations impose structural assumptions that effectively require all non-baseline energy to be attributed to the modeled drivers [16]. In such settings, electricity consumption reflects contextual influences, control logic, occupant behavior, and stochastic effects that are not fully captured by available drivers.

Previous work by the authors compared three representative algorithmic families for low-frequency temperature-dependent load disaggregation, including Bayesian, time–frequency mask-based, and bidirectional LSTM models [17]. While some methods achieved low pointwise error under favorable conditions, their performance degraded substantially when baseline assumptions were violated or observability was incomplete. These findings indicate that the central difficulty in low-frequency disaggregation lies not only in model selection, but in how the attribution problem is formulated under partial observability.

Beyond numerical accuracy, the practical value of load disaggregation in smart buildings depends critically on the interpretability and stability of the resulting estimates [18]. Disaggregation outputs are rarely used in isolation; instead, they support operational reasoning tasks such as fault diagnosis, control strategy assessment, retrofit evaluation, and cross-building benchmarking [19]. In these settings, over-attribution of unexplained energy to contextual drivers can lead to misleading conclusions, for example by overstating the influence of weather or concealing control-related inefficiencies. Models that implicitly assume complete explainability may therefore achieve high apparent accuracy while providing limited decision-support value. This motivates formulations that preserve unexplained energy rather than forcing attribution when explanatory evidence is weak [15]. Methodologically, MD-ADD departs from existing NILM formulations by explicitly decoupling attribution from completeness, enabling uncertainty-aware, multi-driver decomposition under partial observability rather than forcing full explanation of aggregate demand.

These limitations are illustrated by the ADRENALIN Load Disaggregation Challenge [20], which released a curated dataset for temperature-dependent load disaggregation [5]. Post hoc analysis of the winning algorithms showed that highly competitive solutions achieved low normalized mean absolute error by aggressively fitting seasonal structure and residual variability into temperature-dependent components [21]. While effective under the competition metric, these approaches were sensitive to baseline formulation, building type, and operational regime, and often relied on implicit completeness assumptions that are difficult to justify in real-world deployments. This tension between benchmark optimization and operational interpretability directly motivates the design choices of the proposed framework.

The proposed framework is explicitly not designed to optimize leaderboard-style error metrics. Instead, MD-ADD targets low-frequency, context-driven settings where partial observability, regime variability, and limited metadata are the dominant constraints. The framework is therefore designed for analytical and diagnostic workflows in which interpretability, robustness, and explicit uncertainty exposure are more critical than maximizing explained variance or minimizing pointwise error.

This paper proposes MD-ADD, a multi-driver automatic dependency disaggregation framework designed for low-frequency smart meter data in commercial and public buildings. The framework explicitly treats contextual drivers as informative but incomplete, supports an explicit unexplained energy component, and incorporates uncertainty-aware attribution mechanisms. Driver contributions are estimated using out-of-fold modeling to reduce leakage, and uncertainty is quantified through block bootstrap resampling. Optional temporal and time–frequency consistency constraints are included to restrict attributions to scales compatible with the expected physical influence of each driver. The framework is evaluated on the ADRENALIN Challenge dataset [5], which provides validated sub-metering for performance assessment.

Taken together, these findings establish that low-frequency HVAC load disaggregation is fundamentally constrained by partial observability, regime variability, and incomplete contextual information. The present work builds on these insights by proposing a conservative, uncertainty-aware disaggregation formulation for low-frequency smart meter data. Rather than introducing a new algorithmic family, the MD-ADD framework provides a unifying, uncertainty-aware attribution pipeline designed to produce robust and interpretable decompositions under realistic smart meter conditions.

This paper makes three main contributions. First, it formulates low-frequency contextual disaggregation as a decomposition problem in which unexplained energy is an expected and informative outcome, rather than an error term that must be eliminated through forced attribution. The proposed framework is explicitly not designed to optimize leaderboard-style error metrics, but to prioritize interpretability, attribution stability, and diagnostic usefulness under incomplete observability. Second, it presents an integrated pipeline that combines robust baseline estimation, leakage-resistant out-of-fold contextual modelling, conservative driver attribution derived from explainability outputs without hard completeness constraints, and uncertainty quantification using block bootstrap resampling, with optional temporal and time–frequency consistency mechanisms. Third, it evaluates the formulation on the ADRENALIN Challenge dataset using normalized mean absolute error alongside stability and residual structure diagnostics, clarifying the trade-off between metric optimization and operational interpretability in commercial and public buildings.

The remainder of this paper is organized as follows. Section 2 reviews the state of the art in non-intrusive load monitoring with emphasis on low-frequency data, contextual disaggregation, evaluation practices, and limitations related to interpretability and incomplete observability. Section 3 describes the ADRENALIN Challenge dataset, including building selection, data validation, and preprocessing procedures. Section 4 presents the proposed MD-ADD methodology, detailing the problem formulation, baseline estimation, multi-driver attribution framework, uncertainty quantification, and experimental evaluation setup. Section 5 reports and analyzes the empirical results in comparison with established low-frequency disaggregation methods and competition-optimized baselines. Section 6 discusses the findings, implications, limitations, and directions for future research. Finally, Section 7 concludes the paper and summarizes the main contributions.

2. Literature Review

2.1. Scope and Positioning

Non-intrusive load monitoring (NILM), also referred to as energy load disaggregation, aims to estimate end-use or appliance-level electricity consumption from an aggregate meter, avoiding extensive submetering. The canonical formulation was introduced in early work that framed disaggregation as identifying device operation from aggregate measurements [22]. Disaggregation has since expanded from appliance identification toward building-operation analytics, demand response support, and scalable feedback services, with toolkits and benchmark practices proposed to make evaluations reproducible across datasets and metrics [23].

2.2. Why NILM Matters in Practice

A consistent motivation for NILM is that actionable, end-use level feedback can enable behavioral change, retrofit decisions, and load shifting, where aggregate-only feedback is often too coarse for targeted interventions [24]. In the smart building context, disaggregation supports monitoring of HVAC and other dominant end-uses when direct metering is unavailable, and can complement fault detection and energy management by providing operationally interpretable signals at low sensor cost [25].

2.3. Data Regimes and Datasets

2.3.1. Data Frequency and Observability Constraints

NILM performance is strongly shaped by sampling rate. Higher-frequency measurements capture transient signatures that can aid device identification, while low-frequency smart meter data makes transient detection difficult and shifts the problem toward steady-state patterns, contextual inference, and robust priors [11].

This is especially relevant for hourly or sub-hourly datasets, where short appliance activations can be aliased and multiple devices overlap within a single interval.

2.3.2. Public Datasets and Benchmarking Culture

Public datasets have been central to method development and comparison. REDD established a widely used benchmark for household disaggregation and catalyzed machine learning research by providing aggregate and circuit-level submeter data [26]. UK-DALE provided long-duration domestic data with both high-frequency mains measurements and lower-frequency appliance channels, helping evaluations across diverse appliance mixes and recording durations [27].

As the field diversified, the need for standardized preprocessing, metrics, and baseline implementations led to NILMTK, which formalized a reproducible evaluation workflow and provided reference benchmarks across multiple datasets [23].

2.4. Core Methodological Families

2.4.1. Combinatorial and Probabilistic State Models

Early non-intrusive load monitoring framed disaggregation as identifying appliance state changes from the aggregate signal, then attributing step changes to devices with compatible signatures and feasible combinations of simultaneous operation. This line begins with foundational work that formalized the problem setting and highlighted the central role of state changes, signature ambiguity, and identifiability limits in single-channel settings [22].

A dominant probabilistic formulation represents each appliance as a discrete state process, with aggregate power expressed as the sum of appliance state emissions. Factorial hidden Markov models introduced a principled way to represent multiple independent latent chains whose additive emissions form the observed signal, but they also made explicit the core scaling challenge, since exact inference becomes intractable as the number of appliances and states increases [28].

Within NILM, FHMM variants were applied directly to low-frequency smart meter data, including unsupervised regimes where only aggregate measurements are available, and appliance models must be learned or adapted from weak structure. A widely cited example evaluates unsupervised FHMM style variants on low-frequency measurements and shows that adding usage-related structure can improve disaggregation compared with simpler baselines, while still facing ambiguity under overlapping operation and unmodelled loads [29].

Several research strands therefore focus on inference robustness and scalability. One direction introduces scalable approximate inference and learning for FHMMs using stochastic variational inference without message passing, motivated by long sequences and large models [30,31]. Another direction augments additive FHMMs with signal aggregate constraints to incorporate prior knowledge about plausible totals and reduce degeneracy in posterior inference [32]. These efforts improve practicality, but model mismatch remains a key limitation when households contain unknown appliances, variable power devices, or frequent overlapping state transitions that violate assumptions of independent chains and stable emissions.

2.4.2. Optimization-Based Formulations

Optimization-based NILM casts disaggregation as selecting appliance-level contributions that best reconstruct the observed aggregate under constraints derived from device physics, activation sparsity, or state feasibility. The attraction of this family is that it can operate with limited training data when appliance libraries, generic priors, or simple device templates are available, making it suitable for low-frequency and very low-rate smart meter settings. Benchmarking practice in the field has also standardized optimization baselines through common toolkits that implement combinatorial optimization and related formulations, supporting reproducible evaluation across datasets [23].

Sparse representation and structured prediction provide another optimization-oriented family, where appliance activations are represented through learned or predefined dictionaries and inferred by sparse coding subject to reconstruction fidelity. Discriminative disaggregation via sparse coding introduced a structured prediction training approach that improves disaggregation performance compared with generic sparse coding objectives, illustrating how optimization objectives can be directly aligned with disaggregation metrics [33].

Very low-rate regimes amplify challenges for optimization because appliance transitions are often not visible within a sample and multiple appliances overlap within the same interval. A focused study on hourly resolution data evaluates three families side by side, including a direct optimization formulation that minimizes reconstruction error, showing that optimization remains competitive and can outperform several established baselines when appropriately designed for coarse granularity and unlabeled residual demand [34].

More recent work extends optimization formulations to settings beyond residential buildings. For example, mixed integer programming has been used to model industrial NILM with process-aware constraints that reflect structured operation in manufacturing lines, illustrating a broader trend toward constraint-rich formulations that encode operational logic rather than relying only on signal features [35]. These formulations can be powerful, but they typically require careful constraint design, can be sensitive to appliance library incompleteness, and may degrade when variable speed drives, inverter-based loads, or control-driven power modulation create non-stationary signatures.

2.4.3. Deep Learning for Sequence Modeling

Deep learning shifted NILM toward representation learning, where models learn appliance signatures and temporal patterns directly from data, reducing dependence on handcrafted features and rigid state assumptions. Early deep NILM work adapted recurrent networks, denoising autoencoders, and regression-style architectures and reported improvements over classical baselines, including evidence of generalization to unseen houses under certain experimental settings [36].

A key refinement is sequence-to-point learning, where a window of aggregate input predicts the appliance power at a single midpoint. This reduces the need to reconcile multiple overlapping window predictions and was shown to outperform sequence-to-sequence variants in reported evaluations, becoming a widely adopted design pattern for supervised NILM [37].

Recent neural designs increasingly incorporate architectural mechanisms intended to address multi-scale temporal structure, context dependence, and on-state sparsity. For example, a scale and context-aware convolutional architecture uses multi-branch receptive fields, gating, and attention mechanisms, and reports substantial performance gains across public datasets [38].

Despite strong predictive performance, deep learning NILM remains constrained by data availability and domain shift. Supervised training typically requires submetered ground truth that is costly to acquire, and models often degrade under changes in appliance brands, household practices, voltage conditions, or sampling regimes. This has motivated ongoing work on transferability, semi-supervised learning, and hybrid approaches that combine neural inference with explicit constraints, particularly for low-frequency or privacy-constrained smart meter deployments [34].

2.4.4. Graph Signal Processing and Structured Learning

Graph signal processing treats observations as signals defined on graphs that encode similarity relations, temporal adjacency, or event-based structure. The goal is to exploit smoothness or piecewise smoothness assumptions over the graph to regularize inference, which can be attractive when labelled training data is limited or when very low-rate samples provide weak temporal detail. A representative approach constructs an event-based graph whose edge weights capture similarity among power events, then solves graph variation minimization problems with optional refinement to disaggregate appliance contributions, reporting competitive performance against HMM-based and tree-based methods on real household datasets [39].

In very low-rate settings, GSP has also been evaluated as a distinct methodological family alongside optimization and convolutional neural networks. A study focusing on hourly smart meter data benchmarks GSP against FHMM and combinatorial optimization implementations as well as discriminative sparse coding, reporting that GSP can perform well at coarse resolutions where event ambiguity and unlabeled residual demand are pronounced [34].

The main sensitivity in GSP-based NILM lies in graph construction. Performance depends on how nodes, edges, and weights are defined, how stable similarity relations are across households, and whether event representations remain meaningful when sampling is coarse, measurement noise is high, or appliance behavior is highly variable. This makes structured learning choices central, since the graph itself becomes a modelling hypothesis that may not transfer across contexts without adaptation [39].

2.4.5. Context-Aware and Multi-Modal NILM

Low-frequency NILM increasingly integrates contextual variables to reduce ambiguity that cannot be resolved from electricity measurements alone. HVAC-related loads are a central motivation because heating and cooling demand is strongly modulated by outdoor temperature and weather, while internal gains and occupant behavior create overlapping patterns in aggregate electricity [25]. Incorporating context such as weather, humidity, seasonality, or building sensor streams can therefore help separate temperature-driven components from occupant-driven and residual demand [34].

Context can be introduced in different ways, including feature-level fusion, architecture-level context modules, and weak supervision that ties load behavior to external drivers. For example, a context-aware neural architecture explicitly integrates global contextual information using attention-style mechanisms and reports improved disaggregation accuracy, illustrating how contextual signals can be embedded directly into deep sequence models [38]. Context-based disaggregation research also explores combining metadata, sensors, or contextual features to improve identifiability and to support actionable feedback use cases [40].

Beyond HVAC, multi-modal NILM links disaggregation to actionable feedback and building services by combining metadata about appliance ownership, occupant routines, or building characteristics with measurement streams. The broader motivation is to increase identifiability, improve robustness under low sampling rates, and support downstream use cases such as demand response targeting, retrofit advice, and anomaly detection. However, multi-modal designs also introduce new requirements for data governance, sensor availability, and privacy protection, since contextual variables can increase re-identification risk when combined with smart meter time series [34].

2.5. Disaggregation for HVAC and Temperature-Dependent Components

A distinct branch of load disaggregation focuses on separating heating and cooling demand from total electricity use, rather than identifying individual appliances. The central assumption is that HVAC loads exhibit a structured dependence on outdoor conditions, while base and behavioral demand components are less weather sensitive. In practice, this motivates energy signature modelling, where aggregate consumption is expressed as a piecewise linear or smoothly varying function of outdoor air temperature. Such models are widely used in building measurement and verification practice because they provide interpretable parameters such as balance point temperatures, heating and cooling slopes, and baseline demand levels. These formulations are also aligned with long-standing change point regression approaches and degree day reasoning embedded in formal guidance for energy analysis and savings evaluation. Paulus et al. propose an automated procedure for selecting temperature-dependent change point model forms, reducing reliance on manual analyst judgement when fitting weather-based regression models to whole building data [41].

At higher temporal aggregation such as daily or hourly resolution, energy signatures can be estimated at scale from smart meter data, enabling population-level characterization of building weather sensitivity. Westermann et al. demonstrate unsupervised learning of energy signatures using smart meter readings and outdoor air temperature to infer qualitative building characteristics such as heating system type and building type, illustrating how energy signature estimation can support interpretable analytics even when detailed building metadata is unavailable [42]. Complementary work shows that smart meter data can also support inverse modelling for thermophysical parameter estimation when combined with temperature measurements, bridging the gap between purely statistical signatures and simplified physics-based representations. Rasooli and Itard present an approach that combines smart meter and indoor air temperature data with inverse modelling to identify global thermophysical characteristics and air change rates, and they explicitly evaluate how data granularity influences parameter identifiability [43].

More specialized formulations aim to disaggregate temperature-dependent components across heating, cooling, and transition regimes, where HVAC operation may switch between modes and where latent loads can materially affect cooling electricity use. This motivates extending temperature-only models to include humidity-related variables, calendar context, or regime switching representations that better capture cooling dynamics and shoulder season behavior. A paper [44] proposes a machine learning methodology to estimate air conditioning load from hourly smart meter data, addressing the practical constraint that conventional smart meters typically do not provide high-frequency waveforms needed for transient signature methods. In thermal-specific settings, recent work also formalizes non-intrusive thermal load monitoring concepts where HVAC-relevant subloads are inferred and forecast, illustrating a trend toward HVAC-centered disaggregation as a distinct application family rather than a minor variant of appliance-level NILM [45].

2.6. Low-Frequency NILM at Smart Meter Resolution

Hourly or similarly low sampling rates are increasingly common in applied NILM deployments because they align with national smart meter infrastructures, communications constraints, and privacy-preserving aggregation practices. At these sampling rates, classical transient-based identification becomes ineffective, and disaggregation performance is primarily driven by longer duration usage patterns, load magnitude separation, and the availability of contextual priors. This has shifted methodological emphasis toward algorithms that are robust to aliasing, overlapping operation within a sampling interval, and the presence of unlabeled background loads. Zhao et al. [34] provide direct evidence that multiple solution families remain viable at hourly resolution by proposing and benchmarking optimization-based, graph signal processing-based, and convolutional neural network-based approaches, evaluated against established baselines such as factorial hidden Markov models and combinatorial optimization implementations.

The same study is also valuable methodologically because it highlights an applied reality of low-frequency disaggregation, namely that practical solutions often incorporate lightweight metadata such as appliance-rated power from surveys, while attempting to minimize dependence on intrusive submetering. In parallel, low-frequency NILM has expanded beyond conventional household appliances toward identifying specific high-impact end uses that are strongly policy and grid-relevant, such as electric vehicle charging and HVAC. For example, Vavouris et al. develop a scalable approach for detecting residential electric vehicle charging events and consumption using smart meter data, reinforcing that low-frequency NILM can be effective when targeted at high power and temporally structured loads [46]. HVAC-specific estimation at hourly resolution similarly supports demand response and flexibility analysis because cooling and heating loads are among the most controllable components in many buildings, while also being strongly weather dependent [44].

A further state-of-the-art challenge is reproducibility and deployment realism, which becomes more acute as the field moves toward smart meter constraints. Yan et al. review open source NILM methods with explicit focus on reproducibility, identifying persistent inconsistencies in datasets, evaluation settings, and metric implementations that complicate fair comparison [47]. This line of work supports the argument that low-frequency NILM should be evaluated not only as an algorithmic task, but also as a data and tooling discipline, since differences in resampling, missing data handling, and train test splits can dominate reported performance.

2.7. Contextual Features in Low-Frequency Load Disaggregation

A defining characteristic of low-frequency non-intrusive load monitoring is limited observability of device-level dynamics. At hourly or sub-hourly resolution, transient electrical signatures are typically aliased or lost, making direct appliance identification infeasible for many loads. As a result, contemporary NILM approaches increasingly rely on contextual features, defined here as non-electrical variables that provide indirect information about building operation and energy demand. The most widely used contextual variables in practice include meteorological conditions, calendar attributes, and occupancy-related proxies, each of which helps constrain when certain loads are likely to operate and how strongly their demand scales.

Weather features are especially important for HVAC-related components and for seasonal end uses, while calendar features such as hour of day, day of week, and holidays encode behavioral regularities. Pu et al. propose a NILM method that integrates meteorological and calendar features into a denoising autoencoder pipeline, and they report improved performance compared to approaches that rely only on electrical measurements, illustrating the practical value of multi-dimensional feature fusion when electrical observability is weak [48]. In very low rate settings, the relevance of contextual information is also emphasized by work that frames the primary difficulty as overlapping usage within an interval and noise from unlabeled appliances, which implicitly motivates using time of day, survey metadata, or weather priors to reduce ambiguity [34].

Contextual information can also be interpreted more broadly as any auxiliary signal that improves identifiability under aggregation. For example, occupancy state is a latent driver of many residential end uses, and even when occupancy is not directly measured, it can be inferred or approximated using smart meter patterns and then used as a conditioning signal for disaggregation or downstream control. Chen et al. demonstrate non-intrusive occupancy monitoring using smart meters, establishing a methodological foundation for using inferred occupancy as a contextual variable in energy analytics pipelines [49]. In building physics-oriented analytics, context also includes temperature measurements that enable inverse modelling and parameter identification, showing that hybrid pipelines can combine smart meter data with minimal sensing to unlock more interpretable decomposition of energy drivers [43].

2.8. Evaluation Practices and Metrics

Across NILM research, evaluation remains challenging due to differences in datasets, preprocessing pipelines, target definitions, and metric choices. This is particularly problematic for low-frequency settings where ground truth may be incomplete, where appliance labels can be ambiguous at hourly resolution, and where success criteria vary by application, such as behavioral feedback versus grid flexibility forecasting. Batra et al. introduced NILMTK to address these issues by providing a standardized data format, reference benchmark algorithms, and a suite of accuracy metrics intended to enable reproducible comparisons across studies and datasets [23].

Subsequent work synthesizes evaluation practices and demonstrates that reported accuracy can be highly sensitive to metric selection. Pereira and Nunes provide a systematic review of NILM datasets, metrics, and tools, clarifying how common metrics capture different error characteristics and why cross-study comparison frequently fails without consistent evaluation protocols [50]. For low-frequency disaggregation, evaluation also requires careful handling of time alignment and aggregation consistency, since even minor differences in downsampling methods can change the effective learning problem. Zhao et al. explicitly discuss how hourly energy profiles should be derived from higher resolution measurements and benchmark multiple metrics including energy assignment accuracy and match rate style measures, illustrating how methodological clarity in resampling and scoring is essential for credible claims [34].

A growing evaluation theme concerns reproducibility in the strong sense, meaning the ability to rerun published pipelines end to end and obtain consistent results. Yan et al. reported hands-on reproducibility tests across a large set of open source implementations and identified recurring problems in undocumented preprocessing choices, dataset inconsistencies, and experimental settings that prevent faithful reproduction of published claims [47]. This implies that future NILM evaluation sections should increasingly specify dataset versions, resampling definitions, missing data policies, and metric implementations, in addition to reporting headline accuracy values.

Reflections from the organization and post hoc analysis of the ADRENALIN Load Disaggregation Challenge further emphasized that conventional evaluation metrics alone are insufficient to assess the practical value of disaggregation outputs. The lessons learned [51] highlighted that leaderboard-oriented optimization can obscure instability, over-attribution, and sensitivity to data regime shifts, particularly in low-frequency settings where contextual drivers are incomplete. These insights support the need for evaluation practices that explicitly consider attribution stability, residual structure, and uncertainty, rather than relying solely on pointwise error measures.

2.9. Practical Maturity, Trustworthiness, and Interpretability in Non-Intrusive Load Monitoring

The preceding review highlights that non-intrusive load monitoring has reached a considerable level of technical maturity. At the same time, it exposes persistent challenges related to robustness, interpretability, and incomplete observability. Advances in probabilistic modelling, optimization-based formulations, deep learning, and context-aware approaches have enabled substantial improvements in disaggregation accuracy under controlled evaluation settings, particularly for dominant end uses and well-curated datasets [15,19]. As a result, NILM is often characterized as a maturing research field with increasingly sophisticated algorithmic tools.

Despite this progress, a growing body of literature emphasizes that numerical accuracy alone is insufficient to assess the practical value of NILM in real-world deployments. Kaselimi et al. [15] argue that the field is transitioning from an algorithm-centric phase toward a maturity phase in which trustworthiness, transferability, and interpretability become dominant concerns. Many high-performing NILM methods rely on strong implicit assumptions, such as stable load signatures, consistent operational regimes, or the ability of the selected model to explain most residual variability. When these assumptions are violated, models can exhibit unstable behavior, physically implausible attributions, or strong sensitivity to building-specific characteristics and temporal regime shifts.

These limitations are particularly pronounced in low-frequency settings, where smart meter data are sampled at hourly or sub-hourly resolution and transient electrical signatures are largely absent [12,34]. Under such conditions, disaggregation increasingly depends on contextual variables and statistical regularities rather than direct appliance-level evidence. While this enables scalable deployment, it also increases the risk that unexplained variability is implicitly absorbed into modelled components, especially when attribution formulations enforce completeness or prioritize reconstruction error minimization without explicit uncertainty representation.

Recent reviews further highlight that many state-of-the-art NILM models, especially deep learning-based approaches, operate as black-box predictors that provide limited insight into why energy is attributed to a given end use [15,19]. This lack of transparency complicates their use in diagnostic and decision-support workflows, where practitioners require not only point estimates but also an understanding of their reliability and limitations. In addition, supervised learning approaches typically depend on extensive sub-metered training data, which constrains their transferability across buildings and reduces their applicability in commercial and public building contexts.

These observations have motivated increasing attention to interpretability, uncertainty exposure, and conservative attribution strategies in NILM research. Rather than treating residual variability as noise to be eliminated, several authors emphasize that unexplained energy should be recognized as a meaningful signal reflecting missing drivers, regime changes, or unobserved operational dynamics [15]. From this perspective, disaggregation outputs are not definitive decompositions, but evidence-based estimates whose credibility depends on the strength, stability, and plausibility of the underlying explanatory relationships.

From an operational standpoint, this implies that NILM methods intended for real-world use must balance accuracy with robustness and transparency. Models that aggressively fit residual structure can achieve low error metrics while obscuring uncertainty and over-attributing energy to specific drivers. Conversely, conservative formulations that explicitly preserve unexplained components may yield higher numerical error but provide more reliable support for downstream tasks such as benchmarking, fault investigation, and performance assessment.

These challenges are especially relevant for temperature-dependent load disaggregation in commercial and public buildings, where partial observability, mixed operational regimes, and complex control logic are common. The limitations identified across existing NILM formulations motivate the development of disaggregation approaches that prioritize interpretability, attribution stability, and explicit uncertainty handling over forced completeness. The MD-ADD framework proposed in this paper is positioned directly within this perspective, addressing these challenges through conservative attribution and uncertainty-aware decomposition of low-frequency smart meter data.

These observations directly motivate the design choices adopted in MD-ADD, including conservative driver attribution, explicit preservation of unexplained energy, and uncertainty-aware evaluation of attribution stability under realistic low-frequency conditions.

3. Dataset Description

3.1. The ADRENALIN Challenge Dataset

The proposed framework is evaluated using data from the ADRENALIN Challenge, part of a European research initiative focused on advancing data-driven energy analytics for smart buildings. The ADRENALIN Challenge dataset was developed to support systematic evaluation of load disaggregation algorithms under realistic conditions and has been used both in the ADRENALIN Challenge and a benchmark study [17].

The dataset consists of energy and contextual data collected from commercial and public buildings across multiple geographical regions [52]. Buildings were selected based on the availability of high-quality aggregate electricity measurements, outdoor weather data, and reliable sub-metering for HVAC systems. A detailed data collection and validation protocol was applied to ensure consistency and comparability across buildings [5].

3.2. Building Selection and Quality Assurance

An initial pool of buildings was subjected to a multi-stage screening process that evaluated data completeness, temporal consistency, sensor reliability, and the presence of temperature-dependent energy behavior. Only buildings for which heating and cooling loads could be confidently identified from sub-metered data were retained for analysis.

This process resulted in a final set of nine buildings that span different usage types, sizes, and climates. Each building is available at multiple temporal resolutions, typically including 5 min, 15 min, 30 min, and hourly data. This enables evaluation of algorithm performance across resolutions commonly encountered in smart meter deployments.

The data preparation and validation procedures applied to the ADRENALIN Challenge dataset have been documented in detail in prior work [5] and ensure that observed differences in algorithm performance are attributable to methodological factors rather than data quality artifacts.

The data used in this study were collected, harmonized, and validated following a structured data collection and quality assurance protocol developed for the ADRENALIN Challenge. This protocol defines systematic procedures for domain scoping, data source identification, privacy compliance, harmonization, and multi-stage quality validation, with particular emphasis on ensuring that sub-metered heating and cooling data reliably capture temperature-dependent energy use.

3.3. Variables and Preprocessing

For each building, the dataset includes aggregate electricity consumption measured at the building level, outdoor weather variables such as temperature and, in some cases, additional meteorological parameters, as well as sub-metered heating and cooling energy used as ground truth for evaluation. All time-series were aligned to a common temporal grid and cleaned to remove missing or invalid values. Weather data were synchronized with energy measurements using nearest neighbor or interpolation methods as appropriate. Where higher frequency data were available, resampling was applied to obtain consistent low frequency representations.

3.4. Evaluation Metric

Algorithm performance is evaluated using normalized mean absolute error (NMAE), which scales absolute errors by the magnitude of the measured HVAC energy used as ground truth. This metric enables a fair comparison across buildings with different sizes and load levels and is well-suited to low-frequency disaggregation tasks [53].

In addition to NMAE, qualitative diagnostics such as residual autocorrelation and attribution stability are used to assess model behavior and robustness.

Normalized mean absolute error is retained in this study primarily to enable direct comparison with prior work and with the ADRENALIN Challenge results, where NMAE was the primary evaluation metric. Its use here should therefore be interpreted as a point of reference rather than as a comprehensive measure of disaggregation quality. As demonstrated throughout this paper, low NMAE does not necessarily imply stable or physically plausible attribution in low-frequency settings, which motivates the complementary use of stability diagnostics and residual-structure analysis alongside conventional error metrics.

4. Proposed Approach: MD-ADD

4.1. Problem Formulation and Scope

The objective of the proposed methodology is to disaggregate aggregate building electricity consumption into interpretable components associated with multiple contextual drivers, while explicitly accounting for uncertainty and incomplete observability. The method targets low-frequency data, such as hourly or sub-hourly smart meter measurements, and is designed for commercial and public buildings where appliance-level signatures are not available.

Let

Y (t)

denote the aggregate electricity consumption at time

t

, and let

X (t) = (X_{1} (t), \dots, X_{K} (t))

denote a set of contextual driver variables, such as outdoor temperature and other environmental measurements. Consistent with additive NILM formulations, the aggregate signal is decomposed as

Y (t) = B (t) + \sum_{k = 1}^{K} A_{k} (t) + U (t)

(1)

where

B (t)

represents a baseline component largely independent of the modeled drivers,

A_{k} (t)

denotes the estimated contribution associated with contextual driver

k

, and

U (t)

represents unexplained energy.

Equation (1) follows the standard additive perspective widely adopted in non-intrusive load monitoring (NILM), in which aggregate electricity consumption is represented as the superposition of multiple contributing components and a residual term [15]. However, the decomposition above should be interpreted as an attribution-based representation rather than as a generative physical model of load formation. The contextual drivers

X (t)

do not appear explicitly in Equation (1) because their influence is incorporated through the estimated attribution terms

A_{k} (t)

. In classical NILM formulations, additive components often correspond to appliance-level loads or signature-based functions. In the present work, the quantities

A_{k} (t)

are inferred from empirical relationships between contextual variables

X (t)

and variations in aggregate consumption, rather than being assumed to be deterministic functions of the drivers. For clarity, Table 1 summarizes the notation used in the decomposition and subsequent modeling stages.

In this work, the term “data-driven” refers to the fact that all driver-response relationships and attribution magnitudes are inferred directly from observed data rather than prescribed by fixed physical models, parametric energy signature forms, or rule-based assumptions. The baseline component, contextual regression relationships, and attribution terms are estimated from historical smart meter and contextual measurements using statistical learning under a time-series cross-validation scheme. No building-specific physical parameters, equipment specifications, or predefined temperature-load functions are imposed. Instead, the structure of the decomposition in Equation (1) is fixed, while the functional relationships that determine

A_{k} (t)

are learned independently for each building from data. Uncertainty is quantified through block bootstrap resampling rather than imposed through parametric distributional assumptions. In this sense, the disaggregation behavior emerges from empirical patterns in the data rather than from predefined physical or rule-based models.

The methodology does not assume that the available contextual drivers are sufficient to explain all non-baseline energy. Instead, unexplained energy is treated as a meaningful component of the decomposition, reflecting unobserved influences, stochastic variation, and modeling uncertainty. This design choice avoids forcing attribution when explanatory evidence is weak and supports more interpretable and robust disaggregation outcomes.

While the experimental evaluation presented in this paper focuses primarily on weather-related drivers, particularly outdoor air temperature and calendar effects, this choice reflects the availability of validated ground truth and the dominant role of temperature-dependent demand in the ADRENALIN Challenge dataset rather than a limitation of the proposed formulation. The MD-ADD framework is designed to support multiple contextual drivers simultaneously, with each driver contributing an additive, explainability-derived attribution term and an associated uncertainty estimate. Additional drivers, such as occupancy proxies, indoor environmental variables, or control signals, can be incorporated directly into the contextual feature set without modification of the baseline separation, attribution logic, or uncertainty quantification pipeline. Section 5 includes an explicit multi-driver experiment incorporating calendar-derived contextual features to demonstrate simultaneous driver attribution under controlled driver expansion.

MD-ADD is best understood as a disaggregation framework rather than a single estimator. Its core contribution lies in the formulation of conservative attribution under incomplete observability, supported by modular baseline strategies, interchangeable regression backends, and optional consistency constraints. This design allows the framework to be adapted to different building types, contextual feature sets, and analytical objectives without altering its attribution philosophy.

4.2. Data Preparation and Alignment

The method requires two aligned datasets: a dependent time series representing aggregate electricity consumption and a multivariate time series of contextual drivers. All series must share a common timestamp index.

Data are resampled to a uniform temporal resolution if required. For energy applications, the default configuration uses hourly aggregation, with energy values summed and driver values averaged over each interval. Non-numeric driver columns are excluded automatically.

Time steps with missing aggregate consumption values are removed. Driver values are retained as long as at least one driver is available for a given time step. This design avoids artificially imputing energy consumption while allowing flexible handling of incomplete contextual data.

4.3. Baseline Estimation

The baseline component

B (t)

represents electricity consumption that is largely insensitive to the modeled drivers. In commercial buildings, this component typically includes standby loads, continuously operating equipment, and other relatively stable consumption patterns.

It is important to emphasize that the baseline component in MD-ADD is an operational construct rather than a physically distinct end-use. The baseline is not assumed to be strictly independent of temperature or other contextual drivers in a physical sense. Instead, it represents a conservative lower-envelope separation of aggregate consumption intended to isolate excess variability for attribution analysis. In buildings with pronounced seasonal structure or year-round HVAC operation, the estimated baseline may therefore exhibit correlation with outdoor temperature. Within the MD-ADD formulation, such behavior is treated as diagnostically informative rather than erroneous, because it indicates the absence of a stable, temperature-neutral minimum load. Consequently, the baseline is not interpreted as a temperature-independent demand component, and MD-ADD explicitly avoids drawing physical conclusions from baseline magnitude alone.

The methodology supports several interchangeable baseline estimation strategies, all operating solely on the aggregate signal

Y (t)

. A constant baseline may be defined as the median of

Y (t)

. A rolling quantile baseline can be constructed as the lower envelope of

Y (t)

over a moving window, combining a rolling quantile and a rolling mean and selecting the minimum of the two to improve robustness against transient fluctuations. A schedule-based baseline may be estimated using categorical hour of day and day of week variables within a robust regression framework. In addition, a seasonal trend decomposition baseline may be derived using robust STL decomposition when sufficient periodicity is present.

By default, the rolling quantile baseline is applied, as it is non parametric, domain agnostic, and well suited to long-term building energy data. Edge effects are handled through forward and backward-filling to ensure a complete baseline estimate across the full temporal range.

In addition to these generic formulations, MD-ADD also supports a contextual baseline strategy, implemented as a banded baseline, and used in selected experiments to address cases where envelope-based baselines are known to distort attribution results. In buildings with persistent temperature-driven operation, such as year-round heating or cooling, a stable lower envelope of aggregate demand may not exist. In these cases, the banded baseline is constructed by identifying periods of approximately neutral outdoor conditions, inferred from the data rather than fixed thresholds, and deriving a conservative, time-structured reference profile from those observations. This baseline remains an operational reference and does not aim to recover a physically meaningful baseline, but instead avoids systematically subtracting temperature-driven variability when no temperature-neutral minimum load is present.

4.4. Excess Energy Definition

After baseline estimation, the residual signal is defined as

E (t) = Y (t) - B (t) .

(2)

This excess energy represents the portion of consumption that may be influenced by the modeled drivers. Importantly, it is not assumed that this signal is fully explainable.

4.5. Driver-Based Modeling of Excess Energy

A nonlinear regression model is trained to approximate the relationship between excess energy and the contextual drivers,

E (t) \approx f (X (t)) .

(3)

Tree-based ensemble models are used due to their ability to capture nonlinearities and interactions without requiring explicit feature engineering. Multiple backends are supported, including gradient-boosted decision trees implemented via XGBoost, LightGBM, or scikit-learn.

To avoid look-ahead bias and overly optimistic attribution, model training is performed using time-series cross-validation. For each fold, the model is trained on past data and evaluated on a held-out future segment. Out-of-fold predictions are retained for all time steps.

4.6. Attribution via Model Explainability

Local driver contributions are obtained using Shapley-value-based explainability methods. For tree-based models, TreeSHAP-compatible implementations are employed to compute additive contributions for each driver at each time step.

For a given time

t

, the model output can be expressed as

f (X (t)) = ϕ_{0} + \sum_{k = 1}^{K} A_{k} (t),

(4)

where

ϕ_{0} (t)

is the model intercept and

ϕ_{k} (t)

represents the contribution of driver

k

at time

t

.

These quantities are computed on out-of-fold predictions to ensure that attributions reflect generalizable relationships rather than in-sample fit.

4.7. Attribution Strategy and Treatment of Unexplained Energy

Driver contributions are derived directly from model explainability outputs and are not rescaled to enforce exact agreement with the observed excess energy. This attribution strategy avoids imposing hard mass-balance constraints that would require the modeled drivers to explain all variability in the excess signal. As a result, unexplained energy naturally emerges when the available drivers provide insufficient explanatory power or when excess energy contains stochastic or unobserved effects.

The unexplained component is therefore defined as

U (t) = E (t) - \sum_{k = 1} A_{k} (t) .

(5)

Negative unexplained values can optionally be clipped to zero when non-negativity is required for interpretability.

4.8. Attribution Logic and Mapping to HVAC Energy

The explainability formulation yields additive model decompositions that must be interpreted carefully in the context of HVAC energy attribution. This subsection clarifies the roles of SHAP values, the intercept term, and their mapping to the evaluated temperature-dependent HVAC signal.

As introduced in Section 4.6, the model prediction Ê(t) admits an additive SHAP decomposition:

\hat{E} (t) = ϕ_{0} + \sum_{k = 1}^{K} A_{k} (t) .

(6)

Within the MD-ADD framework, the driver contributions

A_{k} (t)

are interpreted as contextual attributions to temperature-dependent HVAC energy. The intercept term

ϕ_{0}

is not associated with any physical driver. It represents the baseline expectation of the predictive model under the background data distribution and captures systematic structure not attributable to specific contextual variables. Accordingly, the intercept does not constitute attributable driver energy.

The estimated HVAC-related energy attributed to contextual drivers is defined as:

{\hat{E}}_{H V A C} (t) = \sum_{k = 1}^{K} A_{k} (t) .

(7)

The unexplained component is computed as:

U (t) = E (t) - {\hat{E}}_{H V A C} (t) .

(8)

Importantly, this definition allows

U (t)

to remain non-zero. When contextual drivers do not fully explain the observed excess energy, the residual term explicitly represents unexplained variability rather than being absorbed into contextual attributions. This separation preserves interpretability and aligns with the conservative attribution philosophy of MD-ADD.

In the experimental evaluation,

{\hat{E}}_{H V A C} (t)

is compared directly to the temperature-dependent HVAC ground-truth signal provided in the ADRENALIN dataset. The intercept term is not mapped to HVAC energy, ensuring that only driver-supported contributions are evaluated against the reference HVAC measurements.

4.9. Temporal and Time–Frequency Consistency Constraints

To suppress spurious attributions caused by temporal-scale mismatch, an optional time–frequency consistency mechanism is applied as a post-processing step on the driver attribution time series. A short-time Fourier transform (STFT) is computed for the excess-energy signal and for each driver signal using the same window length and overlap. For each driver, a time-varying spectral agreement score is computed by comparing the driver and excess-energy STFT magnitudes within a predefined frequency band that corresponds to plausible driver dynamics (for example, low-frequency components for meteorological drivers). This agreement score defines a gating factor in

[0, 1]

that attenuates the driver attribution at time steps where spectral alignment is weak. The intent is to restrict driver attributions to temporal scales consistent with the driver’s expected physical influence, while leaving the core modeling and SHAP attribution unchanged. This option is used for sensitivity analysis and is not required for the baseline MD-ADD formulation.

To formalize this operation, let

A_{k} (t)

denote the raw attribution time series for driver

k

, and let

g_{k} (t) \in [0, 1]

denote the time–frequency agreement score derived from STFT magnitude overlap between the excess-energy signal and driver

k

. The gated attribution is defined as

A_{k}^{T F} (t) = g_{k} (t) A_{k} (t) .

(9)

When

g_{k} (t) = 0

, attribution is fully suppressed at time

t

; when

g_{k} (t) = 1

, the original attribution is preserved. This post-processing step does not alter the regression or SHAP computation, but restricts attribution to temporally plausible scales.

4.10. Attribution Thresholding and Sparsity

Small-magnitude attributions that arise from numerical noise or weak correlations are optionally suppressed using a global or percentile-based threshold. This step encourages sparse and interpretable decompositions and reduces the risk of attributing negligible energy to irrelevant drivers.

4.11. Uncertainty Quantification via Block Bootstrap

Uncertainty in driver attributions is quantified using a moving-block bootstrap procedure that preserves temporal autocorrelation. Contiguous blocks of fixed length are sampled with replacement to generate bootstrap replicates.

For each replicate, the full modeling and attribution pipeline is re-executed. Driver contributions are then aggregated over the desired temporal resolution, such as daily totals. Empirical confidence intervals are computed from the resulting distributions.

This procedure provides uncertainty estimates for both absolute attributions and relative energy shares.

Let

A_{k}^{(b)}

denote the aggregated attribution for driver

k

obtained from bootstrap replicate

b

, where aggregation is performed over a fixed temporal window such as daily totals. For

B

bootstrap replicates, the empirical attribution distribution is

\{A_{k}^{(1)}, A_{k}^{(2)}, \dots, A_{k}^{(B)}\} .

(10)

Uncertainty intervals are computed as empirical quantiles of this distribution, and the median is used as the reported point estimate.

4.12. Diagnostics and Validation

Model adequacy is evaluated using a combination of quantitative accuracy measures and stability diagnostics that reflect both predictive performance and interpretability. Normalized mean absolute error is used to quantify the discrepancy between estimated and measured HVAC energy, providing a scale-independent measure suitable for comparison across buildings and aggregation levels. In addition to pointwise accuracy, residual structure is examined through autocorrelation analysis of the unexplained component, including inspection of the autocorrelation function and application of the Ljung–Box test. Importantly, persistent residual autocorrelation is not interpreted solely as a modeling defect. Instead, it is treated as a diagnostic signal indicating the presence of unobserved drivers, regime shifts, or operational dynamics that are not captured by the available contextual variables. The stability of driver attributions is evaluated by comparing attribution patterns across time series cross-validation folds and bootstrap resampling replicates. Together, these diagnostics provide a more comprehensive assessment of model behavior than accuracy metrics alone, supporting evaluation of robustness, transparency, and physical plausibility.

4.13. Attribution Stability and Residual-Structure Metrics

In addition to NMAE, two diagnostic families are used to evaluate whether driver attributions are sufficiently stable for operational interpretation. First, attribution stability is quantified using bootstrap replicates of aggregated driver energy (for example, daily totals). For each driver and building, stability is summarized by a normalized dispersion statistic computed as the interquartile range divided by the median absolute attribution across bootstrap replicates, and a sign-consistency rate computed as the fraction of replicates with the same attribution sign as the median. Let

A_{k}^{(b)}

denote the aggregated attribution for driver

k

in bootstrap replicate

b

. Attribution dispersion is defined as

S_{k} = \frac{I Q R (A_{k}^{(b)})}{m e d i a n (∣ A_{k}^{(b)} ∣)} .

(11)

and sign consistency is defined as

C_{k} = \frac{1}{B} \sum_{b = 1}^{B} 1 (s i g n (A_{k}^{(b)}) = s i g n (m e d i a n (A_{k}^{(b)}))),

(12)

where

1 (\cdot)

denotes the indicator function. Low dispersion and high sign consistency indicate a transferable driver relationship that is robust to plausible temporal perturbations.

Second, residual structure is summarized using autocorrelation-derived statistics on the unexplained component. In addition to visual inspection of the autocorrelation function, a small set of fixed-lag autocorrelations is reported at lags corresponding to one interval, one day, and one week (depending on sampling rate). A Ljung–Box test is used as a complementary check for non-random residual structure over a defined lag window. Persistent residual structure is interpreted as evidence of missing drivers, regime changes, or operational logic not represented in the contextual features, rather than being treated solely as modeling error.

The attribution stability metrics introduced here are intended as diagnostic indicators rather than formal statistical hypothesis tests. No universal thresholds are assumed for dispersion or sign-consistency, and the framework does not claim statistical significance in the classical inferential sense. Instead, stability is interpreted comparatively across drivers, buildings, and temporal resolutions, with high dispersion or inconsistent attribution sign indicating weak or non-transferable explanatory relationships under the available contextual features. This design choice reflects the exploratory and decision-support-oriented nature of low-frequency disaggregation. The primary objective is to expose uncertainty and sensitivity rather than to assert definitive causal attribution.

4.14. Comparison Algorithms and Evaluation Setup

To assess the performance of the proposed disaggregation framework, results are compared against six reference algorithms drawn from prior work [17] and the winning algorithms from the ADRENALIN Challenge [21]. These algorithms are not re-derived in this paper; instead, they are implemented following their original descriptions and applied under identical data and evaluation conditions.

The first group of comparison methods consists of three general-purpose disaggregation algorithms. The Bayesian regression approach models energy consumption as a probabilistic function of contextual drivers, yielding posterior estimates of temperature-dependent load. The time–frequency masking method applies short-time Fourier transforms to isolate components of the aggregate signal that correlate with contextual variables. The Bi-LSTM model learns temporal dependencies directly from the data using a sequence-to-sequence formulation.

The algorithms from the ADRENALIN Challenge were specifically tuned for low-frequency commercial building data and incorporate domain-informed assumptions regarding baseline behavior and temperature dependency. While differing in implementation, these methods emphasize strong explanatory coverage of temperature-dependent energy under competition-oriented evaluation criteria, providing a useful contrast to the uncertainty-aware and non-forced attribution strategy adopted in this work.

For all comparison algorithms, the same input data, temporal resolution, and preprocessing steps are used as for the proposed method. Performance is evaluated using normalized mean absolute error (NMAE) against measured temperature-dependent energy. No post hoc adjustments are applied to favor any method.

This evaluation setup ensures that observed performance differences arise from methodological design choices rather than data handling or metric definition.

4.15. Implementation Details for Experimental Evaluation

The implementation was carried out in Python 3.10.11 using the following packages: XGBoost 3.0.2, SHAP 0.48.0, statsmodels 0.14.4, NumPy 2.0.2, pandas 2.2.3, matplotlib 3.10.0, and scikit-learn 1.6.1.

All experiments were conducted at hourly temporal resolution. Aggregate electricity consumption was resampled by summation, while contextual drivers were averaged over each hourly interval. Weather drivers included outdoor air temperature (T), relative humidity (Rh), global horizontal solar irradiance (SolGlob), wind speed (Ws), and wind direction (Wd), depending on building-specific data availability. Calendar features were derived from timestamp information and included hour-of-day, day-of-week, month, week-of-year index, and a binary weekend indicator. Hour-of-day, day-of-week, month, and week-of-year were one-hot encoded, while the weekend indicator was retained as a binary variable.

Three experimental configurations were evaluated. In the first configuration, baseline estimation was performed using a rolling-envelope formulation with a fixed window length of 14 days and a lower quantile of 0.2. In the second configuration, a temperature-banded baseline was applied to restrict baseline estimation to quasi-neutral regimes. In the third configuration, baseline subtraction was omitted entirely, and attribution was applied directly to the aggregate signal using both weather and calendar-derived drivers.

The regression backend used in all reported experiments is XGBoost. Hyperparameters were fixed across buildings to ensure comparability and avoid building-specific tuning. Model training and attribution were performed using forward-chaining time-series cross-validation with five expanding-window folds. For each fold, models were trained exclusively on historical data and evaluated on a contiguous future segment. All reported attributions are based on out-of-fold predictions to prevent information leakage.

Uncertainty estimates were obtained using a moving-block bootstrap with a block length of 48 h and 200 bootstrap replicates. Optional time–frequency masking constraints were not activated in the final reported experiments unless explicitly stated.

All configuration parameters used in the reported experiments are summarized in Table 2. In addition, Table 3 documents implementation-sensitive defaults and operational settings, including SHAP computation mode, clipping behavior, alignment rules, bootstrap configuration, and preprocessing edge cases that may influence reproducibility.

4.16. Summary of the Workflow

The complete workflow consists of sequential and modular stages. The process begins with data alignment and preprocessing, followed by baseline estimation from aggregate consumption. Excess energy is then modeled using contextual drivers, after which attribution is derived through explainability methods. Optional temporal and time frequency constraints may be applied to refine attribution behavior. The framework explicitly computes unexplained energy to prevent enforced completeness and maintain conservative attribution. Uncertainty is quantified using a block bootstrap procedure, and the results are subjected to diagnostic evaluation and visualization. This modular structure ensures reproducibility, transparency, and adaptability across different buildings and datasets.

5. Results

This section evaluates MD-ADD under a structured comparison protocol and positions its behavior relative to established low-frequency disaggregation approaches. Two groups of reference results are used, three general low-frequency methods reported in the earlier comparison study [17], and the three top-performing ADRENALIN Challenge methods [21], all evaluated on the same dataset and metric.

In addition to point accuracy (NMAE), stability and residual-structure diagnostics are reported to assess attribution robustness under partial observability. These include bootstrap uncertainty intervals, dispersion measures, sign consistency, and residual autocorrelation.

Results are computed using the MD-ADD decomposition defined in Section 4. Hourly HVAC estimates correspond to the temperature-driver attribution

A_{k} (t)

defined in Equation (1), optionally modified by the gating operation in Section 4.9. Daily values are computed by summation over non-overlapping 24 h windows. Normalized mean absolute error (NMAE) is computed between aggregated attributions and measured HVAC ground truth. Bootstrap uncertainty intervals, dispersion statistics, and sign-consistency rates are computed according to the definitions in Section 4.10, Section 4.11 and Section 4.12.

5.1. Experimental Design and Evaluation Protocol

The evaluation compares multiple disaggregation approaches under a consistent protocol applied to the same buildings, temporal resolution, and ground-truth definitions. Results are reported for the Bayesian approach without weekday separation, the Bayesian approach with weekday separation, and the time–frequency mask-based method from the earlier comparison study, as well as for the three ADRENALIN Challenge algorithms (Adjusted STL, GMM-based clustering, and base-load decomposition). These reference results are reproduced from their respective sources and serve as quantitative baselines under identical evaluation conditions. The proposed MD-ADD framework is evaluated alongside these methods to enable direct comparison.

Within MD-ADD, three configurations are reported to assess the influence of internal modeling choices, represented in Table 4. The initial configuration corresponds to the baseline implementation described in the original submission, while the refined configuration incorporates adjusted baseline handling and validation settings. This distinction allows the effect of baseline formulation and backend validation to be examined without altering the overall attribution philosophy.

Results are reported at both hourly and daily aggregation levels where applicable. Daily aggregation is obtained by summing over non-overlapping 24 h intervals. Reporting both resolutions allows separation of short-term variability from systematic bias and clarifies whether performance differences are driven primarily by high-frequency fluctuations or persistent structural effects.

Performance is primarily evaluated using normalized mean absolute error (NMAE). For the earlier comparison study, additional metrics (MAE, RMSE, and R²) are included as originally reported to preserve consistency with the reference publication. In addition to accuracy, stability and uncertainty diagnostics are also shown. These include bootstrap-based uncertainty intervals, dispersion and sign-consistency measures across replicates, and residual autocorrelation diagnostics. Together, these metrics provide a more comprehensive assessment of attribution robustness beyond point error alone.

5.2. Baseline and Comparative Methods Included

Two reference groups are included in the comparative evaluation. The first group consists of methods reported in the earlier comparison study, including two Bayesian variants and a time–frequency mask-based algorithm. These approaches provide a structured baseline for interpreting how alternative low-frequency modeling assumptions perform across buildings. The second group comprises the ADRENALIN Challenge reference methods, namely Adjusted STL, GMM-based clustering, and a base load decomposition approach, which represent competition optimized baselines evaluated on the ADRENALIN benchmark. In addition to these external references, MD ADD is evaluated in both an initial and a refined configuration. The experimental protocol further incorporates sensitivity analyses and a baseline leakage diagnostic to assess robustness, attribution stability, and residual structure.

5.3. Reference Results from the First Comparison Study

Table 5, Table 6 and Table 7 summarize the results of the Bayesian and time–frequency mask-based algorithms from the earlier comparison study. These results are included here to establish a quantitative reference point for interpreting subsequent comparisons; readers are referred to the original publication for full methodological details and extended discussion [17].

Both Bayesian variants show extreme variability across buildings. While some buildings achieve NMAE below 0.30, others exceed 3.0, accompanied by strongly negative R² values. This indicates severe over- or under-attribution when model assumptions are violated.

The mask-based approach achieves substantially lower NMAE than the Bayesian methods on most buildings, but still exhibits negative R² on several cases, indicating systematic misallocation despite good aggregate error.

5.4. Reference Results from the ADRENALIN Challenge

Table 8 summarizes the NMAE results of the three winning competition algorithms across buildings and resolutions [21]. The ADRENALIN Challenge algorithms achieve uniformly low NMAE across buildings, reflecting effective optimization for the benchmark metric under the fixed evaluation protocol used in the challenge.

5.5. MD-ADD Results: Initial Configuration

Table 9 reports the performance of MD-ADD using an XGBoost backend with weather drivers. Both hourly and daily NMAE are shown. In this configuration, MD-ADD achieves its lowest error on L14.B03 and its highest on L06.B01. In this initial configuration, daily aggregation reduces NMAE across all buildings, indicating that residual errors are dominated by short-term variability rather than long-term bias. MD-ADD yields higher NMAE than competition-optimized baselines, but the systematic error reduction under daily aggregation reflects the deliberate decision not to force short-term variability into driver attributions when contextual evidence is weak.

5.6. MD-ADD Results: Refined Configuration

Table 10 reports MD-ADD performance using a refined configuration, incorporating baseline handling and backend validation while preserving the core conservative attribution strategy. In the refined configuration, the effect of daily aggregation becomes building-dependent, reflecting the interaction between baseline formulation and attribution stability rather than a uniform reduction in short-term error. Relative to the initial configuration, the refined setup reduces hourly NMAE for several buildings, with the most pronounced improvement observed for L06.B01. In other cases, particularly for buildings with weaker or more irregular temperature dependence, the refined configuration trades numerical accuracy for improved baseline separation and attribution stability.

5.7. Quantitative Comparison with Established Low-Frequency Disaggregation Methods

To contextualize the performance of the proposed MD-ADD framework, its results are compared against six established low-frequency disaggregation methods drawn from two prior studies. The first group consists of three representative algorithmic families evaluated in an earlier comparative study, namely a Bayesian regression-based method, a time–frequency mask-based approach, and a bidirectional LSTM sequence model. The second group consists of the three top-performing algorithms from the ADRENALIN Load Disaggregation Challenge, including the adjusted STL-based method, a Gaussian mixture model (GMM)-based clustering approach, and a baseline-oriented decomposition strategy.

All comparison results are reported using normalized mean absolute error (NMAE) and are evaluated on the same buildings, temporal resolutions, and ground truth definitions as used for MD-ADD. No retraining, retuning, or post-processing adjustments were applied beyond those described in the original sources, ensuring that observed differences reflect methodological characteristics rather than experimental artifacts.

Table 11 summarizes hourly NMAE values for all methods across the evaluated buildings. The results show that the ADRENALIN Challenge algorithms achieve the lowest NMAE overall, reflecting their optimization for the competition metric under fixed evaluation constraints. The time–frequency mask-based method from the earlier comparison study also achieves relatively low NMAE on several buildings, although with notable variability.

In contrast, MD-ADD exhibits higher NMAE across most buildings, particularly at hourly resolution. However, unlike several reference methods, MD-ADD avoids extreme failure cases and maintains consistent behavior across heterogeneous building types. This reflects the deliberate design choice to avoid forced completeness and to preserve unexplained energy when contextual drivers provide insufficient explanatory power.

When results are aggregated to daily resolution, MD-ADD exhibits a consistent reduction in error, indicating that residual discrepancies are dominated by short-term variability rather than systematic bias. This behavior contrasts with competition-optimized methods, which aggressively fit short-term fluctuations to minimize pointwise error.

5.8. Multi-Driver Attribution Behavior

MD-ADD operates at the level of individual contextual drivers. Each feature in the contextual set produces a separate out-of-fold signed attribution time series. Aggregation into broader driver families, such as a composite weather-dependent component, is performed only for alignment with available benchmark targets.

To clarify this driver-level structure, Figure 1 presents the daily aggregated attributions for the individual meteorological variables prior to aggregation. The contributions correspond to SHAP-based signed driver attributions estimated by the trained model, alongside the unexplained residual component.

The figure illustrates that meteorological variables contribute independently and exhibit distinct temporal patterns. This confirms that MD-ADD estimates simultaneous contextual effects rather than relying on a single composite weather signal. The residual component remains non-zero, consistent with the conservative attribution philosophy of preserving unexplained energy.

To examine redistribution behavior when an additional driver family is introduced, the contextual feature set was extended by incorporating calendar-derived variables, specifically hour-of-day and weekday indicators, alongside the weather drivers. This configuration allows assessment of how explanatory mass is redistributed across distinct contextual driver families.

Figure 2 presents the mean diurnal decomposition for building L14_B03_1H under the multi-driver configuration. The figure shows that the calendar-attributed component closely follows the pronounced working-hour profile of the aggregate signal, capturing systematic daytime structure that cannot be explained by temperature variation alone. In contrast, the weather-attributed component exhibits smoother variation consistent with longer-term meteorological influence. The unexplained component remains non-zero across the full 24 h cycle, indicating that additional drivers redistribute explained energy rather than forcing complete attribution.

Figure 3 displays the daily stacked decomposition for January 2020 for the same building. The calendar component exhibits clear weekday structure, while the weather component varies more gradually across days. The persistence of a residual component throughout the month further confirms that the expanded driver set reduces but does not eliminate unexplained variability, consistent with the conservative attribution philosophy of MD-ADD.

Together, these results demonstrate that when a second contextual driver family is introduced, MD-ADD redistributes explanatory mass in a stable and interpretable manner rather than absorbing schedule-driven structure into weather attribution.

Table 12 summarizes the mean daily attribution shares for weather, calendar, and unexplained components for the evaluated L14 buildings (hourly data, daily aggregation). The results show that calendar contribution is non-trivial and building-dependent (approximately 0.31 to 0.53 mean share), supporting the claim that MD-ADD can attribute multiple drivers simultaneously rather than implicitly folding schedule effects into temperature attribution.

The stability diagnostics in Table 12 indicate that the driver shares are reasonably concentrated (share CI widths around 0.024 to 0.041 for weather and 0.028 to 0.040 for calendar), while the unexplained share remains consistently non-zero. This supports the intended interpretation that introducing an additional driver redistributes explained energy rather than eliminating unexplained energy.

5.9. Stability, Uncertainty, and Residual Diagnostics

Stability and residual-structure diagnostics defined in Section 4.13 are reported systematically across all experimental configurations to provide a consistent mapping between methodological definitions and empirical behavior. While Table 12 presents the full multi-driver diagnostics for configuration C3, Table 13 provides a configuration-consistent summary of key stability and residual-structure indicators across C1–C3. The table reports mean daily weather attribution share, mean unexplained share, bootstrap confidence interval width as a stability indicator, and the lag-1 autocorrelation of daily residual energy as a residual-structure indicator.

Across configurations, clear structural differences emerge. Relative to C1, the refined baseline configuration (C2) generally increases the unexplained share while maintaining comparable bootstrap confidence interval widths, indicating a more conservative separation of baseline and driver-attributed energy without loss of stability. When calendar drivers are introduced in C3, the weather share decreases substantially across buildings while the unexplained component remains non-zero, demonstrating redistribution of explanatory mass rather than forced completeness. These trends demonstrate that the stability and residual-structure metrics defined in Section 4.13 characterize attribution behavior consistently across configurations.

The high lag-1 autocorrelation values observed for daily residual energy (0.92–0.98) reflect persistent regime-level structure rather than short-term stochastic noise. Because residuals are evaluated at daily aggregation, they retain low-frequency patterns associated with operational schedules, occupancy regimes, and other slowly varying drivers not explicitly modeled. Consequently, the residual component should not be interpreted as random error but as temporally structured unexplained variability. This behavior is consistent with the design objective of MD-ADD, which preserves structured unexplained energy instead of forcing its absorption into contextual driver attributions.

As shown in Table 12 for the multi-driver configuration, weather and calendar share CI widths remain moderate (approximately 0.024–0.041 for weather and 0.028–0.040 for calendar), indicating stable attribution under bootstrap perturbations. The unexplained share remains consistently non-zero across buildings, and residual autocorrelation values between 0.92 and 0.98 indicate persistent structured variability not absorbed by contextual drivers. These results reinforce the conservative attribution behavior of MD-ADD under the expanded multi-driver configuration.

5.10. Sensitivity to Baseline Formulation and Backend Tuning

Additional experiments were conducted to assess the sensitivity of MD-ADD to backend hyperparameter tuning and baseline formulation.

Across all buildings, hyperparameter tuning of the XGBoost backend produced negligible changes in performance. Differences in hourly NMAE between tuned and untuned configurations remained below 0.3 percentage points for all cases. This indicates that backend optimization is not the dominant factor controlling performance.

In contrast, baseline formulation exhibited systematic and building-dependent effects. Configurations using a rolling-envelope baseline, no explicit baseline, and a temperature-banded baseline produced distinct error patterns. For buildings with stable schedules and pronounced seasonal structure, differences between baseline formulations were small. For buildings with mixed regimes or year-round operation, baseline choice measurably affected both hourly and daily NMAE.

These results confirm that baseline handling dominates performance differences within MD-ADD, while backend tuning plays a secondary role.

5.11. Baseline Leakage Diagnostic

The sensitivity analysis highlights that baseline behavior is a critical design choice in low-frequency contextual disaggregation, rather than a fixed modeling component. In particular, rolling-envelope baselines derived from the aggregate signal can exhibit correlation with outdoor temperature in buildings with strong seasonal demand. This behavior is quantified in Table 14, where high correlations between the estimated baseline and temperature are observed for several L14 buildings.

Importantly, this correlation does not indicate a modeling error within the MD-ADD framework. The rolling baseline is not intended to represent a physically temperature-independent load; rather, it serves as a conservative lower-envelope separation that adapts to long-term changes in aggregate consumption. In buildings with pronounced seasonal structure, such adaptation is expected and reflects genuine changes in minimum operational demand.

Crucially, MD-ADD does not rely on a single baseline formulation. When temperature coupling of the rolling baseline is undesirable, the framework provides alternative configurations that explicitly resolve this behavior. In the refined configuration, this is achieved either by enforcing a quasi-constant baseline through temperature-banded estimation or by omitting baseline subtraction altogether and allowing the unexplained component to absorb low-frequency structure.

The temperature-banded baseline enforces constancy by construction, restricting baseline estimation to periods where temperature influence is minimal. Where such neutral regimes are stable, this approach yields a baseline that is effectively independent of seasonal variation. Where neutral regimes are ill-defined or absent, degraded performance indicates that the data do not support a meaningful constant baseline, rather than a failure of the disaggregation formulation.

From this perspective, baseline-temperature correlation is best interpreted as a diagnostic signal rather than a defect. It reveals whether the building exhibits a stable, temperature-independent minimum load and guides the choice of baseline strategy. The MD-ADD framework accommodates this variability explicitly, ensuring that baseline behavior does not force spurious attribution or conceal unexplained structure.

5.12. Diagnostic Analysis: Qualitative Behavior, Residual Structure, and Attribution Stability

Quantitative error metrics provide only a partial view of model behavior in low-frequency contextual disaggregation. To complement NMAE-based evaluation, qualitative diagnostics and uncertainty-based analyses are used to examine temporal alignment, residual structure, and the stability of driver attributions under temporal perturbations.

Figure 4 and Figure 5 illustrate daily ground truth versus MD-ADD temperature-attributed estimates for two representative buildings with contrasting behavior. Figure 4 shows a building with well-defined weather sensitivity and relatively regular operation, where the model captures the dominant temperature-driven pattern while conservatively handling short-term deviations. The close alignment between measured and estimated temperature-dependent energy indicates that, in such cases, MD-ADD produces consistent and interpretable attributions.

In contrast, Figure 5 shows daily ground truth versus MD-ADD estimates for building L06.B01. Substantial deviations remain, particularly during periods of mixed or irregular operation. These discrepancies highlight the limitations of weather-driven contextual models in buildings where multiple drivers, control strategies, or occupancy patterns interact in complex ways. Rather than forcing attribution in such cases, MD-ADD preserves a significant unexplained component.

Residual structure is further examined in Figure 6, which presents the autocorrelation of unexplained energy for L06.B01. The residual exhibits strong temporal dependence rather than white noise, indicating the presence of unmodeled drivers or regime shifts. This behavior supports the interpretation that residual structure in MD-ADD is not merely noise but an informative signal of missing contextual information.

In addition to qualitative diagnostics, attribution stability is assessed using block bootstrap resampling. For each building, bootstrap replicates produce empirical distributions of driver contributions at the chosen aggregation level, enabling confidence intervals for both absolute energy attribution and relative energy shares. In buildings where temperature-driven behavior is stable across seasons, the resulting intervals are narrow, reflecting consistent driver relevance across resampled segments. In buildings with mixed regimes, shifting schedules, or unobserved control changes, intervals widen, indicating that driver contributions are not stable under plausible temporal perturbations of the data.

Beyond interval width, bootstrap distributions provide a direct stability diagnostic. When a driver contribution frequently changes sign or collapses toward zero across bootstrap replicates, it indicates weak or non-transferable explanatory power under the available features. Conversely, persistent contributions with limited dispersion indicate a robust relationship compatible with operational interpretation. These uncertainty and stability outputs therefore complement pointwise error metrics by revealing when driver attributions can be treated as reliable evidence and when they should instead be interpreted as tentative signals requiring additional contextual data.

Together, these diagnostics demonstrate that MD-ADD not only estimates temperature-dependent energy but also exposes when such estimates are stable and when they are highly sensitive to plausible temporal variations in building operation.

6. Discussion

This section provides a deeper interpretation of the results by explicitly contrasting MD-ADD with two groups of reference methods, namely the earlier low-frequency NILM algorithms evaluated in the comparison study and the competition-optimized algorithms from the ADRENALIN Challenge. The emphasis is placed on methodological implications, attribution stability, and robustness under partial observability.

MD-ADD is evaluated against methods that prioritize different objectives under the same data constraints. Some baselines are designed to minimize pointwise error by maximizing explained variance and enforcing near-complete allocation of excess energy, while MD-ADD prioritizes conservative attribution and explicit separation of unexplained variability. The discussion therefore interprets performance differences as outcomes of these objective choices, and it contrasts two MD-ADD configurations to isolate the effect of baseline handling without changing the attribution philosophy.

6.1. MD-ADD and Classical Low-Frequency NILM Algorithms

The Bayesian, mask-based, and BI-LSTM algorithms from the earlier comparison study represent three distinct modeling philosophies that are commonly applied to low-frequency NILM.

The Bayesian methods rely on strong structural assumptions, in particular that periods of low consumption correspond to HVAC inactivity and that temperature dependence dominates excess energy. The reported results confirm that these assumptions hold only for a subset of buildings. While Buildings 6 and 8 achieve very low NMAE values, several other buildings exhibit extreme errors, with NMAE values exceeding 3.0 and highly negative R². These failures are not marginal. They indicate that the model is systematically misallocating energy when its core assumptions are violated, for example in buildings with year-round HVAC operation or strong non-thermal drivers.

MD-ADD differs fundamentally in this respect. Although its best-case NMAE values are higher than those of the Bayesian model, it avoids catastrophic failure across buildings. The worst MD-ADD case, L06.B01, reaches an NMAE of approximately 2.43, which is still large but notably lower than the Bayesian worst cases. In the refined configuration, this value is substantially reduced. This reflects the effect of explicitly allowing unexplained energy rather than forcing all residual consumption into a temperature-driven component. In practice, this means that MD-ADD sacrifices best-case accuracy to gain robustness across heterogeneous building behaviors.

The time–frequency mask-based algorithm demonstrates substantially better numerical performance. Its NMAE values are consistently below 0.8 and often below 0.4. From a metric-focused perspective, this makes it clearly superior to MD-ADD. However, the accompanying R² values reveal an important limitation. Several buildings exhibit strongly negative R² despite low NMAE, indicating that the model captures the magnitude of HVAC energy but misrepresents its temporal structure. This is consistent with a formulation that prioritizes reconstruction accuracy over physical plausibility.

MD-ADD produces higher NMAE values but avoids this inconsistency. Residual diagnostics show that unexplained energy retains structure rather than being absorbed into the HVAC estimate. This difference highlights a key conceptual distinction. The mask-based approach treats unexplained structure as error to be minimized, whereas MD-ADD treats it as information about missing drivers or non-stationary behavior.

The BI-LSTM model occupies an intermediate position. Its NMAE values are more stable than the Bayesian approach and higher than the mask-based method. MD-ADD performs comparably to BI-LSTM on several buildings, particularly those with strong seasonal patterns. The critical difference is interpretability. BI-LSTM implicitly learns to explain residual structure through latent representations, while MD-ADD explicitly exposes uncertainty and residual behavior. From an analytical standpoint, this makes MD-ADD more suitable for diagnostic use, even when numerical accuracy is similar or worse.

6.2. MD-ADD and ADRENALIN Challenge Algorithms

The comparison with the ADRENALIN Challenge algorithms reveals the largest numerical gap. The adjusted STL, GMM-based clustering, and base-load decomposition methods achieve average NMAE values between 0.23 and 0.27, far below those of MD-ADD.

This difference should not be interpreted as a simple measure of methodological superiority. The competition algorithms were explicitly optimized for NMAE under fixed evaluation constraints. Their formulations include strong heuristic elements such as envelope fitting, clipping, reference-week selection, and implicit or explicit forcing of completeness. These design choices are effective for leaderboard performance but rely on assumptions that may not generalize beyond the benchmark setting.

MD-ADD intentionally avoids these mechanisms. As a result, it does not exploit opportunities to reduce error through aggressive fitting. This is particularly visible in buildings such as L14.B05, where competition algorithms achieve near-zero NMAE. MD-ADD remained around 0.59 despite the apparent regularity of the building. This persists in the refined configuration, indicating that the remaining error is not primarily due to misestimated baseline magnitude, but to the deliberate decision not to force short-term deviations into temperature-driven attributions when contextual evidence is weak.

In buildings with known complexity, such as L06.B01, the contrast becomes even more informative. Competition algorithms achieve moderate NMAE values, while MD-ADD performs poorly in absolute terms. However, residual diagnostics show that MD-ADD leaves substantial structured unexplained energy. The competition results themselves identify this building as problematic across methods. The difference is that MD-ADD makes this difficulty explicit rather than concealing it within the HVAC estimate.

To summarize the conceptual distinctions discussed above and to provide a structured comparison of modeling assumptions and attribution behavior, Table 15 presents a qualitative comparison between MD-ADD and representative low-frequency disaggregation methods considered in this study.

6.3. The Role of Temporal Aggregation

One consistent observation across MD-ADD results is the reduction in NMAE when moving from hourly to daily aggregation. This pattern suggests that a large fraction of the error arises from short-term variability rather than systematic bias. In other words, MD-ADD captures the correct long-term magnitude of HVAC energy but does not attempt to explain all intra-day fluctuations.

This behavior contrasts with competition algorithms, which often fit short-term variations aggressively. While this reduces hourly NMAE, it increases the risk of attributing non-HVAC behavior, such as occupancy-driven or control-related effects, to temperature-dependent loads. MD-ADD effectively filters out this behavior by design.

The reduction in error under daily aggregation is observed consistently across both the initial and refined MD-ADD configurations, confirming that baseline refinements primarily affect magnitude separation rather than the treatment of short-term variability.

6.4. Residual Structure as an Analytical Signal

A defining outcome of MD-ADD is the presence of structured residuals. Residual autocorrelation plots show clear temporal dependence rather than white noise, particularly in buildings with complex operation. This should not be interpreted as model inadequacy alone. Instead, it indicates that the available drivers are insufficient to explain observed behavior.

In practical terms, this information is valuable. Structured residuals can signal missing contextual variables, changes in control strategy, or abnormal operation. Competition-optimized models suppress this signal by construction, whereas MD-ADD preserves it, enabling subsequent analysis.

The multi-driver experiment presented in Section 5 further reinforces this interpretation. When calendar-derived contextual features are introduced alongside weather drivers, attribution mass is redistributed rather than absorbed into a single dominant component. Buildings exhibiting strong schedule regularity show increased calendar attribution, while weather-dominated buildings retain temperature-driven shares. Importantly, the unexplained component remains consistently non-zero. This confirms that adding contextual drivers refines explanatory structure without enforcing completeness, and that residual energy reflects genuinely unmodeled variability rather than mere numerical error.

6.5. Implications for NILM Evaluation

The comparisons reinforce a central insight. Low NMAE is not equivalent to reliable disaggregation in low-frequency, context-driven NILM. Methods that enforce completeness can achieve excellent numerical performance while providing limited insight into underlying building behavior.

MD-ADD demonstrates an alternative evaluation philosophy. It prioritizes robustness, interpretability, and explicit uncertainty representation over metric optimization. This leads to higher error but produces outputs that are better aligned with diagnostic and decision-support applications.

MD-ADD is intended for settings where disaggregation outputs are used as evidence in diagnostic reasoning. In such workflows, conservative attributions and explicit residual structure can be more informative than maximizing explained variance, because they separate driver-supported variability from variability that requires additional contextual data or operational investigation.

6.6. Baseline Formulation and Leakage Effects

The sensitivity analysis highlights baseline formulation as a dominant factor in low-frequency contextual disaggregation. In the rolling-envelope baseline used by default in MD-ADD, the estimated baseline can increase during periods of elevated seasonal demand. As a result, a portion of temperature-driven energy may be absorbed into the baseline rather than attributed to contextual drivers.

This effect is quantified by the observed correlation between baseline estimates and outdoor temperature, which exceeds 0.8 for several buildings with strong seasonal structure. Such coupling can reduce apparent temperature dependence under rolling baseline formulations but is explicitly resolved in MD-ADD through alternative baseline strategies or by omitting baseline subtraction altogether.

The temperature-banded baseline was introduced to mitigate this effect by restricting baseline estimation to HVAC-neutral regimes. Its mixed performance across buildings is informative. Where neutral regimes are stable, the approach reduces leakage and improves separation. Where buildings exhibit continuous or overlapping operation, performance degrades, indicating that neutral regimes are either unstable or poorly defined.

Rather than constituting a failure, this degradation exposes regime complexity that would otherwise be concealed by forced attribution. Stronger baseline assumptions therefore act as a stress test on the data, revealing limitations in observability and driver availability.

6.7. Limitations

Several limitations of the current work should be acknowledged. First, MD-ADD relies on the availability and quality of contextual drivers. When relevant drivers are missing or poorly measured, residual energy can remain large, leading to higher error metrics. While the reported experiments focus primarily on weather-driven disaggregation as a representative and widely studied use case, the MD-ADD framework supports multiple contextual drivers by design, and additional drivers can be incorporated without changes to the core attribution and uncertainty mechanisms. Second, the current implementation focuses on weather-driven disaggregation and does not explicitly model occupancy, control logic, or operational schedules beyond baseline estimation. Third, the evaluation is limited to the ADRENALIN Challenge dataset, which, while diverse and carefully curated, does not cover the full range of commercial building behaviors.

MD-ADD is expected to perform poorly in buildings where no contextual variables exhibit stable or interpretable relationships with aggregate consumption, such as environments dominated by stochastic occupancy, rapidly changing control strategies, or highly heterogeneous end-use mixes. In such cases, the framework will typically produce large unexplained components and high attribution uncertainty. Within the intended analytical scope, this behavior is considered an informative outcome rather than a failure, as it indicates that additional sensing, metadata, or operational insight is required for credible interpretation.

These limitations reflect deliberate scope choices rather than implementation deficiencies. Nevertheless, they constrain the applicability of the current framework and motivate future extensions.

6.8. Future Research Needs

Future research should extend MD-ADD toward multi-context disaggregation in settings where weather alone is insufficient. While the framework is inherently multi-driver, the reported experimental evaluation focuses primarily on weather-driven disaggregation as a representative and widely studied use case enabled by the availability of validated HVAC sub-metering. This choice does not restrict the generality of the formulation, which supports the integration of additional contextual drivers without changes to the attribution or uncertainty mechanisms. In particular, incorporating additional drivers such as occupancy proxies, indoor environmental measurements, or control signals is expected to reduce structured residuals and improve attribution quality without reintroducing forced completeness. This is particularly relevant for buildings where temperature alone explains only a fraction of variability.

Future versions of MD-ADD could include automated mechanisms to assess driver relevance and suppress drivers that do not exhibit consistent explanatory power. Such mechanisms could prevent spurious attributions and further improve robustness across heterogeneous buildings.

Integrating causal assumptions or soft physical constraints may help distinguish coincidental correlations from meaningful dependencies, especially in the presence of correlated drivers. This could improve interpretability while preserving the framework’s conservative attribution philosophy.

Extending the framework to operate in an online or rolling setting would enable detection of regime changes, control strategy shifts, or sensor faults. Structured residuals could then be used as triggers for further investigation rather than as static outputs.

Future work should continue to explore evaluation criteria beyond pointwise error metrics. Stability, uncertainty, residual structure, and plausibility should be treated as first-class evaluation dimensions, particularly for decision-support applications in smart buildings.

7. Conclusions

This work introduced MD-ADD, a multi-driver disaggregation framework designed for low-frequency smart meter data in commercial and public buildings. Unlike most existing NILM approaches, the framework explicitly avoids forced completeness and treats unexplained energy as a valid and informative outcome rather than a modeling failure.

The experimental results demonstrate that this design choice has measurable and quantifiable consequences. In terms of normalized mean absolute error (NMAE), MD-ADD does not outperform state-of-the-art algorithms optimized for leaderboard performance. Competition-winning methods achieve average hourly NMAE values in the range of approximately 0.23–0.27, whereas MD-ADD yields higher values under both its initial and refined configurations.

However, relative performance analysis highlights important improvements in robustness. In the initial configuration, MD-ADD exhibits a worst-case hourly NMAE of 2.43 (L06.B01). In the refined configuration, this worst-case value is reduced to 0.65, corresponding to an approximate 73% relative reduction in maximum error. This improvement is achieved without altering the attribution philosophy, but through revised baseline handling and validation consistency.

In contrast, classical Bayesian formulations exhibit worst-case NMAE values exceeding 3.9 under identical evaluation conditions, indicating catastrophic failure when structural assumptions are violated. MD-ADD maintains bounded error across all evaluated buildings and does not exhibit such extreme instability.

Temporal aggregation further clarifies the nature of residual discrepancies. Across buildings, daily aggregation reduces NMAE by approximately 10–40% relative to hourly values in the initial configuration. This confirms that remaining discrepancies are dominated by short-term variability rather than systematic bias in long-term HVAC magnitude.

These findings support a formulation that preserves unexplained energy as an explicit analytical signal under realistic smart meter conditions.

In practical applications, this formulation is particularly suitable for diagnostic and decision-support workflows in which interpretability, failure transparency, and uncertainty awareness are more valuable than maximal explained variance. Suitable use cases include screening-level HVAC benchmarking, identification of regimes requiring additional sensing, and investigation of abnormal operational behavior. In such contexts, the unexplained component serves as an operational signal highlighting regimes that cannot be credibly interpreted using weather and calendar drivers alone.

Author Contributions

Conceptualization, B.N.J., Z.G.M. and B.A.T.; methodology, B.N.J., Z.G.M. and B.A.T.; software, B.A.T.; validation, B.A.T., B.N.J. and Z.G.M.; formal analysis, B.A.T.; investigation, B.A.T.; resources, B.N.J. and Z.G.M.; data curation, B.A.T.; writing—original draft preparation, B.A.T.; writing—review and editing, B.N.J. and Z.G.M.; visualization, B.A.T.; supervision, B.N.J. and Z.G.M.; project administration, B.N.J.; funding acquisition, Z.G.M. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is part of the project titled “Automated Data and Machine Learning Pipeline for Cost-Effective Energy Demand Forecasting in Sector Coupling” (jr. Nr. RF-23-0039; Erhvervsfyrtårn Syd Fase 2), The European Regional Development Fund.

Data Availability Statement

The dataset is available on the Codalab page of the ADRENALIN load disaggregation challenge. https://codalab.lisn.upsaclay.fr/competitions/19659 (accessed on 27 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADRENALIN	Advanced Data-Driven Electricity Load Analysis for Intelligent Buildings
Bi-LSTM	Bidirectional Long Short-Term Memory
FHMM	Factorial Hidden Markov Model
GMM	Gaussian Mixture Model
HVAC	Heating, Ventilation, and Air-Conditioning
MD-ADD	Multi-Driver Automatic Dependency Disaggregation
NILM	Non-Intrusive Load Monitoring
NMAE	Normalized Mean Absolute Error
PIR	Passive Infrared Sensor
RMSE	Root Mean Square Error
SHAP	SHapley Additive exPlanations
STL	Seasonal-Trend decomposition using Loess
STFT	Short-Time Fourier Transform
XGBoost	Extreme Gradient Boosting

References

European Parliament; Council of the European Union. Directive (EU) 2024/1275 of 24 April 2024 on the Energy Performance of Buildings (Recast); European Union: Brussels, Belgium, 2024. [Google Scholar]
International Energy Agency; International Renewable Energy Agency; UN Climate Change High-Level Champions (UNCC HLC). Breakthrough Agenda Report 2023; International Energy Agency: Paris, France, 2023. [Google Scholar]
Ehrhardt-Martinez, K.; Donnelly, K.A.; Laitner, S. Advanced Metering Initiatives and Residential Feedback Programs: A Meta-Review for Household Electricity-Saving Opportunities; American Council for an Energy-Efficient Economy: Washington, DC, USA, 2010. [Google Scholar]
Dai, S.; Wang, Q.; Meng, F. A telehealth framework for dementia care: An ADLs patterns recognition model for patients based on NILM. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; IEEE: New York, NY, USA, 2021; pp. 1–8. [Google Scholar]
Tolnai, B.A.; Ma, Z.G.; Sartori, I.; Pandiyan, S.V.; Amos, M.; Bengtsson, G.; Lien, S.K.; Walnum, H.T.; Hameed, A.; Rajasekharan, J.; et al. ADRENALIN: Energy Data Preparation and Validation for HVAC Load Disaggregation in Commercial Buildings. In Energy Informatics; Springer Nature: Cham, Switzerland, 2026; pp. 321–337. [Google Scholar]
Lu, N. An evaluation of the HVAC load potential for providing load balancing service. IEEE Trans. Smart Grid 2012, 3, 1263–1270. [Google Scholar] [CrossRef]
Zoha, A.; Gluhak, A.; Imran, M.A.; Rajasegarar, S. Non-intrusive load monitoring approaches for disaggregated energy sensing: A survey. Sensors 2012, 12, 16838–16866. [Google Scholar] [CrossRef]
Lien, S.K.; Najafi, B.; Rajasekharan, J. Advances in Machine-Learning Based Disaggregation of Building Heating Loads: A Review. In Energy Informatics Academy Conference; Springer: Berlin/Heidelberg, Germany, 2023; pp. 179–201. [Google Scholar]
De Baets, L.; Ruyssinck, J.; Develder, C.; Dhaene, T.; Deschrijver, D. On the Bayesian optimization and robustness of event detection methods in NILM. Energy Build. 2017, 145, 57–66. [Google Scholar] [CrossRef]
Li, J.; Tian, M.; Ma, M.; Zhou, G.; Wang, G.; Guo, L. Air Conditioning Load Monitoring in NILM Using Low Frequency Power Data. In Proceedings of the 2024 3rd Asian Conference on Frontiers of Power and Energy (ACFPE), Chengdu, China, 25–27 October 2024; IEEE: New York, NY, USA, 2024; pp. 404–408. [Google Scholar]
Tolnai, B.A.; Ma, Z.; Jørgensen, B.N. A Scoping Review of Energy Load Disaggregation. In Progress in Artificial Intelligence; Springer Nature: Cham, Switzerland, 2023; pp. 209–221. [Google Scholar]
Huber, P.; Calatroni, A.; Rumsch, A.; Paice, A. Review on Deep Neural Networks Applied to Low-Frequency NILM. Energies 2021, 14, 2390. [Google Scholar] [CrossRef]
Kim, D.-W.; Ahn, K.-U.; Shin, H.; Lee, S.-E. Simplified Weather-Related Building Energy Disaggregation and Change-Point Regression: Heating and Cooling Energy Use Perspective. Buildings 2022, 12, 1717. [Google Scholar] [CrossRef]
Oh, S.; Kim, K.H. Change-point modeling analysis for multi-residential buildings: A case study in South Korea. Energy Build. 2020, 214, 109901. [Google Scholar] [CrossRef]
Kaselimi, M.; Protopapadakis, E.; Voulodimos, A.; Doulamis, N.; Doulamis, A. Towards Trustworthy Energy Disaggregation: A Review of Challenges, Methods, and Perspectives for Non-Intrusive Load Monitoring. Sensors 2022, 22, 5872. [Google Scholar] [CrossRef]
Henriet, S.; Simsekli, U.; Richard, G.; Fuentes, B. Energy disaggregation for commercial buildings: A statistical analysis. In Proceedings of the International Workshop on Non-Intrusive Load Monitoring (NILM2018), Austin, TX, USA, 7–8 March 2018; pp. 1–4. [Google Scholar]
Tolnai, B.A.; Ma, Z.; Jørgensen, B.N. Comparison of Three Algorithms for Low-Frequency Temperature-Dependent Load Disaggregation in Buildings Without Submetering. In Energy Informatics; Springer Nature: Cham, Switzerland, 2026; pp. 355–370. [Google Scholar]
Saleem, M.H.; Taha, M.; Rehmani, M.A.A.; Tito, S.R.; Soltic, S.; Nieuwoudt, P.; Pandey, N.; Ahmed, M.D. A comprehensive review of machine learning and deep learning models for non-intrusive load monitoring: Performance, analyses, practical insights, and emerging trends. Appl. Intell. 2025, 55, 1020. [Google Scholar] [CrossRef]
Rafiq, H.; Manandhar, P.; Rodriguez-Ubinas, E.; Ahmed Qureshi, O.; Palpanas, T. A review of current methods and challenges of advanced deep learning-based non-intrusive load monitoring (NILM) in residential context. Energy Build. 2024, 305, 113890. [Google Scholar] [CrossRef]
ADRENALIN Project. Available online: https://adrenalin.energy/ (accessed on 27 January 2026).
Tolnai, B.A.; Zimmermann, R.S.; Xie, Y.; Tran, N.; Çeliker, C.E.; Ma, Z.G.; Sartori, I.; Amos, M.; Bengtsson, G.; Lien, S.K.; et al. Advancing Non-intrusive Load Monitoring: Insights from the Winning Algorithms in the ADRENALIN 2024 Load Disaggregation Competition. In Energy Informatics; Springer Nature: Cham, Switzerland, 2026; pp. 338–354. [Google Scholar]
Hart, G.W. Nonintrusive appliance load monitoring. Proc. IEEE 1992, 80, 1870–1891. [Google Scholar] [CrossRef]
Batra, N.; Kelly, J.; Parson, O.; Dutta, H.; Knottenbelt, W.; Rogers, A.; Singh, A.; Srivastava, M. NILMTK: An open source toolkit for non-intrusive load monitoring. In Proceedings of the 5th international conference on Future energy systems, Cambridge, UK, 11–13 June 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 265–276. [Google Scholar]
Stankovic, L.; Stankovic, V.; Murray, D.; Liao, J. Energy feedback enabled by load disaggregation. In Proceedings of the 1st Energy Feedback Symposium:“Feedback in Energy Demand Reduction: Examining Evidence and Exploring Opportunities”, Edinburgh, UK, 4–5 July 2016. [Google Scholar]
Hu, Y.; Waite, M.; Patz, E.; Xia, B.; Xu, Y.; Olsen, D.; Gopan, N.; Modi, V. A data-driven approach for the disaggregation of building-sector heating and cooling loads from hourly utility load data. Energy Strategy Rev. 2023, 49, 101175. [Google Scholar] [CrossRef]
Kolter, J.Z.; Johnson, M.J. REDD: A public data set for energy disaggregation research. In Proceedings of the Workshop on data mining applications in sustainability (SIGKDD), San Diego, CA, USA, 21 August 2011; pp. 59–62. [Google Scholar]
Kelly, J.; Knottenbelt, W. The UK-DALE dataset, domestic appliance-level electricity demand and whole-house demand from five UK homes. Sci. Data 2015, 2, 150007. [Google Scholar] [CrossRef]
Ghahramani, Z.; Jordan, M. Factorial hidden Markov models. Adv. Neural Inf. Process. Syst. 1995, 8, 472–478. [Google Scholar]
Kim, H.; Marwah, M.; Arlitt, M.; Lyon, G.; Han, J. Unsupervised Disaggregation of Low Frequency Power Measurements. In Proceedings of the 2011 SIAM International Conference on Data Mining (SDM), Mesa, AZ, USA, 28–30 April 2011; pp. 747–758. [Google Scholar]
Ng, Y.C.; Chilinski, P.M.; Silva, R. Scaling factorial hidden markov models: Stochastic variational inference without messages. Adv. Neural Inf. Process. Syst. 2016, 29, 4044–4052. [Google Scholar]
Kolter, J.Z.; Jaakkola, T. Approximate inference in additive factorial hmms with application to energy disaggregation. In Artificial Intelligence and Statistics; PMLR: London, UK, 2021; pp. 1472–1482. [Google Scholar]
Zhong, M.; Goddard, N.; Sutton, C. Signal aggregate constraints in additive factorial HMMs, with application to energy disaggregation. In Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 2, Montreal, QC, Canada, 8–13 December 2014; MIT Press: Cambridge, MA, USA, 2014; pp. 3590–3598. [Google Scholar]
Kolter, J.; Batra, S.; Ng, A. Energy disaggregation via discriminative sparse coding. Adv. Neural Inf. Process. Syst. 2010, 23, 1153–1161. [Google Scholar]
Zhao, B.; Ye, M.; Stankovic, L.; Stankovic, V. Non-intrusive load disaggregation solutions for very low-rate smart meter data. Appl. Energy 2020, 268, 114949. [Google Scholar] [CrossRef]
Li, C.; Zheng, K.; Guo, H.; Chen, Q. A mixed-integer programming approach for industrial non-intrusive load monitoring. Appl. Energy 2023, 330, 120295. [Google Scholar] [CrossRef]
Kelly, J.; Knottenbelt, W. Neural NILM: Deep Neural Networks Applied to Energy Disaggregation. In Proceedings of the 2nd ACM International Conference on Embedded Systems for Energy-Efficient Built Environments, Seoul, Republic of Korea, 4–5 November 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 55–64. [Google Scholar]
Zhang, C.; Zhong, M.; Wang, Z.; Goddard, N.; Sutton, C. Sequence-to-point learning with neural networks for non-intrusive load monitoring. In Proceedings of the AAAI conference on artificial intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Chen, K.; Zhang, Y.; Wang, Q.; Hu, J.; Fan, H.; He, J. Scale- and context-aware convolutional non-intrusive load monitoring. IEEE Trans. Power Syst. 2019, 35, 2362–2373. [Google Scholar] [CrossRef]
He, K.; Stankovic, L.; Liao, J.; Stankovic, V. Non-intrusive load disaggregation using graph signal processing. IEEE Trans. Smart Grid 2016, 9, 1739–1747. [Google Scholar] [CrossRef]
Paradiso, F.; Paganelli, F.; Giuli, D.; Capobianco, S. Context-Based Energy Disaggregation in Smart Homes. Future Internet 2016, 8, 4. [Google Scholar] [CrossRef]
American Society of Heating, R.; Engineers, A.C. Ashrae Guideline 14-2014: Measurement of Energy, Demand and Water Savings; American Society of Heating, Refrigerating, and Air-Conditioning Engineers: Peachtree Corners, GA, USA, 2014. [Google Scholar]
Westermann, P.; Deb, C.; Schlueter, A.; Evins, R. Unsupervised learning of energy signatures to identify the heating system and building type using smart meter data. Appl. Energy 2020, 264, 114715. [Google Scholar] [CrossRef]
Rasooli, A.; Itard, L. Automated in-situ determination of buildings’ global thermo-physical characteristics and air change rates through inverse modelling of smart meter and air temperature data. Energy Build. 2020, 229, 110484. [Google Scholar] [CrossRef]
Najafi, B.; Di Narzo, L.; Rinaldi, F.; Arghandeh, R. Machine learning based disaggregation of air-conditioning loads using smart meter data. IET Gener. Transm. Distrib. 2020, 14, 4755–4762. [Google Scholar] [CrossRef]
Kaneko, N.; Okazawa, K.; Zhao, D.; Nishikawa, H.; Taniguchi, I.; Murayama, H.; Yura, Y.; Okamoto, M.; Catthoor, F.; Onoye, T. Non-intrusive thermal load disaggregation and forecasting for effective HVAC systems. Appl. Energy 2024, 367, 123379. [Google Scholar] [CrossRef]
Vavouris, A.; Garside, B.; Stankovic, L.; Stankovic, V. Low-Frequency Non-Intrusive Load Monitoring of Electric Vehicles in Houses with Solar Generation: Generalisability and Transferability. Energies 2022, 15, 2200. [Google Scholar] [CrossRef]
Yan, Z.; Nardello, M.; Brunelli, D.; Wen, H. Reproducible non-intrusive load monitoring: A survey of open-source methods and practical challenges. Renew. Sustain. Energy Rev. 2026, 226, 116255. [Google Scholar] [CrossRef]
Pu, Z.; Huang, Y.; Weng, M.; Meng, Y.; Zhao, Y.; He, G. Enhancing non-intrusive load monitoring with weather and calendar feature integration in DAE. Front. Energy Res. 2024, 12, 1361916. [Google Scholar] [CrossRef]
Chen, D.; Barker, S.; Subbaswamy, A.; Irwin, D.; Shenoy, P. Non-intrusive occupancy monitoring using smart meters. In Proceedings of the 5th ACM workshop on embedded systems for energy-efficient buildings, Rome, Italy, 11–15 November 2013; pp. 1–8. [Google Scholar]
Pereira, L.; Nunes, N. Performance evaluation in non-intrusive load monitoring: Datasets, metrics, and tools—A review. WIREs Data Min. Knowl. Discov. 2018, 8, e1265. [Google Scholar] [CrossRef]
Tolnai, B.A.; Ma, Z.; Sartori, I.; Miller, C.; White, S.; Amos, M.; Bengtsson, G.; Hameed, A.; Jørgensen, B.N. Lessons Learned from the ADRENALIN Load Disaggregation Challenge. In Energy Informatics; Springer Nature: Cham, Switzerland, 2026; pp. 371–387. [Google Scholar]
Tolnai, B.A.; Ma, Z.; Jørgensen, B.N. Energy Data Collection Protocol: A Case Study on the ADRENALIN Project. In Energy Informatics; Springer Nature: Cham, Switzerland, 2025; pp. 120–135. [Google Scholar]
Tolnai, B.A.; Ma, Z.; Jørgensen, B.N. Standard Energy Data Competition Procedure: A Comprehensive Review with a Case Study of the ADRENALIN Load Disaggregation Competition. In Energy Informatics; Springer Nature: Cham, Switzerland, 2024; pp. 60–76. [Google Scholar]

Figure 1. Daily driver-level weather attributions prior to aggregation. Individual meteorological drivers are shown separately before aggregation into the composite weather-dependent component used for evaluation. Contributions correspond to out-of-fold SHAP-based signed attributions inferred by MD-ADD. The unexplained component represents residual model output not allocated to contextual drivers.

Figure 2. Mean diurnal profile (hour-of-day) for L14_B03_1H under the multi-driver configuration. The calendar component captures regular working-hour structure, while the weather component reflects smoother temperature-related variation.

Figure 3. Daily stacked decomposition for L14_B03_1H during January 2020 under the multi-driver configuration. Weather and calendar contributions vary across days, while a residual component is preserved, indicating incomplete observability.

Figure 4. Illustrates daily ground truth versus MD-ADD temperature-attributed estimates for building L14.B05.

Figure 5. Illustrates daily ground truth versus MD-ADD temperature-attributed estimates for building L06.B01.

Figure 6. Shows strong residual autocorrelation, indicating unmodeled structure rather than random noise.

Table 1. Notation used in the decomposition and attribution framework.

Symbol	Description
Y(t)	Aggregate electricity consumption at time t.
X(t)	Vector of contextual driver variables at time t.
B(t)	Estimated baseline component independent of contextual drivers.
$A_{k}$ (t)	Estimated contribution associated with contextual driver k at time t.
U(t)	Unexplained residual component at time t.
E(t)	Excess consumption above baseline, defined as E(t) = Y(t) − B(t)
Ê(t)	Model-predicted excess consumption used for attribution.

Table 2. Configuration parameters used for all reported experiments.

Component	Parameter	C1	C2	C3
Temporal resolution	Resampling frequency	Hourly	Hourly	Hourly
Baseline modeling	Mode	Rolling-envelope	Temperature-banded	None
Baseline modeling	Window length	14 days	None	None
Baseline modeling	Quantile	0.2	None	None
Cross-validation	Scheme	Forward-chaining (5 folds)	Forward-chaining (5 folds)	Forward-chaining (5 folds)
Model backend	Algorithm	XGBoost	XGBoost	XGBoost
XGBoost	n_estimators	600	600	600
XGBoost	learning_rate	0.05	0.05	0.05
XGBoost	max_depth	5	5	5
XGBoost	subsample	0.8	0.8	0.8
XGBoost	colsample_bytree	0.8	0.8	0.8
XGBoost	reg_lambda	1	1	1
Attribution	Method	SHAP (TreeExplainer/pred_contribs)	SHAP (TreeExplainer/pred_contribs)	SHAP (TreeExplainer/pred_contribs)
Bootstrap	Block length	48 h	48 h	48 h
Bootstrap	Replications	200	200	200
Calendar features	Variables	None	None	hour, dayofweek, month, weekofyear, is_weekend
Weather features	Variables	T, Rh, SolGlob, Ws, Wd	T, Rh, SolGlob, Ws, Wd	T, Rh, SolGlob, Ws, Wd
Optional constraints	Time–frequency masking	Disabled	Disabled	Disabled

Table 3. Implementation-Sensitive Defaults and Settings.

Component	Exact Implementation Setting Used	Rationale/Impact
Time alignment	Inner join on timestamp; rows with missing aggregate Y removed; driver rows aligned and dropped if NA after resampling	Ensures consistency between Y and X; affects effective sample size
Resampling	Aggregate electricity Y aggregated by sum to hourly resolution; driver variables aggregated by mean	Preserves energy totals; prevents driver amplification
Weather driver selection	Available subset of {T, Rh, SolGlob, Ws, Wd}; temperature required; missing drivers excluded per building	Driver set is building-dependent
Solar preprocessing	If Rd and Rr present: SolGlob = Rd + Rr	Ensures consistent solar representation
Cross-validation	TimeSeriesSplit, 5 forward-chaining folds; attribution computed only from out-of-fold predictions	Prevents temporal leakage
Model backend	XGBoost with: n_estimators = 600, learning_rate = 0.05, max_depth = 5, subsample = 0.8, colsample_bytree = 0.8, reg_lambda = 1.0, tree_method = “hist”	Fixes backend variability
SHAP computation	Tree-based contribution mode via XGBoost pred_contribs; bias term excluded from driver aggregation	Determines attribution baseline and feature mass allocation
SHAP normalization	No post hoc SHAP normalization or rescaling applied	Preserves raw contribution magnitudes
Attribution scaling	Contributions not re-scaled; raw SHAP sums used directly	Maintains model-consistent decomposition
Explainable mass (C3)	Explainable(t) = max(Y(t) − Residual(t), 0)	Prevents forced completeness
Weather/Calendar split	Split explainable mass proportionally to	Weather_raw
Residual handling	Residual taken directly from model; not forced to zero; optionally clipped nonnegative before share computation	Preserves unexplained energy as diagnostic
Clipping rules	Weather and calendar components constrained nonnegative after explainable-mass allocation	Ensures physical interpretability
Bootstrap	Moving-block bootstrap; block length = 48 h; 200 replicates; 2.5–97.5 percentile CI	Determines uncertainty width
Daily aggregation for diagnostics	Non-overlapping 24 h summation before share computation	Affects stability metrics
Masking	Time–frequency masking disabled in reported experiments	Avoids hidden structural constraints

Table 4. Experimental configurations evaluated in this study.

Configuration ID	Baseline Strategy	Driver Set	Consistency Constraints	Bootstrap Uncertainty
C1	Rolling-envelope baseline (14-day window, quantile 0.2), “initial configuration”	Weather drivers (T, Rh, SolGlob, Ws, Wd, subject to building availability)	Time–frequency masking disabled	Yes, moving-block bootstrap (48 h blocks, 200 replicates)
C2	Temperature-banded baseline, refined configuration	Weather drivers (T, Rh, SolGlob, Ws, Wd, subject to building availability)	Time–frequency masking disabled	Yes, moving-block bootstrap (48 h blocks, 200 replicates)
C3	No baseline subtraction, multi-driver configuration	Weather drivers + calendar-derived variables	Time–frequency masking disabled	Yes, moving-block bootstrap (48 h blocks, 200 replicates)

Table 5. Bayesian disaggregation without weekday separation.

Building	NMAE	MAE	RMSE	R²
B1	0.54	19.05	28.17	0.35
B2	1.66	10.91	15.6	−0.9
B3	3.91	41.12	44.76	−3.86
B4	2.61	76.24	79.75	−1.24
B5	0.71	31.41	38.93	−1.04
B6	0.29	4.44	6.48	0.76
B7	0.37	10.53	19.63	0.7
B8	0.24	25.91	35.84	0.9
B9	0.5	49.64	70.31	0.62

Table 6. Bayesian disaggregation with weekday separation.

Building	NMAE	MAE	RMSE	R²
B1	0.59	16.4	22.9	0.57
B2	1.66	10.59	15.15	−0.8
B3	3.36	39.58	42.76	−3.35
B4	3.46	81.71	86.18	−1.61
B5	0.58	31.03	39.07	−1.07
B6	0.29	4.47	6.51	0.75
B7	0.39	10.92	19.65	0.7
B8	0.16	15.76	23.33	0.96
B9	0.44	46.64	65.07	0.67

Table 7. Time–frequency mask-based algorithm.

Building	NMAE	MAE	RMSE	R²
B1	0.396	17.276	24.864	0.48
B2	0.706	31.448	40.011	−3.776
B3	0.605	34.803	40.54	−2.908
B4	0.241	33.167	47.609	0.249
B5	0.403	15.453	22.077	0.515
B6	0.45	11.616	12.749	0.337
B7	0.305	11.463	17.218	0.836
B8	0.48	87.34	97.195	0.122
B9	0.419	29.005	42.561	0.831

Table 8. NMAE of ADRENALIN Challenge algorithms.

Building	Adjusted STL	GMM-Based	Base-Load
L03.B02_1H	0.3049	0.3421	0.4634
L06.B01_1H	0.3728	0.7205	0.3903
L09.B01_1H	0.1939	0.17	0.2017
L10.B01_1H	0.1421	0.2083	0.2413
L14.B01_1H	0.2991	0.357	0.6392
L14.B02_1H	0.3053	0.3485	0.4002
L14.B03_1H	0.3968	0.4307	0.2052
L14.B04_1H	0.2233	0.0981	0.134
L14.B05_1H	0.0687	0.0167	0.0195

Table 9. MD-ADD performance.

Building	Hourly NMAE	Daily NMAE
L03.B02	0.907	0.803
L06.B01	2.428	2.276
L09.B01	0.953	0.531
L10.B01	0.690	0.424
L14.B01	0.808	0.628
L14.B02	0.582	0.509
L14.B03	0.412	0.350
L14.B04	0.666	0.582
L14.B05	0.591	0.591

Table 10. MD-ADD performance using the refined configuration.

Building	Hourly NMAE	Daily NMAE
L03.B02	0.685	0.940
L06.B01	0.647	0.841
L09.B01	0.739	0.921
L10.B01	0.741	0.915
L14.B01	0.529	0.596
L14.B02	0.575	0.713
L14.B03	0.703	0.872
L14.B04	0.488	0.616
L14.B05	0.509	0.655

Table 11. Hourly normalized mean absolute error (NMAE) per building for the three ADRENALIN Challenge algorithms and the proposed MD-ADD framework under its initial and refined configurations. All methods are evaluated on identical buildings, temporal resolution and ground-truth definitions.

Building	Adjusted STL	GMM-Based	BASE-LOAD	MD-ADD (Initial)	MD-ADD (Refined)
L03.B02_1H	0.3049	0.3421	0.4634	0.907	0.685
L06.B01_1H	0.3728	0.7205	0.3903	2.428	0.647
L09.B01_1H	0.1939	0.17	0.2017	0.953	0.739
L10.B01_1H	0.1421	0.2083	0.2413	0.69	0.741
L14.B01_1H	0.2991	0.357	0.6392	0.808	0.529
L14.B02_1H	0.3053	0.3485	0.4002	0.582	0.575
L14.B03_1H	0.3968	0.4307	0.2052	0.412	0.703
L14.B04_1H	0.2233	0.0981	0.134	0.666	0.488
L14.B05_1H	0.0687	0.0167	0.0195	0.591	0.509

Table 12. Multi-driver attribution shares and stability diagnostics (daily aggregation).

Building	Mean Daily Share (Weather)	Mean Daily Share (Calendar)	Mean Daily Share (Unexplained)	MAD Daily Share (Weather)	MAD Daily Share (Calendar)	ACF1 Daily Residual Energy	Share CI Width (Weather)	Share CI Width (Calendar)	Share CI Width (Unexplained)
L03_B02_1H	0.4085	0.4103	0.1812	0.1787	0.2019	0.981	0.0501	0.0564	0.0798
L06_B01_1H	0.2643	0.5713	0.1645	0.0992	0.1026	0.9349	0.0309	0.058	0.0821
L10_B01_1H	0.1857	0.647	0.1672	0.0696	0.0736	0.9507	0.0193	0.0511	0.0636
L14_B01_1H	0.4069	0.4275	0.1656	0.0952	0.0963	0.9636	0.0302	0.032	0.0545
L14_B02_1H	0.5087	0.3247	0.1666	0.1182	0.1174	0.9824	0.0348	0.0284	0.0551
L14_B03_1H	0.3002	0.5331	0.1667	0.1091	0.1088	0.9176	0.0245	0.0396	0.0551
L14_B04_1H	0.4073	0.4261	0.1666	0.1566	0.148	0.969	0.0334	0.032	0.0551

Table 13. Stability and residual-structure diagnostics across C1–C3 (daily aggregation).

Building	Config	Mean Share (Weather)	Mean Share (Unexplained)	ACF1 Residual	CI Width (Weather)	CI Width (Unexplained)
L03_B02_1H	C1	0.8061	0.059	0.9002	0.0805	0.0336
L03_B02_1H	C2	0.8075	0.1243	0.9766	0.0803	0.0617
L03_B02_1H	C3	0.4085	0.1812	0.981	0.0501	0.0798
L06_B01_1H	C1	0.8338	0.1063	0.8718	0.0822	0.0581
L06_B01_1H	C2	0.8355	0.1441	0.9269	0.082	0.0748
L06_B01_1H	C3	0.2643	0.1645	0.9349	0.0309	0.0821
L10_B01_1H	C1	0.831	0.0501	0.7635	0.0636	0.0218
L10_B01_1H	C2	0.8326	0.0696	0.831	0.0636	0.028
L10_B01_1H	C3	0.1857	0.1672	0.9507	0.0193	0.0636
L14_B01_1H	C1	0.8	0.0688	0.8396	0.0499	0.0295
L14_B01_1H	C2	0.8249	0.117	0.9583	0.052	0.0387
L14_B01_1H	C3	0.4069	0.1656	0.9636	0.0302	0.0545
L14_B02_1H	C1	0.8064	0.0454	0.9236	0.0523	0.0154
L14_B02_1H	C2	0.8177	0.064	0.9605	0.0534	0.0208
L14_B02_1H	C3	0.5087	0.1666	0.9824	0.0348	0.0551
L14_B03_1H	C1	0.8013	0.0806	0.8142	0.0524	0.0289
L14_B03_1H	C2	0.8233	0.11	0.8835	0.0551	0.0376
L14_B03_1H	C3	0.3002	0.1667	0.9176	0.0245	0.0551
L14_B04_1H	C1	0.8237	0.0471	0.817	0.0522	0.0174
L14_B04_1H	C2	0.8322	0.0838	0.9362	0.0547	0.0276
L14_B04_1H	C3	0.4073	0.1666	0.969	0.0334	0.0551

Table 14. Correlation between estimated baseline and outdoor temperature.

Building	Correlation (Baseline, T)
L14.B01	0.83
L14.B04	0.81
L14.B05	0.83
L10.B01	0.29
L06.B01	0.28

Table 15. Qualitative comparison of MD-ADD and representative low-frequency disaggregation methods, highlighting differences in modeling assumptions, attribution behavior, uncertainty handling, and interpretability.

Method	Core Modeling Principle	Explicit Uncertainty Reporting	Residual Preserved as Signal	Interpretability	Source
Bayesian disaggregation	Probabilistic regression with structural assumptions	No	No	Medium	[17]
Time–frequency mask-based	Spectral decomposition and masking	No	No	Low	[17]
BI-LSTM	Deep sequence modeling	No	No	Low	[17]
Adjusted STL	Seasonal-trend decomposition with heuristic adjustments	No	No	Medium	[21]
GMM-based clustering	State-based clustering of load patterns	No	No	Low	[21]
Base-load decomposition	Heuristic baseline separation	No	No	Medium	[21]
MD-ADD (this work)	Contextual, attribution-based regression with residual preservation	Yes	Yes	High	This work

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tolnai, B.A.; Ma, Z.G.; Jørgensen, B.N. Data-Driven Electricity Load Analysis in Smart Buildings: A Multi-Driver Automatic Dependency Disaggregation Approach. Electronics 2026, 15, 929. https://doi.org/10.3390/electronics15050929

AMA Style

Tolnai BA, Ma ZG, Jørgensen BN. Data-Driven Electricity Load Analysis in Smart Buildings: A Multi-Driver Automatic Dependency Disaggregation Approach. Electronics. 2026; 15(5):929. https://doi.org/10.3390/electronics15050929

Chicago/Turabian Style

Tolnai, Balázs András, Zheng Grace Ma, and Bo Nørregaard Jørgensen. 2026. "Data-Driven Electricity Load Analysis in Smart Buildings: A Multi-Driver Automatic Dependency Disaggregation Approach" Electronics 15, no. 5: 929. https://doi.org/10.3390/electronics15050929

APA Style

Tolnai, B. A., Ma, Z. G., & Jørgensen, B. N. (2026). Data-Driven Electricity Load Analysis in Smart Buildings: A Multi-Driver Automatic Dependency Disaggregation Approach. Electronics, 15(5), 929. https://doi.org/10.3390/electronics15050929

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data-Driven Electricity Load Analysis in Smart Buildings: A Multi-Driver Automatic Dependency Disaggregation Approach

Abstract

1. Introduction

2. Literature Review

2.1. Scope and Positioning

2.2. Why NILM Matters in Practice

2.3. Data Regimes and Datasets

2.3.1. Data Frequency and Observability Constraints

2.3.2. Public Datasets and Benchmarking Culture

2.4. Core Methodological Families

2.4.1. Combinatorial and Probabilistic State Models

2.4.2. Optimization-Based Formulations

2.4.3. Deep Learning for Sequence Modeling

2.4.4. Graph Signal Processing and Structured Learning

2.4.5. Context-Aware and Multi-Modal NILM

2.5. Disaggregation for HVAC and Temperature-Dependent Components

2.6. Low-Frequency NILM at Smart Meter Resolution

2.7. Contextual Features in Low-Frequency Load Disaggregation

2.8. Evaluation Practices and Metrics

2.9. Practical Maturity, Trustworthiness, and Interpretability in Non-Intrusive Load Monitoring

3. Dataset Description

3.1. The ADRENALIN Challenge Dataset

3.2. Building Selection and Quality Assurance

3.3. Variables and Preprocessing

3.4. Evaluation Metric

4. Proposed Approach: MD-ADD

4.1. Problem Formulation and Scope

4.2. Data Preparation and Alignment

4.3. Baseline Estimation

4.4. Excess Energy Definition

4.5. Driver-Based Modeling of Excess Energy

4.6. Attribution via Model Explainability

4.7. Attribution Strategy and Treatment of Unexplained Energy

4.8. Attribution Logic and Mapping to HVAC Energy

4.9. Temporal and Time–Frequency Consistency Constraints

4.10. Attribution Thresholding and Sparsity

4.11. Uncertainty Quantification via Block Bootstrap

4.12. Diagnostics and Validation

4.13. Attribution Stability and Residual-Structure Metrics

4.14. Comparison Algorithms and Evaluation Setup

4.15. Implementation Details for Experimental Evaluation

4.16. Summary of the Workflow

5. Results

5.1. Experimental Design and Evaluation Protocol

5.2. Baseline and Comparative Methods Included

5.3. Reference Results from the First Comparison Study

5.4. Reference Results from the ADRENALIN Challenge

5.5. MD-ADD Results: Initial Configuration

5.6. MD-ADD Results: Refined Configuration

5.7. Quantitative Comparison with Established Low-Frequency Disaggregation Methods

5.8. Multi-Driver Attribution Behavior

5.9. Stability, Uncertainty, and Residual Diagnostics

5.10. Sensitivity to Baseline Formulation and Backend Tuning

5.11. Baseline Leakage Diagnostic

5.12. Diagnostic Analysis: Qualitative Behavior, Residual Structure, and Attribution Stability

6. Discussion

6.1. MD-ADD and Classical Low-Frequency NILM Algorithms

6.2. MD-ADD and ADRENALIN Challenge Algorithms

6.3. The Role of Temporal Aggregation

6.4. Residual Structure as an Analytical Signal

6.5. Implications for NILM Evaluation

6.6. Baseline Formulation and Leakage Effects

6.7. Limitations

6.8. Future Research Needs

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI