1. Introduction
The global energy landscape is undergoing a profound transformation driven by the dual pressures of escalating demand and the urgent need for environmental sustainability. Over the past two decades, industrialization, urbanization, and the proliferation of electrical appliances have led to an unprecedented surge in energy consumption worldwide [
1,
2]. Buildings constitute a major share of total energy consumption in urban environments, with estimates varying by region and accounting methodology: the International Energy Agency reports that buildings account for approximately 30% of global final energy use, while in certain highly urbanized regions this share can exceed 40% when considering both direct and indirect consumption [
3,
4]. This substantial contribution underscores the critical importance of developing effective energy management strategies that can mitigate environmental impact while meeting the growing needs of modern society. In this context, energy efficiency has emerged as a cornerstone of sustainable development to reduce greenhouse gas emissions without compromising economic growth or quality of life [
2]. The adoption of smart meters and smart grids has created new opportunities for monitoring and optimizing energy consumption, enabling real-time feedback to consumers and utility providers alike [
5,
6].
Achieving effective energy management requires understanding and monitoring energy consumption at a granular level, which can be accomplished through various methods, most notably energy disaggregation. Energy disaggregation is the process of decomposing aggregate energy consumption into the contributions of individual appliances or systems [
2,
7]. By providing detailed insights into how energy is distributed across different loads, disaggregation enables consumers, facility managers, and utility providers to identify potential failures and inefficiencies, optimize usage patterns, and implement targeted conservation measures. Research has suggested that access to appliance-level consumption data, when combined with appropriate behavioral interventions and sustained engagement, can lead to energy savings in the range of 5–15% in household settings, though the magnitude of these savings depends on study duration, intervention type, and occupant engagement [
8]. Furthermore, disaggregated energy data supports demand-side management strategies, enabling utilities to implement dynamic pricing schemes and load-balancing mechanisms that benefit both consumers and grid operators [
9].
Energy disaggregation can be accomplished through two fundamentally different approaches: Intrusive Load Monitoring (ILM) and Non-Intrusive Load Monitoring (NILM). ILM involves the deployment of dedicated sensing hardware at each appliance or circuit, providing highly accurate measurements of individual energy consumption [
2,
10]. Although this approach offers high accuracy and reliability, its practical application is limited by high costs, installation complexity, and maintenance requirements, especially in large-scale deployments or in existing buildings where retrofitting is difficult [
2,
11]. The need for physical access to each appliance and the ongoing maintenance of multiple sensors make ILM impractical for most residential and commercial applications. In contrast, NILM offers a cost-effective alternative by leveraging aggregate energy data collected from a single metering point to infer the consumption of each individual appliance [
7,
12]. First proposed by George Hart in 1985 at the Massachusetts Institute of Technology, NILM eliminates the need for extensive sensor networks, making it an attractive solution for widespread adoption in both residential and commercial settings [
7,
12]. The non-intrusive nature of this approach means that it can be deployed without modifying existing electrical installations, significantly reducing implementation barriers.
NILM algorithms comprise different approaches, which can be classified according to different criteria. From a sampling frequency perspective, NILM approaches are categorized as either high-frequency or low-frequency methods [
2,
5]. High-frequency NILM systems operate at sampling rates exceeding 1 kHz, capturing detailed electrical signatures such as harmonic content, transient waveforms, and current-voltage trajectories that enable precise appliance identification [
2,
13]. However, these approaches require sophisticated and expensive data acquisition hardware, limiting their scalability for mass-market applications. Low-frequency NILM methods, conversely, operate at conventional smart meter rates (typically ≤1 Hz) and rely on features such as active and reactive power, steady-state consumption levels, and temporal usage patterns [
2,
14]. While low-frequency approaches may sacrifice some discriminative power, they offer significant advantages in terms of cost-effectiveness and compatibility with existing smart meter infrastructure, making them particularly suitable for large-scale deployment scenarios.
Additionally, NILM methodologies can be distinguished as event-based or non-event-based approaches [
2,
15]. Event-based methods detect discrete changes in energy consumption patterns, such as the switching on or off of appliances, and use these transition events to identify individual loads [
2,
16]. These methods typically achieve high accuracy for appliances with distinct on/off signatures but may struggle with continuously variable loads. Non-event-based approaches, in contrast, analyze the continuous aggregate signal to extract features indicative of different appliances without explicitly identifying switching events [
2]. Both paradigms have demonstrated effectiveness across various scenarios, with the choice often depending on the specific application requirements and available computational resources.
Despite significant advances in NILM research, several persistent challenges limit its widespread practical adoption. A fundamental limitation is the reliance on extensive and often specialized datasets for model training, which can restrict the transferability of models across different settings, appliance types, and geographical regions [
8,
14,
17,
18]. The availability of high-quality labeled data remains a critical bottleneck, as most existing datasets focus on residential environments in specific countries [
19,
20,
21]. Furthermore, many state-of-the-art NILM systems impose substantial computational demands due to their reliance on complex neural network architectures, limiting their practical deployment on embedded or edge devices with constrained resources [
5,
22,
23]. The reproducibility and comparability of NILM research also remain problematic due to the lack of efficient benchmarking frameworks with standardized evaluation procedures [
5,
24,
25]. Moreover, the vast majority of NILM research has concentrated on residential environments, leaving commercial and industrial settings largely unexplored despite their substantial energy footprints and unique consumption characteristics [
2,
3,
26].
Commercial and industrial buildings present distinct challenges compared to residential settings, including the prevalence of similar or identical loads such as multiple air conditioning units, lighting systems, and industrial machinery [
3]. The energy consumption patterns in these environments differ significantly due to factors such as centralized heating, ventilation, and air conditioning (HVAC) systems, consistent operational schedules, and higher load densities [
3,
27]. Recent case studies have revealed that NILM has largely failed to disaggregate loads effectively in commercial buildings due to the overlapping signatures of similar appliances [
3]. This gap underscores the need for methodological adaptations tailored to the specific characteristics of non-residential environments and comprehensive evaluations across diverse building types.
Despite this growing body of work, three specific gaps remain insufficiently addressed. First, the majority of NILM studies evaluate a single model architecture within a single domain (typically residential), making it difficult to draw generalizable conclusions about which model families are best suited for which load types. Second, emerging architectures inspired by foundation models—such as GPT-style causal transformers and Prophet-style temporal decomposition—have not been systematically evaluated for NILM nor compared against classical alternatives under controlled conditions. Third, while prior transformer-based NILM studies (e.g., GRU-BERT [
28], OPT-NILM [
22]) have demonstrated attention-based improvements for residential loads, their applicability to commercial and industrial environments with fundamentally different load characteristics (high inter-unit correlation, coordinated HVAC schedules, high-power transient machinery) remains unexplored.
This work addresses these gaps through the following contributions:
- 1.
A systematic, cross-domain evaluation framework that compares five model architectures—Vision Transformer (ViT), Variational Autoencoder (VAE), Random Forest (RF), and custom architectures inspired by TimeGPT and Prophet—across residential, commercial, and industrial environments under standardized experimental conditions. Unlike prior comparative NILM studies that typically focus on a single building type, this tri-sector analysis enables the identification of architecture–load type interactions that are not observable in single-domain evaluations.
- 2.
An empirical analysis of how load-level characteristics—including signal-to-noise ratio, Pearson correlation with aggregate consumption, operational state complexity, and sampling frequency—mediate disaggregation performance across model architectures, providing evidence-based criteria for model–load pairing.
- 3.
The introduction and evaluation of TimeGPT-inspired and Prophet-inspired architectures for NILM, which adapt causal attention and temporal decomposition paradigms from the time series forecasting literature to the energy disaggregation task and assess their strengths relative to established NILM approaches.
- 4.
Evidence-based, practitioner-oriented guidelines for algorithm selection based on load characteristics, dataset properties, and operational constraints, which facilitate informed decision-making for deploying NILM solutions in heterogeneous building environments.
It is important to note that all experiments in this study are conducted offline on publicly available benchmark datasets; no online deployment or real-time inference constraints are evaluated. The conclusions are therefore valid under these specific data and sampling conditions.
This paper is organized as follows:
Section 2 reviews related work in NILM, covering the evolution of algorithmic approaches from traditional machine learning to modern deep learning architectures.
Section 3 presents the methodology, encompassing the problem description, input and output variable definitions, the cost function formulation, and practical recommendations along with the limitations of the proposed approach.
Section 4 describes the experimental results, detailing the evaluation metrics employed and the performance analysis across different scenarios and appliance categories. Finally,
Section 5 discusses the findings in relation to prior literature, and
Section 6 concludes the paper with a summary of the key findings and outlines directions for future work.
2. Related Work
The field of NILM has undergone a significant transformation with the advent of artificial intelligence, particularly deep learning techniques, which have substantially advanced the capabilities and performance of disaggregation algorithms [
5,
8,
14]. Recent advances in Neural Network (NN) architectures, computational hardware, and training methodologies have enabled solutions that were previously unattainable. Deep learning (DL) for NILM was first introduced in 2015 and quickly demonstrated substantial improvements in disaggregation performance and generalization capability compared to conventional approaches [
5,
22]. Since then, the adoption of Deep Neural Networks (DNN) in NILM has grown rapidly.
Among the most influential deep learning architectures applied to NILM are Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory (LSTM) networks [
8,
23,
29]. CNNs have proven particularly effective for extracting spatial and temporal features from energy consumption patterns, while RNNs and LSTMs excel at modeling sequential dependencies and capturing long-term temporal dynamics [
8,
14]. The sequence-to-point learning paradigm, which maps a window of aggregate consumption to a single appliance output, has emerged as a dominant framework demonstrating state-of-the-art performance across multiple benchmark datasets [
8,
30].
More recently, transformer-based architectures, originally developed for Natural Language Processing (NLP), have been successfully adapted for NILM applications [
31,
32,
33]. The self-attention mechanism at the core of transformers enables models to capture global dependencies between aggregate and appliance-level signals, offering advantages over traditional recurrent approaches that process sequences sequentially [
31,
32,
34]. These transformer-based NILM models have achieved impressive results, with F1-scores exceeding 92% and demonstrating effectiveness in accurately diagnosing and estimating the energy consumption of individual home appliances [
31]. The integration of attention mechanisms, temporal pooling, and residual connections has further enhanced model performance, establishing transformers as a promising direction for future NILM research [
31,
35].
The state-of-the-art in NILM encompasses diverse methodological contributions that address various aspects of the disaggregation problem. The multi-NILM framework proposed by Nalmpantis and Vrakas [
23] tackles NILM as a multi-label classification problem, employing dimensionality reduction techniques such as Signal2Vec to develop privacy-preserving and cost-effective solutions suitable for deployment on embedded devices [
36]. Transfer learning approaches have emerged as a powerful strategy for enhancing the adaptability of NILM models across different datasets, appliances, and geographical regions [
8,
37,
38]. D’Incecco et al. [
8] demonstrated that appliance transfer learning and cross-domain transfer learning can significantly improve model generalization, with findings showing that only fully connected layers require fine-tuning for effective knowledge transfer.
Advanced feature extraction techniques have also contributed substantially to improving NILM accuracy. The Two-Stream Convolutional Neural Networks (TSCNN) proposed by Chen et al. [
13] leverage both temporal and spectral load signatures to provide rich distinguishing features for appliance recognition. The Gramian Angular Field (GAF) representation and affinity propagation clustering strategies have been employed to mitigate the negative impact of intra-class variety caused by multi-state loads [
13]. Furthermore, the Mother–Son Model (MSM) introduced by Mei et al. [
39] addresses challenges related to missing data and limited transfer capabilities through a novel feature inheritance mechanism, enabling dynamic decomposition of multiple appliances, including unknown devices.
Research has also addressed the practical deployment challenges of NILM systems. Studies like OPT-NILM [
22] propose cost-effective pruning approaches that can reduce model trainable parameters by up to 95% with minimal performance loss, making NILM more feasible for edge deployment. The DeepEdge-NILM case study [
3] demonstrated the viability of NILM edge devices for monitoring multiple air conditioners in commercial buildings, highlighting the potential for real-world implementation beyond laboratory settings. Additionally, trainingless approaches based on multi-objective evolutionary computing [
9] and sparse coding techniques [
40] have expanded the NILM methodology to scenarios where extensive training data or plug-level sensing are unavailable.
Recent comprehensive surveys have systematically categorized the diverse algorithmic approaches addressing these challenges, ranging from traditional machine learning methods to modern deep learning architectures, including CNNs, RNNs, autoencoders, and transformer-based models, each exhibiting distinct strengths and limitations depending on the application context [
41].
Despite the breadth of these contributions, several critical observations emerge from this literature analysis. First, the vast majority of studies evaluate individual architectures on residential datasets, making cross-architecture and cross-domain comparisons difficult due to differences in preprocessing, metrics, and train/test protocols. Second, while transformer-based approaches such as GRU-BERT [
28] and pruned models like OPT-NILM [
22] have shown promising results, they have been validated primarily on residential benchmarks (e.g., REDD, UK-DALE); their behavior on commercial loads with high inter-unit correlation or industrial loads with high-power transient signatures remains uncharacterized. Third, temporal-decomposition paradigms inspired by Prophet and causal-attention paradigms inspired by GPT have not been adapted or evaluated for NILM, despite their demonstrated effectiveness in related time series forecasting tasks. The present study is positioned to address these gaps by providing a controlled, multi-architecture comparison across three building domains under a unified experimental protocol.
3. Methodology
This section presents the methodological framework developed for NILM, encompassing the formal problem description, the definition of input and output variables, the cost function formulation employed for model optimization, and the description of the implemented algorithms. The methodology integrates multiple ML and DL approaches to address the energy disaggregation challenge across different building environments.
3.1. Problem Description
From a mathematical perspective, the NILM problem can be formulated as a blind source separation task. The aggregate power consumption measured at time
t is expressed as the superposition of individual appliance consumptions:
where
denotes the total aggregate power consumption at time
t,
represents the power consumption of the
i-th appliance,
N corresponds to the total number of appliances in the monitored environment, and
accounts for the noise term, which encompasses measurement errors, base loads, and contributions from unknown or unmonitored devices.
This formulation rests on several key assumptions. First, the noise term is assumed to be zero-mean and stationary over the observation window, i.e., , . Second, individual appliance signals are assumed to be mutually independent conditional on external factors (time of day, occupancy), which is a simplification that may be violated when centralized building management systems coordinate multiple loads (e.g., AHU units). Third, the problem is inherently ill-posed and underdetermined since N unknowns must be recovered from a single measurement; identifiability therefore relies on the assumption that each appliance possesses a sufficiently distinctive consumption signature—characterized by power levels, state-transition patterns, or temporal periodicity—such that supervised models can learn discriminative mappings from labeled examples. When these conditions are weakly satisfied (e.g., low signal-to-noise ratio, high inter-appliance correlation), disaggregation accuracy is expected to degrade regardless of model complexity, as confirmed by our experimental results.
The complexity of the NILM problem stems from several intrinsic characteristics of electrical consumption patterns. First, power signals exhibit non-linear behavior, with appliances displaying various operational states, transient responses, and consumption profiles that do not follow simple additive relationships. Second, multiple appliances may operate simultaneously with overlapping signatures, making it challenging to distinguish individual contributions from the aggregate signal. Third, the same appliance type may exhibit different consumption patterns depending on operating conditions, user behavior, and environmental factors, introducing intra-class variability that complicates recognition tasks.
In the context of supervised learning approaches, the NILM problem can be reformulated as learning a function
f that maps the aggregate consumption to individual appliance consumptions:
where
is the predicted consumption of appliance
i at time
t,
represents a temporal window of aggregate consumption centered at time
t with width
, and
is the learned disaggregation function for appliance
i. This formulation, known as sequence-to-point learning, has demonstrated state-of-the-art performance in recent NILM literature, as the temporal context surrounding a measurement point provides valuable information for accurate prediction [
8,
30].
3.2. Input Variables
The input to the NILM algorithm consists of the aggregate power consumption signal acquired from a single metering point. In this work, the primary input variable is the active power measurement , sampled at regular intervals determined by the dataset’s temporal resolution. The input signal is processed as a time series that captures the cumulative electrical demand of all connected loads within the monitored environment.
For DL models, the input is structured as a sliding window of consecutive measurements:
where
represents the input vector at time
t,
is the total sequence length, and
w defines the temporal context considered for prediction. This windowed approach enables models to capture temporal dependencies and contextual information that characterize appliance activation and deactivation patterns. In this study, the sequence length is set to
samples for all deep learning models, corresponding to 480 min (≈8 h) for AMPds, 240 min (≈4 h) for COmBED, and 480 s (≈8 min) for IMDELD. This value was selected to capture at least one full operational cycle of the target appliances while remaining computationally tractable. The sensitivity of the results to window size was not systematically explored and constitutes a limitation of the current study.
Prior to model training, the input data undergoes preprocessing transformations to enhance learning efficiency:
where
and
represent the mean and standard deviation computed exclusively on the training partition, respectively. These statistics are then applied to normalize both validation and test sets, ensuring no information leakage from future data into the training process. This standardization ensures that input features have zero mean and unit variance, facilitating gradient-based optimization and improving convergence properties.
Additionally, energy consumption datasets often exhibit significant class imbalance, with appliances remaining inactive for extended periods. To mitigate this, we employ a combination of strategies: (i) oversampling of minority-class windows containing appliance activation events during training batch construction; (ii) asymmetric weighting of the loss function that penalizes missed activations more heavily than false positives. Recent research has established criteria for assessing dataset suitability for data reduction techniques based on class imbalance, temporal structure, and feature redundancy, enabling computational cost reduction while maintaining predictive accuracy [
42].
For traditional ML approaches such as RF, the input is augmented with engineered features extracted from temporal windows, including statistical features (mean, median, standard deviation, variance, percentiles), change detection features (successive differences, total variation, zero-crossing rate), and spectral features derived from Fast Fourier Transform (FFT) analysis.
3.3. Output Variables
The output of the NILM system corresponds to the estimated power consumption of individual target appliances. For each appliance
, where
M denotes the number of target appliances, the model generates a prediction:
where
represents the predicted power consumption of appliance
i at time
t, and
T is the total number of time steps in the evaluation period. The non-negativity constraint reflects the physical reality that power consumption cannot be negative.
In sequence-to-point architectures, the output corresponds to the predicted consumption at the midpoint of the input window:
where
represents the neural network parameterized by
for appliance
i. This approach generates a single prediction for each input window, avoiding the averaging effects inherent in sequence-to-sequence methods and providing more precise temporal localization of consumption events.
Post-processing operations are applied to ensure physically meaningful predictions:
where
represents the maximum rated power of appliance
i, constraining predictions within feasible operational bounds.
3.4. Cost Function
The training objective minimizes the discrepancy between predicted and actual appliance consumption values. The cost function quantifies this discrepancy and guides the optimization process through gradient-based parameter updates. The selection of an appropriate cost function significantly influences model behavior and generalization capabilities.
3.4.1. Primary Loss Functions
The Mean Squared Error (MSE) serves as the primary loss function for regression-based NILM models:
where
denotes the ground-truth consumption of appliance
i at time
t,
represents the corresponding prediction, and
n is the number of training samples. The MSE penalizes larger errors more heavily due to the quadratic term, making it sensitive to outliers and encouraging the model to minimize substantial prediction errors.
The Mean Absolute Error (MAE) provides an alternative loss formulation that exhibits greater robustness to outliers:
The linear penalty structure of MAE treats all error magnitudes proportionally, making it more suitable for datasets with significant noise or occasional measurement anomalies.
3.4.2. Variational Autoencoder Loss
For the VAE implementation, the loss function combines reconstruction accuracy with regularization of the latent space distribution:
where
represents the reconstruction loss, and the Kullback-Leibler (KL) divergence term is defined as:
where
and
are the mean and standard deviation of the latent distribution,
d is the latent space dimension, and
is a weighting coefficient that balances reconstruction fidelity against latent space regularization.
3.4.3. Optimization Procedure
All deep learning models are optimized using the Adam optimizer [
43] with an initial learning rate of
, which was selected based on preliminary experiments and is consistent with common practice in NILM literature [
8]. L2 weight regularization (
) is applied to prevent overfitting, given the high temporal autocorrelation in energy time series, which reduces the effective number of independent training samples. Training proceeds for a maximum of 100 epochs, with early stopping (patience = 10 epochs) based on validation loss. The choice of MSE as the primary loss for regression models and the composite VAE loss (Equation (
10)) reflects a deliberate trade-off: MSE penalizes large errors more heavily, which is desirable for NILM, where missing high-power activation events are more consequential than small steady-state errors.
3.5. Vision Transformer
The ViT architecture [
44], originally designed for image processing, was adapted for temporal signal processing in NILM applications. The implementation divides energy consumption time series into patches or discrete segments, analogous to how the original ViT processes images.
The architecture comprises three main components: Patch Embedding transforms the input time series into a sequence of patch embeddings, where each segment of size
is linearly projected to a feature space of dimension
. Additionally, a classification token (CLS token) and learnable positional embeddings are incorporated to maintain temporal information. The number of patches is calculated as:
Transformer Blocks implement encoder blocks including Multi-Head Self-Attention (MHSA), Feedforward Neural Networks (FFNN) with Gaussian Error Linear Unit (GELU) activation, residual connections, and layer normalization. The Multilayer Perceptron (MLP) Head processes the CLS token to generate final consumption predictions per device.
Two configurations were implemented: ViT-small (3 transformer blocks, 23,425 parameters) and ViT-large (12 transformer blocks, 1,252,481 parameters). The overall ViT-NILM architecture is illustrated in
Figure 1.
The transformer-based approach for NILM has been further advanced through hybrid architectures that combine sequential models with attention mechanisms. Recent work has demonstrated that integrating Gated Recurrent Unit (GRU) networks with Bidirectional Encoder Representations from Transformers (BERT) style attention captures both short-term temporal dependencies and long-range contextual information, achieving improved disaggregation accuracy and robustness [
28].
3.6. Variational Autoencoder
The VAE implementation uses an unsupervised learning approach for energy disaggregation, based on the model’s capacity to learn latent representations of energy consumption patterns. The encoder projects input data to a lower-dimensional latent space using 1D convolutional and dense layers, while the decoder reconstructs consumption predictions from this latent space. The implemented VAE-NILM architecture is shown in
Figure 2.
The loss function combines reconstruction loss and KL divergence as described in Equation (
11).
The architecture includes 75,264 total parameters with configurable latent dimension.
3.7. Random Forest
The RF algorithm for NILM disaggregation extracts multi-domain features from sliding temporal windows. Statistical features include measures of central tendency (mean, median), dispersion (standard deviation, variance), and distribution shape characteristics (percentiles). Change detection features capture temporal dynamics through the analysis of successive differences
, including total variation, maximum change, and zero-crossing rate. The RF-NILM architecture is depicted in
Figure 3.
Spectral features from FFT provide complementary frequency information, including total spectral energy, dominant frequency, and energy distribution across four frequency bands. The ensemble model comprises
decision trees with a maximum depth of 30 levels, minimum samples per split of 5, and minimum samples per leaf of 5. These hyperparameters were selected through manual tuning on a held-out validation set, starting from a default configuration of 100 trees and progressively increasing complexity until validation performance plateaued. The final prediction for appliance
j is obtained as:
where
f represents the extracted feature vector and
denotes the prediction from the
i-th decision tree.
3.8. TimeGPT-Inspired Architecture
The TimeGPT-inspired model represents a custom architecture that draws on principles from Generative Pre-trained Transformer (GPT) language models, which are specifically adapted for energy consumption time series. Unlike the proprietary Nixtla TimeGPT service, this implementation is trained from scratch on NILM data and does not use pre-trained weights. The architecture includes temporal embedding with learnable positional encoding, temporal convolutions with different dilation rates 1, 2, 4 and 8 for multi-scale feature extraction, four transformer blocks with causal attention (represented as gray blocks in
Figure 4) to preserve the temporal ordering of the input sequence, and global average pooling followed by dense layers for point prediction. The causal attention mask—the key design element borrowed from GPT-style models—ensures that predictions at time
t depend only on past and present observations, unlike the bidirectional attention in ViT. The model comprises 854,656 parameters with 4 transformer blocks.
3.9. Prophet-Inspired Architecture
The Prophet-inspired model adapts the temporal decomposition philosophy of Facebook’s Prophet—which decomposes time series into trend, seasonality, and holiday components—but implements it through a fundamentally different neural architecture rather than the additive regression model of the original Prophet. Specifically, seasonal features are extracted through sinusoidal functions at multiple frequencies (daily, weekly, and custom periods), and trend is modeled using linear and quadratic temporal features, consistent with the Prophet philosophy. However, unlike the original Prophet, sequential dependencies are captured through optional bidirectional LSTM layers, and a custom attention mechanism with batch normalization and residual connections is employed to learn adaptive weighting of seasonal and trend components. This hybrid design is motivated by the observation that appliance consumption patterns exhibit both deterministic periodicity (captured by the decomposition) and stochastic transient behavior (captured by the LSTM and attention components). The architecture includes 109,184 total parameters. The Prophet-NILM architecture is illustrated in
Figure 5.
3.10. Datasets
Three benchmark datasets were employed to evaluate the proposed framework across different building types:
- 1.
AMPds (Almanac of Minutely Power Dataset) [
21]: A residential dataset containing minute-resolution measurements from a single Canadian household over two years (2012–2014), collected at Simon Fraser University (Burnaby, BC, Canada). Data acquisition was performed using a custom metering board based on a Raspberry Pi single-board computer combined with a Powerhouse Dynamics eMonitor whole-home energy monitor. The dataset is publicly distributed via the Harvard Dataverse repository. It includes 21 sub-metered loads representing typical residential appliances. For evaluation, three representative loads were selected: refrigerator (continuously variable consumption), heat pump (finite-state machine behavior), and electric oven (on/off behavior).
- 2.
COmBED (Commercial Building Energy Dataset) [
25]: A commercial dataset monitoring energy consumption in an office building at the Indian Institute of Technology Delhi (New Delhi, India) at 30-s resolution over one month. Power measurements were collected using Watts Up? Pro energy meters, and dataset formatting and preprocessing were performed using NILMTK v0.2 [
5] (open-source toolkit). Monitored loads include five Air Handling Units (AHU), lighting systems, two socket circuits, and an elevator. This dataset most closely represents the target commercial building application.
- 3.
IMDELD (Industrial Machines Dataset for Electrical Load Disaggregation) [
45]: An industrial dataset with 1-s resolution collected over 111 days from a feed production plant located in Minas Gerais, Brazil, by researchers at the Universidade Federal de Minas Gerais (UFMG, Belo Horizonte, Brazil). Electrical variables—including voltage, current, and active, reactive, and apparent power—were recorded using WEG CFW-11 variable-frequency drive monitoring units with calibrated current transducers at each sub-circuit. The dataset is publicly available via IEEE Dataport. It includes pelletizers, bipolar contactors, exhaust fans, and milling machines. Three representative loads were selected: Pelletizer I (finite states), Exhaust Fan I (on/off), and Bipolar Contactor I (continuously variable).
For each dataset, three representative loads were selected to span the range of operational behaviors encountered in practice: (i) a continuously variable load; (ii) a finite-state or quasi-periodic load; (iii) a sporadic or event-driven load. This selection strategy ensures that the evaluation covers loads with diverse signal-to-noise ratios, Pearson correlations with the aggregate signal, and duty cycle characteristics.
Table 1 summarizes the Pearson correlation between each selected load and the corresponding aggregate signal, which serves as a proxy for disaggregation difficulty.
Table 2 provides a consolidated summary of the hyperparameter configurations for all models to facilitate reproducibility.
3.11. Evaluation Metrics
Performance evaluation employed multiple metrics to capture different aspects of prediction accuracy:
Root Mean Squared Error (RMSE):
MAE:
F1-Score: In the NILM context, the F1-score is computed by first converting continuous power predictions into binary ON/OFF classifications using a power threshold. Specifically, an appliance is considered ON at time
t if
, where
is set to 15 W for low-power appliances (e.g., lighting, sockets) and 50 W for high-power appliances (e.g., heat pumps, pelletizers), following thresholds commonly used in the NILM literature [
20]. Precision, recall, and F1 are then computed on a per-sample basis:
This metric captures the model’s ability to correctly detect appliance activation events, complementing the continuous regression metrics (RMSE, MAE, R
2) with a state-detection perspective. For loads that are continuously ON (e.g., refrigerators, sockets with constant base load), the F1-score may approach 1.0 trivially and should be interpreted with caution.
Normalized Error in Assigned Power (NEP) [
46]:
3.12. Limitations
Several limitations warrant explicit consideration when interpreting the findings of this study.
Dataset heterogeneity and generalizability. The three datasets differ substantially in temporal resolution (1 s, 30 s, 60 s), duration (1 month to 2 years), geographic origin, and number of monitored loads. These differences introduce a structural bias: the higher performance observed in the IMDELD industrial dataset may be attributable, at least in part, to its finer temporal resolution rather than to inherent properties of industrial loads. We do not attempt to control for this confound, and direct cross-dataset performance comparisons should be interpreted cautiously. Furthermore, all three datasets are publicly available benchmarks that may not fully represent the variability of real-world building environments, where factors such as sensor drift, missing data, and evolving occupancy patterns introduce additional challenges.
Limited load selection. Only three representative loads per dataset were evaluated. While these were selected to span a range of operational behaviors (continuous, finite-state, sporadic), the results may not generalize to all appliance types—particularly to loads with intermediate characteristics or to multi-state appliances with many operating modes.
Window size sensitivity. A single sequence length () was used across all datasets and models. The optimal window size likely varies by appliance duty cycle, sampling rate, and model architecture. A systematic window-size ablation study would strengthen the conclusions but was not conducted due to computational constraints.
Absence of computational cost analysis. Although computational efficiency is relevant for practical NILM deployment, a detailed quantitative analysis of training time, inference latency, and memory requirements was not performed in this study. Such an analysis would require controlled benchmarking on standardized hardware and is identified as an important direction for future work.
Offline evaluation only. All experiments are conducted in an offline, batch-processing setting. Real-time deployment introduces additional constraints (streaming inference, concept drift, limited memory) that are not addressed here.
4. Results
This section presents the experimental results obtained after evaluating six disaggregation models in three different energy consumption environments by using the five metrics widely adopted in the NILM literature: RMSE, MAE, F1-score, R2, and NDE. Results are organized by deployment context, beginning with the residential environment (AMPds dataset) featuring household appliances with diverse consumption patterns, followed by the industrial environment (IMDELD dataset) containing high-power electromechanical equipment, and concluding with the commercial environment (COmBED dataset) representing building-level loads including HVAC systems and auxiliary circuits.
For each environment, we analyze performance across appliances exhibiting different operational characteristics—continuous variable consumption, finite-state operation, and on/off switching behavior—to provide a comprehensive assessment of model suitability across load types. Best-performing values for each metric are highlighted in bold within the corresponding tables.
4.1. Residential Environment (AMPds)
Table 3 presents the complete results for the AMPds dataset across the three selected appliances.
For the refrigerator, which exhibits the lowest Pearson correlation (0.097) with aggregate consumption, Prophet emerged as the best option with an F1-score of 0.55 and the lowest RMSE (71.99). Time-series-based models showed an advantage due to their inherent capacity for modeling seasonal trends and periodic patterns. However, the near-zero R2 (0.06) confirms the inherent difficulty of detecting low relative consumption loads operating continuously with subtle variations.
The heat pump, with a Pearson correlation of 0.512 with the aggregate consumption, proved to be the most identifiable load. RF achieved outstanding performance with an F1-score of 0.97 and R
2 of 0.90, suggesting that loads with well-defined operational states and high contribution to total consumption are ideal candidates for disaggregation using traditional machine learning techniques. The RF model accurately captures the ON/OFF state transitions characteristic of this finite-state load, as illustrated in
Figure 6.
The electric oven results revealed the greatest divergence from expectations. With an intermediate Pearson correlation of 0.182, only Prophet (F1-score = 0.50) and TimeGPT (F1-score = 0.41) achieved any detection capability. Most models, including RF and ViT variants, obtained F1-scores of 0.00, indicating complete failure in identification. This unexpected result suggests that Pearson correlation alone is not a sufficient predictor of NILM performance, with sporadic oven usage and potential confusion with other high-power loads complicating the disaggregation task. The TimeGPT-inspired model detects some activation events but struggles with the sporadic usage pattern, as shown in
Figure 7.
4.2. Industrial Environment (IMDELD)
Table 4 presents results for the industrial dataset.
For the pelletizer, ViT-based models demonstrated superior performance. ViT-small achieved the best overall RMSE (10,013.82), the lowest NDE (0.18), and an R
2 of 0.94. ViT-large obtained comparable results with the best MAE (3748.17). Both ViT variants significantly outperformed traditional methods and baselines, while TimeGPT, VAE, and Mean completely failed with R
2 = 0 and RMSE exceeding 45,000.
Figure 8 illustrates the ViT-small disaggregation output for Pelletizer I, confirming accurate tracking of high-power state transitions at 1-s resolution.
For the exhaust fan, close competition was observed among the approaches. ViT-small led with the best RMSE (593.42), the lowest NDE (0.26), and the best R2 (0.88). Prophet showed practically equivalent performance with an RMSE of 640.00, the best MAE (180.68), and an excellent F1-score of 0.97. RF also achieved an F1-score of 0.97, indicating excellent on/off state detection capability.
For the bipolar contactor, near-equivalent performance was observed across methodologies. Prophet obtained the best RMSE (181.41) and the best NDE (0.22), with an R2 of 0.91. ViT-small followed closely with an RMSE of 187.41, the best MAE (108.63), and the same R2 of 0.91. Both models demonstrated strong capacity for modeling this low-power load.
4.3. Commercial Environment (COmBED)
Table 5 presents results for the Air Handling Units, representing the primary HVAC loads in commercial buildings.
Analysis of the four AHU units revealed distinctive behavior patterns, indicating different operational dynamics across the systems. For AHU I, RF emerged as the best-performing model with an RMSE of 688.12 and an R
2 of 0.67, demonstrating a strong balance between precision and generalization. The RF disaggregation output for AHU I is shown in
Figure 9. AHU II exhibited similar characteristics, with RF maintaining its superiority (RMSE 633.47, R
2 0.77).
A different pattern was observed for AHU III, where transformer-based architectures demonstrated superiority. ViT-large achieved the lowest RMSE (595.63), an R2 of 0.90, and an F1-score of 0.80. This shift toward attention-based models suggests that this unit exhibits more complex temporal patterns that benefit from the transformer’s capacity to capture long-range dependencies.
AHU IV exhibited the most distinctive behavior, with ViT-small achieving a near-perfect fit (R2 = 0.97) and the lowest NDE of 0.12. However, the marked variability in performance across models (some exceeding RMSE 5000.00) suggests that this unit possesses unique operational characteristics that require specific architectural choices for effective modeling.
Table 6 presents results for other commercial building loads.
The elevator proved to be a challenging load, with a low correlation (0.251) and low R
2 values across all models. Prophet and RF demonstrated comparable performance (R
2 of 0.08 and 0.11, respectively), with RF achieving a slightly lower RMSE (725.09). The irregular, event-driven usage pattern that limits all models to low R
2 values is illustrated in
Figure 10.
The lighting system exhibited characteristics that clearly favor transformer architectures. ViT-large achieved the best performance with an RMSE of 381.95 and an R
2 of 0.76, along with the lowest NDE of 0.16. These results indicate that lighting consumption patterns, although well-defined, contain temporal subtleties that transformers capture more effectively than traditional methods.
Figure 11 shows the ViT-large predictions for the lighting system, capturing the temporal subtleties that transformers exploit more effectively than traditional methods.
The socket circuits presented the most complex and instructive scenarios. Socket I exhibited a paradoxical result: the baseline Mean model outperformed all sophisticated architectures (RMSE 120.51, MAE 74.68, F1-score 0.88). This suggests that this load exhibits such high stochastic variability that complex models suffer from overfitting, while a simple mean estimate captures the central tendency more effectively. In contrast, Socket II showed moderately improved predictability, with ViT-small achieving the best performance (RMSE 271.30, R
2 0.30). The ViT-small disaggregation output for Socket II is shown in
Figure 12.
5. Discussion
The comprehensive evaluation across three distinct building types reveals several important insights for NILM implementation using AI techniques.
5.1. Model Performance by Load Type
The results demonstrate clear patterns in model effectiveness related to load characteristics. For loads with well-defined finite states and high contribution to aggregate consumption (heat pump, AHU systems), traditional machine learning approaches, particularly RF, consistently achieved superior performance. This finding aligns with prior observations by Kelly and Knottenbelt [
20] and Bousbiat et al. [
5], who state that simpler models can match or exceed deep learning on well-structured NILM tasks, and can be attributed to RF’s effectiveness at capturing discrete state transitions through axis-aligned partitions of the engineered feature space without requiring the extensive temporal context that deep learning models need.
Transformer-based architectures excelled in scenarios with complex temporal patterns and high-power industrial loads. The ViT-small configuration frequently outperformed ViT-large, a finding consistent with the broader machine learning literature on the “double descent” phenomenon and the bias-variance trade-off in overparameterized models. Specifically, for time series with strong autocorrelation, the effective number of independent training examples is substantially smaller than the nominal sample count. Belkin et al. [
47] showed that models entering the interpolation regime without sufficient effective data tend to memorize rather than generalize. In our NILM context, ViT-large (1.25 M parameters) likely enters this regime on the smaller effective datasets, while ViT-small (23 K parameters) remains in the classical regime where generalization improves with training. This has important practical implications: lightweight architectures may simultaneously offer superior accuracy and reduced inference costs for edge deployment.
Prophet-inspired models demonstrated particular strength in capturing seasonal and periodic patterns, making them especially suitable for residential appliances with regular usage cycles. The interpretable decomposition into trend and seasonality components provides added value for energy auditing applications, echoing the emphasis on model transparency highlighted by Huzzat et al. [
41].
5.2. Analysis of Model Failures
Several models produced near-zero R2 values or extremely high RMSE for specific loads (e.g., TimeGPT-inspired and VAE on IMDELD pelletizer; VAE on COmBED AHUs). Inspection of learning curves revealed that these failures are attributable to two distinct mechanisms. For the VAE, the KL divergence regularization term tends to dominate the reconstruction loss during early training, causing the decoder to collapse to predicting the population mean—a known pathology of VAE training when is not carefully annealed. For the TimeGPT-inspired model, causal attention restricts the model to using only past context, which proves insufficient for loads whose state at time t depends on future scheduled events (e.g., centrally controlled HVAC). These are not instances of training instability per se, but rather fundamental architectural mismatches with specific load types. We did not observe sensitivity to random seeds or learning rate for the successful model–load pairings.
5.3. Practitioner Guidance: When to Prefer Simple Baselines
The finding that a Mean baseline outperforms all complex models for Socket I is instructive for practitioners. We propose the following heuristic criteria for identifying loads where simple baselines are likely preferable: (i) Pearson correlation with aggregate signal ; (ii) coefficient of variation of the target load ; (iii) absence of repeatable temporal patterns (no significant peaks in the autocorrelation function beyond lag 0). When these conditions are met, the target load contributes negligibly to the aggregate signal and exhibits high stochastic variability, making it indistinguishable from noise for any learned model. In such cases, practitioners should either accept the Mean baseline or consider sub-metering the load directly.
5.4. Building Type Considerations
Commercial buildings present unique challenges compared to residential environments. The high correlation among AHU units (exceeding 0.90 in some cases) suggests coordinated operation typical of centralized building management systems, creating difficulties in differentiating individual units. However, this correlation also means that accurate disaggregation of any single AHU provides substantial insight into overall HVAC consumption. A promising direction for addressing correlated commercial loads is the adoption of multi-output or multi-task learning frameworks that jointly model all correlated loads simultaneously, thereby leveraging inter-unit dependencies as an informative signal rather than treating them as a confound. Integration of external signals from building management systems (e.g., control setpoints, occupancy schedules) could further improve disambiguation.
Industrial environments showed the highest predictability for high-power machinery, with R2 values exceeding 0.90 for pelletizers and exhaust fans. The 1-s sampling rate of IMDELD, combined with distinct operational signatures of industrial equipment, facilitated accurate disaggregation. However, as noted in the Limitations, the contribution of sampling rate versus intrinsic load characteristics cannot be fully disentangled in this study.
5.5. Signal-to-Noise Ratio and Pearson Correlation Effects
The observed performance disparities across appliances can be fundamentally attributed to the signal-to-noise ratio (SNR) inherent in each disaggregation task. Loads exhibiting low Pearson correlation with the aggregate signal, such as the refrigerator () in the residential dataset, present a challenging scenario in which the target signal is effectively masked by the superposition of other concurrent loads. The near-zero R2 values obtained for such appliances are not indicative of model failure but rather reflect an intrinsic limitation of the disaggregation problem when the target load contributes minimally to the aggregate measurement.
Conversely, high-correlation loads such as the heat pump () and pelletizers () provide stronger discriminative signals that enable models to achieve R2 values exceeding 0.85. This correlation–performance relationship underscores the importance of considering load contribution ratios when selecting target appliances for NILM deployment, as energy-intensive loads inherently offer more favorable disaggregation conditions regardless of the chosen algorithm.
5.6. Random Forest Superiority for Finite-State Loads
The consistent outperformance of RF over deep learning architectures for finite-state appliances such as heat pumps and AHU systems can be explained through the lens of decision boundary complexity. RF constructs an ensemble of decision trees that naturally partition the feature space into discrete regions corresponding to appliance operational states (ON, OFF, intermediate levels). This architectural characteristic aligns well with the nature of finite-state loads, where power consumption transitions between well-defined levels rather than varying continuously. Furthermore, the ensemble approach handles non-linear relationships between aggregate power measurements and individual appliance states without requiring the extensive temporal context that recurrent or attention-based architectures demand.
The 200-tree configuration with a maximum depth of 30 levels provides sufficient representational capacity to capture state transitions, while the minimum sample constraints (5 per split and leaf) prevent overfitting to noise in the training data. These results are consistent with previous findings in the literature demonstrating that traditional machine learning methods can outperform deep learning when the underlying problem structure matches the model’s inductive biases.
5.7. Model Complexity and Overfitting Trade-Offs
The counterintuitive finding that ViT-small (23,425 parameters) frequently outperformed ViT-large (1,252,481 parameters) reveals important insights about the relationship between model capacity and NILM task requirements. Deep learning models with excessive parameterization relative to the available training data are prone to memorizing training patterns rather than learning generalizable representations.
The NILM datasets employed, despite spanning months to years of measurements, exhibit substantial temporal autocorrelation that effectively reduces the number of independent training examples. Additionally, the fundamental patterns in appliance signatures—characterized by state transitions and power levels—may not require the extensive hierarchical feature extraction that larger transformer architectures provide. The ViT-small configuration appears to strike an optimal balance, offering sufficient attention capacity to capture relevant temporal dependencies without the regularization challenges posed by overparameterized networks. This finding carries practical implications for edge deployment, where lightweight architectures may simultaneously offer superior accuracy and reduced inference costs.
5.8. Challenges of Overlapping Signatures in Commercial Environments
The elevated Pearson correlation coefficients observed among AHU units (exceeding 0.90) exemplify a fundamental challenge in commercial building NILM: the presence of multiple similar or identical loads operating under coordinated control strategies. Building management systems typically activate HVAC equipment in synchronized patterns based on occupancy schedules and thermal comfort requirements, resulting in highly correlated consumption profiles that confound traditional disaggregation approaches. Unlike residential environments, in which appliance diversity naturally provides discriminative signatures, commercial buildings frequently deploy multiple units of identical equipment whose individual contributions become statistically indistinguishable.
The elevator load, with a low correlation of 0.251 and consistently poor R2 values across all models, represents an extreme case in which irregular, event-driven operation combined with brief duty cycles creates insufficient statistical structure for reliable disaggregation. These findings suggest that commercial NILM implementations may require complementary approaches—such as sub-metering at distribution panels or incorporation of building automation system data—to resolve ambiguities among similar loads.
5.9. Sampling Rate Impact on Disaggregation Performance
The strong performance achieved in the industrial environment (R2 exceeding 0.90 for pelletizers and exhaust fans) can be substantially attributed to the 1-s sampling rate of the IMDELD dataset, compared to the 30-s and 1-min intervals in the commercial and residential datasets, respectively. Higher sampling frequencies capture transient signatures—the distinctive power fluctuations during appliance startup, shutdown, and state transitions—which provide rich discriminative information beyond steady-state power levels.
Industrial machinery, characterized by high power ratings and pronounced operational signatures, generates transient events with amplitudes that significantly exceed measurement noise, enabling robust detection even in complex multi-load scenarios. The finer temporal resolution also facilitates the identification of duty cycling patterns and variable-speed drive characteristics that would be aliased or averaged out at lower sampling rates.
This observation has important implications for NILM system design: strategic investment in higher-frequency metering infrastructure may yield substantially improved disaggregation accuracy, particularly for loads whose signatures are predominantly transient in nature. The trade-off between data acquisition costs, storage requirements, and disaggregation performance merits careful consideration in practical deployments.
5.10. Summary: Model Selection Guidelines
Table 7 synthesizes the experimental findings into a practitioner-oriented model selection guide, mapping load type and environment to the recommended architecture.
6. Conclusions
This study presented a systematic, cross-domain evaluation of five AI-based architectures for NILM across residential, commercial, and industrial environments. Beyond reporting per-model accuracy figures, the work yields three principal insights that advance our understanding of the NILM problem.
First, the results provide empirical evidence that the interaction between model architecture and load-level characteristicsis a stronger determinant of disaggregation accuracy than either factor in isolation. RF excels on finite-state loads because its decision-boundary structure aligns with discrete power-level transitions; ViT-small excels on complex industrial loads because its attention mechanism captures transient temporal dependencies; and the Prophet-inspired model excels on seasonal residential loads because its decomposition architecture matches the generative structure of periodic consumption. This architecture–load interaction perspective offers a more nuanced and actionable framework than the prevailing “which model is best?” paradigm.
Second, the study demonstrates that model complexity is not monotonically related to performance in the NILM domain. The consistent superiority of ViT-small over ViT-large can be attributed to the reduced effective sample size caused by temporal autocorrelation in energy time series, which pushes overparameterized models into a memorization regime. This finding carries practical significance for edge deployment, where lightweight models are preferable.
Third, the analysis of signal-to-noise ratio and Pearson correlation reveals fundamental identifiability limits: loads contributing less than approximately 10% to the aggregate signal (Pearson ) present disaggregation challenges that no model architecture can overcome, as the target signal is effectively masked by superimposed loads. This insight reframes certain “model failures” as problem-intrinsic limitations and provides practitioners with quantitative criteria for assessing NILM feasibility prior to deployment.
These findings support the use of NILM as a practical tool for energy monitoring in buildings, particularly when existing metering infrastructure is available and analysis is conducted offline. However, the characterization of NILM as “cost-effective” should be understood as conditional on these assumptions and on targeting high-impact loads where disaggregation is technically feasible. The high inter-unit correlations observed among AHU systems (exceeding 0.90) highlight the limitations of purely algorithmic approaches in commercial environments with coordinated building management, suggesting that hybrid solutions integrating NILM with building automation data may be necessary for comprehensive disaggregation in complex facilities.
Future Work
Several promising research directions emerge from this study’s findings.
The development of adaptive sampling strategies that dynamically adjust acquisition frequency based on detected load events could optimize the trade-off between data storage requirements and disaggregation accuracy.
Investigating semi-supervised and self-supervised learning paradigms may address the persistent challenge of obtaining labeled ground-truth data in commercial environments where sub-metering installation is impractical.
The integration of contextual information from building management systems, occupancy sensors, and weather data as auxiliary inputs could enhance model performance for highly correlated loads such as HVAC systems operating under coordinated control strategies.
Exploring lightweight Neural Architecture Search (NAS) techniques specifically tailored for NILM applications could yield optimized model configurations that balance accuracy with computational efficiency for resource-constrained edge devices.
Model interpretability: Leveraging feature importance analysis (RF) and attention map visualization (ViT, TimeGPT-inspired) to provide diagnostic insights to facility managers about which temporal patterns drive specific appliance predictions would improve trust and facilitate practical adoption of NILM systems.
A rigorous computational cost analysis—benchmarking training time, inference latency, and memory usage across models on standardized hardware—is needed to complement the accuracy-focused evaluation presented here.
Longitudinal studies examining model degradation over time due to appliance replacement, seasonal variations, and changes in occupancy patterns would provide valuable insights for developing robust, self-adaptive NILM systems suitable for long-term deployment in real-world facilities.