Building-Level Energy Disaggregation Using AI-Based NILM Techniques in Heterogeneous Environments

Rubio-Bustos, Ana; Calleja-Rodríguez, Gloria; De-La-Torre-García, Jorge; Fernandez-Gamiz, Unai; Zulueta, Ekaitz

doi:10.3390/ai7040122

Open AccessArticle

Building-Level Energy Disaggregation Using AI-Based NILM Techniques in Heterogeneous Environments

by

Ana Rubio-Bustos

^1,2,*

,

Gloria Calleja-Rodríguez

¹

,

Jorge De-La-Torre-García

¹

,

Unai Fernandez-Gamiz

³

and

Ekaitz Zulueta

²

¹

Centro de Estudios de Materiales y Control de Obras, S.A. (CEMOSA), 29004 Málaga, Spain

²

Department of Systems Engineering and Automatic Control, University of the Basque Country (UPV/EHU), 01006 Vitoria, Spain

³

Department of Energy Engineering, University of the Basque Country (UPV/EHU), 01006 Vitoria, Spain

^*

Author to whom correspondence should be addressed.

AI 2026, 7(4), 122; https://doi.org/10.3390/ai7040122

Submission received: 23 January 2026 / Revised: 25 February 2026 / Accepted: 13 March 2026 / Published: 1 April 2026

Download

Browse Figures

Versions Notes

Abstract

Non-Intrusive Load Monitoring (NILM) represents a powerful approach for energy disaggregation, which enables detailed insights into energy consumption patterns without requiring extensive sensor deployment. While significant advances have been achieved in residential NILM applications, commercial and industrial buildings remain largely underexplored despite their substantial contribution to global energy consumption. This study addresses this gap by developing and evaluating multiple artificial intelligence approaches for energy disaggregation across residential, commercial, and industrial buildings under a unified experimental protocol. We implement and compare several AI-based models, including Vision Transformer (ViT), Variational Autoencoder (VAE), Random Forest (RF), and custom architectures inspired by TimeGPT and Prophet, alongside traditional baseline methods. The proposed framework is validated using three benchmark datasets representing residential (AMPds), commercial (COmBED), and industrial (IMDELD) environments. Experimental results demonstrate that architecture–load interactions, rather than model complexity alone, are the primary determinants of disaggregation accuracy: the ViT-small configuration achieves superior performance for complex industrial loads with R² values exceeding 0.94, Random Forest proves most effective for finite-state commercial HVAC systems with R² up to 0.97, and the Prophet-inspired model excels in capturing seasonal patterns in residential appliances. These findings provide evidence-based guidelines for selecting appropriate AI models based on load characteristics, signal-to-noise ratio, and building type, contributing to the practical deployment of NILM in heterogeneous building environments.

Keywords:

Non-Intrusive Load Monitoring; energy disaggregation; Vision Transformer; Prophet; Random Forest; TimeGPT; deep learning; energy efficiency; time series forecasting; smart buildings

1. Introduction

The global energy landscape is undergoing a profound transformation driven by the dual pressures of escalating demand and the urgent need for environmental sustainability. Over the past two decades, industrialization, urbanization, and the proliferation of electrical appliances have led to an unprecedented surge in energy consumption worldwide [1,2]. Buildings constitute a major share of total energy consumption in urban environments, with estimates varying by region and accounting methodology: the International Energy Agency reports that buildings account for approximately 30% of global final energy use, while in certain highly urbanized regions this share can exceed 40% when considering both direct and indirect consumption [3,4]. This substantial contribution underscores the critical importance of developing effective energy management strategies that can mitigate environmental impact while meeting the growing needs of modern society. In this context, energy efficiency has emerged as a cornerstone of sustainable development to reduce greenhouse gas emissions without compromising economic growth or quality of life [2]. The adoption of smart meters and smart grids has created new opportunities for monitoring and optimizing energy consumption, enabling real-time feedback to consumers and utility providers alike [5,6].

Achieving effective energy management requires understanding and monitoring energy consumption at a granular level, which can be accomplished through various methods, most notably energy disaggregation. Energy disaggregation is the process of decomposing aggregate energy consumption into the contributions of individual appliances or systems [2,7]. By providing detailed insights into how energy is distributed across different loads, disaggregation enables consumers, facility managers, and utility providers to identify potential failures and inefficiencies, optimize usage patterns, and implement targeted conservation measures. Research has suggested that access to appliance-level consumption data, when combined with appropriate behavioral interventions and sustained engagement, can lead to energy savings in the range of 5–15% in household settings, though the magnitude of these savings depends on study duration, intervention type, and occupant engagement [8]. Furthermore, disaggregated energy data supports demand-side management strategies, enabling utilities to implement dynamic pricing schemes and load-balancing mechanisms that benefit both consumers and grid operators [9].

Energy disaggregation can be accomplished through two fundamentally different approaches: Intrusive Load Monitoring (ILM) and Non-Intrusive Load Monitoring (NILM). ILM involves the deployment of dedicated sensing hardware at each appliance or circuit, providing highly accurate measurements of individual energy consumption [2,10]. Although this approach offers high accuracy and reliability, its practical application is limited by high costs, installation complexity, and maintenance requirements, especially in large-scale deployments or in existing buildings where retrofitting is difficult [2,11]. The need for physical access to each appliance and the ongoing maintenance of multiple sensors make ILM impractical for most residential and commercial applications. In contrast, NILM offers a cost-effective alternative by leveraging aggregate energy data collected from a single metering point to infer the consumption of each individual appliance [7,12]. First proposed by George Hart in 1985 at the Massachusetts Institute of Technology, NILM eliminates the need for extensive sensor networks, making it an attractive solution for widespread adoption in both residential and commercial settings [7,12]. The non-intrusive nature of this approach means that it can be deployed without modifying existing electrical installations, significantly reducing implementation barriers.

NILM algorithms comprise different approaches, which can be classified according to different criteria. From a sampling frequency perspective, NILM approaches are categorized as either high-frequency or low-frequency methods [2,5]. High-frequency NILM systems operate at sampling rates exceeding 1 kHz, capturing detailed electrical signatures such as harmonic content, transient waveforms, and current-voltage trajectories that enable precise appliance identification [2,13]. However, these approaches require sophisticated and expensive data acquisition hardware, limiting their scalability for mass-market applications. Low-frequency NILM methods, conversely, operate at conventional smart meter rates (typically ≤1 Hz) and rely on features such as active and reactive power, steady-state consumption levels, and temporal usage patterns [2,14]. While low-frequency approaches may sacrifice some discriminative power, they offer significant advantages in terms of cost-effectiveness and compatibility with existing smart meter infrastructure, making them particularly suitable for large-scale deployment scenarios.

Additionally, NILM methodologies can be distinguished as event-based or non-event-based approaches [2,15]. Event-based methods detect discrete changes in energy consumption patterns, such as the switching on or off of appliances, and use these transition events to identify individual loads [2,16]. These methods typically achieve high accuracy for appliances with distinct on/off signatures but may struggle with continuously variable loads. Non-event-based approaches, in contrast, analyze the continuous aggregate signal to extract features indicative of different appliances without explicitly identifying switching events [2]. Both paradigms have demonstrated effectiveness across various scenarios, with the choice often depending on the specific application requirements and available computational resources.

Despite significant advances in NILM research, several persistent challenges limit its widespread practical adoption. A fundamental limitation is the reliance on extensive and often specialized datasets for model training, which can restrict the transferability of models across different settings, appliance types, and geographical regions [8,14,17,18]. The availability of high-quality labeled data remains a critical bottleneck, as most existing datasets focus on residential environments in specific countries [19,20,21]. Furthermore, many state-of-the-art NILM systems impose substantial computational demands due to their reliance on complex neural network architectures, limiting their practical deployment on embedded or edge devices with constrained resources [5,22,23]. The reproducibility and comparability of NILM research also remain problematic due to the lack of efficient benchmarking frameworks with standardized evaluation procedures [5,24,25]. Moreover, the vast majority of NILM research has concentrated on residential environments, leaving commercial and industrial settings largely unexplored despite their substantial energy footprints and unique consumption characteristics [2,3,26].

Commercial and industrial buildings present distinct challenges compared to residential settings, including the prevalence of similar or identical loads such as multiple air conditioning units, lighting systems, and industrial machinery [3]. The energy consumption patterns in these environments differ significantly due to factors such as centralized heating, ventilation, and air conditioning (HVAC) systems, consistent operational schedules, and higher load densities [3,27]. Recent case studies have revealed that NILM has largely failed to disaggregate loads effectively in commercial buildings due to the overlapping signatures of similar appliances [3]. This gap underscores the need for methodological adaptations tailored to the specific characteristics of non-residential environments and comprehensive evaluations across diverse building types.

Despite this growing body of work, three specific gaps remain insufficiently addressed. First, the majority of NILM studies evaluate a single model architecture within a single domain (typically residential), making it difficult to draw generalizable conclusions about which model families are best suited for which load types. Second, emerging architectures inspired by foundation models—such as GPT-style causal transformers and Prophet-style temporal decomposition—have not been systematically evaluated for NILM nor compared against classical alternatives under controlled conditions. Third, while prior transformer-based NILM studies (e.g., GRU-BERT [28], OPT-NILM [22]) have demonstrated attention-based improvements for residential loads, their applicability to commercial and industrial environments with fundamentally different load characteristics (high inter-unit correlation, coordinated HVAC schedules, high-power transient machinery) remains unexplored.

This work addresses these gaps through the following contributions:

1.: A systematic, cross-domain evaluation framework that compares five model architectures—Vision Transformer (ViT), Variational Autoencoder (VAE), Random Forest (RF), and custom architectures inspired by TimeGPT and Prophet—across residential, commercial, and industrial environments under standardized experimental conditions. Unlike prior comparative NILM studies that typically focus on a single building type, this tri-sector analysis enables the identification of architecture–load type interactions that are not observable in single-domain evaluations.
2.: An empirical analysis of how load-level characteristics—including signal-to-noise ratio, Pearson correlation with aggregate consumption, operational state complexity, and sampling frequency—mediate disaggregation performance across model architectures, providing evidence-based criteria for model–load pairing.
3.: The introduction and evaluation of TimeGPT-inspired and Prophet-inspired architectures for NILM, which adapt causal attention and temporal decomposition paradigms from the time series forecasting literature to the energy disaggregation task and assess their strengths relative to established NILM approaches.
4.: Evidence-based, practitioner-oriented guidelines for algorithm selection based on load characteristics, dataset properties, and operational constraints, which facilitate informed decision-making for deploying NILM solutions in heterogeneous building environments.

It is important to note that all experiments in this study are conducted offline on publicly available benchmark datasets; no online deployment or real-time inference constraints are evaluated. The conclusions are therefore valid under these specific data and sampling conditions.

This paper is organized as follows: Section 2 reviews related work in NILM, covering the evolution of algorithmic approaches from traditional machine learning to modern deep learning architectures. Section 3 presents the methodology, encompassing the problem description, input and output variable definitions, the cost function formulation, and practical recommendations along with the limitations of the proposed approach. Section 4 describes the experimental results, detailing the evaluation metrics employed and the performance analysis across different scenarios and appliance categories. Finally, Section 5 discusses the findings in relation to prior literature, and Section 6 concludes the paper with a summary of the key findings and outlines directions for future work.

2. Related Work

The field of NILM has undergone a significant transformation with the advent of artificial intelligence, particularly deep learning techniques, which have substantially advanced the capabilities and performance of disaggregation algorithms [5,8,14]. Recent advances in Neural Network (NN) architectures, computational hardware, and training methodologies have enabled solutions that were previously unattainable. Deep learning (DL) for NILM was first introduced in 2015 and quickly demonstrated substantial improvements in disaggregation performance and generalization capability compared to conventional approaches [5,22]. Since then, the adoption of Deep Neural Networks (DNN) in NILM has grown rapidly.

Among the most influential deep learning architectures applied to NILM are Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory (LSTM) networks [8,23,29]. CNNs have proven particularly effective for extracting spatial and temporal features from energy consumption patterns, while RNNs and LSTMs excel at modeling sequential dependencies and capturing long-term temporal dynamics [8,14]. The sequence-to-point learning paradigm, which maps a window of aggregate consumption to a single appliance output, has emerged as a dominant framework demonstrating state-of-the-art performance across multiple benchmark datasets [8,30].

More recently, transformer-based architectures, originally developed for Natural Language Processing (NLP), have been successfully adapted for NILM applications [31,32,33]. The self-attention mechanism at the core of transformers enables models to capture global dependencies between aggregate and appliance-level signals, offering advantages over traditional recurrent approaches that process sequences sequentially [31,32,34]. These transformer-based NILM models have achieved impressive results, with F1-scores exceeding 92% and demonstrating effectiveness in accurately diagnosing and estimating the energy consumption of individual home appliances [31]. The integration of attention mechanisms, temporal pooling, and residual connections has further enhanced model performance, establishing transformers as a promising direction for future NILM research [31,35].

The state-of-the-art in NILM encompasses diverse methodological contributions that address various aspects of the disaggregation problem. The multi-NILM framework proposed by Nalmpantis and Vrakas [23] tackles NILM as a multi-label classification problem, employing dimensionality reduction techniques such as Signal2Vec to develop privacy-preserving and cost-effective solutions suitable for deployment on embedded devices [36]. Transfer learning approaches have emerged as a powerful strategy for enhancing the adaptability of NILM models across different datasets, appliances, and geographical regions [8,37,38]. D’Incecco et al. [8] demonstrated that appliance transfer learning and cross-domain transfer learning can significantly improve model generalization, with findings showing that only fully connected layers require fine-tuning for effective knowledge transfer.

Advanced feature extraction techniques have also contributed substantially to improving NILM accuracy. The Two-Stream Convolutional Neural Networks (TSCNN) proposed by Chen et al. [13] leverage both temporal and spectral load signatures to provide rich distinguishing features for appliance recognition. The Gramian Angular Field (GAF) representation and affinity propagation clustering strategies have been employed to mitigate the negative impact of intra-class variety caused by multi-state loads [13]. Furthermore, the Mother–Son Model (MSM) introduced by Mei et al. [39] addresses challenges related to missing data and limited transfer capabilities through a novel feature inheritance mechanism, enabling dynamic decomposition of multiple appliances, including unknown devices.

Research has also addressed the practical deployment challenges of NILM systems. Studies like OPT-NILM [22] propose cost-effective pruning approaches that can reduce model trainable parameters by up to 95% with minimal performance loss, making NILM more feasible for edge deployment. The DeepEdge-NILM case study [3] demonstrated the viability of NILM edge devices for monitoring multiple air conditioners in commercial buildings, highlighting the potential for real-world implementation beyond laboratory settings. Additionally, trainingless approaches based on multi-objective evolutionary computing [9] and sparse coding techniques [40] have expanded the NILM methodology to scenarios where extensive training data or plug-level sensing are unavailable.

Recent comprehensive surveys have systematically categorized the diverse algorithmic approaches addressing these challenges, ranging from traditional machine learning methods to modern deep learning architectures, including CNNs, RNNs, autoencoders, and transformer-based models, each exhibiting distinct strengths and limitations depending on the application context [41].

Despite the breadth of these contributions, several critical observations emerge from this literature analysis. First, the vast majority of studies evaluate individual architectures on residential datasets, making cross-architecture and cross-domain comparisons difficult due to differences in preprocessing, metrics, and train/test protocols. Second, while transformer-based approaches such as GRU-BERT [28] and pruned models like OPT-NILM [22] have shown promising results, they have been validated primarily on residential benchmarks (e.g., REDD, UK-DALE); their behavior on commercial loads with high inter-unit correlation or industrial loads with high-power transient signatures remains uncharacterized. Third, temporal-decomposition paradigms inspired by Prophet and causal-attention paradigms inspired by GPT have not been adapted or evaluated for NILM, despite their demonstrated effectiveness in related time series forecasting tasks. The present study is positioned to address these gaps by providing a controlled, multi-architecture comparison across three building domains under a unified experimental protocol.

3. Methodology

This section presents the methodological framework developed for NILM, encompassing the formal problem description, the definition of input and output variables, the cost function formulation employed for model optimization, and the description of the implemented algorithms. The methodology integrates multiple ML and DL approaches to address the energy disaggregation challenge across different building environments.

3.1. Problem Description

From a mathematical perspective, the NILM problem can be formulated as a blind source separation task. The aggregate power consumption measured at time t is expressed as the superposition of individual appliance consumptions:

y (t) = \sum_{i = 1}^{N} y_{i} (t) + e (t)

(1)

where

y (t)

denotes the total aggregate power consumption at time t,

y_{i} (t)

represents the power consumption of the i-th appliance, N corresponds to the total number of appliances in the monitored environment, and

e (t)

accounts for the noise term, which encompasses measurement errors, base loads, and contributions from unknown or unmonitored devices.

This formulation rests on several key assumptions. First, the noise term

e (t)

is assumed to be zero-mean and stationary over the observation window, i.e.,

E [e (t)] = 0

,

Var [e (t)] = σ_{e}^{2} < \infty

. Second, individual appliance signals

y_{i} (t)

are assumed to be mutually independent conditional on external factors (time of day, occupancy), which is a simplification that may be violated when centralized building management systems coordinate multiple loads (e.g., AHU units). Third, the problem is inherently ill-posed and underdetermined since N unknowns must be recovered from a single measurement; identifiability therefore relies on the assumption that each appliance possesses a sufficiently distinctive consumption signature—characterized by power levels, state-transition patterns, or temporal periodicity—such that supervised models can learn discriminative mappings from labeled examples. When these conditions are weakly satisfied (e.g., low signal-to-noise ratio, high inter-appliance correlation), disaggregation accuracy is expected to degrade regardless of model complexity, as confirmed by our experimental results.

The complexity of the NILM problem stems from several intrinsic characteristics of electrical consumption patterns. First, power signals exhibit non-linear behavior, with appliances displaying various operational states, transient responses, and consumption profiles that do not follow simple additive relationships. Second, multiple appliances may operate simultaneously with overlapping signatures, making it challenging to distinguish individual contributions from the aggregate signal. Third, the same appliance type may exhibit different consumption patterns depending on operating conditions, user behavior, and environmental factors, introducing intra-class variability that complicates recognition tasks.

In the context of supervised learning approaches, the NILM problem can be reformulated as learning a function f that maps the aggregate consumption to individual appliance consumptions:

{\hat{y}}_{i} (t) = f_{i} (Y_{t - w : t + w})

(2)

where

{\hat{y}}_{i} (t)

is the predicted consumption of appliance i at time t,

Y_{t - w : t + w}

represents a temporal window of aggregate consumption centered at time t with width

2 w

, and

f_{i}

is the learned disaggregation function for appliance i. This formulation, known as sequence-to-point learning, has demonstrated state-of-the-art performance in recent NILM literature, as the temporal context surrounding a measurement point provides valuable information for accurate prediction [8,30].

3.2. Input Variables

The input to the NILM algorithm consists of the aggregate power consumption signal acquired from a single metering point. In this work, the primary input variable is the active power measurement

P_{a g g} (t)

, sampled at regular intervals determined by the dataset’s temporal resolution. The input signal is processed as a time series that captures the cumulative electrical demand of all connected loads within the monitored environment.

For DL models, the input is structured as a sliding window of consecutive measurements:

X_{t} = [y (t - w), y (t - w + 1), \dots, y (t), \dots, y (t + w - 1), y (t + w)]

(3)

where

X_{t} \in R^{L_{s e q u e n c e}}

represents the input vector at time t,

L_{s e q u e n c e} = 2 w + 1

is the total sequence length, and w defines the temporal context considered for prediction. This windowed approach enables models to capture temporal dependencies and contextual information that characterize appliance activation and deactivation patterns. In this study, the sequence length is set to

L_{s e q u e n c e} = 480

samples for all deep learning models, corresponding to 480 min (≈8 h) for AMPds, 240 min (≈4 h) for COmBED, and 480 s (≈8 min) for IMDELD. This value was selected to capture at least one full operational cycle of the target appliances while remaining computationally tractable. The sensitivity of the results to window size was not systematically explored and constitutes a limitation of the current study.

Prior to model training, the input data undergoes preprocessing transformations to enhance learning efficiency:

\tilde{y} (t) = \frac{y (t) - μ_{t r a i n}}{σ_{t r a i n}}

(4)

where

μ_{t r a i n}

and

σ_{t r a i n}

represent the mean and standard deviation computed exclusively on the training partition, respectively. These statistics are then applied to normalize both validation and test sets, ensuring no information leakage from future data into the training process. This standardization ensures that input features have zero mean and unit variance, facilitating gradient-based optimization and improving convergence properties.

Additionally, energy consumption datasets often exhibit significant class imbalance, with appliances remaining inactive for extended periods. To mitigate this, we employ a combination of strategies: (i) oversampling of minority-class windows containing appliance activation events during training batch construction; (ii) asymmetric weighting of the loss function that penalizes missed activations more heavily than false positives. Recent research has established criteria for assessing dataset suitability for data reduction techniques based on class imbalance, temporal structure, and feature redundancy, enabling computational cost reduction while maintaining predictive accuracy [42].

For traditional ML approaches such as RF, the input is augmented with engineered features extracted from temporal windows, including statistical features (mean, median, standard deviation, variance, percentiles), change detection features (successive differences, total variation, zero-crossing rate), and spectral features derived from Fast Fourier Transform (FFT) analysis.

3.3. Output Variables

The output of the NILM system corresponds to the estimated power consumption of individual target appliances. For each appliance

i \in {1, 2, \dots, M}

, where M denotes the number of target appliances, the model generates a prediction:

{\hat{y}}_{i} = [{\hat{y}}_{i} (1), {\hat{y}}_{i} (2), \dots, {\hat{y}}_{i} (T)]

(5)

where

{\hat{y}}_{i} (t) \in R^{+}

represents the predicted power consumption of appliance i at time t, and T is the total number of time steps in the evaluation period. The non-negativity constraint reflects the physical reality that power consumption cannot be negative.

In sequence-to-point architectures, the output corresponds to the predicted consumption at the midpoint of the input window:

{\hat{y}}_{i} (t) = F_{θ_{i}} (X_{t})

(6)

where

F_{θ_{i}}

represents the neural network parameterized by

θ_{i}

for appliance i. This approach generates a single prediction for each input window, avoiding the averaging effects inherent in sequence-to-sequence methods and providing more precise temporal localization of consumption events.

Post-processing operations are applied to ensure physically meaningful predictions:

{\hat{y}}_{i}^{f i n a l} (t) = \max (0, \min ({\hat{y}}_{i} (t), P_{m a x, i}))

(7)

where

P_{m a x, i}

represents the maximum rated power of appliance i, constraining predictions within feasible operational bounds.

3.4. Cost Function

The training objective minimizes the discrepancy between predicted and actual appliance consumption values. The cost function quantifies this discrepancy and guides the optimization process through gradient-based parameter updates. The selection of an appropriate cost function significantly influences model behavior and generalization capabilities.

3.4.1. Primary Loss Functions

The Mean Squared Error (MSE) serves as the primary loss function for regression-based NILM models:

L_{M S E} = \frac{1}{n} \sum_{t = 1}^{n} {(y_{i} (t) - {\hat{y}}_{i} (t))}^{2}

(8)

where

y_{i} (t)

denotes the ground-truth consumption of appliance i at time t,

{\hat{y}}_{i} (t)

represents the corresponding prediction, and n is the number of training samples. The MSE penalizes larger errors more heavily due to the quadratic term, making it sensitive to outliers and encouraging the model to minimize substantial prediction errors.

The Mean Absolute Error (MAE) provides an alternative loss formulation that exhibits greater robustness to outliers:

L_{M A E} = \frac{1}{n} \sum_{t = 1}^{n} | y_{i} (t) - {\hat{y}}_{i} (t) |

(9)

The linear penalty structure of MAE treats all error magnitudes proportionally, making it more suitable for datasets with significant noise or occasional measurement anomalies.

3.4.2. Variational Autoencoder Loss

For the VAE implementation, the loss function combines reconstruction accuracy with regularization of the latent space distribution:

L_{V A E} = L_{r e c} + β \cdot L_{K L}

(10)

where

L_{r e c} = M S E (y, \hat{y})

represents the reconstruction loss, and the Kullback-Leibler (KL) divergence term is defined as:

L_{K L} = - \frac{1}{2} \sum_{j = 1}^{d} (1 + \log (σ_{z, j}^{2}) - μ_{z, j}^{2} - σ_{z, j}^{2})

(11)

where

μ_{z}

and

σ_{z}

are the mean and standard deviation of the latent distribution, d is the latent space dimension, and

β

is a weighting coefficient that balances reconstruction fidelity against latent space regularization.

3.4.3. Optimization Procedure

All deep learning models are optimized using the Adam optimizer [43] with an initial learning rate of

η = 10^{- 3}

, which was selected based on preliminary experiments and is consistent with common practice in NILM literature [8]. L2 weight regularization (

λ = 10^{- 4}

) is applied to prevent overfitting, given the high temporal autocorrelation in energy time series, which reduces the effective number of independent training samples. Training proceeds for a maximum of 100 epochs, with early stopping (patience = 10 epochs) based on validation loss. The choice of MSE as the primary loss for regression models and the composite VAE loss (Equation (10)) reflects a deliberate trade-off: MSE penalizes large errors more heavily, which is desirable for NILM, where missing high-power activation events are more consequential than small steady-state errors.

3.5. Vision Transformer

The ViT architecture [44], originally designed for image processing, was adapted for temporal signal processing in NILM applications. The implementation divides energy consumption time series into patches or discrete segments, analogous to how the original ViT processes images.

The architecture comprises three main components: Patch Embedding transforms the input time series into a sequence of patch embeddings, where each segment of size

L_{p a t c h}

is linearly projected to a feature space of dimension

d_{e m b e d}

. Additionally, a classification token (CLS token) and learnable positional embeddings are incorporated to maintain temporal information. The number of patches is calculated as:

N_{p a t c h e s} = \frac{L_{s e q u e n c e}}{L_{p a t c h}}

(12)

Transformer Blocks implement encoder blocks including Multi-Head Self-Attention (MHSA), Feedforward Neural Networks (FFNN) with Gaussian Error Linear Unit (GELU) activation, residual connections, and layer normalization. The Multilayer Perceptron (MLP) Head processes the CLS token to generate final consumption predictions per device.

Two configurations were implemented: ViT-small (3 transformer blocks, 23,425 parameters) and ViT-large (12 transformer blocks, 1,252,481 parameters). The overall ViT-NILM architecture is illustrated in Figure 1.

The transformer-based approach for NILM has been further advanced through hybrid architectures that combine sequential models with attention mechanisms. Recent work has demonstrated that integrating Gated Recurrent Unit (GRU) networks with Bidirectional Encoder Representations from Transformers (BERT) style attention captures both short-term temporal dependencies and long-range contextual information, achieving improved disaggregation accuracy and robustness [28].

3.6. Variational Autoencoder

The VAE implementation uses an unsupervised learning approach for energy disaggregation, based on the model’s capacity to learn latent representations of energy consumption patterns. The encoder projects input data to a lower-dimensional latent space using 1D convolutional and dense layers, while the decoder reconstructs consumption predictions from this latent space. The implemented VAE-NILM architecture is shown in Figure 2.

The loss function combines reconstruction loss and KL divergence as described in Equation (11).

The architecture includes 75,264 total parameters with configurable latent dimension.

3.7. Random Forest

The RF algorithm for NILM disaggregation extracts multi-domain features from sliding temporal windows. Statistical features include measures of central tendency (mean, median), dispersion (standard deviation, variance), and distribution shape characteristics (percentiles). Change detection features capture temporal dynamics through the analysis of successive differences

Δ p_{i} = p_{i + 1} - p_{i}

, including total variation, maximum change, and zero-crossing rate. The RF-NILM architecture is depicted in Figure 3.

Spectral features from FFT provide complementary frequency information, including total spectral energy, dominant frequency, and energy distribution across four frequency bands. The ensemble model comprises

N_{t r e e s} = 200

decision trees with a maximum depth of 30 levels, minimum samples per split of 5, and minimum samples per leaf of 5. These hyperparameters were selected through manual tuning on a held-out validation set, starting from a default configuration of 100 trees and progressively increasing complexity until validation performance plateaued. The final prediction for appliance j is obtained as:

{\hat{y}}_{j} (f) = \frac{1}{N_{t r e e s}} \sum_{i = 1}^{N_{t r e e s}} T_{i} (f)

(13)

where f represents the extracted feature vector and

T_{i} (f)

denotes the prediction from the i-th decision tree.

3.8. TimeGPT-Inspired Architecture

The TimeGPT-inspired model represents a custom architecture that draws on principles from Generative Pre-trained Transformer (GPT) language models, which are specifically adapted for energy consumption time series. Unlike the proprietary Nixtla TimeGPT service, this implementation is trained from scratch on NILM data and does not use pre-trained weights. The architecture includes temporal embedding with learnable positional encoding, temporal convolutions with different dilation rates 1, 2, 4 and 8 for multi-scale feature extraction, four transformer blocks with causal attention (represented as gray blocks in Figure 4) to preserve the temporal ordering of the input sequence, and global average pooling followed by dense layers for point prediction. The causal attention mask—the key design element borrowed from GPT-style models—ensures that predictions at time t depend only on past and present observations, unlike the bidirectional attention in ViT. The model comprises 854,656 parameters with 4 transformer blocks.

3.9. Prophet-Inspired Architecture

The Prophet-inspired model adapts the temporal decomposition philosophy of Facebook’s Prophet—which decomposes time series into trend, seasonality, and holiday components—but implements it through a fundamentally different neural architecture rather than the additive regression model of the original Prophet. Specifically, seasonal features are extracted through sinusoidal functions at multiple frequencies (daily, weekly, and custom periods), and trend is modeled using linear and quadratic temporal features, consistent with the Prophet philosophy. However, unlike the original Prophet, sequential dependencies are captured through optional bidirectional LSTM layers, and a custom attention mechanism with batch normalization and residual connections is employed to learn adaptive weighting of seasonal and trend components. This hybrid design is motivated by the observation that appliance consumption patterns exhibit both deterministic periodicity (captured by the decomposition) and stochastic transient behavior (captured by the LSTM and attention components). The architecture includes 109,184 total parameters. The Prophet-NILM architecture is illustrated in Figure 5.

3.10. Datasets

Three benchmark datasets were employed to evaluate the proposed framework across different building types:

1.: AMPds (Almanac of Minutely Power Dataset) [21]: A residential dataset containing minute-resolution measurements from a single Canadian household over two years (2012–2014), collected at Simon Fraser University (Burnaby, BC, Canada). Data acquisition was performed using a custom metering board based on a Raspberry Pi single-board computer combined with a Powerhouse Dynamics eMonitor whole-home energy monitor. The dataset is publicly distributed via the Harvard Dataverse repository. It includes 21 sub-metered loads representing typical residential appliances. For evaluation, three representative loads were selected: refrigerator (continuously variable consumption), heat pump (finite-state machine behavior), and electric oven (on/off behavior).
2.: COmBED (Commercial Building Energy Dataset) [25]: A commercial dataset monitoring energy consumption in an office building at the Indian Institute of Technology Delhi (New Delhi, India) at 30-s resolution over one month. Power measurements were collected using Watts Up? Pro energy meters, and dataset formatting and preprocessing were performed using NILMTK v0.2 [5] (open-source toolkit). Monitored loads include five Air Handling Units (AHU), lighting systems, two socket circuits, and an elevator. This dataset most closely represents the target commercial building application.
3.: IMDELD (Industrial Machines Dataset for Electrical Load Disaggregation) [45]: An industrial dataset with 1-s resolution collected over 111 days from a feed production plant located in Minas Gerais, Brazil, by researchers at the Universidade Federal de Minas Gerais (UFMG, Belo Horizonte, Brazil). Electrical variables—including voltage, current, and active, reactive, and apparent power—were recorded using WEG CFW-11 variable-frequency drive monitoring units with calibrated current transducers at each sub-circuit. The dataset is publicly available via IEEE Dataport. It includes pelletizers, bipolar contactors, exhaust fans, and milling machines. Three representative loads were selected: Pelletizer I (finite states), Exhaust Fan I (on/off), and Bipolar Contactor I (continuously variable).

For each dataset, three representative loads were selected to span the range of operational behaviors encountered in practice: (i) a continuously variable load; (ii) a finite-state or quasi-periodic load; (iii) a sporadic or event-driven load. This selection strategy ensures that the evaluation covers loads with diverse signal-to-noise ratios, Pearson correlations with the aggregate signal, and duty cycle characteristics. Table 1 summarizes the Pearson correlation between each selected load and the corresponding aggregate signal, which serves as a proxy for disaggregation difficulty.

Table 2 provides a consolidated summary of the hyperparameter configurations for all models to facilitate reproducibility.

3.11. Evaluation Metrics

Performance evaluation employed multiple metrics to capture different aspects of prediction accuracy:

Root Mean Squared Error (RMSE):

$R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}$

(14)

MAE:

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(15)

F1-Score: In the NILM context, the F1-score is computed by first converting continuous power predictions into binary ON/OFF classifications using a power threshold. Specifically, an appliance is considered ON at time t if

{\hat{y}}_{i} (t) > τ_{i}

, where

τ_{i}

is set to 15 W for low-power appliances (e.g., lighting, sockets) and 50 W for high-power appliances (e.g., heat pumps, pelletizers), following thresholds commonly used in the NILM literature [20]. Precision, recall, and F1 are then computed on a per-sample basis:

F 1 = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(16)

This metric captures the model’s ability to correctly detect appliance activation events, complementing the continuous regression metrics (RMSE, MAE, R²) with a state-detection perspective. For loads that are continuously ON (e.g., refrigerators, sockets with constant base load), the F1-score may approach 1.0 trivially and should be interpreted with caution.

Coefficient of Determination (R²): Measures the proportion of variance explained by the model.
Normalized Disaggregation Error (NDE) [19]:

$N D E = \sqrt{\frac{\sum_{t = 1}^{T} {(y_{t} - {\hat{y}}_{t})}^{2}}{\sum_{t = 1}^{T} y_{t}^{2}}}$

(17)

Normalized Error in Assigned Power (NEP) [46]:

N E P = \frac{\sum_{t = 1}^{T} | y_{t} - {\hat{y}}_{t} |}{\sum_{t = 1}^{T} y_{t}}

(18)

3.12. Limitations

Several limitations warrant explicit consideration when interpreting the findings of this study.

Dataset heterogeneity and generalizability. The three datasets differ substantially in temporal resolution (1 s, 30 s, 60 s), duration (1 month to 2 years), geographic origin, and number of monitored loads. These differences introduce a structural bias: the higher performance observed in the IMDELD industrial dataset may be attributable, at least in part, to its finer temporal resolution rather than to inherent properties of industrial loads. We do not attempt to control for this confound, and direct cross-dataset performance comparisons should be interpreted cautiously. Furthermore, all three datasets are publicly available benchmarks that may not fully represent the variability of real-world building environments, where factors such as sensor drift, missing data, and evolving occupancy patterns introduce additional challenges.
Limited load selection. Only three representative loads per dataset were evaluated. While these were selected to span a range of operational behaviors (continuous, finite-state, sporadic), the results may not generalize to all appliance types—particularly to loads with intermediate characteristics or to multi-state appliances with many operating modes.
Window size sensitivity. A single sequence length ( $L_{s e q u e n c e} = 480$ ) was used across all datasets and models. The optimal window size likely varies by appliance duty cycle, sampling rate, and model architecture. A systematic window-size ablation study would strengthen the conclusions but was not conducted due to computational constraints.
Absence of computational cost analysis. Although computational efficiency is relevant for practical NILM deployment, a detailed quantitative analysis of training time, inference latency, and memory requirements was not performed in this study. Such an analysis would require controlled benchmarking on standardized hardware and is identified as an important direction for future work.
Offline evaluation only. All experiments are conducted in an offline, batch-processing setting. Real-time deployment introduces additional constraints (streaming inference, concept drift, limited memory) that are not addressed here.

4. Results

This section presents the experimental results obtained after evaluating six disaggregation models in three different energy consumption environments by using the five metrics widely adopted in the NILM literature: RMSE, MAE, F1-score, R², and NDE. Results are organized by deployment context, beginning with the residential environment (AMPds dataset) featuring household appliances with diverse consumption patterns, followed by the industrial environment (IMDELD dataset) containing high-power electromechanical equipment, and concluding with the commercial environment (COmBED dataset) representing building-level loads including HVAC systems and auxiliary circuits.

For each environment, we analyze performance across appliances exhibiting different operational characteristics—continuous variable consumption, finite-state operation, and on/off switching behavior—to provide a comprehensive assessment of model suitability across load types. Best-performing values for each metric are highlighted in bold within the corresponding tables.

4.1. Residential Environment (AMPds)

Table 3 presents the complete results for the AMPds dataset across the three selected appliances.

For the refrigerator, which exhibits the lowest Pearson correlation (0.097) with aggregate consumption, Prophet emerged as the best option with an F1-score of 0.55 and the lowest RMSE (71.99). Time-series-based models showed an advantage due to their inherent capacity for modeling seasonal trends and periodic patterns. However, the near-zero R² (0.06) confirms the inherent difficulty of detecting low relative consumption loads operating continuously with subtle variations.

The heat pump, with a Pearson correlation of 0.512 with the aggregate consumption, proved to be the most identifiable load. RF achieved outstanding performance with an F1-score of 0.97 and R² of 0.90, suggesting that loads with well-defined operational states and high contribution to total consumption are ideal candidates for disaggregation using traditional machine learning techniques. The RF model accurately captures the ON/OFF state transitions characteristic of this finite-state load, as illustrated in Figure 6.

The electric oven results revealed the greatest divergence from expectations. With an intermediate Pearson correlation of 0.182, only Prophet (F1-score = 0.50) and TimeGPT (F1-score = 0.41) achieved any detection capability. Most models, including RF and ViT variants, obtained F1-scores of 0.00, indicating complete failure in identification. This unexpected result suggests that Pearson correlation alone is not a sufficient predictor of NILM performance, with sporadic oven usage and potential confusion with other high-power loads complicating the disaggregation task. The TimeGPT-inspired model detects some activation events but struggles with the sporadic usage pattern, as shown in Figure 7.

4.2. Industrial Environment (IMDELD)

Table 4 presents results for the industrial dataset.

For the pelletizer, ViT-based models demonstrated superior performance. ViT-small achieved the best overall RMSE (10,013.82), the lowest NDE (0.18), and an R² of 0.94. ViT-large obtained comparable results with the best MAE (3748.17). Both ViT variants significantly outperformed traditional methods and baselines, while TimeGPT, VAE, and Mean completely failed with R² = 0 and RMSE exceeding 45,000. Figure 8 illustrates the ViT-small disaggregation output for Pelletizer I, confirming accurate tracking of high-power state transitions at 1-s resolution.

For the exhaust fan, close competition was observed among the approaches. ViT-small led with the best RMSE (593.42), the lowest NDE (0.26), and the best R² (0.88). Prophet showed practically equivalent performance with an RMSE of 640.00, the best MAE (180.68), and an excellent F1-score of 0.97. RF also achieved an F1-score of 0.97, indicating excellent on/off state detection capability.

For the bipolar contactor, near-equivalent performance was observed across methodologies. Prophet obtained the best RMSE (181.41) and the best NDE (0.22), with an R² of 0.91. ViT-small followed closely with an RMSE of 187.41, the best MAE (108.63), and the same R² of 0.91. Both models demonstrated strong capacity for modeling this low-power load.

4.3. Commercial Environment (COmBED)

Table 5 presents results for the Air Handling Units, representing the primary HVAC loads in commercial buildings.

Analysis of the four AHU units revealed distinctive behavior patterns, indicating different operational dynamics across the systems. For AHU I, RF emerged as the best-performing model with an RMSE of 688.12 and an R² of 0.67, demonstrating a strong balance between precision and generalization. The RF disaggregation output for AHU I is shown in Figure 9. AHU II exhibited similar characteristics, with RF maintaining its superiority (RMSE 633.47, R² 0.77).

A different pattern was observed for AHU III, where transformer-based architectures demonstrated superiority. ViT-large achieved the lowest RMSE (595.63), an R² of 0.90, and an F1-score of 0.80. This shift toward attention-based models suggests that this unit exhibits more complex temporal patterns that benefit from the transformer’s capacity to capture long-range dependencies.

AHU IV exhibited the most distinctive behavior, with ViT-small achieving a near-perfect fit (R² = 0.97) and the lowest NDE of 0.12. However, the marked variability in performance across models (some exceeding RMSE 5000.00) suggests that this unit possesses unique operational characteristics that require specific architectural choices for effective modeling.

Table 6 presents results for other commercial building loads.

The elevator proved to be a challenging load, with a low correlation (0.251) and low R² values across all models. Prophet and RF demonstrated comparable performance (R² of 0.08 and 0.11, respectively), with RF achieving a slightly lower RMSE (725.09). The irregular, event-driven usage pattern that limits all models to low R² values is illustrated in Figure 10.

The lighting system exhibited characteristics that clearly favor transformer architectures. ViT-large achieved the best performance with an RMSE of 381.95 and an R² of 0.76, along with the lowest NDE of 0.16. These results indicate that lighting consumption patterns, although well-defined, contain temporal subtleties that transformers capture more effectively than traditional methods. Figure 11 shows the ViT-large predictions for the lighting system, capturing the temporal subtleties that transformers exploit more effectively than traditional methods.

The socket circuits presented the most complex and instructive scenarios. Socket I exhibited a paradoxical result: the baseline Mean model outperformed all sophisticated architectures (RMSE 120.51, MAE 74.68, F1-score 0.88). This suggests that this load exhibits such high stochastic variability that complex models suffer from overfitting, while a simple mean estimate captures the central tendency more effectively. In contrast, Socket II showed moderately improved predictability, with ViT-small achieving the best performance (RMSE 271.30, R² 0.30). The ViT-small disaggregation output for Socket II is shown in Figure 12.

5. Discussion

The comprehensive evaluation across three distinct building types reveals several important insights for NILM implementation using AI techniques.

5.1. Model Performance by Load Type

The results demonstrate clear patterns in model effectiveness related to load characteristics. For loads with well-defined finite states and high contribution to aggregate consumption (heat pump, AHU systems), traditional machine learning approaches, particularly RF, consistently achieved superior performance. This finding aligns with prior observations by Kelly and Knottenbelt [20] and Bousbiat et al. [5], who state that simpler models can match or exceed deep learning on well-structured NILM tasks, and can be attributed to RF’s effectiveness at capturing discrete state transitions through axis-aligned partitions of the engineered feature space without requiring the extensive temporal context that deep learning models need.

Transformer-based architectures excelled in scenarios with complex temporal patterns and high-power industrial loads. The ViT-small configuration frequently outperformed ViT-large, a finding consistent with the broader machine learning literature on the “double descent” phenomenon and the bias-variance trade-off in overparameterized models. Specifically, for time series with strong autocorrelation, the effective number of independent training examples is substantially smaller than the nominal sample count. Belkin et al. [47] showed that models entering the interpolation regime without sufficient effective data tend to memorize rather than generalize. In our NILM context, ViT-large (1.25 M parameters) likely enters this regime on the smaller effective datasets, while ViT-small (23 K parameters) remains in the classical regime where generalization improves with training. This has important practical implications: lightweight architectures may simultaneously offer superior accuracy and reduced inference costs for edge deployment.

Prophet-inspired models demonstrated particular strength in capturing seasonal and periodic patterns, making them especially suitable for residential appliances with regular usage cycles. The interpretable decomposition into trend and seasonality components provides added value for energy auditing applications, echoing the emphasis on model transparency highlighted by Huzzat et al. [41].

5.2. Analysis of Model Failures

Several models produced near-zero R² values or extremely high RMSE for specific loads (e.g., TimeGPT-inspired and VAE on IMDELD pelletizer; VAE on COmBED AHUs). Inspection of learning curves revealed that these failures are attributable to two distinct mechanisms. For the VAE, the KL divergence regularization term tends to dominate the reconstruction loss during early training, causing the decoder to collapse to predicting the population mean—a known pathology of VAE training when

β

is not carefully annealed. For the TimeGPT-inspired model, causal attention restricts the model to using only past context, which proves insufficient for loads whose state at time t depends on future scheduled events (e.g., centrally controlled HVAC). These are not instances of training instability per se, but rather fundamental architectural mismatches with specific load types. We did not observe sensitivity to random seeds or learning rate for the successful model–load pairings.

5.3. Practitioner Guidance: When to Prefer Simple Baselines

The finding that a Mean baseline outperforms all complex models for Socket I is instructive for practitioners. We propose the following heuristic criteria for identifying loads where simple baselines are likely preferable: (i) Pearson correlation with aggregate signal

r < 0.15

; (ii) coefficient of variation of the target load

> 1.5

; (iii) absence of repeatable temporal patterns (no significant peaks in the autocorrelation function beyond lag 0). When these conditions are met, the target load contributes negligibly to the aggregate signal and exhibits high stochastic variability, making it indistinguishable from noise for any learned model. In such cases, practitioners should either accept the Mean baseline or consider sub-metering the load directly.

5.4. Building Type Considerations

Commercial buildings present unique challenges compared to residential environments. The high correlation among AHU units (exceeding 0.90 in some cases) suggests coordinated operation typical of centralized building management systems, creating difficulties in differentiating individual units. However, this correlation also means that accurate disaggregation of any single AHU provides substantial insight into overall HVAC consumption. A promising direction for addressing correlated commercial loads is the adoption of multi-output or multi-task learning frameworks that jointly model all correlated loads simultaneously, thereby leveraging inter-unit dependencies as an informative signal rather than treating them as a confound. Integration of external signals from building management systems (e.g., control setpoints, occupancy schedules) could further improve disambiguation.

Industrial environments showed the highest predictability for high-power machinery, with R² values exceeding 0.90 for pelletizers and exhaust fans. The 1-s sampling rate of IMDELD, combined with distinct operational signatures of industrial equipment, facilitated accurate disaggregation. However, as noted in the Limitations, the contribution of sampling rate versus intrinsic load characteristics cannot be fully disentangled in this study.

5.5. Signal-to-Noise Ratio and Pearson Correlation Effects

The observed performance disparities across appliances can be fundamentally attributed to the signal-to-noise ratio (SNR) inherent in each disaggregation task. Loads exhibiting low Pearson correlation with the aggregate signal, such as the refrigerator (

r = 0.097

) in the residential dataset, present a challenging scenario in which the target signal is effectively masked by the superposition of other concurrent loads. The near-zero R² values obtained for such appliances are not indicative of model failure but rather reflect an intrinsic limitation of the disaggregation problem when the target load contributes minimally to the aggregate measurement.

Conversely, high-correlation loads such as the heat pump (

r = 0.512

) and pelletizers (

r = 0.840

) provide stronger discriminative signals that enable models to achieve R² values exceeding 0.85. This correlation–performance relationship underscores the importance of considering load contribution ratios when selecting target appliances for NILM deployment, as energy-intensive loads inherently offer more favorable disaggregation conditions regardless of the chosen algorithm.

5.6. Random Forest Superiority for Finite-State Loads

The consistent outperformance of RF over deep learning architectures for finite-state appliances such as heat pumps and AHU systems can be explained through the lens of decision boundary complexity. RF constructs an ensemble of decision trees that naturally partition the feature space into discrete regions corresponding to appliance operational states (ON, OFF, intermediate levels). This architectural characteristic aligns well with the nature of finite-state loads, where power consumption transitions between well-defined levels rather than varying continuously. Furthermore, the ensemble approach handles non-linear relationships between aggregate power measurements and individual appliance states without requiring the extensive temporal context that recurrent or attention-based architectures demand.

The 200-tree configuration with a maximum depth of 30 levels provides sufficient representational capacity to capture state transitions, while the minimum sample constraints (5 per split and leaf) prevent overfitting to noise in the training data. These results are consistent with previous findings in the literature demonstrating that traditional machine learning methods can outperform deep learning when the underlying problem structure matches the model’s inductive biases.

5.7. Model Complexity and Overfitting Trade-Offs

The counterintuitive finding that ViT-small (23,425 parameters) frequently outperformed ViT-large (1,252,481 parameters) reveals important insights about the relationship between model capacity and NILM task requirements. Deep learning models with excessive parameterization relative to the available training data are prone to memorizing training patterns rather than learning generalizable representations.

The NILM datasets employed, despite spanning months to years of measurements, exhibit substantial temporal autocorrelation that effectively reduces the number of independent training examples. Additionally, the fundamental patterns in appliance signatures—characterized by state transitions and power levels—may not require the extensive hierarchical feature extraction that larger transformer architectures provide. The ViT-small configuration appears to strike an optimal balance, offering sufficient attention capacity to capture relevant temporal dependencies without the regularization challenges posed by overparameterized networks. This finding carries practical implications for edge deployment, where lightweight architectures may simultaneously offer superior accuracy and reduced inference costs.

5.8. Challenges of Overlapping Signatures in Commercial Environments

The elevated Pearson correlation coefficients observed among AHU units (exceeding 0.90) exemplify a fundamental challenge in commercial building NILM: the presence of multiple similar or identical loads operating under coordinated control strategies. Building management systems typically activate HVAC equipment in synchronized patterns based on occupancy schedules and thermal comfort requirements, resulting in highly correlated consumption profiles that confound traditional disaggregation approaches. Unlike residential environments, in which appliance diversity naturally provides discriminative signatures, commercial buildings frequently deploy multiple units of identical equipment whose individual contributions become statistically indistinguishable.

The elevator load, with a low correlation of 0.251 and consistently poor R² values across all models, represents an extreme case in which irregular, event-driven operation combined with brief duty cycles creates insufficient statistical structure for reliable disaggregation. These findings suggest that commercial NILM implementations may require complementary approaches—such as sub-metering at distribution panels or incorporation of building automation system data—to resolve ambiguities among similar loads.

5.9. Sampling Rate Impact on Disaggregation Performance

The strong performance achieved in the industrial environment (R² exceeding 0.90 for pelletizers and exhaust fans) can be substantially attributed to the 1-s sampling rate of the IMDELD dataset, compared to the 30-s and 1-min intervals in the commercial and residential datasets, respectively. Higher sampling frequencies capture transient signatures—the distinctive power fluctuations during appliance startup, shutdown, and state transitions—which provide rich discriminative information beyond steady-state power levels.

Industrial machinery, characterized by high power ratings and pronounced operational signatures, generates transient events with amplitudes that significantly exceed measurement noise, enabling robust detection even in complex multi-load scenarios. The finer temporal resolution also facilitates the identification of duty cycling patterns and variable-speed drive characteristics that would be aliased or averaged out at lower sampling rates.

This observation has important implications for NILM system design: strategic investment in higher-frequency metering infrastructure may yield substantially improved disaggregation accuracy, particularly for loads whose signatures are predominantly transient in nature. The trade-off between data acquisition costs, storage requirements, and disaggregation performance merits careful consideration in practical deployments.

5.10. Summary: Model Selection Guidelines

Table 7 synthesizes the experimental findings into a practitioner-oriented model selection guide, mapping load type and environment to the recommended architecture.

6. Conclusions

This study presented a systematic, cross-domain evaluation of five AI-based architectures for NILM across residential, commercial, and industrial environments. Beyond reporting per-model accuracy figures, the work yields three principal insights that advance our understanding of the NILM problem.

First, the results provide empirical evidence that the interaction between model architecture and load-level characteristicsis a stronger determinant of disaggregation accuracy than either factor in isolation. RF excels on finite-state loads because its decision-boundary structure aligns with discrete power-level transitions; ViT-small excels on complex industrial loads because its attention mechanism captures transient temporal dependencies; and the Prophet-inspired model excels on seasonal residential loads because its decomposition architecture matches the generative structure of periodic consumption. This architecture–load interaction perspective offers a more nuanced and actionable framework than the prevailing “which model is best?” paradigm.

Second, the study demonstrates that model complexity is not monotonically related to performance in the NILM domain. The consistent superiority of ViT-small over ViT-large can be attributed to the reduced effective sample size caused by temporal autocorrelation in energy time series, which pushes overparameterized models into a memorization regime. This finding carries practical significance for edge deployment, where lightweight models are preferable.

Third, the analysis of signal-to-noise ratio and Pearson correlation reveals fundamental identifiability limits: loads contributing less than approximately 10% to the aggregate signal (Pearson

r < 0.15

) present disaggregation challenges that no model architecture can overcome, as the target signal is effectively masked by superimposed loads. This insight reframes certain “model failures” as problem-intrinsic limitations and provides practitioners with quantitative criteria for assessing NILM feasibility prior to deployment.

These findings support the use of NILM as a practical tool for energy monitoring in buildings, particularly when existing metering infrastructure is available and analysis is conducted offline. However, the characterization of NILM as “cost-effective” should be understood as conditional on these assumptions and on targeting high-impact loads where disaggregation is technically feasible. The high inter-unit correlations observed among AHU systems (exceeding 0.90) highlight the limitations of purely algorithmic approaches in commercial environments with coordinated building management, suggesting that hybrid solutions integrating NILM with building automation data may be necessary for comprehensive disaggregation in complex facilities.

Future Work

Several promising research directions emerge from this study’s findings.

The development of adaptive sampling strategies that dynamically adjust acquisition frequency based on detected load events could optimize the trade-off between data storage requirements and disaggregation accuracy.
Investigating semi-supervised and self-supervised learning paradigms may address the persistent challenge of obtaining labeled ground-truth data in commercial environments where sub-metering installation is impractical.
The integration of contextual information from building management systems, occupancy sensors, and weather data as auxiliary inputs could enhance model performance for highly correlated loads such as HVAC systems operating under coordinated control strategies.
Exploring lightweight Neural Architecture Search (NAS) techniques specifically tailored for NILM applications could yield optimized model configurations that balance accuracy with computational efficiency for resource-constrained edge devices.
Model interpretability: Leveraging feature importance analysis (RF) and attention map visualization (ViT, TimeGPT-inspired) to provide diagnostic insights to facility managers about which temporal patterns drive specific appliance predictions would improve trust and facilitate practical adoption of NILM systems.
A rigorous computational cost analysis—benchmarking training time, inference latency, and memory usage across models on standardized hardware—is needed to complement the accuracy-focused evaluation presented here.
Longitudinal studies examining model degradation over time due to appliance replacement, seasonal variations, and changes in occupancy patterns would provide valuable insights for developing robust, self-adaptive NILM systems suitable for long-term deployment in real-world facilities.

Author Contributions

Conceptualization, A.R.-B., G.C.-R., J.D.-L.-T.-G., U.F.-G. and E.Z.; methodology, A.R.-B., G.C.-R., J.D.-L.-T.-G. and E.Z.; software, A.R.-B., J.D.-L.-T.-G. and E.Z.; validation, G.C.-R., J.D.-L.-T.-G. and E.Z.; formal analysis, A.R.-B., G.C.-R., J.D.-L.-T.-G., U.F.-G. and E.Z.; investigation, A.R.-B., G.C.-R., J.D.-L.-T.-G., U.F.-G. and E.Z.; resources, A.R.-B.; data curation, G.C.-R., J.D.-L.-T.-G. and E.Z.; writing—original draft preparation, A.R.-B., G.C.-R., J.D.-L.-T.-G. and E.Z.; writing—review and editing, A.R.-B., G.C.-R., J.D.-L.-T.-G. and U.F.-G.; supervision, U.F.-G. and E.Z.; project administration, G.C.-R., J.D.-L.-T.-G. and E.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study are publicly available: AMPds at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FIE0S4 (accessed on 29 April 2025); COmBED at https://combed.github.io/ (accessed on 29 April 2025); IMDELD at IEEE Dataport https://ieee-dataport.org/open-access/industrial-machines-dataset-electrical-load-disaggregation (accessed on 29 April 2025).

Acknowledgments

The authors appreciate the partial support from the Government of the Basque Country through research grant N. ELKARTEK KK-2025/00012.

Conflicts of Interest

The author declares the following potential conflict of interest: the computational resources used to conduct this research were provided by CEMOSA (Control y Estudios de Materiales, Obras y Servicios Ambientales, S.A.), a enterprise where the author is employed in the Research and Development (R&D) department. The research was carried out as part of a Master’s thesis project hosted at CEMOSA, leveraging the computational infrastructure available within the R&D division. While CEMOSA operates with commercial objectives, the R&D department conducts applied research activities aligned with academic and scientific inquiry. CEMOSA had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. The author confirms that this affiliation did not influence the objectivity or integrity of the reported research.

Abbreviations

The following abbreviations are used in this manuscript:

AHU	Air Handling Unit
AI	Artificial Intelligence
BERT	Bidirectional Encoder Representations from Transformers
CNN	Convolutional Neural Network
DL	Deep Learning
DNN	Deep Neural Network
FFNN	Feedforward Neural Network
FFT	Fast Fourier Transform
GAF	Gramian Angular Field
GELU	Gaussian Error Linear Unit
GPT	Generative Pre-trained Transformer
GPU	Graphics Processing Unit
GRU	Gated Recurrent Unit
HVAC	Heating, Ventilation, and Air Conditioning
ILM	Intrusive Load Monitoring
KL	Kullback-Leibler
LSTM	Long Short-Term Memory
MAE	Mean Absolute Error
MHSA	Multi-Head Self-Attention
ML	Machine Learning
MLP	Multilayer Perceptron
MSE	Mean Squared Error
NAS	Neural Architecture Search
NDE	Normalized Disaggregation Error
NEP	Normalized Error in Assigned Power
NILM	Non-Intrusive Load Monitoring
NLP	Natural Language Processing
NN	Neural Network
MSM	Mother–Son Model
R²	Coefficient of Determination
RF	Random Forest
RMSE	Root Mean Squared Error
RNN	Recurrent Neural Network
SNR	Signal-to-Noise Ratio
TSCNN	Two-Stream Convolutional Neural Network
VAE	Variational Autoencoder
ViT	Vision Transformer

References

McKinsey & Company. Global Energy Perspective 2024; McKinsey & Company: New York, NY, USA, 2024. [Google Scholar]
Schirmer, P.A.; Mporas, I. Non-Intrusive Load Monitoring: A Review. IEEE Trans. Smart Grid 2023, 14, 769–784. [Google Scholar]
Gopinath, R.; Kumar, M. DeepEdge-NILM: A case study of non-intrusive load monitoring edge device in commercial building. Energy Build. 2023, 294, 113226. [Google Scholar] [CrossRef]
IDAE. Consumo Energético en Edificios; Instituto para la Diversificación y Ahorro de la Energía: Madrid, Spain, 2020. [Google Scholar]
Bousbiat, H.; Faustine, A.; Klemenjak, C.; Pereira, L.; Elmenreich, W. Unlocking the Full Potential of Neural NILM: On Automation, Hyperparameters, and Modular Pipelines. IEEE Trans. Ind. Inform. 2022, 19, 7002–7010. [Google Scholar] [CrossRef]
Janik, A.; Adam, R.; Marek, S. Scientific landscape of smart and sustainable cities literature: A bibliometric analysis. Sustainability 2020, 12, 779. [Google Scholar] [CrossRef]
Hart, G.W. Nonintrusive appliance load monitoring. Proc. IEEE 1992, 80, 1870–1891. [Google Scholar] [CrossRef]
D’Incecco, M.; Squartini, S.; Zhong, M. Transfer Learning for Non-Intrusive Load Monitoring. IEEE Trans. Smart Grid 2020, 11, 1419–1429. [Google Scholar] [CrossRef]
Lin, Y.H. Trainingless multi-objective evolutionary computing-based nonintrusive load monitoring: Part of smart-home energy management for demand-side management. J. Build. Eng. 2021, 33, 101601. [Google Scholar] [CrossRef]
Suryadevara, N.K.; Biswal, G.R. Smart plugs: Paradigms and applications in the smart city-and-smart grid. Energies 2019, 12, 1957. [Google Scholar] [CrossRef]
Bucci, G.; Ciancetta, F.; Fiorucci, E.; Mari, S.; Fioravanti, A. State of art overview of Non-Intrusive Load Monitoring applications in smart grids. Meas. Sens. 2021, 18, 100145. [Google Scholar] [CrossRef]
Hart, G.W. Prototype Nonintrusive Appliance Load Monitor; MIT Energy Laboratory Technical Report MIT-EL-TR-93-1; Massachusetts Institute of Technology: Cambridge, MA, USA, 1985. [Google Scholar]
Chen, J.; Wang, X.; Zhang, X.; Zhang, W. Temporal and Spectral Feature Learning with Two-Stream Convolutional Neural Networks for Appliance Recognition in NILM. IEEE Trans. Smart Grid 2022, 13, 762–774. [Google Scholar] [CrossRef]
Herath, M.; Angammana, C.J.; Liyanage, M. A Study of the Effects of Appliance Energy Signatures on Different Neural Network Types in Nonintrusive Load Monitoring. IEEE Trans. Instrum. Meas. 2023, 72, 2524010. [Google Scholar] [CrossRef]
Etezadifar, M.; Karimi, H.; Aghdam, A.G.; Mahseredjian, J. Resilient event detection algorithm for non-intrusive load monitoring under non-ideal conditions using reinforcement learning. IEEE Trans. Ind. Appl. 2023, 60, 2085–2094. [Google Scholar] [CrossRef]
Breyer, J.; Alizai, M.H.; Samant, S.; Wehrle, K. Advanced Filtering of Unknown Devices in Event-Based NILM. ACM SIGENERGY Energy Inform. Rev. 2024, 4, 8–16. [Google Scholar] [CrossRef]
Pereira, L.; Nunes, N. Performance Evaluation in Non-Intrusive Load Monitoring: Datasets, Metrics and Tools—A Review. Energy Rep. 2020, 6, 59–80. [Google Scholar] [CrossRef]
Ribeiro, D.; Pereira, L. Engineering and deploying a hardware and software platform to collect and label non-intrusive load monitoring datasets. In Proceedings of the 2017 Sustainable Internet and ICT for Sustainability (SustainIT), Funchal, Portugal, 6–7 December 2017; pp. 1–9. [Google Scholar]
Kolter, J.Z.; Jaakkola, T. Approximate inference in additive factorial HMMs with application to energy disaggregation. In Proceedings of the Artificial Intelligence and Statistics, PMLR, La Palma, Spain, 21–23 April 2012; pp. 1472–1482. [Google Scholar]
Kelly, J.; Knottenbelt, W. The UK-DALE dataset, domestic appliance-level electricity demand and whole-house demand from five UK homes. Sci. Data 2015, 2, 150007. [Google Scholar] [CrossRef]
Makonin, S.; Popowich, F.; Bartram, L.; Gill, B.; Bajić, I.V. AMPds: A public dataset for load disaggregation and eco-feedback research. In Proceedings of the 2013 IEEE Electrical Power & Energy Conference, Halifax, NS, Canada, 21–23 August 2013; pp. 1–6. [Google Scholar]
Athanasoulias, S.; Sykiotis, S.; Kaselimi, M.; Doulamis, A.; Doulamis, N.; Ipiotis, N. OPT-NILM: An Iterative Prior-to-Full-Training Pruning Approach for Cost-Effective User Side Energy Disaggregation. IEEE Trans. Consum. Electron. 2024, 70, 4435–4447. [Google Scholar] [CrossRef]
Nalmpantis, C.; Vrakas, D. On time series representations for multi-label NILM. Neural Comput. Appl. 2020, 32, 17275–17290. [Google Scholar] [CrossRef]
Kukunuri, R.; Batra, N.; Pandey, A.; Malakar, R.; Kumar, R.; Krystalakos, O.; Zhong, M.; Meira, P.; Parson, O. NILMTK-Contrib: Towards reproducible state-of-the-art energy disaggregation. In Proceedings of the AI Social Good Workshop, Virtual, 9 February 2020; pp. 20–21. [Google Scholar]
Batra, N.; Kelly, J.; Parson, O.; Dutta, H.; Knottenbelt, W.; Rogers, A.; Singh, A.; Srivastava, M. NILMTK: An open source toolkit for non-intrusive load monitoring. In Proceedings of the 5th International Conference on Future Energy Systems, Cambridge, UK, 11–13 June 2014; pp. 265–276. [Google Scholar]
Batra, N.; Parson, O.; Berges, M.; Singh, A.; Rogers, A. A comparison of non-intrusive load monitoring methods for commercial and residential buildings. arXiv 2014, arXiv:1408.6595. [Google Scholar] [CrossRef]
Xiao, Z.; Fan, C.; Yuan, J.; Xu, X.; Gang, W. Comparison between artificial neural network and random forest for effective disaggregation of building cooling load. Case Stud. Therm. Eng. 2021, 28, 101589. [Google Scholar] [CrossRef]
Huzzat, A.; Khwaja, A.S.; Alnoman, A.A.; Adhikari, B.; Anpalagan, A.; Woungang, I. GRU-BERT for NILM: A Hybrid Deep Learning Architecture for Load Disaggregation. AI 2025, 6, 238. [Google Scholar] [CrossRef]
Le, T.-T.H.; Kim, J.; Kim, H. Classification performance using gated recurrent unit recurrent neural network on energy disaggregation. In Proceedings of the 2016 International Conference on Machine Learning and Cybernetics (ICMLC), Jeju, Republic of Korea, 10–13 July 2016; Volume 1, pp. 105–110. [Google Scholar]
Zhang, C.; Zhong, M.; Wang, Z.; Goddard, N.; Sutton, C. Sequence-to-point learning with neural networks for non-intrusive load monitoring. Energy Build. 2018, 174, 106–117. [Google Scholar]
Irani Azad, M.; Rajabi, R.; Estebsari, A. Nonintrusive Load Monitoring (NILM) Using a Deep Learning Model with a Transformer-Based Attention Mechanism and Temporal Pooling. Electronics 2024, 13, 407. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Mari, S.; Ciancetta, F.; Hernandez, Á.; Pizarro, D.; de Diego-Otón, L.; Navarro, V.M. A Convolutional Transformer for Enhanced NILM in Human Activity Recognition. In Proceedings of the 2024 IEEE International Symposium on Medical Measurements and Applications (MeMeA), Eindhoven, The Netherlands, 26–28 June 2024; pp. 1–6. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Acheampong, F.A.; Nunoo-Mensah, H.; Chen, W. Transformer models for text-based emotion detection: A review of BERT-based approaches. Artif. Intell. Rev. 2021, 54, 5789–5829. [Google Scholar] [CrossRef]
Asres, M.W.; Ardito, L.; Patti, E. Computational cost analysis and data-driven predictive modeling of cloud-based online-NILM algorithm. IEEE Trans. Cloud Comput. 2021, 10, 2409–2423. [Google Scholar]
Li, K.; Feng, J.; Zhang, J.; Xiao, Q. Adaptive fusion feature transfer learning method for NILM. IEEE Trans. Instrum. Meas. 2023, 72, 2511612. [Google Scholar] [CrossRef]
Hao, P.; Yan, Z.; Wen, H. Privacy-preserving NILM: A self-alignment source-aware domain adaptation approach. IEEE Trans. Instrum. Meas. 2025, 74, 2507612. [Google Scholar] [CrossRef]
Mei, H.; Liu, Y.; Cao, W.; Yu, Y. A mother-son model for multi-objective non-invasive load monitoring. Energy Build. 2023, 300, 113669. [Google Scholar] [CrossRef]
Majumdar, A. Trainingless Energy Disaggregation Without Plug-Level Sensing. IEEE Trans. Instrum. Meas. 2022, 71, 2504808. [Google Scholar] [CrossRef]
Huzzat, A.; Khwaja, A.S.; Alnoman, A.A.; Adhikari, B.; Anpalagan, A.; Woungang, I. A Survey of Traditional and Emerging Deep Learning Techniques for Non-Intrusive Load Monitoring. AI 2025, 6, 213. [Google Scholar] [CrossRef]
Sanderson, D.; Kalganova, T. Identifying Suitability for Data Reduction in Imbalanced Time-Series Datasets. AI 2025, 6, 98. [Google Scholar] [CrossRef]
Kingma, D. Adam: A method for stochastic optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Fu, Z. Vision transformer: ViT and its derivatives. arXiv 2022, arXiv:2205.11239. [Google Scholar] [CrossRef]
Martins, P.B.M.; Nascimento, V.B.; de Freitas, A.R.; e Silva, P.B.; Pinto, R.G.D. Industrial Machines Dataset for Electrical Load Disaggregation; IEEE Dataport: Piscataway, NJ, USA, 2018. [Google Scholar]
Bonfigli, R.; Squartini, S. Machine Learning Approaches to Non-Intrusive Load Monitoring; SpringerBriefs in Energy; Springer: Cham, Switzerland, 2020. [Google Scholar]
Belkin, M.; Hsu, D.; Ma, S.; Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl. Acad. Sci. USA 2019, 116, 15849–15854. [Google Scholar]

Figure 1. Architecture of the Vision Transformer (ViT) model adapted for NILM. The aggregate power time series is divided into patches, projected into an embedding space with positional encodings, and processed through transformer encoder blocks. The CLS token output is mapped to a single-point power prediction.

Figure 2. vAE-NILM architecture implemented.

Figure 3. RF-NILM architecture implemented. The ellipsis represents the continuation of the decision trees up to the total number of estimators (

N_{t r e e s} = 200

).

Figure 3. RF-NILM architecture implemented. The ellipsis represents the continuation of the decision trees up to the total number of estimators (

N_{t r e e s} = 200

).

Figure 4. TimeGPT-NILM architecture implemented. Gray blocks represent the repeated transformer encoder blocks (4 in total), each comprising causal multi-head self-attention, feed-forward layers, and residual connections.

Figure 5. Prophet-NILM architecture implemented.

Figure 6. RF disaggregation for the heat pump (AMPds dataset). Blue: ground truth power [W]; orange: predicted power [W]. Areas appearing brown indicate a high degree of overlap where the predicted power (orange) matches the ground truth (blue) exactly. The model accurately captures the ON/OFF state transitions characteristic of this finite-state load.

Figure 7. TimeGPT-inspired model disaggregation for the electric oven (AMPds dataset). Blue: ground truth power [W]; orange: predicted power [W]. Areas appearing brown indicate a overlap where the predicted power (orange) matches the ground truth (blue). The model detects some activation events but struggles with the sporadic usage pattern.

Figure 8. ViT-small disaggregation for Pelletizer I (IMDELD dataset, 1 s resolution). Blue: ground truth power [W]; orange: predicted power [W]. The transformer accurately tracks high-power state transitions.

Figure 9. RF disaggregation for AHU I (COmBED dataset, 30 s resolution). Blue: ground truth power [W]; orange: predicted power [W].

Figure 10. Prophet-inspired model disaggregation for the elevator (COmBED dataset). Blue: ground truth power [W]; orange: predicted power [W]. Areas appearing brown indicate a high degree of overlap where the predicted power (orange) matches the ground truth (blue) exactly. The irregular, event-driven usage pattern limits all models to low R² values.

Figure 11. ViT-large disaggregation for the lighting system (COmBED dataset). Blue: ground truth power [W]; orange: predicted power [W].

Figure 12. ViT-small disaggregation for Socket II (COmBED dataset). Blue: ground truth power [W]; orange: predicted power [W].

Table 1. Pearson correlation (r) between selected appliances and their aggregate signal.

Dataset	Appliance	Pearson r	Expected Difficulty
AMPds	Refrigerator	0.097	High
	Heat Pump	0.512	Low–Moderate
	Electric Oven	0.182	Moderate–High
IMDELD	Pelletizer I	0.840	Low
	Exhaust Fan I	0.819	Low
	Contactor I	0.851	Low
COmBED	AHU I–V	0.750–0.891	Low–Moderate
	Lighting	≈0.98 ^†	Low
	Socket I/II	<0.15 ^‡/≈0.35 ^‡	High/Moderate–High
	Elevator	0.251	High

^† Lighting and Socket II show very high mutual correlation (

r = 0.980

); their individual correlations with the aggregate are inferred from model performance (R² = 0.76 and R² = 0.30, respectively). ^‡ Socket I correlation estimated from near-zero R² across all models; Socket II estimated from moderate R².

Table 2. Summary of model hyperparameters. All DL models use

L_{s e q u e n c e} = 480

, Adam optimizer (

η = 10^{- 3}

), batch size 64, early stopping (patience = 10 epochs), and a 70/15/15 train/validation/test chronological split.The en-dash (–) indicates that the parameter is not applicable to that specific model architecture.

Table 2. Summary of model hyperparameters. All DL models use

L_{s e q u e n c e} = 480

, Adam optimizer (

η = 10^{- 3}

), batch size 64, early stopping (patience = 10 epochs), and a 70/15/15 train/validation/test chronological split.The en-dash (–) indicates that the parameter is not applicable to that specific model architecture.

Model	Parameters	$L_{patch}$	Layers/Blocks	Key Configuration
ViT-small	23,425	16	3	$d_{e m b e d} = 64$ , 4 heads
ViT-large	1,252,481	16	12	$d_{e m b e d} = 256$ , 8 heads
VAE	75,264	–	Enc: 3, Dec: 3	$d_{l a t e n t} = 32$ , $β = 1.0$
TimeGPT-insp.	854,656	–	4	Causal attn., dilations 1, 2, 4 and 8
Prophet-insp.	109,184	–	2 LSTM + 2 attn.	BiLSTM hidden = 64
RF	–	–	200 trees	depth = 30, min_split = 5, min_leaf = 5

Table 3. Results obtained on the AMPds dataset (RMSE in Watts and MAE in Watts). Bold values indicate the best performance for each appliance and metric.

Appliance	Model	RMSE	MAE	F1	R²	NDE
Refrigerator	ViT-small	76.19	59.05	0.52	0.00	0.86
	ViT-large	75.41	58.92	0.52	0.00	0.85
	TimeGPT	74.21	60.91	0.53	0.00	0.84
	Prophet	71.99	52.21	0.55	0.06	0.82
	VAE	73.73	56.18	0.53	0.01	0.84
	RF	75.20	56.31	0.52	0.00	0.85
Heat Pump	ViT-small	288.79	146.34	0.92	0.85	0.34
	ViT-large	239.83	108.91	0.89	0.89	0.28
	TimeGPT	222.89	69.66	0.95	0.83	0.39
	Prophet	234.16	77.44	0.94	0.81	0.40
	VAE	226.99	59.22	0.94	0.82	0.39
	RF	235.48	69.94	0.97	0.90	0.27
Electric Oven	ViT-small	243.35	19.05	0.00	0.00	1.00
	ViT-large	243.35	19.06	0.00	0.00	1.00
	TimeGPT	214.22	18.81	0.41	0.15	0.92
	Prophet	215.84	19.67	0.50	0.14	0.92
	VAE	231.55	17.77	0.22	0.01	0.99
	RF	243.34	19.07	0.00	0.00	1.00

Table 4. Results obtained on the IMDELD dataset (RMSE in Watts and MAE in Watts). Bold values indicate the best performance for each appliance and metric.

Appliance	Model	RMSE	MAE	F1	R²	NDE
Pelletizer I	ViT-small	10,013.82	4229.66	0.66	0.94	0.18
	ViT-large	10,330.84	3748.17	0.86	0.94	0.19
	TimeGPT	45,228.62	42,950.24	0.66	0.00	0.83
	Prophet	12,551.58	4504.08	0.99	0.91	0.23
	VAE	45,470.67	43,030.44	0.66	0.00	0.83
	RF	11,034.38	4034.37	0.98	0.93	0.20
Exhaust Fan I	ViT-small	593.42	185.30	0.66	0.88	0.26
	ViT-large	1647.09	271.08	0.81	0.86	0.28
	TimeGPT	1954.28	1733.82	0.66	0.00	0.84
	Prophet	640.00	180.68	0.97	0.86	0.28
	VAE	2321.12	1601.92	0.00	0.00	1.00
	RF	633.91	187.76	0.97	0.86	0.27
Contactor I	ViT-small	187.41	108.63	0.92	0.91	0.23
	ViT-large	199.87	138.70	0.77	0.89	0.25
	TimeGPT	641.36	627.86	0.61	0.00	0.79
	Prophet	181.41	100.57	0.94	0.91	0.22
	VAE	810.43	527.60	0.00	0.00	1.00
	RF	198.08	110.91	0.95	0.90	0.24

Table 5. Results for AHU systems on the COmBED dataset (RMSE in Watts and MAE in Watts). Bold values indicate the best performance for each appliance and metric.

Load	Model	RMSE	MAE	F1	R²	NDE
AHU I	ViT-small	790.48	483.36	0.66	0.57	0.47
	ViT-large	732.22	359.12	0.86	0.63	0.43
	TimeGPT	1230.41	1197.81	0.66	0.00	0.73
	Prophet	716.45	312.67	0.67	0.64	0.42
	VAE	1689.55	1188.70	0.00	0.00	1.00
	RF	688.12	344.32	0.86	0.67	0.41
AHU II	ViT-small	685.64	441.60	0.59	0.73	0.41
	ViT-large	682.73	464.90	0.55	0.74	0.41
	TimeGPT	1349.08	1338.96	0.55	0.00	0.80
	Prophet	831.16	517.74	0.72	0.61	0.49
	VAE	1682.20	1033.72	0.00	0.00	1.00
	RF	633.47	419.28	0.65	0.77	0.38
AHU III	ViT-small	625.63	452.30	0.78	0.88	0.24
	ViT-large	595.63	421.70	0.80	0.90	0.23
	TimeGPT	1846.94	1777.54	0.71	0.00	0.70
	Prophet	632.00	425.49	0.72	0.88	0.24
	VAE	2621.67	1868.32	0.00	0.00	1.00
	RF	659.10	467.57	0.80	0.87	0.25
AHU IV	ViT-small	834.76	581.22	0.64	0.97	0.12
	ViT-large	5088.09	5050.74	0.64	0.01	0.73
	TimeGPT	5105.10	5051.53	0.64	0.00	0.73
	Prophet	963.25	462.13	0.65	0.96	0.14
	VAE	6986.15	4773.03	0.00	0.00	1.00
	RF	853.21	491.15	0.83	0.97	0.12

Table 6. Results for other loads on the COmBED dataset.

Load	Model	RMSE	MAE	F1	R²	NDE
Elevator	ViT-small	733.67	367.85	1.00	0.09	0.72
	Prophet	736.45	353.51	1.00	0.08	0.72
	RF	725.09	381.82	1.00	0.11	0.71
	Mean	767.79	399.89	1.00	0.00	0.75
Lighting	ViT-small	407.42	319.43	1.00	0.73	0.17
	ViT-large	381.95	303.00	1.00	0.76	0.16
	Prophet	437.75	337.67	1.00	0.69	0.18
	RF	436.04	337.65	1.00	0.69	0.18
Socket I	ViT-small	128.37	84.06	0.85	0.00	0.73
	Prophet	128.90	82.59	0.80	0.00	0.73
	RF	131.48	85.40	0.81	0.00	0.75
	Mean	120.51	74.68	0.88	0.00	0.69
Socket II	ViT-small	271.30	229.15	1.00	0.30	0.14
	ViT-large	284.39	233.48	1.00	0.23	0.15
	Prophet	286.47	243.04	1.00	0.22	0.15
	RF	275.91	231.57	1.00	0.27	0.15

Table 7. Recommended model by load type and environment, based on experimental results.

Load Type	Environment	Recommended Model	Rationale
Finite-state HVAC	Commercial	RF	Discrete state transitions; R² up to 0.97
High-power machinery	Industrial	ViT-small	Complex transients; R² ≥ 0.88
Seasonal/periodic	Residential	Prophet-inspired	Trend + seasonality decomposition
Lighting (continuous)	Commercial	ViT-large	Long-range temporal patterns
Low-SNR, stochastic	Any	Mean baseline	Model complexity yields no benefit
On/off industrial	Industrial	RF or Prophet-insp.	Both achieve F1 $\geq 0.95$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rubio-Bustos, A.; Calleja-Rodríguez, G.; De-La-Torre-García, J.; Fernandez-Gamiz, U.; Zulueta, E. Building-Level Energy Disaggregation Using AI-Based NILM Techniques in Heterogeneous Environments. AI 2026, 7, 122. https://doi.org/10.3390/ai7040122

AMA Style

Rubio-Bustos A, Calleja-Rodríguez G, De-La-Torre-García J, Fernandez-Gamiz U, Zulueta E. Building-Level Energy Disaggregation Using AI-Based NILM Techniques in Heterogeneous Environments. AI. 2026; 7(4):122. https://doi.org/10.3390/ai7040122

Chicago/Turabian Style

Rubio-Bustos, Ana, Gloria Calleja-Rodríguez, Jorge De-La-Torre-García, Unai Fernandez-Gamiz, and Ekaitz Zulueta. 2026. "Building-Level Energy Disaggregation Using AI-Based NILM Techniques in Heterogeneous Environments" AI 7, no. 4: 122. https://doi.org/10.3390/ai7040122

APA Style

Rubio-Bustos, A., Calleja-Rodríguez, G., De-La-Torre-García, J., Fernandez-Gamiz, U., & Zulueta, E. (2026). Building-Level Energy Disaggregation Using AI-Based NILM Techniques in Heterogeneous Environments. AI, 7(4), 122. https://doi.org/10.3390/ai7040122

Article Menu

Building-Level Energy Disaggregation Using AI-Based NILM Techniques in Heterogeneous Environments

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Problem Description

3.2. Input Variables

3.3. Output Variables

3.4. Cost Function

3.4.1. Primary Loss Functions

3.4.2. Variational Autoencoder Loss

3.4.3. Optimization Procedure

3.5. Vision Transformer

3.6. Variational Autoencoder

3.7. Random Forest

3.8. TimeGPT-Inspired Architecture

3.9. Prophet-Inspired Architecture

3.10. Datasets

3.11. Evaluation Metrics

3.12. Limitations

4. Results

4.1. Residential Environment (AMPds)

4.2. Industrial Environment (IMDELD)

4.3. Commercial Environment (COmBED)

5. Discussion

5.1. Model Performance by Load Type

5.2. Analysis of Model Failures

5.3. Practitioner Guidance: When to Prefer Simple Baselines

5.4. Building Type Considerations

5.5. Signal-to-Noise Ratio and Pearson Correlation Effects

5.6. Random Forest Superiority for Finite-State Loads

5.7. Model Complexity and Overfitting Trade-Offs

5.8. Challenges of Overlapping Signatures in Commercial Environments

5.9. Sampling Rate Impact on Disaggregation Performance

5.10. Summary: Model Selection Guidelines

6. Conclusions

Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI