Evaluation of Traditional and Data-Driven Algorithms for Energy Disaggregation Under Sampling and Filtering Conditions

Rodriguez-Navarro, Carlos; Portillo, Francisco; Robalo, Isabel; Alcayde, Alfredo

doi:10.3390/inventions10030043

Open AccessArticle

Evaluation of Traditional and Data-Driven Algorithms for Energy Disaggregation Under Sampling and Filtering Conditions

Department of Engineering, University of Almeria, ceiA3, 04120 Almeria, Spain

^*

Author to whom correspondence should be addressed.

Inventions 2025, 10(3), 43; https://doi.org/10.3390/inventions10030043

Submission received: 12 May 2025 / Revised: 9 June 2025 / Accepted: 11 June 2025 / Published: 13 June 2025

(This article belongs to the Section Inventions and Innovation in Electrical Engineering/Energy/Communications)

Download

Browse Figures

Versions Notes

Abstract

Non-intrusive load monitoring (NILM) enables the disaggregation of appliance-level energy consumption from aggregate electrical signals, offering a scalable solution for improving efficiency. This study compared the performance of traditional NILM algorithms (Mean, CO, Hart85, FHMM) and deep neural network-based approaches (DAE, RNN, Seq2Point, Seq2Seq, WindowGRU) under various experimental conditions. Factors such as sampling rate, harmonic content, and the application of power filters were analyzed. A key aspect of the evaluation was the difference in testing conditions: while traditional algorithms were evaluated under multiple experimental configurations, deep learning models, due to their extremely high computational cost, were analyzed exclusively under a specific configuration consisting of a 1-s sampling rate, with harmonic content present and without applying power filters. The results confirm that no universally superior algorithm exists, and performance varies depending on the type of appliance and signal conditions. Traditional algorithms are faster and more computationally efficient, making them more suitable for scenarios with limited resources or rapid response requirements. However, significantly more computationally expensive deep learning models showed higher average accuracy (MAE, RMSE, NDE) and event detection capability (F1-SCORE) in the specific configuration in which they were evaluated. These models excel in detailed signal reconstruction and handling harmonics without requiring filtering in this configuration. The selection of the optimal NILM algorithm for real-world applications must consider a balance between desired accuracy, load types, electrical signal characteristics, and crucially, the limitations of available computational resources.

Keywords:

energy disaggregation; non-intrusive load monitoring; MAE; RMSE; F1-SCORE; NDE

1. Introduction

The global energy transition urgently requires innovative solutions to address the dual challenges of climate change mitigation. Households account for approximately 20–30% of global energy use [1], with inefficient monitoring systems exacerbating financial burdens and greenhouse gas emissions from fossil fuel dependence. Non-intrusive load monitoring (NILM) [2] has emerged as a transformative approach to enable granular energy management by disaggregating appliance-level consumption from aggregated electrical signals. While traditional methods rely on intrusive submetering, which is costly and impractical for large-scale deployment, NILM offers a scalable alternative. Yet, its widespread adoption remains hindered by unresolved technical limitations.

Since Hart’s seminal work in the 1980s [3], NILM algorithms [4,5] have evolved considerably, incorporating machine learning and advanced signal processing techniques. In recent years, research in NILM has progressed significantly with the application of deep learning architectures and innovative models that enhance the extraction of temporal patterns and complex features from electrical signals. For instance, a convolutional neural network (CNN)-based approach that leverages temporal patternization to capture more detailed usage patterns and improve appliance classification from power signals has been proposed, demonstrating notable improvements in both accuracy and robustness compared to traditional methods [6]. Similarly, a transformer-based model that employs attention mechanisms to extract global dependencies between the aggregate signal and individual device signals (overcoming the limitations of recurrent models and reducing data preprocessing requirements) was introduced in [7]. Furthermore, harmonic and generative modeling for NILM have been explored, showing that the explicit incorporation of harmonic information and generative techniques can enhance disaggregation performance, particularly in scenarios with high harmonic distortion [8].

However, critical barriers persist in inconsistent benchmarking due to heterogeneous datasets, sensitivity to sampling rates (varying from milliseconds to minutes), and inadequate handling of harmonic distortions or low-power noise. Current evaluations often overlook the interplay between algorithmic robustness and real-world variables, such as the electrical characteristics of appliances or computational constraints for real-time applications. This gap limits the practical implementation of NILM systems (see Figure 1), despite their proven potential to reduce household consumption by up to 20% through behavioral feedback.

This paper advances NILM research by tackling key gaps in standardization, algorithmic evaluation, and metric-driven assessment. It promotes standardization using open-source tools, such as evaluating classical algorithms and neural network-based methods across diverse sampling rates, harmonic conditions, and power filters. The analysis integrates performance metrics, event detection, and computational efficiency to guide real-world deployment. By linking algorithmic performance [9] to dataset characteristics and hardware constraints, this work provides a framework for tailoring NILM solutions to specific appliance profiles and operational requirements [10]. The open tools and datasets underpinning this analysis enhance transparency, inviting replication and extension in future research.

2. Materials and Methods

Evaluating the performance of energy disaggregation algorithms requires a robust infrastructure for accurate data acquisition and standardized tools for comparative analysis. This study leverages datasets collected using open-source metering hardware, processed and analyzed through a dedicated software toolkit, and evaluated using a defined set of performance metrics under various experimental conditions.

2.1. Measurement Hardware

The datasets used in this study were generated using the OpenZmeter (oZm) platform, an open-source energy meter and power quality analyzer. While previous works describe versions v1 and v2 of the oZm, the core datasets analyzed in this article were obtained using the updated and high-precision oZm v3 [11]. This meter is distinguished by its capability to perform high-frequency electrical measurements at a sampling rate of 15,625 Hz, recording a broad spectrum of electrical variables, including voltage, current, and harmonics up to the 50th order for voltage, current, and power.

The oZm has been utilized in other research papers for different purposes, and its use in this work demonstrates the device’s robustness and versatility, establishing a new benchmark in high-frequency electrical monitoring for NILM research. Long-term data acquisition was facilitated via the oZm API.

2.2. Datasets

Evaluating the performance of energy disaggregation algorithms requires reliable electrical datasets that capture consumption at the aggregate and individual levels. In the field of NILM, the evaluation of algorithmic performance is fundamentally based on using such datasets. However, the heterogeneity of existing datasets often leads to inconsistency in algorithm benchmarking. Research on NILM has used several prominent public datasets for evaluation (see Figure 2), including the following:

Reference Energy Disaggregation Dataset (REDD) [12], a public dataset used for energy disaggregation research.
Minutely Power Dataset Almanac (AMPds) [13], a public dataset for load disaggregation and eco-feedback that includes data on the electricity, water, and natural gas consumption of residential households in Canada from 2012 to 2014.
UK Domestic Appliance-Level Electricity (UK-DALE) dataset [14], obtained from a two-year longitudinal study of UK households [15].
Electricity Consumption and Occupancy (ECO) dataset [16], which is used to evaluate the performance of NILM algorithms.
Green Energy Consumption Dataset (GREEND) [17], which comprises data on household energy consumption in Italy and Austria.

This paper employed two publicly available datasets based on open hardware to address some of the challenges related to data variability and allow a focused analysis of the impact of sampling and filtering conditions. Both datasets were collected at the University of Almeria (Spain) using the oZm platform.

The first dataset was DSUALM10H [18], introduced in June 2023. This is a new high-resolution dataset for NILM and a multichannel dataset containing measurements of 10 common household appliances. It is distinguished by its ability to perform high-frequency electrical measurements at a sampling rate of 15,625 Hz. It captures a broad spectrum of electrical variables, including voltage, current, and harmonics up to the 50th order for voltage, current, and power. It includes 150 electrical variables, such as current, voltage, and power transients.

DSUALM10H was generated using three oZm v3, providing 12 measurement channels, 1 for aggregate consumption and 10 for individual appliances. The monitored appliances included an electric oven, a microwave, a kettle, a vacuum cleaner, a radiator, a television, an electric shower heater, a fan, a refrigerator, and a freezer. A feature of DSUALM10H is that it includes harmonic content for each appliance.

The second dataset was DSUALM10, which was derived from DSUALM10H. It was obtained by eliminating the harmonic components of voltage, current, and power, thus containing only the fundamental electrical measurements. Like DSUALM10H, it is based on measurements from the 10 appliances using the oZm v3. Both datasets are publicly available [18,19] and generated with open hardware, which promotes transparency and reproducibility in NILM research. They serve as a basis for exploring how signal characteristics and sampling conditions impact the performance of algorithms.

2.3. Software Tools and Evaluation Metrics

Energy disaggregation and algorithm evaluation were conducted using the Non-Intrusive Load Monitoring Toolkit (NILMTK) v0.4.0 [20], an open-source framework designed to streamline NILM research by providing standardized workflows, algorithms, data converters, and a suite of built-in evaluation metrics. Within this framework, custom converters were developed to process data. The evaluation methodology followed the full NILMTK pipeline (from raw data conversion to disaggregation assessment).

To extend its capabilities, the NILMTK-Contrib extension was employed [21], providing implementations of traditional algorithms and deep learning-based approaches. In addition, it offers a rapid experimentation API for cross-building and cross-dataset validation, as well as support for training on synthetic aggregates and transfer learning across different sampling frequencies [22].

Core NILMTK-supported metrics were used to evaluate algorithm performance, covering both power estimation (regression) and event detection (classification) aspects [23] as follows:

1.: Mean Absolute Error (MAE) (Equation (1)), which measures the average absolute difference between the predicted and actual power consumption of an appliance. Lower MAE values indicate better estimation performance,

$M A E = (\frac{1}{N}) * \sum |p r e d i c t e d_{i} - a c t u a l_{i}|$

(1)

where, predicted_i and actual_i represent the individual prediction values and actual energy consumption values of the appliances, respectively, and N represents the total number of observations.
2.: Root Mean Squared Error (RMSE) (Equation (2)), which is like MAE, but gives greater weight to larger errors, represents the standard deviation of estimation errors,

$R M S E = \sqrt{\frac{1}{T} \sum {(y_{t}^{(n)} - {\hat{y}}_{t}^{(n)})}^{2}}$

(2)

where ${\hat{y}}_{t}^{(n)}$ represents the estimated power of device n at each time interval t, $y_{t}^{(n)}$ denotes the actual power of the same device, and T is the total number of observations or time intervals recorded during energy consumption.
3.: F1-SCORE, which combines Precision (Equation (3)) and Recall (Equation (4)), assesses the accuracy in detecting ON/OFF events. A higher F1-SCORE reflects a better ability to correctly identify appliance operation states:

$P r e c i s i o n = \frac{T P}{T P + F P}$

(3)

$R e c a l l = \frac{T P}{T P + F N}$

(4)

TP represents true positives, FP false positives, and FN false negatives.

The F1-SCORE synthesizes these aspects, offering a composite measure that signals robust accuracy in identifying and predicting appliance states, as expressed in Equation (5):

F 1 = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

4.: Normalized Disaggregation Error (NDE), which quantifies the total energy estimation error for each appliance, is normalized by actual consumption to enable fair comparisons,

${N D E}^{(i)} = \frac{{\sum_{t = 1}^{T} {(\hat{y}}_{t}^{(i)} - y_{t}^{(i)})}^{2}}{\sum_{t = 1}^{T} {(y_{t}^{(i)})}^{2}}$

(6)

where ${\hat{y}}_{t}^{(i)}$ is the estimated power consumption of the device i at time t, and $y_{t}^{(i)}$ reflects the actual power consumption of the device over the total time interval T.

2.4. Algorithms

This study evaluated traditional algorithms and deep neural network-based models [24] for NILM using NILMTK and NILMTK-Contrib [24]. The main characteristics of the traditional algorithms are described below:

Combinatorial Optimization (CO). This algorithm performs an exhaustive search of all possible combinations of appliance states to find the one that best explains the aggregate signal. While it can yield good results in simple scenarios, its complexity grows exponentially with the number of devices and possible states, which limits its scalability [25].
Hart85. Based on finite state machines, this method detects ON/OFF events using dynamic active/reactive power thresholds. It is susceptible to parameter configuration and signal quality, showing highly variable performance depending on the appliance type and experimental conditions [3].
Mean. This method uses a moving average of the aggregate consumption signal over a time window to estimate the consumption of each appliance. It is notable for its simplicity, robustness, and low computational demand, although its accuracy may be limited for devices with complex or variable consumption patterns.
Factorial Hidden Markov Model (FHMM) [26]. This probabilistic model uses the Viterbi algorithm to infer the most probable sequence of device states, considering temporal transitions and relationships between them, with complexity O(T·SN) [27]. Unlike the standard FHMM implementation in NILMTK, this version incorporates several optimizations to improve efficiency; it leverages parallel processing on multicore CPUs via multithreading, replaces Python 3.7.12 loops with vectorized NumPy operations, manages the state space more efficiently through probability precomputing and dynamic pruning, optimizes memory usage with adjustable data types and sparse array storage, and uses JIT compilation with Numba for critical routines. In contrast, the original NILMTK version relies on sequential, generic implementations and non-vectorized data structures, resulting in substantially lower computational performance and resource efficiency.

The main characteristics of deep neural network-based models are as follows:

WindowGRU. Implemented as a bidirectional GRU network that processes temporary windows of aggregate electrical consumption, WindowGRU uses recurring layers with ReLU activation to predict the state of appliances in the last temporal step, is integrated into the NILMTK experimentation API for cross-evaluation between datasets, and is typically trained with Adam (30 epochs, learning rate 1 × 10⁻³) under GPU requirements (TensorFlow-GPU + CUDA) and normalization preprocessing, although non-optimized implementations of the activation functions may limit its practical performance [21].
Seq2Seq. This model implements an encoder–decoder-based deep neural network architecture, where a time window of the aggregated signal (e.g., 99 samples) is processed by the model to predict the corresponding sequence of consumption of an appliance, using recurrent and convolutional layers; each device has its own trained model, training is performed with normalized and batch data, and integration with the NILMTK API allows performance to be evaluated in different buildings and datasets in a flexible and reproducible way [21,28].
Denoising Autoencoder (DAE). Fully convolutional in this implementation, it uses Conv1D layers in the encoder (e.g., 3 layers with 8/16/32 filters and kernel = 4) and a symmetric transposed decoder to capture temporal patterns, injecting Gaussian noise (σ = 0.1–0.3) into the aggregate input to strengthen the model, and optimizes a combined loss function that integrates the Mean Square Error (MSE) of reconstruction with L1 regularization over the weights (controlled by λ) to avoid overfitting. Training is with Adam (learning rate ~1 × 10⁻³) on time windows of 150–600 standardized samples, thus achieving disaggregation through sparsa and locally invariant latent representations [21].
Recurrent Neural Networks (RNN). This model implements standard recurrent neural networks for energy disaggregation, processing temporal sequences of aggregate consumption through predefined windows (e.g., 100–600 samples). Each appliance has an independent RNN model trained with recurrent dense layers and ReLU activation, using Keras as a backend. The data are normalized and screened before training, which is performed with the Adam optimizer (learning rate ~1 × 10⁻³) and MSE loss function, integrated into the NILMTK API for cross-evaluation between buildings/datasets. The current implementation faces practical limitations due to version-specific dependencies (TensorFlow/Keras) and problems reported in the train–test division after migrating to internal Keras methods [21].
Seq2Point. This model implements a CNN-based neural network that takes temporal windows of the aggregated signal (e.g., 99 samples) and predicts the consumption value of an appliance only at the center point of the window, using a sequential architecture of Conv1D, Dense, Dropout, and Flatten layers, trained with MSE loss function, data normalization, and hyperparameter configuration, such as window size, epoch number, and batch size, all integrated into NILMTK’s rapid experimentation API to facilitate training, evaluation, and comparison between devices and datasets [29].

2.5. Experimental Conditions

To explore how different factors influence algorithm performance, evaluations were conducted under multiple experimental conditions, focusing on the following:

Sampling rate. Tested at intervals ranging from 90 and 60 s to higher resolutions (1 s, 500 ms, 250 ms, and 125 ms).
Harmonic content. By comparing DSUALM10H (with harmonics) and DSUALM10 (without them), the impact of harmonic components was assessed [18].
Power filtering. The effect of applying aggregate power thresholds (10 W, 50 W, 100 W) was evaluated before disaggregation.

This multifactorial evaluation (employing DSUALM10, DSUALM10H, and NILMTK metrics) aimed to comprehensively understand energy disaggregation algorithms’ behavior.

3. Results and Discussion

This section presents the evaluation outcomes for various energy disaggregation algorithms applied to DSUALM10H and DSUALM10, using the NILMTK-Contrib tool. In addition, the influence of different experimental conditions was investigated, including the application of power filters, the type of appliance, the sampling rate, and the presence of harmonics. Regarding algorithm performance, the results of visual disaggregation and an analysis of the execution times for all the algorithms considered are also included.

3.1. Evaluation Metrics

Four key metrics (MAE, NDE, F1-SCORE, and RMSE) were used to evaluate algorithm performance under identical experimental conditions (sampling rate, harmonic presence, power filters) for household appliances, with an 80/20 train–test data split, ensuring consistent scenario comparisons across metrics [22]. The inability to use sub-second sampling intervals for deep learning algorithms in NILMTK-Contrib (e.g., CNNs, LSTMs), unlike classical methods (CO, FHMM), stems from architectural constraints: classical approaches rely on statistical/discrete-event models independent of temporal granularity, while neural networks require fixed-length input windows (e.g., 500 samples at 1 s = 500 s context), where shorter intervals (<1 s) reduce effective temporal context, eroding long-term pattern capture, which is critical for performance. Additionally, higher sampling frequencies exponentially increase data volume and computational demands, and current NILMTK-Contrib implementations lack mechanisms to adapt hyperparameters (e.g., kernel sizes, pooling layers) to sub-second scales, unlike signal-level filters applicable across frequencies. The detailed results of the MAE, NDE, F1-SCORE, and RMSE metrics for all algorithms and experimental conditions are presented in Appendix A.

3.1.1. F1-SCORE

Figure 3 shows that, in general, neural network-based algorithms achieve higher or comparable F1-SCOREs to traditional algorithms for most home appliances. This suggests that neural networks are more robust and effective at handling the complexity and variability of electrical signals, especially in the presence of harmonics and unfiltered. However, there are exceptions: for example, the traditional Mean algorithm obtains an exceptionally high F1-SCORE for the TV, and Hart85 achieves good results only for some appliances.

On the other hand, while neural networks often outperform traditional algorithms, they also present difficulties in specific applications, such as vacuum cleaners and electric space heaters, where F1-SCOREs are low even for these advanced models. This indicates that, despite their sophistication, neural networks cannot always correctly detect all events in appliances with more complex or less predictable consumption patterns. In summary, the analysis confirms that, under the condition of 1-s sampling with harmonic content, neural networks offer a better capacity for event detection than traditional algorithms, showing a more balanced and robust performance against different types of appliances. Nevertheless, this advantage is accompanied by a considerably higher computational cost, which can limit practical applications.

3.1.2. MAE

The MAE varies considerably for traditional algorithms depending on the algorithm, appliance, and conditions. Stable and straightforward appliances such as the TV or fan have lower MAEs, especially with Mean. High-power or complex appliances, such as the electric oven, consistently present high MAEs with all traditional algorithms. Hart85 significantly improves its MAE by applying power filters.

Neural network-based algorithms, evaluated in the configuration of 1 s with harmonic content, generally outperform classical algorithms in all average metrics, including MAE. They tend to achieve lower MAEs, even for complex or high-powered appliances. For example, for the electric furnace, the MAE of RNN was 174.46 W, while Seq2Point was 285.67 W, Seq2Seq was 389.78 W, and DAE was 480.55 W. These values are considerably lower than the MAE reported for Mean (910.21 W), CO (553.7 W), and FHMM (322.3 W) without filters, and in some cases, even better than Hart85 (423.29 W) without filters. This reinforces the idea that neural models can better model complex consumption patterns. Figure 4 shows the differences in MAE per appliance among all the algorithms evaluated (including DAE, RNN, Seq2Point, Seq2Seq, and WindowGRU).

3.1.3. NDE

In traditional algorithms, Mean performs better with low NDEs for regular, unfiltered, low-consumption appliances. At the same time, Hart85 presents poor performance under these conditions, improving with the application of filters (see Figure 5).

The fridge has the worst overall performance with traditional algorithms. Algorithms based on neural networks, under the conditions evaluated (1 s with harmonic content), also present generally lower NDE on average than traditional algorithms, reflecting a greater precision in estimating normalized energy consumption. For the electric oven, the NDEs of RNN (0.295), Seq2Point (0.438), Seq2Seq (0.500), and DAE (0.610) are substantially lower than those of traditional unfiltered algorithms, indicating a better ability to estimate the consumption of this high-power appliance. For the fridge, whose NDEs are high with some traditional algorithms, the neural networks obtained NDEs between 0.762 and 0.7945, representing an improvement. The NDE confirms that while Mean is effective for regular loads, neural networks offer more robust performance for various consumption patterns.

3.1.4. RMSE

Like MAE, Mean shows the lowest RMSEs for stable, low-power unfiltered appliances (see Figure 6). Hart85 has a significantly high RMSE without filters, but improves substantially with its application. The electric oven consistently presents the highest RMSEs among all traditional appliances and algorithms. CO shows a significant increase in RMSE for the fridge and freezer with filters.

By including the neural network algorithms evaluated in the condition of 1 s with harmonic content, it is observed that they generally achieve lower RMSEs on average than classical algorithms. This is particularly notable for high-powered appliances such as the electric oven, where the RMSEs of RNN (323.0 W), Seq2Point (479.9 W), Seq2Seq (547.8 W), and DAE (667.6 W) are lower than those of traditional unfiltered algorithms, indicating a better ability to handle significant power fluctuations. Although Hart85 can achieve low RMSEs for high-power loads with filters, neural networks accomplish this without relying on this specific preprocessing in the tested configuration. The RMSE analyses, such as the MAE and NDE, underline that neural networks offer greater overall accuracy in estimating power, even in the presence of harmonics, although at a considerably higher computational cost.

The inclusion of algorithms based on neural networks in the analysis of metrics such as F1-SCORE, MAE, NDE, and RMSE allows a more complete evaluation, since these models usually outperform traditional algorithms on average and demonstrate a greater ability to handle the complexity and variability of appliances, especially without the need for filters; however, this better performance implies a much higher computational cost, so the choice of the most appropriate algorithm must balance the desired accuracy, the characteristics of the devices to be monitored, and the resource limitations of the environment where it will be implemented.

3.2. Influence of Experimental Factors

The following section discusses the impact of specific experimental conditions (filters, appliance type, sampling rate) on evaluation metrics.

3.2.1. Effect of Power Filtering

The comparative analysis of the influence of power filters on the disaggregation of electricity consumption metrics shows that, when evaluating the average of all appliances, the classical algorithms experience a slight improvement in absolute errors (MAE, RMSE) when applying a power filter (e.g., 100 W). However, this improvement is marginal and is accompanied by a degradation in the ability to detect events, especially in small and cyclic loads, as evidenced by the drop in the average F1-SCORE in the Hart85 algorithm. In contrast, the Mean algorithm demonstrates greater robustness, maintaining stable performance across all metrics despite filtering. Nonetheless, the improvements remained relatively limited (see Figure 7).

The results show that combining harmonics with deep learning techniques such as Seq2Point and RNN, along with low filter values, optimizes the F1-SCORE in the experimental framework analyzed. In contrast, the use of high filter values improves the RMSE in RNN, and the inclusion of harmonics benefits the performance of Seq2Seq, although it is detrimental to FHMM. Neural networks, specifically RNN and Seq2Point, exhibit up to 54% less error compared to traditional methods such as FHMM. To minimize NDE, neural networks—particularly Seq2Point and Seq2Seq—with medium to high filter values (50–100 W) and the use of harmonics, are the most efficient option, which contrasts with the results observed for F1-SCORE, where the Mean method stands out. Finally, to reduce MAE, neural networks, especially RNN with a filter value of 50 W and Seq2Point, prove to be optimal, while statistical methods such as Mean show significant limitations in numerical accuracy. This analysis highlights the superiority of neural architectures over conventional statistical approaches, both in terms of accuracy and adaptability to different evaluation metrics.

Neural network-based algorithms, evaluated with and without filtering despite their computational demands, demonstrate clear superiority over traditional methods across all average metrics, achieving superior MAE, RMSE, F1-SCORE, and NDE values. This indicates greater accuracy in estimating the energy consumed and detecting events in all appliances, not just high-power ones. Although neural models require more computational resources, their overall performance is superior, and they do not depend on filtering to obtain good results.

The figure summarizes these findings, highlighting that filtering only partially benefits classical algorithms and can distort the overall evaluation by impairing the detection of minor loads. At the same time, neural networks offer more balanced and accurate performance across the complete set of devices.

3.2.2. Effect of Type of Application

The performance of the algorithms used for energy disaggregation depends mainly on the type of appliance analyzed. In the case of appliances with stable and straightforward consumption patterns, such as televisions, fans, or incandescent lamps, classical algorithms, especially the Mean method, usually offer the best results in terms of accuracy and stability, showing excellent robustness to variations in the electrical signal. However, in high-powered devices or devices with more complex consumption patterns, such as electric ovens, heaters, or vacuum cleaners, errors in metrics increase for all algorithms. In these cases, models based on deep neural networks achieve better results in error metrics and event detection, outperforming classical methods, especially when power filters are not applied (see Figure 8a–d).

Traditional algorithms exhibit reduced performance for appliances that exhibit cyclical or intermittent patterns, such as refrigerators, freezers, or microwaves, especially if filters are applied. Neural models maintain competitive values and can better model these complex patterns. In general, no algorithm is superior in all cases: the Mean method is efficient for simple loads but loses competitiveness in more complicated scenarios. Algorithms such as Hart85 can improve in specific contexts. However, their performance is inconsistent, while neural methods offer a more balanced and robust performance against a greater variety of consumption patterns (but with more computational resources).

The choice of the most appropriate algorithm for energy disaggregation must consider both the type of appliance and the particularities of its consumption. Classical methods are recommended for stable and straightforward loads. At the same time, neural models are more appropriate for devices with complex consumption, high power, or cyclic patterns, provided that the necessary computational capacity is available.

3.2.3. Effect of the Sampling Rate

Sampling frequency is a key factor in energy disaggregation, as it determines the ability of algorithms to detect rapid events and complex patterns in appliance consumption. In classical algorithms, a higher sampling rate (1 s) significantly improves error metrics (MAE, RMSE, NDE) and event detection capability (F1-SCORE). As the interval between samples increases, errors increase and the F1-SCORE decreases, as brief events and rapid changes in consumption are lost, especially at small or intermittent loads. For example, the average MAE of the Mean algorithm drops from about 400 to 280 W, and the F1-SCORE improves from 0.48 to 0.51, when the sample rate increases from 90 s to 1 s, reflecting the importance of high-frequency measurement for accurate disaggregation (see Table 1). The higher the sampling frequency, the classical algorithms show fewer errors and higher F1-SCORE, especially in the Mean and CO algorithms, while decreasing the frequency (60 s or 90 s) results in increased errors and a decreased F1-SCORE; Hart85 and FHMM are less sensitive in F1-SCORE but their errors also grow less frequently, with Mean standing out as the most robust and stable algorithm.

In the case of algorithms based on deep neural networks, this study only evaluates the 1 s sampling condition due to its high computational cost. Although there is no direct data on the impact of reducing the frequency in these models, it is inferred that, as in the classical models, a lower frequency would degrade their performance by losing relevant temporal information. The neural models, evaluated at 1 s, present average metrics higher than the classical ones at any frequency, suggesting they make the most of the available temporal detail.

In summary, a high sampling rate is preferable to achieve maximum accuracy in energy disaggregation. However, it must be balanced with computational cost, especially in real-time or resource-constrained applications.

3.2.4. Effect of Harmonic Content

The presence of harmonics in the electrical signal variably affects the performance of energy disaggregation algorithms. In deep neural network-based models, the metrics were obtained with harmonics. Under these conditions, these algorithms show a robust and competitive performance, with an average F1-SCORE higher than most classical algorithms and errors generally lower for most appliances, especially in 1-s sampling setups and without a power filter. Therefore, harmonics do not prevent good results from detecting events and estimating power, surpassing the classics in most cases under the same conditions.

Classical algorithms were evaluated with and without harmonic content. Direct comparison of average metrics shows that the presence of harmonics does not produce drastic changes in the overall performance of these methods. However, it may cause slight variations in error metrics and event detection capability, depending on the type of appliance and experimental configuration. In general, Mean maintains its robustness and stability with or without harmonics, while Hart85, CO, and FHMM show small fluctuations in their metrics, but without a clear pattern of systematic deterioration. Harmonics can slightly affect the accuracy of event detection and slightly increase estimation errors in some devices. Still, the effect is usually less than that caused by other factors, such as the sampling rate or the application of filters.

In summary, the effect of the harmonic content on energy disaggregation metrics is limited compared to other experimental factors. Neural network-based algorithms maintain competitive performance even with harmonic content, and classical algorithms exhibit only slight variations in their metrics under these conditions. Therefore, the presence of harmonics does not represent a significant obstacle to the effectiveness of deep neural models in energy disaggregation tasks.

3.3. Disaggregated Data and Execution Times

The disparity in execution times between algorithms directly impacts their practical applicability. Traditional methods are considerably faster than deep learning-based approaches. Visual fidelity in reconstructing individual signals generally correlates with longer execution time. Traditional algorithms are distinguished by their computational efficiency, being generally much faster than neural networks. However, their performance and visual detail vary.

Mean is the fastest algorithm, with extremely low execution times (0.91 s for the visualization). It offers a considerably simplified disaggregation, as shown in Figure 9a. In terms of metrics, Mean is extremely efficient and accurate for stable and straightforward loads such as TVs or fans, showing low MAEs for these (e.g., TV 20.89, Fan 15.89, under 1 s, harmonics, no filter). However, its accuracy limits its usefulness for appliances with complex or high-power patterns, which usually have the highest MAE (e.g., 910.21 W in the electric oven). Despite this limitation, the Mean algorithm remains robust to variations, maintaining stable metrics even with filtering.

The CO algorithm is fast and maintains low and stable execution times (2.39 s for visualization with harmonics). It offers a remarkable level of disaggregation detail for its low computational cost (Figure 9b), which is often higher than other traditional methods. However, its behavior can be more erratic, being sensitive to the sampling rate and filters, and it can deteriorate significantly for cyclical appliances such as the refrigerator or freezer The 1 s configuration with harmonic content, without a filter, features high MAEs for the fridge (155.15 W) and the freezer (103.80 W), which can align with the observation that harmonics have a negative effect.

The FHMM algorithm has a considerably longer execution time than Mean and CO (25.67 s). Despite the increased computational effort, the level of visual detail in the disaggregation shown is surprisingly modest compared to CO (Figure 10a). It is noticeably slower than Mean and CO, which limits its viability in rapid response scenarios. Regarding the MAE values in the 1 s configuration with harmonic content, without filter, the values range from relatively low values for appliances such as the TV (18.56 W) to substantially higher values for more complex devices, such as the electric oven (322.33 W).

Hart85 exhibits slower execution than Mean and CO but is still manageable for non-strict environments (Figure 10b). Without filters, the MAE tends to be high in several appliances, such as the TV, fan, refrigerator, and freezer (TV 33.72 W, fan 23.88 W, fridge 26.61 W, freezer 24.34 W). Its performance improved with the application of power filters. Its dependence on the sampling rate is extreme, with near-zero performance in event detection at high frequencies without filters.

Algorithms based on deep neural networks offer greater fidelity in reconstructing individual signals at the cost of much higher computational demand. They were evaluated without a filter due to their high cost and their metrics were obtained with harmonics in 1 s sampling. They outperform classical algorithms in all average metrics when evaluated without a filter and with harmonic content.

DAE has a considerable execution time for visualization (555.63 s) with a window size of 99. It offers a much richer level of disaggregation, identifying specific consumption patterns. The MAEs in the 1 s configuration with harmonic content (Figure 11a) are generally low (e.g., microwave 108.51 W, TV 19.61 W, fan 14.59 W).

The scale of the Y-axis varies between algorithms because each NILMTK disaggregation method (such as Mean, CO, FHMM, or DNN) estimates appliance power from the total signal in a different way: simpler algorithms such as Mean tend to smooth the values and underestimate peaks, resulting in lower maximum power values. In contrast, others, such as CO, are more sensitive to abrupt changes and can estimate significantly higher maximum values. These differences reflect the unique behavior of each algorithm when interpreting and separating the power signal, so the variation in scale does not indicate an error, but rather the diversity in how each model processes and represents energy consumption.

RNN, while one of the most computationally expensive methods (11,892.43 s), offers a remarkable improvement in visual accuracy (Figure 11b). However, this benefit must be weighed against the high time cost, making it a suitable choice for applications where visual accuracy is a priority and computational resources are not a constraint.

Seq2Point has a runtime of 2012.42 s for visualization. The consistent disaggregated profile shows a good balance between complexity and detail (Figure 12a). The MAE is very low for several appliances (e.g., microwave 55.76, vacuum cleaner 87.95, electric shower heater 273.68 W, freezer 12.61 W). Seq2Seq completed the task in 857.34 s for visualization using a sequence length of 99, an epochs number of 10, and a batch size of 512. Their visual performance aligns with other neural models, evidencing good signal reconstruction, especially for complex loads (Figure 12b). It has the lowest MAE for the microwave (51.14 W) and the electric shower heater (226.71 W).

WindowGRU is the algorithm that requires the longest processing time for visualization (16,283.86 s). Despite its high computational cost, the level of detail obtained is comparable to that of other deep learning models (Figure 13), suggesting that cost does not always translate into a commensurate improvement in quality. The MAEs are generally higher than other neural models for various appliances.

Although WindowGRU exhibits the longest execution, the level of disaggregation detail it produced was comparable to that of neural network-based algorithms, offering no clear advantage in output quality despite its significantly higher computational cost.

In summary, there is a clear correlation between the execution time and the level of visual detail of the disaggregation results. Traditional methods offer fast times, although with less precision. At the same time, algorithms based on neural networks provide greater fidelity in reconstructing individual signals at the cost of a greater demand on computational resources. This information is key when choosing an algorithm according to the constraints and objectives of the practical application.

Figure 14 illustrates the difference in logarithmic-scale execution times between traditional and deep learning methods of speed, multiplying by several orders of magnitude the Mean and CO times, which limits its viability in scenarios where response time is a critical factor.

The analysis of average metrics (Figure 15) reveals a clear superiority of deep learning algorithms (Seq2Point, Seq2Seq, RNN, and DAE) over traditional methods, although with significant differences in execution times. Seq2Point stands out as the most balanced—with the best F1-SCORE (0.56), lowest MAE (111.63 W), and second-best RMSE (189.78 W)—but it requires 2012.42 s of processing. Seq2Seq achieves the best NDE (0.62) and RMSE (188.77 W) in 857.34 s, although with a slightly lower F1-SCORE (0.53). Traditional methods such as CO (1.22 s) and FHMM (66.31 s) show average performance (F1-SOCRE ≤ 0.464, MAE > 189 W), and Hart85 (8.67 s) presents a contradiction—good MAE/RMSE (144.05 W/282.7 W) but very poor F1-SCORE/NDE (0.114/0.957)—suggesting numerical accuracy at the expense of detection reliability. Mean shows limitations, with the worst MAE (241.03 W) despite an acceptable F1-SCORE (0.5) and minimal runtime (1.31 s), while WindowGRU (F1-SCORE = 0.5, MAE = 150.05 W) positions itself as an intermediate option but with the highest computational time (16,283.86 s). This hierarchy confirms that sequential models (Seq2Point/Seq2Seq) simultaneously optimize precision and consistency, but require up to 34 m of execution. In comparison, traditional methods excel in speed (≤1 m) at the cost of compromising key metrics, establishing a decisive trade-off between efficiency and performance.

4. Conclusions

The systematic analysis of the performance of energy disaggregation algorithms under different experimental conditions, considering both execution times and quantitative metrics such as MAE, RMSE, F1-SCORE, and NDE, reveals that there is no universally superior algorithm. The performance of each approach intrinsically depends on the type of appliance, the experimental configuration (sampling rate, harmonic content, and application of power filters), and the characteristics of the model. Traditional algorithms, such as Mean, CO, Hart85, and FHMM, excel at computational efficiency and speed, making them particularly suitable for real-time applications or resource-constrained environments. Mean offers remarkable robustness against experimental variations and maintains low errors for simple loads, although its simplicity limits its accuracy with complex consumption patterns. CO and Hart85 have specific advantages, such as low run times and significant improvements through preprocessing. However, their performance may be affected by the cyclical nature of certain appliances or by sensitivity to the sample rate and filters applied.

On the other hand, algorithms based on deep neural networks (DAE, RNN, Seq2Point, Seq2Seq, WindowGRU) demonstrate a superior ability to reconstruct individual signals and achieve better average metrics in most scenarios, especially in harmonic-based and unfiltered configurations. These models excel at event detection and accurate power estimation and are better suited to the complexity and variability of modern appliances. However, its main limitation lies in the high computational cost, with execution times extending to several hours, which restricts its applicability in contexts where speed and efficiency are a priority.

The influence of experimental factors such as the sampling rate, the presence of harmonics, and the application of filters is also decisive, affecting each algorithm differently. While Mean is relatively insensitive to these factors, Hart85 and CO experience more noticeable variations, and neural networks maintain competitive performance even with harmonic content.

The selection of the most appropriate disaggregation algorithm must be carried out considering, in an integrated way, the context of use, the type of loads to be monitored, the signal conditions, and especially the computational constraints of the environment. Traditional methods are preferable for applications that demand rapid response and efficiency. At the same time, deep neural network-based approaches are ideal when maximum accuracy is a priority, and the necessary resources are available. This holistic view is critical to developing NILM solutions that are accurate, efficient, and adaptable to real-world demands.

Author Contributions

Conceptualization, C.R.-N. and A.A.; methodology, C.R.-N., F.P. and A.A.; software, C.R.-N. and A.A.; validation, C.R.-N., F.P., I.R. and A.A.; formal analysis, C.R.-N. and A.A.; investigation, C.R.-N., F.P., I.R. and A.A.; resources, C.R.-N. and A.A.; data curation, C.R.-N. and A.A.; writing—original draft preparation, C.R.-N., F.P., I.R. and A.A.; writing— review and editing, C.R.-N., F.P., I.R. and A.A.; visualization, C.R.-N., F.P. and I.R.; supervision, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data will be made available upon request to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.