1. Introduction
In the age of digital energy transformation, Non-Intrusive Load Monitoring (NILM) has become a cornerstone technology for understanding and optimizing energy consumption. By disaggregating total power usage into appliance-level insights using only aggregate measurements, NILM eliminates the need for costly and invasive submetering, offering a scalable and efficient solution for energy management, demand-side response, and smart grid optimization across residential, commercial, and industrial domains [
1]. Recent real-world applications underscore NILM’s growing relevance. In industrial environments, NILM is being used for anomaly detection and predictive maintenance, enabling manufacturers to monitor machinery health and energy efficiency without installing sensors on every device. For example, active learning models have been developed to reduce the need for labeled data, cutting implementation costs by up to 99% while maintaining high prediction accuracy. In smart homes, NILM is integrated into intelligent energy management systems to track appliance usage, detect faults, and support demand-side optimization. Data-driven NILM approaches using deep learning have shown high accuracy in predicting future consumption and identifying abnormal patterns [
2,
3].
Despite these advances, most NILM research remains confined to high-performance computing environments or simulations, limiting its practical deployment [
4]. This gap between theory and practice poses a significant challenge: how can NILM algorithms be effectively integrated into resource-constrained edge computing devices that are increasingly prevalent in smart grid optimization and Internet of Things (IoT) ecosystems? Bridging this gap requires evaluating how NILM algorithms perform on resource-constrained edge computing platforms. These platforms offer the potential for real-time, decentralized energy analysis. Furthermore, edge computing offers a compelling solution. By processing data locally, i.e., closer to where it is generated, edge computing enhances privacy and minimizes reliance on cloud infrastructure. This is particularly beneficial for NILM, where real-time inference and low-latency decision-making are essential. Studies have shown that edge-based NILM systems can reduce response time by up to 80% compared to cloud-based alternatives, and achieve up to 36% reduction in computational complexity and 75% reduction in storage requirements through model optimization. Moreover, edge deployment enables scalable, decentralized energy monitoring, which is critical for smart grids and distributed energy systems [
5,
6]. However, the computational demands of machine learning-based NILM models, particularly those designed for event detection and energy disaggregation, raise critical questions about their feasibility on embedded platforms, concerning challenges in terms of computational power, memory, and latency.
This study presents an experimental evaluation of state-of-the-art NILM algorithms for both event detection and energy disaggregation on edge computing devices. Unlike most previous studies, which focus on simulation on high-performance computing environments, this research implements and compares two NILM models: Deep Detection, Feature Extraction and Multi-Label Classification (DeepDFML)–NILM for event detection, and Sequence-to-Point (Seq2Point)–Convolutional Neural Network (CNN) for energy disaggregation, on constrained edge platforms such as the NVIDIA Jetson Nano and Jetson Orin Nano. The goal of this work is to evaluate the execution time, resource usage, and inference performance of two NILM algorithms for event detection and energy disaggregation when deployed on edge computing devices, assessing the feasibility of implementing these algorithms on such systems. Thus, rather than assessing NILM algorithms against each other, this work focuses on evaluating the computational performance of the selected algorithms on the selected devices. This work provides quantitative metrics on inference time, resource usage (CPU, RAM, GPU), and model sizes and errors, offering insights into the feasibility and trade-offs of deploying NILM systems on low-power hardware. The results highlight the limitations of Jetson Nano and Orin Nano for real-time applications, establishing a baseline for practical NILM deployments on embedded systems.
The present work is organized as follows:
Section 2 reviews recent and relevant NILM research, with a focus on experimental studies and the limited exploration of embedded system deployment.
Section 3 describes the selected NILM algorithms DeepDFML-NILM for event detection and Seq2Point CNN for load disaggregation along with their architectures and reported performance. This section also includes a dedicated
Section 3.3, which details the Laboratory for Innovation and Technology (LIT) [
7] and UK Domestic Appliance-Level Electricity (UKDALE) [
8] datasets used in the experiments.
Section 4 outlines the hardware and software setup, including the use of Docker containers and virtual environments to ensure consistent execution across devices.
Section 5 presents the training and inference procedures for the event detection algorithm, evaluated on the PC, Jetson Nano, and Jetson Orin Nano, with corresponding resource usage and performance metrics. Similarly,
Section 6 covers the training and inference procedures for the load disaggregation algorithm.
Section 7 summarizes all the results and presents the conclusions for the metrics across devices. Finally,
Section 8 discusses key findings, outlines limitations, and proposes future directions for optimizing NILM models for embedded systems, including GPU optimization, model compression, improved generalization testing, as well as a roadmap for their integration into smart grid and IoT ecosystems. A graphical overview of the present study’s pipeline is depicted in
Figure 1.
2. Related Works
This section synthesizes key contributions in NILM research, focusing on computing paradigms, datasets, toolkits, and the distinction between load disaggregation and load detection. Promising approaches such as hybrid Long Short-Term Memory (LSTM) -CNN models [
11], spectrogram-based CNNs [
12], and lightweight event detection algorithms [
13] are highlighted, while noting persistent challenges in generalization, real-time deployment, and standardized evaluation [
14,
15]. This section also demonstrates that challenges persist in deploying NILM on edge devices due to constraints in memory, processing power, and latency.
Non-Intrusive Load Monitoring (NILM) has evolved significantly since Hart’s foundational work in the 1980s, which introduced event-based detection and clustering for load disaggregation. NILM formulates the energy disaggregation task as inferring per-appliance signals from aggregate mains data. From Hart’s seminal work on step-change signatures and combinatorial optimization [
16], research progressed to probabilistic state-space models (i.e., Hidden Markov models (HMM) and Factorial Hidden Markov models (FHMM)) and, more recently, to deep learning architectures (e.g., CNN, Long Short-Term Memory (LSTM), seq2point, Sequence-to-Sequence (seq2seq), Generative adversarial networks (GANs), Transformers, etc.) [
14,
15,
17].
Concerning the main approaches for loading disaggregation (estimation), hybrid sequence models are reported to combine local feature extraction with temporal modeling. A compact LSTM-CNN sequence-to-sequence architecture reported high overall accuracy on the Electrical Load Measurements dataset (REFIT) with notably modest parameter count (on the order of
), though per-appliance macro-F1 can lag for short-duty devices [
11]. End-to-end CNNs operating on time–frequency representations (e.g., Short-Time Fourier Transform (STFT) spectrograms of current derivatives) jointly perform event handling and classification, showing near real-time feasibility in controlled tests and competitive accuracy on Building-Level Fully Labeled Electricity Disaggregation Dataset (BLUED) [
12,
18]. Beyond these exemplars, widely used families include sequence-to-point CNNs [
19], causal 1D CNNs on complex power (WaveNILM) [
20], graph-signal approaches [
21], GAN-enhanced models, and transformer variants surveyed in recent reviews [
14,
15]. Despite accuracy gains, many works still report sensitivity to house and device distribution shifts and limited reporting on inference time and computing cost. Yet, Compact LSTM and CNN disaggregation can reach high overall accuracy with relatively small models, but may underperform on short, bursty appliances and has primarily been validated on single datasets and homes [
11].
With regards to load detection (event detection), event-based pipelines segments are reported to aggregate signals into candidate transitions and then classify signatures. Lightweight, on-site methods emphasize simple features and decision logic for low-latency detection of closely spaced events, explicitly targeting embedded/edge deployment [
13]. Event-driven CNN pipelines pair robust start/stop detection with deep transient classification, advancing accuracy while retaining an interpretable detection stage [
22]. Compared to direct disaggregation models, event-first approaches offer a pragmatic route to edge-friendly NILM, but still require downstream assignment and potential disaggregation for energy accounting.
Recent research has focused on enhancing NILM through deep learning, embedded computing, and standardized datasets. For instance, toolkits such as Non-Intrusive Load Monitoring Toolkit (NILMTK) (
https://github.com/nilmtk/nilmtk, accessed on 2 September 2025) [
23] and its extension NILMTK-contrib (
https://github.com/nilmtk/nilmtk-contrib, accessed on 2 September 2025) [
24] have enabled reproducible experimentation and benchmarking across algorithms and datasets. These frameworks support preprocessing, model training, and performance evaluation using metrics such as F1 score, root mean square error (RMSE), and other precision metrics. Furthermore, datasets such as LIT-dataset (
https://pessoal.dainf.ct.utfpr.edu.br/douglasrenaux/LIT_Dataset/index.html, accessed on 2 September 2025), UKDALE-dataset (
https://data.ceda.ac.uk/edc/efficiency/residential/EnergyConsumption/Domestic/UK-DALE-2015, accessed on 2 September 2025), and Reference Energy Disaggregation Dataset (REDD) [
25] have become essential for training and validating NILM models. The LIT dataset, in particular, offers high-frequency sampling and precise event labeling, facilitating advanced feature extraction and classification. Deep learning models, including CNN-based architectures like DeepDFML-NILM (
https://github.com/LucasNolasco/DeepDFML-NILM, accessed on 2 September 2025) [
26], have achieved high accuracy in load detection and multi-label classification. These models often use high-frequency signals and image-like representations of voltage–current trajectories to improve appliance recognition.
In general, public datasets are central to benchmarking and reproducibility: REDD [
25], UKDALE [
8], REFIT [
27], BLUED [
18], and others (e.g., Almanac of Minutely Power Dataset (AMPds and AMPds2), Electricity Consumption and Occupancy Dataset (ECO), Plug-Load Appliance Identification Dataset (PLAID), Building-level Office environment Dataset (BLOND), etc.). They vary in sampling rates (sub-Hz to tens of kHz), duration, number of homes, and device coverage [
14,
15]. Tooling for fair comparison includes NILMTK [
23,
24,
28], which provides parsers, baselines, and evaluation metrics. Reviews emphasize the need for unified test protocols (e.g., unseen-home splits, cross-dataset validation, and standardized thresholds) to enable apples-to-apples comparisons [
14,
15]. It is important to highlight that spectrogram-CNN approaches demonstrate near real-time processing and strong lab results, with encouraging performance on BLUED, but broader, longitudinal in-home trials remain limited [
12,
18].
Furthermore, surveys and individual studies converge on three gaps:
(i) Generalization: models often degrade on unseen homes due to device heterogeneity and usage diversity; domain adaptation and semi-unsupervised learning are active but not definitive [
14,
15,
29].
(ii) Real-time/on-device: few works report end-to-end latency, memory, and energy footprints; promising timings exist (e.g., ∼100 ms per 1 s window on desktop for spectrogram-CNN) but embedded deployment evidence remains scarce [
12].
(iii) Evaluation standards: inconsistent splits, thresholds, and metrics hinder fair comparison; community toolkits help, but standardized scenarios are still needed [
14,
15,
23,
28].
Moreover, several authors agree in the following practical recommendations:
(i) Edge–cloud co-design: Use lightweight, on-device event detection to gate higher-cost inference; escalate to compact seq2point/seq2seq disaggregation locally, and defer to the cloud only for ambiguous segments [
11,
13].
(ii) Evaluation protocol: Adopt NILMTK-based pipelines with unseen-home and cross-dataset tests; report accuracy, latency, and memory (and, where relevant, energy) [
23,
28].
(iii) Data strategy: Mix canonical datasets (REDD, UKDALE, REFIT, BLUED) with augmentation and careful validation on small in-home pilots to quantify out-of-distribution robustness [
8,
14,
15,
18,
25,
27].
Nevertheless, while NILM research has made substantial progress in algorithm development, dataset creation, and toolkit standardization, most NILM implementations remain confined to simulated or controlled environments [
30,
31,
32]. Furthermore, while performance on public datasets is strong, authors consistently report challenges in generalization to unseen homes, latency and memory constraints, and a scarcity of field deployments [
14,
15].
As a consequence, there is a pressing need for real-world implementations, cross-platform benchmarking, and adaptive models that can operate reliably in diverse environments and real-world deployments. For example, in [
33], an edge solution based on an EVALSTPM32 microcontroller is developed, which involves a year-long field study to validate NILM systems under dynamic residential conditions. In [
34], the development and validation of an Open Multi Power Meter (OMPM) is presented, using a low-cost, scalable, and open-source hardware solution compatible with NILMTK, that differs from previous Arduino and Raspberry Pi-based solutions in its ability to offer a balance of accuracy and scalability since it is built around a single microcontroller architecture and uses RS485 communication. The developed OMPM enables simultaneous measurement of multiple electrical parameters (voltage, current, power, frequency, and power factor) across several channels, offering a replicable platform for academic research and practical energy management applications. In [
35], an instrumental prototype architecture of a smart meter with embedded capabilities for high-frequency NILM is presented, while evaluating computational performance on six edge platforms: Raspberry Pi 4 (RP4), Intel Neural Compute Stick 2 (NCS2), NXP i.MX 8M Plus (IMX8), NVIDIA Jetson Nano (NANO), Jetson Xavier NX (XAVIERNX), and Ultra96v1 (ULTRA96V1) AMD-Xilinx MPSoC FPGA. In [
36], lightweight, retrainable neural architectures like the Siamese network with fixed CNN and retrainable back propagation (BP) layers, are shown to overcome the memory, processing, and latency constraints of edge devices, by using an embedded Linux system with STM32MP1 to deploy the NILM pre-trained algorithm. It enables real-time, scalable NILM with online adaptation, making it a promising solution for smart homes, IoT, and energy management systems [
36]. Similarly, in [
37], convolutional and LSTM neural networks are implemented to asses computational performance in the NVIDIA Jetson Orin NX and Nano, Google Coral (DevBoard and USB), Intel Neural Compute Stick 2, NXP i.MX8 Plus, and Xilinx Zynq UltraScale+ MPSoC ZCU104. The results show considerable variability among devices, with the ZCU104 and Jetson Orin achieving the lowest latencies across most models without any increase in the error metric.
From the above, event-driven methods deliver robust detection including closely spaced transitions, with resource-friendly designs suitable for embedded devices, yet they require integration with downstream disaggregation to provide appliance-level energy estimates [
13,
22].
Several studies have compared NILM algorithms and their computational performance using conventional computing environments. For instance, ref. [
30] evaluates four machine learning models (XGBoost, LSTM, Logistic Regression, and DTW-KNN) on a desktop equipped with an Intel i7-7700K CPU and NVIDIA GTX 1080 GPU, reporting accuracy, training time, and memory usage. Similarly, ref. [
31] analyzes computation times for each stage of its NILM pipeline on a Core i5 (2.3 GHz, 8 GB RAM) workstation, achieving an average processing speed of 19.35 ms per sample. These studies offer insights into algorithmic behavior but do not address edge-level deployment constraints. More recent efforts emphasize edge–cloud collaboration. The study in [
38] proposes a three-tier client–edge–cloud NILM architecture, deploying XGBoost and Seq2Point on a Raspberry Pi 5 (64-bit Arm Cortex-A76, 8 GB RAM) as the edge node and an Intel i7-10875H + RTX 2060 as the cloud, demonstrating that running inference at the edge significantly reduces latency. In addition, ref. [
36] converts a CNN into TensorFlow Lite and deploys it on an STM32MP1 dual-core SoC, achieving inference in 20 ms, while [
37] benchmarks edge-AI platforms (Jetson Orin NX/Nano, Google Coral, Intel NCS2, NXP i.MX8, Xilinx ZCU104) showing wide latency variability and identifying the ZCU104 and Jetson Orin as the fastest. Unlike studies such as [
30,
31], which primarily benchmark algorithms under desktop conditions, in the present work, a systematic experimental evaluation of NILM models is performed directly on embedded edge devices, specifically the NVIDIA Jetson Nano and Jetson Orin Nano, to quantify latency, resource consumption, and accuracy under real-time constraints. Similar to the edge-oriented approaches in [
38], the present study shares the goal of enabling practical NILM deployment under latency and scalability limitations; however, the present work contribution extends these efforts by providing detailed empirical measurements of processing time, CPU/GPU/RAM utilization, and energy disaggregation accuracy across multiple streaming configurations (fixed and variable stride).
6. Load Disaggregation Training and Inference
This section describes the complete workflow for the load disaggregation model, including both the training process conducted on a personal computer (PC) and the inference phase performed across all target platforms: the PC, Jetson Nano, and Jetson Orin Nano. Each device used its corresponding runtime environment, as detailed in
Table 1 and
Table 2.
Figure 7 shows a zoomed-in view of the event detection pipeline, originally introduced in the overall pipeline in
Figure 1.
6.1. Disaggregation Training
The implementation of the training of the disaggregation models was based on the repositories NILMTK [
23], NILMTK-contrib [
24] and Seq2Point-nilm [
19,
39], from which different scripts, normalization values, and CNN architectures were obtained.
After preprocessing a training subset from the UKDALE-Dataset, models for the fridge, kettle, dishwasher, and washing machine were trained.
Figure 8 shows the training curves for the different devices.
Figure 8 shows that the validation loss closely follows the training loss. Although a small gap is present in most cases, it remains minor, indicating that the models are not significantly overfitting.
The models generated have a fixed size of 3,623,449 parameters, with an approximate weight of 43.6 MB each. The average resource usage and time required during training are shown in
Table 6.
The training resource summary in
Table 6 shows that all models were trained efficiently, with training times ranging from 13.4 to 20.5 s. CPU usage remained consistent around 10%, while RAM usage was between 72 and 74%, indicating moderate memory demands. GPU usage was higher for some appliances, especially the washing machine, but always stayed under 50%, without overloading the system.
On the other hand,
Table 7 shows the complete training metrics of the models.
Table 7 confirms good training performance across all appliances. The training and validation metrics, MSE, MSLE, and MAE, remain low overall, with only minor increases in validation values, particularly for the dishwasher and washing machine. This supports earlier observations of minimal overfitting, suggesting that the models generalize well to unseen data.
6.2. Disaggregation Inference Evaluation
The disaggregation models were evaluated using an offline strategy, in which aggregated signals were processed on demand. A testing subset of the UKDALE dataset, corresponding to approximately 30 days of energy consumption subsampled at 60 s intervals (44,286 samples), was used.
For each device and appliance, average resource usage were measured as shown in
Table 8 with a CI = 95% and the number of data points (
n) varying across appliances and devices.
In addition, the cumulative energy error (SAE), inference error metrics, and average processing times with
n = 11 and a CI = 95% are summarized in
Table 9.
The results in
Table 8 and
Table 9 confirm that all devices successfully performed offline disaggregation over 30 days of data, but with clear differences in efficiency, accuracy, and resource usage.
The PC, as the reference device, achieved the fastest inference times (≈2.2 s per appliance) with consistently low CPU (≈10%), moderate GPU usage (≈23%), and low RAM usage (50–58%). Error metrics were also the lowest across all devices, confirming that high-precision inference was maintained. The Orin Nano completed inference in 18–19 s per appliance, showing moderate CPU (≈25%), moderate GPU usage (≈29–30%), and high RAM usage (75–81%). Its error metrics were slightly worse than the PC but closely matched those of the Jetson Nano. The Jetson Nano showed the highest resource usage (CPU ≈ 38%, GPU ≈ 52–56%, RAM ≈ 90–91%) and slower inference times (≈22–25 s per appliance). Error metrics were nearly identical to the Orin Nano, indicating that performance differences are due more to hardware constraints than inference quality. Its high resource utilization suggests the Jetson Nano is operating close to capacity during inference.
The results indicate that the GPU usage presented the highest variability across all devices, with confidence intervals of up to ±5% on the Jetson Nano. This suggests that GPU utilization is more sensitive to fluctuations in workload and background processes than CPU or RAM usage, which remained relatively stable. In terms of processing time, the Jetson Nano exhibited the highest variability (up to ±2.9 s), compared to the Jetson Orin Nano and especially the PC, whose execution times were extremely consistent (only ±0.02–0.06 s). This indicates that Jetson Nano runtime stability is less predictable, likely due to scheduling, memory management, or differences in precision handling.
The results also show that, although the same trained disaggregation model was used across all devices, slight differences appear in metrics such as MSE, MAE, and MSLE. This variation is likely due to differences in inference precision, hardware architecture, or backend libraries used on each platform. Devices like the Jetson Nano and Orin Nano often rely on reduced precision, which can introduce minor numerical discrepancies during inference. However, the SAE (cumulative energy error) remains identical across all platforms, confirming that despite low-level differences, the overall energy estimation is consistent. This suggests the model’s core disaggregation capability remains stable across platforms.
Figure 9 shows examples of the aggregated, ground truth, predicted signal, and prediction error for each appliance.
Figure 9 shows that the models are capable of accurately tracking the disaggregated signal throughout the inference period. This indicates that the models have learned the general patterns and behavior of each appliance and can effectively separate their individual consumption from the aggregate signal. The alignment between the predicted and actual values suggests good generalization and reliable performance, even when processing extended periods of real-world data.
The fridge prediction error curve indicates that this model is among the best trained in the study. Errors only rise briefly during state transitions or when other appliances introduce disturbances during steady states. The dishwasher error curve follows a similar pattern, with peaks at state changes and difficulties in capturing low-power states, though the overall error remains low.
In contrast, the kettle and washing machine error curves show much higher error levels. For the kettle, the error is persistent across the entire timeline, with additional peaks where the model incorrectly predicts appliance activity while the ground truth is zero. The washing machine error is also substantial, likely due to the complex, non-steady patterns that are hard to replicate.
Despite the washing machine’s high appliance-level error, its cumulative energy error (SAE) is the lowest among all models in this study. This suggests that achieving low per-appliance error does not always translate into the most accurate overall energy consumption estimates.
Figure 10 shows the results of disaggregating each appliance from the same aggregated signal using the API and dashboard to show their contribution to the total energy consumption.
Figure 10 shows that the washing machine is the appliance with the lowest total consumption error, aligning with the SAE value, while the kettle total consumption error is almost half the predicted value. It is also important to note that all errors are positive, indicating that the predictions add noise rather than underestimating the appliances’ consumption.
7. Results
This section presents a summary and conclusion of all key metrics, beginning with the training performance and resulting models for both event detection and load disaggregation, as evaluated on the PC. It reviews training durations, model complexity, and resource usage. Following this, this section compares inference performance across different hardware platforms, PC, Jetson Nano, and Jetson Orin Nano, highlighting differences in processing time, resource usage, and accuracy.
7.1. Training Performance
Table 10 summarizes the average training time and hardware resource usage for both the detection and disaggregation models on the PC. It includes CPU, RAM, and GPU usage during training, as well as the parameter count and memory size for each model.
Table 10 shows that although the detection model involved a more complex training process (requiring more training time and higher GPU usage), it remains significantly lighter than the disaggregation model in terms of both the number of parameters and memory size.
7.2. Event Detection Results
The event detection algorithm itself demonstrated strong performance across all devices, achieving a detection accuracy of 96.1%, a load classification accuracy of 97.06%, and an event distance error of only 0.78 half-cycles. The performance metrics obtained with the trained model in this work are very similar to those reported in [
26]. The main difference lies in the processing time for a 0.833 s window: while [
26] reported 11.2 ms on a Jetson TX1, the results obtained were as follows: 23.71 ms on the PC, 138.17 ms on the Jetson Orin Nano, and 191.63 ms on the Jetson Nano, as shown in
Table 4.
It is important to highlight that the variability of the data used in this work is very low, as shown in
Table 4 and
Table 5, and
Figure 3,
Figure 4 and
Figure 5, which makes the results reliable and consistent. From this analysis, it can be observed that the processing time reported for the Jetson TX1 in [
26] is significantly lower than that obtained even on the PC. Furthermore, while [
26] reported an average GPU utilization of 50.6%, in the experiments conducted in this study, GPU usage remained low as shown in
Table 5, suggesting that the detection algorithm was not directly leveraging GPU acceleration for data processing. This behavior may be attributed to the compatibility of the libraries and packages available in each execution environment, which could prevent effective use of the GPU.
The results of
Table 4 and
Table 5 show that the PC has the lowest CPU and RAM usage and the fastest processing time. Among the embedded platforms, the Jetson Orin Nano outperforms the Jetson Nano, demonstrating lower CPU and GPU usage as well as faster inference. In all cases, CPU and GPU usage remained within reasonable limits, while RAM usage reached its limit, especially on Jetson devices.
To analyze the streaming modes and real-time inference capability,
Table 11 summarizes the processing times presented in
Figure 3,
Figure 4 and
Figure 5 with their respective number of samples (
n) and CI, and compares the inference latency of the detection modes across different platforms. The table also reports the percentage of events detected by each device under Mode 3 based on 180 events tested.
The PC, serving as the reference device, fully meets the requirements for real-time detection. It achieved rapid response times of 0.5 s when processing a 4.16 s window while maintaining 100% detection accuracy under the adaptive stride configuration (Mode 3). It also handled offline inference efficiently, processing the entire 37.7 s window in only 5.5 s, making it well-suited for both real-time and batch processing scenarios.
The Jetson Orin Nano, although offering better performance than the Jetson Nano, is unable to sustain real-time operation. It required approximately 2.8 s to process a 4.16 s window and 33.10 s for the full 37.7 s window. These times are insufficient for stride-based detection, making the device unsuitable for online use. Under adaptive stride (Mode 3), its detection accuracy dropped to only 66%, further confirming its limitations for time-sensitive applications. Moreover, in Mode 2, approximately 2 s of data is accumulated in the buffer during each iteration, clearly illustrating the latency gap between data acquisition and processing.
The Jetson Nano exhibited even greater performance limitations, being unable to sustain real time. It required approximately 3.7 s to process a 4.16 s window and 42.78 s for the full 37.7 s window. These processing times are far from sufficient for stride-based detection, making the device unsuitable for online use. Under adaptive stride, its detection accuracy dropped to only 55%, underscoring its limited applicability in latency-critical scenarios. Furthermore, in Mode 2, nearly 2.8 s of data is accumulated in the buffer during each iteration, further highlighting the severe mismatch between data acquisition and processing speed.
In summary, all devices evaluated are acceptable for offline processing tasks. However, the Jetson boards cannot achieve real-time detection in either Mode 2 or Mode 3, as their processing times are too high, leading to potential data buffering or missed detections caused by insufficient overlap.
7.3. Disaggregation Results
For their part, the disaggregation models were effective in replicating the temporal patterns of consumption, as seen in
Figure 9 and
Table 9. Nevertheless, the cumulative energy analysis reveals opportunities for improvement, especially in devices with sporadic activations such as the kettle or dishwasher, where the SAE exceeds 0.7. This discrepancy suggests that, while the model may be accurate at the sample level, it could benefit from optimizations focused on predicting total consumption, such as adjusting the loss function or performing targeted retraining.
Table 8 and
Table 9 show that although all platforms used the same disaggregation models and achieved identical SAE values, indicating similar cumulative energy estimates, there are evident differences in resource usage and inference times. The PC achieved the best overall efficiency as it is the reference device, with the lowest latency of about 2.2 s and minimal CPU (10.4%) and GPU (22.8%) usage across all the appliances. In contrast, the Jetson Nano exhibited the highest computational load and the slowest inference time of about 23.8 s across appliances. The Jetson Orin Nano demonstrated balanced performance, achieving a latency of about 18.9 s while maintaining moderate resource usage. Consistent use of the GPU across all devices demonstrates that the algorithm appropriately utilized resources to accelerate processing. One of the most heavily used resources during the inference process was RAM, which became the main bottleneck in the entire computation. This issue was particularly evident on the Jetson Nano, where memory resources are very limited, causing saturation easily and leading the system to stop functioning.
In conclusion, all devices are suitable for performing offline disaggregation. However, the Jetson Nano presented significant issues due to its limited RAM, making the system unstable and prone to crashes.
8. Discussion
This study assesses the execution time, resource usage, and inference performance of two NILM algorithms—DeepDFML-NILM for event detection and Seq2Point CNN for energy disaggregation—on edge devices such as the Jetson Nano and Jetson Orin Nano, to assess the feasibility of implementing these algorithms on such systems.
The results presented in
Section 5.2 allow us to conclude that both the Jetson Orin Nano and the Jetson Nano are capable of performing accurate offline event detection. However, neither device is able to perform real-time detection in either of the two implemented online modes, as they suffer from either buffer overflows or reduced event detection performance caused by the limited overlap resulting from the high processing time per window. Another important bottleneck is the limited RAM, particularly on the Jetson Nano, which frequently becomes saturated and causes the device to crash during the experiments. An observation worth highlighting is the low GPU utilization during the experiments in this study, whereas the reference work [
26] reports an average GPU utilization of 50.6% and a processing time of 11.2 ms for a 0.833 s window, almost half the time of the best-performing device in this study, the PC, which achieves 23.7 ms for the same window size. This suggests that the experiments conducted in this study do not fully leverage the GPU, possibly due to the specific software versions available for each system. For future work, it is proposed to study in detail the software dependencies required to take full advantage of the available GPU resources on the Jetson devices, and to explore model compression methods such as TensorFlow Lite Converter, as done in [
36] to run models efficiently on edge hardware, potentially enabling real-time event detection on these platforms.
On the other hand, the results presented in
Section 6.2 show that both Jetson devices are fully capable of performing appliance-level signal disaggregation efficiently, since processing data equivalent to 30 days of measurements was completed in under 30 s on both platforms. The only bottleneck in this case is, again, the RAM on the Jetson Nano, which occasionally saturates when running the algorithm. For real-world deployments, the Jetson Nano is, therefore, not recommended for either algorithm, as it does not guarantee stability and would require constant monitoring. Future work in this direction includes training models specifically to optimize the SAE metric, as it is the one most closely related to accurate appliance-level energy estimation. Additionally, it is proposed to evaluate the trained models using different datasets, or data from households other than the one used for training, where consumption patterns vary significantly, in order to test their generalization ability.
Finally, after achieving models with good generalization capacity that are optimized for real-time processing on embedded systems such as the Jetson Nano and Jetson Orin, future work will focus on the integration of NILM into practical smart grid and IoT ecosystems. For this, the following roadmap is proposed:
Data and hardware infrastructure: Ensure the availability of high-resolution measurements through smart meters and sensors, together with the selection of edge devices that balance energy consumption, cost, and computational capacity.
Algorithm optimization: Adapt and compress models for efficient deployment on resource-constrained hardware, ensuring low latency and scalability across different types of appliances.
IoT architecture integration: Design an edge-to-cloud architecture that allows local event processing while transmitting aggregated data to the cloud for advanced analytics, employing standard protocols and interoperable APIs.
User and grid oriented applications: Provide appliance-level consumption feedback to users, and enable demand-side management services, anomaly detection, and energy-use optimization in households and microgrids.
Security and privacy: Implement encryption and access control to preserve the privacy of energy data.
Scaling and validation: Conduct pilot tests in real environments, evaluate accuracy, latency, and robustness metrics, and finally integrate the results into smart grid platforms for large-scale services such as load forecasting, demand response, and distributed energy trading.