DiffNILM: A Novel Framework for Non-Intrusive Load Monitoring Based on the Conditional Diffusion Model

Non-intrusive Load Monitoring (NILM) is a critical technology that enables detailed analysis of household energy consumption without requiring individual metering of every appliance, and has the capability to provide valuable insights into energy usage behavior, facilitate energy conservation, and optimize load management. Currently, deep learning models have been widely adopted as state-of-the-art approaches for NILM. In this study, we introduce DiffNILM, a novel energy disaggregation framework that utilizes diffusion probabilistic models to distinguish power consumption patterns of individual appliances from aggregated power. Starting from a random Gaussian noise, the target waveform is iteratively reconstructed via a sampler conditioned on the total active power and encoded temporal features. The proposed method is evaluated on two public datasets, REDD and UKDALE. The results demonstrated that DiffNILM outperforms baseline models on several key metrics on both datasets and shows a remarkable ability to effectively recreate complex load signatures. The study highlights the potential of diffusion models to advance the field of NILM and presents a promising approach for future energy disaggregation research.


Introduction
In recent years, the demand for fine-grained power data has increased, leading to a growing interest in the energy disaggregation technique for obtaining information on appliance-level power consumption. A commonly cited application of this technique is to generate detailed electricity bills, which encourage energy conservation among residents. Additionally, electric power companies can utilize disaggregated power consumption data to calculate Demand Side Response (DSR) resources and evaluate DSR capability. An intuitive way to obtain such data is through Intrusive Load Monitoring (ILM), which involves the direct installation of sensors on target appliances. While ILM yields accurate results, it is generally considered unfeasible for large-scale deployment due to its high cost. On the other hand, Non-Intrusive Load Monitoring (NILM) has gained better application prospects from an economic standpoint. NILM can be viewed as a software sensor that identifies the operating states of individual appliances and estimates their power consumption using only power or current data recorded by the mains meter; thereby, reducing the overall cost.
The practical implementation of NILM has been facilitated by the development of big data technology in the energy industry. Advanced Metering Facilities (AMIs) provide real-time load monitoring data, and modern Artificial Intelligence (AI) algorithms can effectively process massive amounts of data.
In our study, energy disaggregaion is framed as a generation task and a highly promising deep generative model, the diffusion model [1], is employed to reconstruct target power profiles. In the last two years, diffusion models have been gaining significant popularity and have nearly replaced Generative Adversarial Network (GAN) and other generative models due to their ease of training, improved tractability, and flexibility. Diffusion models have demonstrated exceptional performance in various fields, including image generation [2], image segmentation [3], audio synthesis [4] and point cloud reconstruction [5], etc. However, to the best of our knowledge, no published research has investigated the use of diffusion models for NILM. Therefore, this paper proposes DiffNILM, a diffusion probabilistic model for energy disaggregation. The main contributions of our work are as follows: • DiffNILM is the first NILM framework adopting the diffusion model. Specifically, We engineer the conditional diffusion model to address the NILM task, where the total active power and embedded time tags are fed to the model as conditional input, and the appliance power waveform is generated step-by-step from Gaussian noise. • We propose an encoding method for multi-scale temporal features that takes into account the regularity of power consumption behaviors. • We implement and evaluate the proposed method on two public datasets, REDD and UKDALE. Empirical results demonstrate that DiffNILM outperforms previous models, as evidenced by both classification metrics and regression metrics.

Related Works
The overall framework of NILM was pioneered by Professor Hart in the 1980s, as documented in [6]. This approach was based on the notion that electric appliances exhibit unique features during state transition, which formed the basis for the event-based load monitoring method. However, Hart's original approach only extracted steady-state features, which proved inadequate for appliances with multiple states and relatively low power consumption. To improve Hart's algorithm, researchers discovered that repeatable transient profiles could be observed with high sampling rates, which allowed for the recognition of appliances' transient signatures [7]. Various signal processing techniques, including Fourier Transform [8], Wavelet Transform [9], and Hilbert Transformation [10], were attempted to process transient power, current, and voltage signals.
To improve the identification accuracy, multiple electrical parameters were combined as input. The most prevailing method is the V-I trajectory, which maps current and voltage signals as a 2D image. Both steady-state and transient data can be utilized to generate V-I trajectories. For instance, Wang et al. [11] extracted the V-I trajectory based on steady-state data and developed an approach to quantify ten trajectory features. The research work in [12] utilized instantaneous voltage and current waveforms and proposed an algorithm that demonstrated high precision and strong robustness.
NILM based on traditional machine learning methods is mainly realized by the Hidden Markov Model (HMM) [13][14][15], Conditional Random Field (CRF) [16], and Support Vector Machine (SVM) [17]. These algorithms are supported by explicable mathematical principles, but their performances are often constrained by stringent assumptions (the characterization of load state transitions may not align with the actual operational features of various appliances), leading to limited accuracy and generalization abilities. Efforts have also been made to tackle the issue by framing it as a Combinatorial Optimization (CO) problem [18], but this method has proven to be computationally intractable, since it relies on enumeration.
With the significant progress of Deep Learning (DL), DL-based solutions have brought fresh insights to the practical advancement of artificial intelligence, which have been extensively adopted in various fields, including Computer Vision, Natural Language Processing, Signal Processing, etc. The application of DL-based techniques to energy disaggregation started with Kelly and Knottenbelt's pioneering work in 2015 [19], where they introduced three deep neural network architectures to NILM, surpassing CO and diverse HMM-based algorithms in terms of both accuracy and generalization capability. Since then, DL-based methods have gradually dominated NILM research.
Recurrent Neural Networks (RNNs) are a type of deep learning architecture particularly well-suited for handling sequential data [20]. However, the vanishing gradient problem has posed a major challenge in the field. To address this issue, Long Short-Term Memory (LSTM) networks [21] have been commonly used in NILM. Convolutional Neural Networks (CNNs) have proven to be highly effective for image tasks and excel in sequential data analysis as well [22]. Zhang et al. [23] compared Seq2Point and Seq2Seq learning approaches using CNN-based mappings for training.
The models mentioned above have also been optimized to enhance computational efficiency since real-time load disaggregation is crucial for certain use cases, such as DSR and fault detection [24,25]. While accurate approaches have been proposed, there are also light-weight approaches to enable online computation, including a super-state hidden Markov model and a new variant of the Viterbi algorithm in an HMM-based framework for computationally efficient exact inference [26], as well as methods based on Gated Recurrent Unit (GRU), which reduce memory usage and computational complexity [27,28]. In addition, an experimental platform has been developed to realize real-time computation with a calculation time limit of one second [29].
In the past few years, Attention Mechanism has gained widespread popularity in handling sequential data processing tasks. The fundamental idea is to direct focus onto the most essential segment in the input sequence by assigning the highest weights to the most relevant parts. Capitalizing on the advantages of Attention Mechanism, Google introduced the Transformer architecture in 2017 [30], which allows parallel computation, as opposed to RNNs, and demonstrates a significantly superior capability to capture sequential features compared to CNNs. Building upon Transformer, the research work in [31,32] designed an architecture, based on Bidirectional Encoder Representations, from Transformers (BERT) for NILM, and proposed comprehensive loss functions that incorporate both regression and classification metrics.
NILM can also be regarded as a generation task aimed at creating synthetic waveforms for individual appliances, so the implementation of deep generative models, which are capable of modeling the underlying distribution of the power data, has been explored. A deep latent generative model for NILM, based on the Variational Recurrent Neural Network (VRNN), has been proposed, which performs sequence-to-many-sequence prediction [33]. The strong generational ability of the Variational Autoencoder (VAE) improves the formation of complex load profiles [34]. Conditional Generative Adversarial Network (cGAN) was used to avoid manually designing loss functions [35]. The work in [36] unified auto-encoder and GAN to realize the source separation of nonlinear power signals. Drawing inspiration from the favorable outcomes attained by several non-autoregressive generative models, the proposed study endeavors to employ the diffusion model, a more advanced approach, to the task of NILM.

Denoising Diffusion Probabilistic Models
With inspiration from non-equilibrium thermodynamics, the basic idea of DDPMs is to destroy the original data by gradually adding Gaussian noise, and then to learn to reconstruct the data through an inference process. The noising and denoising Markov chains are defined as the forward process and reverse process, respectively.
The step-by-step destruction and reconstruction of a power waveform in a diffusion model is illustrated in Figure 1. Random noise is successively added to x 0 , a segment of clean power waveform of the target appliance, until the discernible features are completely lost. In reverse, we start from a random Gaussian noise x T and progressively remove extra noise to generate the target distribution. The original data x 0 and whitened latent variables x 1 , x 2 , . . . , x T share the same dimensionality. Figure 1. Illustration of the forward process and the reverse process in a NILM task. The pink arrows point out the process of forward diffusion, where a clean power pattern is gradually destroyed. The green arrows indicate the process of denoising inference, where the target waveform is recovered.

The Forward Process
Diffusion models can be seen as latent variable models which create mappings to a hidden feature space, and this process is controlled by a predefined linear schedule According to the defining characteristic of the Markov chain, the distribution of x t at any arbitrary time step depends solely on its previous state x t−1 , so we add Gaussian noise to a x t by means of the following formula: The iterative formula of the forward process is given as: In order to spare us from having to do step-by-step iteration, we derived the closedform expression to directly calculate x t by x 0 using a reparameterization trick: whereᾱ t = ∏ t i=1 α i . In many applications of the diffusion process, the parameters β 1:T are often assigned small values following an increasing pattern. For instance, in [1], β 1:T is defined as a linear function with values ranging from 10 −4 to 0.02 over 1000 time steps. As T grows sufficiently large,ᾱ t converges to zero, and the resulting distribution of the latent variable x T approaches a standard normal distribution. The diffusion process ceases when the final distribution becomes sufficiently disordered to be considered an isotropic Gaussian distribution.

The Reverse Process
The reverse process is where the desired output data is generated by tracing the Markov chain backward. Starting from x T , if the distribution of any x t−1 can be derived from the prior term x t , the original distribution x 0 can be recovered from pure Gaussian noise. Unfortunately, the reverse transfer distribution q(x t−1 | x t ) is not inferrable by simple mathematical derivation, so we used a deep learning model with parameter θ to estimate this reverse distribution, as depicted in Figure 2. Conditioned on x 0 , the reverse conditional probability can be derived on the basis of the Bayes Rule: where C is a term not involving x t−1 . According to the probability density function of the normal distribution, the mean and variance of Equation (4) can be expressed as: Then, we transform Equation (1) into the form of (5), and derive the target mean that only depends on x t : The above derivations reveal that the varianceβ t relies solely on the noise schedule and, thus, can be pre-computed. The parameter to be approximated (ε t ) exists in µ t , so we use a neural network to estimate the noise and, consequently, the mean.

Training a Diffusion Model
Diffusion models adopt the modeling method of noise prediction, where the neural networks take x t and time step t as input to estimate the noise ε θ (x t , t). The goal of the training process is to narrow the gap between the actual noise and the predicted one by optimizing the negative log-likelihood using the variational lower bound . The loss term is parameterized as: A few simplifications lead to more stable training:

Conditional Diffusion Model as Appliance-Level Data Generator
One of the salient features of NILM, as a generation task, is that, instead of randomly generating power sequences that follow a certain distribution, the generation of each segment of appliance power waveform is conditioned on a segment of aggregated power waveform with the same length. However, the vanilla diffusion model was originally designed for unconditional image generation, which necessitates adaptive modifications to tailor it to the requirements of the NILM task.
Conditional diffusion models have been well-studied in other sequence modeling tasks. For instance, in machine translation the model conditions on the source sentences, and in speech synthesis the model conditions on the mel-spectrogram. The general goal of such algorithms is to model the probability density of p θ (x 0 | x d ), where x d contains conditioning features relevant to x 0 . For diffusion models, the conditional distribution can be written as: The proposed model takes two conditional inputs: the total power and the encoded temporal features. Traditional NILM algorithms only detect the states of appliances based on the aggregated power sequence, disregarding the regularity and periodicity in the energy consumption patterns of users (for instance, dishwashers are generally used after dinner, and refrigerators operate more frequently during summer). In this study, we present an encoding technique that integrates multi-scale temporal information as supplementary knowledge for energy disaggregation, with reference to the global timestamp representation introduced in [37]. As illustrated in Figure 3, we extract three features from each time tag: hour of day, day of week and month of year, and then linearly encode these three features into values within the interval of [−0.5, 0.5], respectively. Moreover, the continuous noise level is adopted in this paper, as opposed to discrete noise level, where we sample t ∼ Uniform({1, 2, . . . , T}) and reach for the corresponding α t in the predefined linear schedule. The proposed diffusion model conditions on the continuous noise level √ᾱ instead of time step t and √ᾱ is randomly chosen between two adjacent discrete noise levels: For our task, as depicted in Figure 4, the neural network takes three inputs: the noisy appliance-level power data x t , the corresponding noise level √ā , the conditional aggregated power data x aggre and embedded time tags x time , and outputs the approximated noise ε θ x t , √ā , x aggre , x time .

Network Architecture
This section details the implementation of a neural network for noise prediction, with an architecture inspired by NU-Wave [38] and DiffWave [4], which are two diffusionbased neural vocoders.
As revealed in Figure 5, 1D convolutional layers were used to increase the number of channels of the input sequences xᾱ, x aggre and x time to C, and the Sigmoid Linear Unit (SiLU) activation was adopted: Similar to the positional embedding method, proposed in Transformer [30], the sinusoidal encoding formula is applied to embed the noise level √ᾱ :  Then we use two shared SiLu-activated Fully Connected (FC) layers and one residuallayer-specific FC layer to project the encoded noise level to a C-dimensional vector, and add it to the convoluted xᾱ as a bias term.
The main body of the model consists of N conformably-structured layers connected in residual manner to enable the direct delivery of input information to the final layers. In each residual layer, we used Bi-directional Dilated Convolution (Bi-DilConv) to deal with the inputs for an exponential growth in the receptive field, and in the i-th residual layer, the spacing between the kernel points was set to 2 i mod n . Gated Units (GU) are applied to activate the summation of the processed noisy signals and conditional signals. Then, the convoluted vector is split in two and passed on as residual output and skip output, respectively. Finally, we sum all the skip connections and use two convolutional layers to gain the noise vector ε θ in the same shape as xᾱ and x aggre .

Training and Sampling Procedures
The training and sampling procedures of the diffusion model are shown in Algorithms 1 and 2. In the training procedure, after extracting data from the dataset, we sample an iteration index t and obtain a corresponding continuous noise level √ᾱ to determine the extent of whitening applied to the original waveform. As mentioned in Section 3.3, the deep learning model updates its parameters with the purpose of minimizing the distance between the sampled noise ε and the predicted noise ε θ . Instead of using common loss functions, such as MSE and L 1 norm, we found that log-norm improves convergence speed and leads to improved empirical outcomes: In the sampling algorithm, we adopted a fast sampling method where much fewer inference steps are used. Instead of traversing the reverse process step by step with t = T, T − 1, . . . , 1, we define an inference schedule with only T in f er noise levels (T in f er << T). The test results demonstrated that the fast sampling trick greatly accelerated the inference procedure without degrading generational quality. In each inference step, we calculate the predicted varianceβ t and mean µ θ xᾱ, √ᾱ , x d to estimate the previous term.

Experiments
We carried out an experiment to test the proposed model. The workflow, as illustrated in Figure 6, involved pre-processing data, splitting the dataset, training a neural network using the training set, and evaluating its performance on the testing set.

Dataset
This study employed low-frequency active power data from the REDD and UKDALE datasets to train and test the proposed model. REDD is the most widely-used dataset in the domain of NILM, comprising the mains and submeter power data of six residential homes in the United States, recorded over a period of approximately four months. The UKDALE dataset, on the other hand, was published by Imperial College London, in 2014, and contains power consumption information from House 1 collected for up to three years, while the data for the other four houses were recorded for several months.
We pre-processed the original power data according to the following procedure: Step 1: Merge the data of split-phase mains meter. Two-phase power supply is commonlyused in North American households, so, for REDD, we calculated the sum of each mains meter to obtain the actual aggregated power data.
Step 2: Resample the power data at a fixed interval of 6 s.
Step 3: Fill data gaps shorter than 3 min by forward-filling, and fill those longer than 3 min with zeros.
Step 4: Attach status labels to the datasets. An appliance is classified as being in an 'on' state at a particular time point and assigned a status label of 1, provided that its power consumption falls within the acceptable 'on' power range and its operation time exceeds the minimum duration specified in Table 1. Otherwise, a status label of 0 is assigned.
Step 5: Standardize the power data according to Formula (14) to enhance the accuracy of the model and convergence speed. Following the pre-processing of the power data, overlapping sliding windows were utilized to extract sequences of processable length.

Evaluation Metrics
The selection of suitable metrics is important in appraising the algorithm's performance. As NILM can be formulated as either a binary classification problem (to detect the on/off states of the target appliance) or a regression problem (to estimate the numeric value of power consumption), the evaluation incorporated both classification and regression metrics to ensure a comprehensive assessment.

Classification Metrics
We used classification metrics in Equation (16) to evaluate the ability of the algorithm to identify the on/off states, where TP, FP and FN, respectively, represent the number of TP (True Positive), FP (False Positive) and TN (True Negative) results, and P and N, respectively, represent the number of points where the appliance is switched on and off in ground truth.
While accuracy is an intuitive classification metric, its applicability is restricted in datasets that are unbalanced, where the 'on' states of appliances constitute a small fraction of the entire sequence. In such cases, the F-score index serves as an effective approach to address the imbalance issue. The F-score comprehensively incorporates both precision and recall, and varying weights can be assigned to them by adjusting the β value, thereby enabling an evaluation of the quality of NILM algorithms under diverse application scenarios. Given that precision and recall are usually deemed equally important, the value of β was set to 1, and F-score was calculated as the harmonic average of the two, termed as F1-score.

Regression Metrics
To evaluate the performance of the model to reconstruct the power profiles of the target appliance, two commonly-used regression metrics, Mean Absolute Error (MAE) and Mean Relative Error(MRE), were adopted: wherex t and x t , respectively, represent the appliance's estimated and actual power at time t, and T is the total number of points in the sequence.

Implementation Details
The NILM project was conducted on a 64-bit computer equipped with Intel(R) CoreTM i7-12700 CPU@ 3.61GHz, 32GB memory, and NVIDIA GeForce RTX 3080Ti. The Pytorch framework was employed to train and test the diffusion model.
During the training phase, the model was trained until convergence at a learning rate of 3 × 10 −5 . To accelerate gradient descent, the Adam optimizer was utilized, where the hyperparameters β 1 and β 2 were set to 0.5 and 0.999, respectively.
The hyperparameters of the diffusion model are shown in Table 2. Table 2. Hyperparameters of the model.

Results
DiffNILM was evaluated against four state-of-the-art NILM models, including the bi-directional LSTM [21], CNN [23], BERT4NILM [31] and cGAN [35]. The objectivity of the comparative experiments was ensured by adopting the same data processing method, and all the baseline models were trained to convergence. The performance indicators of the five disaggregation models on REDD and UKDALE datasets are shown in Tables 3 and 4. Output sample curves generated by DiffNILM, BERT4NILM, and cGAN models are displayed in Figures 7 and 8, where two relatively underperforming methods were excluded to avoid clutter.
For starters, we examined the performance of DiffNILM on microwaves and kettles, which are characterized by infrequent usage and relatively short running periods. The results from the tables indicate that the proposed algorithm outperformed other methods on several indicators, particularly the MAE and MRE. The output signals further reveal that the model effectively captured most of the activations and the predicted power values aligned well with the ground truth values. However, a few exceptional cases were identified where the power signatures were not exactly typical, notably in the first activation of the microwave, depicted in Figure 8, which exhibited a longer turn-on time than other instances, and was subject to relatively strong interference from background noise. The bold item in each column represents the optimal index for that particular metric in all the models. ↑ indicates that a higher value of the metric is better, while ↓ indicates that a lower value of the metric is better.
The bold item in each column represents the optimal index for that particular metric in all the models.
Washers and dishwashers are a type of household appliances that exhibit infrequent use but extended operation per use. The consumption patterns are intricate, due to the frequent start-and-stop events and mode switching during operation. In the REDD dataset, Washers maintained a constant power level during the 'on' mode and the waveforms were effectively rebuilt by DiffNILM, despite the slightly elevated power values. Washers in the UK have operating patterns that are distinct from their US counterparts, with evident power oscillations, which the proposed algorithm effectively reconstructed. Dishwashers present more complex operational characteristics with multiple modes, such as pre-rinse, steam wash and dry, which require a more advanced model generation capacity. Although DiffNILM's output in low power consumption mode was not entirely consistent with the ground truth signal, it exhibited good overall power estimation performance.
The refrigerator operates based on automatic temperature regulation requirements, with frequent start and stop events and prominent periodicity. Based on the evaluation metrics and sample waveforms, DiffNILM exhibited satisfactory performance in disaggregating the refrigerator load. The algorithm could accurately detect each activation event, and the power prediction accuracy was only compromised when there was significant background power interference.  Through horizontal comparison of the results on the two datasets, it is interesting to notice that the metrics of the two generative models exhibited more enhanced performance on UKDALE than REDD. A plausible reason for this outcome is that deep generative models typically necessitate larger amounts of training data. Specifically, the smaller REDD dataset might fail to meet the data requirements of cGAN and DiffNILM, which, in turn, would hinder their performances on this dataset. In contrast, the larger UKDALE dataset facilitated better performance, reflected in the significant improvement of the metrics of the generative models.
Overall, the proposed algorithm outperformed the baseline models on most metrics and yielded better results than the previous methods concerning the mean values of the four metrics on both datasets. Meanwhile, DiffNILM demonstrated a satisfactory fitting effect on the consumption signals of various electrical appliances, and was capable of handling complex load patterns. Nonetheless, due to the unique nature of 'diffusion', the predicted power curve was not always smooth. Additionally, in cases where the background power was complex, the disaggregated curve might experience distortion following the total power, although the impact remained within an acceptable range.

Conclusions
In this paper, we introduce DiffNILM, a novel framework for energy disaggregation that utilizes the diffusion probabilistic model. The key innovation of our approach is the conditional diffusion model which takes both the total active power and embedded time tags as inputs and generates the appliance power waveforms. Additionally, we propose an encoding method for multi-scale temporal features which captures the periodicity of power consumption behaviors. The proposed method was applied and assessed on two open-access datasets, REDD and UKDALE. Averaging across all appliances, DiffNILM displayed an improvement in all four metrics on both datasets. The results also highlight the potential of the proposed DiffNILM algorithm in reconstructing complex load patterns, despite the fact that DiffNILM exhibits certain issues, such as generating power waveforms that are not sufficiently smooth and may experience distortion.
Meanwhile, we would like to clarify that the algorithm was developed with accuracy as the primary objective, and we did not explicitly consider the computational burden of the proposed implementation. Going forward, we are committed to developing a light-weight version of the algorithm that balances both accuracy and computational efficiency. This will enable the approach to be deployed in real-world settings with limited computational resources.
Furthermore, when analyzing the results, the significance of dataset size in achieving optimal performance was noted. However, acquiring large-scale appliance-level data through field sampling in numerous households can be a formidable task. In forthcoming research, we aim to explore a method of synthesizing appliance power signatures as a means of augmenting the existing NILM datasets, which can also be realized with diffusion models.