2.1. Problem Definition
Non-intrusive load disaggregation refers to the process of extracting the power consumption data of individual appliances from the total household energy consumption. Suppose the user has
electrical appliances, each with only two states (on and off), and the energy consumption of the appliances does not vary significantly while they are operating. Then, the total energy measured at time
can be expressed as:
Here, represents the total number of electrical appliances in the user’s home. denotes the power consumption of the -th appliance at time , while represents the on/off state of the same appliance at time . stands for the noise present at the time . To obtain the power consumption of each appliance at a time , power-monitoring modules need to be installed between each appliance and the power supply, which undoubtedly increases the cost of the smart grid. Non-intrusive load monitoring avoids the need to install sensors at each individual load point, greatly reducing equipment installation costs. This allows for the estimation of based solely on the given total power consumption .
Load disaggregation, as an important component of non-intrusive load monitoring (NILM), aims to analyze the power consumption of specific appliances, whose consumption patterns often exhibit highly nonlinear characteristics, making traditional linear analysis methods ineffective for accurate parsing and disaggregation. Deep learning offers an efficient and accurate solution through its powerful data processing and pattern recognition capabilities, as well as its innate advantages in modeling nonlinear data. To formalize this problem, it can be described as finding an arbitrary function. This function can accurately map the observed aggregate energy consumption to the unique consumption profiles of individual appliances. The objective of load disaggregation is to infer the energy consumption patterns of individual appliances
by analyzing the total energy consumption data
.
represents the entire framework of non-intrusive load disaggregation, requiring only the total power
for research. In it,
represents each appliance, for which a corresponding deep learning model must be established to perform disaggregation, and then these are aggregated to form a holistic NILM system. In NILM, the energy consumption data of appliances are typically time-series data, involving multiple time points. Using vectors can conveniently represent this data structure, where the power consumption at each unit of time (such as minutes or hours) can be a component of the vector. This process can be expressed as follows:
Therefore, the process of energy disaggregation involves establishing a mapping relationship between aggregate power consumption and the consumption of individual appliances. This implies the need to develop different models for various appliances since each mapping corresponds only to a specific appliance. Furthermore, the uniqueness of the load characteristics of each appliance further emphasizes the individuality of these models.
Ultimately, the modeling problem for load disaggregation is summarized as follows: For a set of
appliances with known individual power consumption
and total measured power consumption
, the process of non-intrusive load monitoring can be conceptualized as an optimization problem. Specifically, for a given time point
, the task is to find an
dimensional vector that the error
between the disaggregated power obtained through deep learning and the actual power is minimized, as shown in the following Equation (3):
2.2. VMD–Nyströmformer–BiTCN Model
This paper proposes the VMD–Nyströmformer–BiTCN load disaggregation model, which employs VMD filtering as the data processing component to preprocess raw data. By incorporating the Nyströmformer attention model as the feature extraction and variable-selection layer, the model performs weighted feature extraction of the power signals. Additionally, the BiTCN network is used as the deep learning network to train the power signals. Residual networks are introduced to enhance features and prevent overfitting, thereby improving the model’s disaggregation accuracy.
VMD generally outperforms EMD in load disaggregation, offering greater adaptability without the need for pre-selecting bases or determining decomposition levels like wavelet filtering. Instead, it automatically determines decomposition parameters through optimization. VMD can also adjust regularization parameters to balance accuracy and resistance to mode mixing, demonstrating flexibility and controllability when dealing with non-stationary appliance power signals. In contrast, the Nyströmformer enhances the operational speed and computational efficiency of attention mechanisms without compromising accuracy. Before the emergence of TCN, processing sequence data relied on RNNs such as GRU and LSTM, which are difficult to parallelize and limited by dependencies on time steps, restricting efficiency when handling large-scale data. TCN overcomes these limitations through convolution operations, enabling parallel computation and suitability for large-scale data processing, especially in performing load disaggregation tasks. The BiTCN further enhances processing efficiency by expanding the receptive field without increasing the number of convolution layers.
2.2.1. VMD Filtering
In 2014, Konstantin Dragomiretskiy et al. proposed [
26] the VMD (variational mode decomposition) method, a novel, adaptive, completely non-recursive mode variance and signal processing technique. This method can use the alternating direction method of multipliers (ADMM) to optimize complex signals into several modal components (intrinsic mode functions, IMF) arranged from low to high frequencies. It has good time-frequency localization capabilities and better noise robustness.
The model initially chose VMD filtering because, when collecting appliance data at low frequencies, a signal outlier amplified due to the large time gap between consecutive samples adversely affects the training outcomes. For example, a spike in current may exist only momentarily. However, with a sampling frequency of 1/6 Hz, this point’s data might show a 12 s fluctuation. This fluctuation does not align with the normal operational power characteristics, therefore making filtering necessary.
Compared to other filtering methods, VMD filtering can better suppress mode mixing by introducing regularization terms and variational optimization than EMD (empirical mode decomposition) filtering, resulting in a cleaner power signal after denoising. VMD filtering identifies optimal modal functions through variational optimization, generally providing better decomposition accuracy than EMD. Compared to wavelet filtering, VMD’s adaptability is stronger. VMD does not require pre-selecting different wavelet bases or determining decomposition levels for different appliances. Instead, it automatically determines them through the optimization process. VMD filtering can balance decomposition accuracy and enhance anti-mode mixing ability by adjusting regularization parameters. Furthermore, it offers flexibility, controllability, and effective handling of non-stationary signals, such as the power signals of appliances.
In power signal filtering, the VMD signal filtering process is the process of obtaining the optimal solution to the variational problem. By shifting the modal functions’ spectra, we can obtain the respective computed center frequencies; then, using Gaussian smoothing to demodulate the data signal, we obtain the bandwidth of each IMF. Subsequently, by matching the optimal center frequencies and finite bandwidths of each IMF, we separate the IMFs, partition the signal’s frequency domain, extract the effective parts of the signal, and eventually obtain the optimal solution to the variational problem. The decomposition model is shown by the following equation:
In the formula,
represents the different modal components of the target appliance obtained after VMD decomposition;
represents the center frequencies of the modal components;
is the gradient operator;
represents the switching states of the appliance;
is the sampling time of the appliance power; and
is the original power signal. To find the optimal solution to the variational problem, the constrained variational problem needs to be converted into an unconstrained variational problem. This introduces the Lagrange multiplier
and the quadratic penalty term
. The augmented Lagrangian function expression is given by the following:
Then, after converting the parameters in the time domain to the frequency domain, and subsequently performing a secondary optimization within the non-negative frequency range, the various modal components can be obtained:
using the alternating direction method of multipliers (ADMM) to iteratively update the values of
,
,
to obtain the optimal solution of Equation (5), thereby decomposing the original load power signal. In load disaggregation processing of power signals, VMD (variational mode decomposition) decomposes the power signals into a series of modal functions with different frequency characteristics. Based on the characteristics of the intrinsic mode functions (IMF), and through iterative training, the IMFs related to noise can be identified. By setting thresholds, selectively discarding certain IMFs, or weighting the IMFs, the identified noisy IMFs can be removed or suppressed. The remaining IMFs are then reconstructed to obtain the denoised power signal. This approach effectively removes noise while retaining the significant electrical power features in the data.
2.2.2. Nyströmformer
This attention mechanism is inspired by the human visual attention mechanism. When people observe objects, they focus on certain parts of the objects and ignore unimportant information. In load disaggregation tasks, attention can help the model focus on the most relevant and important power features in the input data. It allows the model to dynamically attend to different parts of the information during prediction, which helps capture long-distance dependencies in the input data, and thereby improves the model’s understanding of the data. Traditional attention mechanisms apply linear transformations of Query
, Key
, and Value
to the input data to compute attention scores. These attention scores are then processed through a softmax function to obtain attention weights. These weights represent the importance of each value, indicating which parts the model should focus on. The standard scaled-dot attention in matrix form is written as follows:
In the above formula, the complexity of the model is . In non-intrusive load monitoring tasks, the data collected for each appliance is vast. The traditional attention model has a complexity of , which means that as N increases to maintain global feature input, this will lead to excessively long model training times. Therefore, this paper adopts a novel attention mechanism model that maintains the accuracy of traditional attention to reduce appliance training time and increase the feasibility of edge computing.
In the above formula,
must be computed first before calculating the softmax, which prevents us from using the associative property of matrix multiplication. The
matrix is an inner product of vectors, resulting in both time and space complexity of
. To address this issue, Nyströmformer proposes [
27] the following solution to reduce the computation complexity of attention while maintaining precision.
The softmax function of the
matrix in the model cannot be directly computed. The best approach is to make it equivalent to a matrix that can be computed separately. The softmax matrix used in self-attention is as follows:
Here,
,
,
, and
.
is a sample matrix obtained by selecting
columns and
rows (landmark) from
. First, perform singular value decomposition (SVD) on the sample matrix
. Let
. The formula is as follows:
is the Moore–Penrose pseudoinverse of
.
is approximated by
. Next, for a given query vector
and key vector
, let
where
,
. We can then construct this as follows:
with
and
available in hand, the matrix
for standard Nyström proximation is calculated as follows:
The matrix represents selecting columns from an matrix, while the matrix represents selecting rows from the matrix. This representation is an application of (10) for softmax matrix approximation in self-attention. in (10) corresponds to the first matrix in (15) and in (10) corresponds to the last matrix in (15).
The model uses an iterative method to approximate the Moore–Penrose pseudoinverse through efficient matrix–matrix multiplications. Ultimately, its
matrices will be equivalent to the following:
In the complexity analysis of the Nyström approximation, several key steps’ time complexity is primarily considered. First, landmark selection uses the segment means method, which has a time complexity of . This means that the efficiency of the algorithm for selecting landmarks is linearly related to the size of the input data, which can be accomplished through a simple linear scanning process. This is crucial for large-scale data processing, as it ensures that landmark selection does not become a performance bottleneck.
Next, the iterative approximation calculation of the pseudoinverse requires
time in the worst case. Regarding matrix multiplication, the calculations involve first computing
and
, followed by the multiplication of these two results. The overall time complexity for this series of operations is
. Overall, the analysis indicates that the Nyström approximation method is efficiently designed in terms of time complexity concerning the input size and the number of landmarks, maintaining good efficiency when handling large datasets. Thus, the total time complexity is
. The principle can be referenced in the following flowchart (
Figure 1):
In terms of memory, the cost of storing the landmark matrices and is , while the cost of storing the four Nyström approximation matrices is . Therefore, the total memory usage of the model proposed in this paper is .
This undoubtedly greatly reduces the training time of the load disaggregation part in NILM. Considering the current sample training time, using Nyströmformer to replace the traditional attention model can reduce computational maintenance costs as well as provide a feasible solution for lightweight models and rapid training of unknown devices in edge computing.
2.2.3. BiTCN Architecture
The main network trained in this paper adopts an extended network of the TCN, known as BiTCN (bidirectional temporal convolutional network). The BiTCN retains the dilated convolution and residual connections of the TCN but replaces the causal convolutions with a bidirectional temporal convolutional neural network structure. This bidirectional transmission of information allows the network to extract power feature information more effectively.
TCN stands for temporal convolutional network, a neural network architecture used for processing time-series data. It utilizes convolution operations to capture local patterns and long-term dependencies in time-series data. The advantage of TCN is its ability to map sequences of any length to output sequences of the same length. This feature makes it particularly suitable for tasks like load disaggregation that deal with large amounts of one-dimensional appliance power data. In a one-dimensional convolution, the receptive field of the later layers becomes increasingly larger, allowing more load features to be extracted. Such a network structure significantly enhances the ability to learn subtle fluctuations in load power, making the model more precise in learning load characteristics.
However, such a TCN structure limits the receptive field, requiring an unlimited number of network layers to expand the receptive field, which in turn brings a series of problems such as gradient explosion, overfitting, and inefficiency. To address this, the TCN network incorporates dilated convolutions, which expand the receptive field without increasing computational complexity. The receptive field of each layer increases exponentially with network depth through the progressive increase in the dilation factor. Dilated convolutions effectively extend the receptive field of the network model and improve its performance. As a result, the network can obtain more data information for analysis, thus better learning the global features of electrical appliances when processing power signals. A larger receptive field allows the model to capture more effective information, enhancing its capacity for data learning and analysis, and generating more accurate predictive results. The structure is illustrated in the following
Figure 2:
To tackle the issues of gradient vanishing and gradient explosion, we built deeper neural networks through residual connections. This allows information to be transmitted more directly, which in turn facilitates the training of deeper networks.
Before the emergence of TCN, sequence models were predominantly based on RNN models (such as GRU and LSTM). However, RNNs face computational challenges in parallelization because each time step depends on the output of the previous time step. This dependency restricts the efficiency of training and inference on large-scale data. When dealing with millions of power data points, the learning efficiency of RNNs limits their feasibility for edge computing. In contrast, TCNs can effectively manage large amounts of appliance data for load disaggregation tasks through convolutional operations, enabling efficient parallel computation that is ideal for processing large-scale data.
Besides using one-dimensional fully convolutional layers to ensure that the output length matches the input length, TCNs introduce a causal convolution structure in the temporal domain. Causal convolutions ensure that the network computes the current output value using only input data from past time steps. Building a network with causal convolution layers requires either a large convolution kernel or a very deep network to achieve a sufficiently large receptive field. To capture long-term dependencies more effectively and avoid the computational burden associated with large kernels or deep network structures, this paper introduces the bidirectional temporal convolutional network (BiTCN). BiTCN combines forward and backward causal convolutions, as illustrated in the following
Figure 3:
Its structure is capable of simultaneously processing the forward and backward information flow of time-series, thereby enhancing the ability to model complex temporal dependencies. Additionally, the results from the forward and backward convolutions are combined, usually through concatenation or weighted summation. This approach allows for the reduction in the number of network layers without altering the receptive field range when processing large amounts of power data, ensuring that the global characteristics of the power data from appliances are preserved.
2.2.4. Overall Network Model Structure
In the task of non-intrusive load disaggregation, the collected power signals are first processed through variational mode decomposition (VMD), decomposing the signals into multiple modes. Modes identified as noise are discarded, while modes highly correlated with the original signal are retained and summed to obtain the filtered signal. This method effectively filters out noise and singular points caused by external interference in the original signal, thus providing high-quality input data for subsequent deep learning models.
The deep learning network features a Nyströmformer-BiTCN structure. The initial Nyströmformer layer serves as a variable-selection layer to screen features from the appliance power signals. The goal is to select the variables that are most relevant to the response variable, reducing the impact of noise and irrelevant information. This process improves the model’s prediction accuracy and generalization capability in load disaggregation. This also reduces the computational burden during the training and prediction processes, enhancing the computational efficiency of load disaggregation. Identifying the most important features for model prediction allows the model to more effectively extract features related to the appliance’s power state, which enhances the model’s interpretability.
Subsequently, the structure combines bidirectional temporal convolution and dilated convolution within BiTCN layers, reducing the limitations of the receptive field to simultaneously incorporate past and future power data. This enables the load disaggregation model to learn global power characteristics more effectively. The point-wise weighted training approach allows the model to capture fine power fluctuations. Additionally, the inclusion of residual networks helps prevent overfitting, improving the model’s generalization, and making it applicable to different types of electrical appliances. Hence, BiTCN is used as the feature-learning layer to extract and train the power characteristics of appliances. Along with the variable-selection layer, the original data processed through a residual network with a low dropout rate is also input into the BiTCN network.
Implementing a residual network with a slightly lower dropout rate incorporates the original data as an additional feature input to the BiTCN network. Although minimal, the dropout rate acts as a regularization measure to help prevent overfitting. When training electrical equipment, residual connections may cause complex interactions between appliance features across layers. A slight dropout can mitigate these interactions, enhancing model robustness. It also prevents the variable-selection layer alone from overshadowing detail features, alleviating the gradient vanishing problem. Additionally, incorporating dropout further reduces inter-layer dependency, aiding better gradient propagation during training.
Finally, the output from BiTCN is used as the input for the weighted output layer of Nyströmformer. Nyströmformer, as the weighted output layer, introduces weight parameters, allowing flexible adjustment of the model’s emphasis on different features from various appliances. This helps the model adapt better to diverse appliance disaggregation tasks, thereby improving its performance and generalization across different types of electric appliances. The structure diagram is as follows
Figure 4: