Non-Intrusive Load Disaggregation Based on a Multi-Scale Attention Residual Network

: Non-intrusive load disaggregation (NILD) is of great significance to the development of smart grids. Current energy disaggregation methods extract features from sequences, and this process easily leads to a loss of load features and difficulties in detecting, resulting in a low recognition rate of low-use electrical appliances. To solve this problem, a non-intrusive sequential energy disaggregation method based on a multi-scale attention residual network is proposed. Multi-scale convolutions are used to learn features, and the attention mechanism is used to enhance the learning ability of load features. The residual learning further improves the performance of the algorithm, avoids network degradation, and improves the precision of load decomposition. The experimental results on two benchmark datasets show that the proposed algorithm has more advantages than the existing algorithms in terms of load disaggregation accuracy and judgments of the on/off state, and the attention mechanism can further improve the disaggregation accuracy of low-frequency electrical appliances.


Introduction
Load disaggregation technology is a key technology in smart grids [1]. Traditional load monitoring adopts intrusive methods, which are able to obtain accurate and reliable data with low data noise [2], but they are difficult to be accepted by users due to their high implementation costs. Non-intrusive methods can provide detailed information for residents in time, and have the advantages of low cost and easy implementation. According to this technology, the power consumption behaviors of users can be analyzed, and users can be guided toward a reasonable consumption of electricity and hence reduce their power consumption costs. With the continuous development of power demand side management [3], big data analysis, and other technologies, non-intrusive load disaggregation is attracting more attention.
The microgrid is an important manifestations of the smart grid. With the development of clean energy, such as solar and wind energy, and energy internet technology, the microgrid has emerged. It is a small power system with distributed power sources, which can realize a highly reliable supply of multiple energy sources and improve the quality of the power supply [4]. As NILM technology becomes more mature, the intelligent dispatching of the microgrid can be realized through automation in the future to improve the effective utilization of power resources, ensure the stable economic operation of a power system, and avoid the unnecessary waste of power resources. Therefore, NILM technology is important.
where σ is the activation function. The right hand is the skip connection x of the residual structure, and the final output is obtained after a summation operation. The formula is as follows: where F(x, W i ) is the residual mapping function that needs to be studied. W i represents the weight matrix of the hidden layer. When the input and output dimensions need to be changed, a linear transformation W · x could be performed through the input at the skip connection residual structure. Thus, the same figure of input and output characteristics could be expressed as According to Equation (3), it could be noted that for a deeper layer L, its relationship with the l layer could be expressed as where layer x L and layer x l are the residual unit inputs of layer L and layer l. According to Equation (4), inputs of the residual unit in layer L could be expressed as the sum of inputs of a shallow residual unit and all the complex mappings. The calculation power needed of the sum is much less than that of the quadratics. With a loss function, and according to the chain rule of back propagation, we can obtain This means that continuous multiplications generated in the network are replaced by plus operations, and the problem of gradient disappearance is well solved. The use of residual structure is able to increase the network depth and extract deeper load characteristics from the data, thus improving the accuracy of disaggregation algorithms. Based on the residual network, in view of the characteristics of load disaggregation, we replaced the convolutional layer with the multi-scale structure and the attention structure. Therefore, we proposed the MSA-Resnet. Figure 1. The residual structure.

Attention Mechanism
A convolution kernel, the core of a CNN, is generally regarded as aggregating information spatially and channel-wise on local receptive fields. A CNN is composed of a series of convolution layers, nonlinear layers, and down-sampling layers, among others, and captures required features from global receptive fields [30].
However, in order to obtain better network performance, a squeeze and excitation attention mechanism is used in the network [31]. Its structure is shown in Figure 2. One of the novelties of the algorithm in this paper lies in the use of the attention mechanism in the residual structure to further improve the feature extraction ability of the network, especially for the extraction of electrical features that are not frequently used. This attention mechanism is different from the previous structure, as it improves performance from feature channels. The first 3D matrix in Figure 2 is composed of a feature graph C with a size of H × W.
H W w w) ) ) ) ) ) ) ) According to Figure 2, the spliced feature map of a multi-scale module is obtained through the global pooling layer to obtain attention vector z c , which is compressed from the spatial dimension to obtain the global receptive field. z c is a high-dimensional vector containing the low-order global feature information of the feature map, and its dimension is 1 × 1 × C. The expression equation is as follows: Next, two fully connected layers are applied to establish correlations between channels as excitation and to output the same number of weights as the input feature. The equation is where the dimensions of w 1 and w 2 are all C s × C, s represents the scaling coefficient, and the attention vector s c is obtained after being activated by the Relu function and the Sigmoid function. It is a high-dimensional vector of high-order global feature information obtained on the basis of attention vector z c . The attention vector s c further represents the change of load feature information in the channel dimension, and its dimension is also 1 × 1 × C. A Sigmoid activation function is similar to a gating processing mechanism, which generates different weights for each feature channel of the attention vector s c . Finally, the original three-dimensional matrix u c is multiplied to complete the weight recalibration. Therefore, the importance of each load feature is obtained according to the feature map [32]. Finally, the useful load characteristics are enhanced and the less useful load characteristics are suppressed, which can improve the accuracy of load disaggregation of low-usage appliances. The equation is

Multi-Scale Attention Resnet Based NILD
From the point of view of neural networks, the problem of NILM can be interpreted in this way. Assuming that Y(t) is the sum of all the active power consumption of appliances in the household, it can be expressed as the following formula: In the formula, X i (t) represents the power data of electrical equipment i at time t, I represents the number of electrical equipment, and e represents the model noise. Therefore, there is a pair of data (X, Y), a model can be trained to represent the relationship between X and Y. X = f (Y) is a non-linear regression problem, and the neural network is an effective method to learn the function f .
The overall network structure of the multi-scale attention residual network (MSA-Resnet) proposed in this paper is shown in Figure 3.  (a) The multi-scale module is composed of convolution kernels with sizes of 3 × 1, 5 × 1, and 1 × 1 and the pooling layer [33]. By combining (b) the attention block and the residual structure, (c) the multi-scale attention residual block is formed. The structure of (a) is shown in Figure 4. All convolution cores of original residual elements are 3 × 1 in size, which makes it impossible for convolution layers to observe load data from multiple scales, and difficult to obtain more abundant load features. The multi-scale module first goes though a 3 × 1 convolution, followed by four branches. The first branch uses a 1 × 1 convolution to increase the load characteristics transformation [34], and a 3 × 1 convolution is then applied to obtain a feature map (map1). The second branch is convolved at 1 × 1, and a 5 × 1 convolution is then added to obtain map2. The third branch is pooled at 3 × 1 to obtain map3. The fourth branch uses a 1 × 1 convolution to obtain map4. Finally, these feature maps are concatenated to input vectors for the attention module. Because the actual load power has a large number of different gear positions, switch starts and stops, and operating characteristics, the multi-scale feature method can improve the network's ability to extract load characteristics and increase the diversity of different scales of the network, thus improving the accuracy of non-intrusive load disaggregation.
In the network structure of MSA-Resnet, nine multi-scale attention residual blocks are used. Forty convolution cores are used in the first three blocks, 60 convolution cores are used in the forth to sixth blocks, and 80 convolution cores are allocated in the last three. The first convolution layer and each output part of multi-scale attention residual block is activated by an activation function Leaky-Relu. Relu and Leaky-Relu [35] are shown in Figure 5:  The Relu activation function represents the "modified linear element," which could accelerate the convergence speed of the network. Its equation is When the input is positive, the derivative is not zero, so the learning is based on the gradient. However, when the input is negative, the learning speed of Relu slows down and can even inactivate the neurons, such that it cannot follow new weights. Equation (11) represents the Leaky-Relu activation function, where λ ∈ (0, 1) modifies the data distribution and retains the value of the negative axis. As a result, the information retention ability is improved without losing more load characteristics, and the gradient is guaranteed not to appear.

Data Sources
The experimental data in this paper is from the public dataset UK-DALE [36] and WikiEnergy [37]. The UK-DALE dataset is a public access dataset from the UK, and the sampling frequency is 1/6 Hz. The WikiEnergy dataset was produced by the Pecan Street company in the UK and contains the data of nearly 600 households. It includes the total power consumed by each household over a period of time and the power consumed by each individual electrical appliance. The sampling frequency of the dataset is 1/60 Hz [38]. Kettles, air conditioners, fridges, microwaves, washing machines, and dishwashers were chosen as non-intrusive load disaggregation tasks for the following reasons: (1) The power consumption of these electrical appliances is a large proportion of the total power consumption. They are representatives of electrical appliances. (2) Electrical appliances with low frequency and minimal power consumption used in the data are easily disturbed by noise and not easily disaggregated. (3) The power consumption of these six electrical devices includes mode disaggregation from simple to complex.

Data Preprocessing
Firstly, the NILM-TK toolkit was used to export the power data of the selected home appliances from the WikiEnergy database and UK-DALE database. We then created aggregate power profiles and used them as the experimental data. Secondly, different evaluation indexes often have different dimensions, and their values are quite wide ranged, and this may affect the analysis results. In order to eliminate the influence of these differences, data standardization is needed. Here, the maximum and minimum value normalization method was used to normalize the result between [0, 1]. The normalization equation is where x t is the total unstandardized power at time t, X max is the maximum of the total power sequence, X min is the minimum of the total power sequence, and x * t is the standardized result at time t.

Sliding Window
Deep learning training relies on a large amount of data. The NILM-TK toolkit was used to process the database. We selected the desired electrical data and sorted it into an Excel file. The top 80% of the processed data was taken as training data. The total power sequence X was taken as the input sequence, and the individual electrical appliance Y was taken as the target sequence. The remaining 20% of the processed data was taken as testing data. In order to increase the training data of the network and improve the expression ability of the data, the data was processed by using a sliding window [39].
As shown in Figure 6a, the overlap sliding window [40] was used to process the total power sequence and the target sequence in the training data to increase the data samples. Assuming that the length of power sequence is M, a window of length N was cut from the original data with a sliding step of 1, and the sliding operation was then carried out to obtain M − N + 1 samples. However, as shown in Figure 6b, the non-overlap sliding window was used to process the testing data to save time. Assuming that the sequence length is H, H N samples could be obtained.

Result
This experiment used the Keras neural network framework. The computer processor was AMD2600, and the graphics card was 1060 6G. After data was standardized, the length of the sliding window was set to 200, the learning rate of the network was set to 0.001, and the Adam optimizer was selected as the network optimizer.
Kelly's experiments indicate that the DAE algorithm performs well in NILD, and Zhang C's work also shows a good performance of CNNs in sequence-to-sequence and sequence-to-point load disaggregation. From the WikiEnergy data, we selected the air conditioner, fridge, microwave, washing machine, and dishwasher from Household 25. From the UK-DALE dataset, the kettle, fridge, microwave, washing machine, and dishwasher of Household 5 were selected. In order to verify the effectiveness and stability of the algorithm proposed in this paper, four approaches were compared with the MSA-Resnet: the KNN, the DAE, the CNN sequence-to-sequence learning (CNN s-s), and the CNN sequence-to-point learning (CNN s-p). Firstly, the WikiEnergy dataset was tested. Figure 7 shows the disaggregation effect diagrams of five appliances of WikiEnergy from Household 25, and the actual power consumption data of these appliances. The figure compares the four disaggregation methods with the MSA-Resnet proposed in this paper. In order to verify the effectiveness of the proposed method, two evaluation indexes were selected to evaluate the performance of the algorithm: the Mean Absolute Error (MAE) and the Signal Aggregate Error (SAE). The MAE evaluation index was used to measure the average error of power consumption and the actual power consumption of individual electrical appliances disaggregated at each moment. The MAE is expressed as the following: where g t represents the actual power consumed by an appliance at time t, p t represents the disaggregation power of the appliance at time t, and T represents the number of time points. Equation (14) is the expression of the SAE, whereê and e represent the power consumption predicted by disaggregation within a period of time and the real power consumption within a period of time. This index is helpful for daily electricity reports. Figure 7 describes disaggregations of Household 25 in the WikiEnergy dataset. It can be seen that the above algorithms can basically achieve effective load disaggregation for the air conditioner. In the load disaggregation diagram of the fridge, the DAE and CNN s-s algorithms fluctuate greatly in the mean area of the appliance, compared with other algorithms. The KNN algorithm has the worst load disaggregation effect on the last three kinds of electrical appliances, so it could not realize an effective disaggregation of mutation points. For these three low-frequency electrical appliances, the load disaggregation of CNN s-s and CNN s-p algorithms are stable compared with the other two algorithms, but the load disaggregation of the CNN s-p method fluctuates greatly in the region of low power consumption. In summary, compared with other methods in load disaggregation, the MSA-Resnet shows the best performance on each electrical appliance, based on the power consumption curve. Table 1 shows comparisons of MAE and SAE indexes of Household 25 load disaggregation in the WikiEnergy dataset. It can be seen that MSA-Resnet has obvious advantages in the disaggregations of the air conditioner, fridge, microwave, washing machine, and dishwasher. According to the MAE index, the MSA-Resnet performs better than the other four methods. For the SAE, the MSA-Resnet achieves the lowest value on the fridge, washing machine, and dishwasher, and accurate disaggregation of energy is achieved over a period of time. Combined with Figure 7 and Table 1, it can be inferred that the shallow CNN s-s and CNN s-p have difficulty accurately disaggregating the total power into the appliances with lower frequency. Compared with KNN and MSA-Resnet, the disaggregation errors of CNN s-s and CNN s-p are larger, because the structure of shallow CNNs is not able to extract deeper and more effective load characteristics, and their disaggregation effect is not as good as that of MSA-Resnet. There are two reasons for this: firstly, the residual is used to deepen the network and better enhance the ability to learn unbalanced samples; secondly, the ability to deal with low frequency appliances by multi-scale convolutions is strong. As can be seen in Figure 7, the overall disaggregation effect of the KNN on the washing machine is not good, but the disaggregation error is small in terms of two indicators. To explain this phenomenon, certain interval periods are selected for comparative analysis, as shown in Figure 8, the disaggregation comparison diagram shows each algorithm on each electrical appliance with a finer scale. The figure reflects the ability of the KNN to detect peak values. It can be seen in Figure 8b,c that the KNN is not able to accurately disaggregate mutation points, but it could process regions with a power close to 0.   After the disaggregation of load, power thresholds of electrical appliances were used to distinguish the on/off states, so as to calculate their evaluation indexes. The thresholds of the air conditioner, fridge, microwave, washing machine, and dishwasher were set to 100 W, 50 W, 200 W, 20 W, and 100 W, respectively. Recall rate, precision rate, accuracy rate, and F1 values [41] were used to further evaluate the performance of the different algorithms in their on/off states.
Recall represents the probability of predicting correctly in the instance with a positive label: where True Positive (TP) represents the number of predicted states that are disaggregated as "on" when their ground truth state is "on", and False Negative (FN) denotes the number of predicted states that are "on" when their ground truth state is "off". There are two possibilites: one is to predict the original positive class as a positive class (TP), and the other is to predict the original positive class as a negative class (FN). Precision refers to the proportion of samples that are predicted to be in an "on" state and are indeed in an "on" state: where False Positive (FP) represents the number of states that are actually "off" when their predicted states are "on". Accuracy refers to the ratio of the number of samples correctly predicted to the number of the total dataset: where P is the number of positive samples, and N is the number of negative samples. F1 can be expressed as Table 2 is a comparison of the evaluation indexes for judging "on" or "off" states of Household 25 electrical appliances. It can be seen from Table 2 that for Accuracy and F1, MSA-Resnet achieves the best performance in various electrical appliances.The disaggregation diagrams of the microwave, the washing machine, and the dishwasher are in Figure 8, which shows that, in the actual power consumption of these three electrical appliances, their proportion of "on" states is significantly lower than that of the first two electrical appliances. In such unbalanced sample data with a small sample size, the "on" states of the washing machine cannot be effectively predicted using the CNN s-s and CNN s-p, whereas the MSA-Resnet presents better results. In order to prove the effectiveness of the Leaky-Relu function, under the same conditions, comparative experiments are conducted with WikiEnergy's Household 25 using the Relu function. According to the experimental results in Table 3, at the two indicators of the MAE and the SAE, the algorithm using the Leaky-Relu function is better. For further verification, we selected five electric appliances from Household 5 in the UK-DALE dataset for additional experiments. Figure 9 shows the results of disaggregation. The figure shows that all of the above algorithms are able to achieve effective disaggregation for the kettle, an electrical appliance that is used often. For the fridge, the KNN and the DAE work worse than the CNN s-s, the CNN s-p, and the MSA-Resnet. For the microwave, the washing machine, and the dishwasher, which are infrequently used and have a low power consumption, the MSA-Resnet has better disaggregation results than the other two deep learning algorithms, mainly because it could better detect peaks and state changes. Table 4 shows Household 5's load disaggregation evaluation index in the UK-DALE dataset. Table 4 shows that the MSA-Resnet does better in MAE and SAE compared with other methods. For the MAE, the MSA-Resnet performs better with respect to the kettle, the fridge, the washing machine, and the dishwasher. The MSA-Resnet has smaller SAE values in the kettle, the fridge, and the washing machine.   Table 5 shows the judgement results of "on" and "off" states of Household 5 in the UK-DALE dataset. The thresholds of the kettle, the fridge, the microwave, the washing machine, and the dishwasher were set to 100 W, 50 W, 200 W, 20 W, and 100 W, respectively. Table 5 shows that the Recalls of the washing machine and the dishwasher using the CNN s-s and the CNN s-p are low, the number of positive samples is small, and its ability to predict the "on" state is poor. If the task of judging the electrical state is considered as classification, appliances with a high utilization rate have better classification results. Figure 10 shows load disaggregation comparisons of these five methods over a period of time. It can be seen from the figure that, compared with other algorithms, the MSA-Resnet could better disaggregate equipments, whereas the KNN and the DAE have the worst decomposition abilities. For the low-frequency washing machine and dishwasher, the MSA-Resnet could still well fit the power curve, because of its network structure. It uses multi-scale convolutions to obtain rich load characteristics, and it improves the performance of the network through the attention mechanism and the residual structure.  In order to prove the effectiveness of the Leaky-Relu function, a comparative experiment with the Relu function was also done on the UK-DALE dataset. Table 6 can prove that the Leaky-Relu function is still the best.

Conclusions
Load disaggregation is an important part of smart grids. At present, existing non-intrusive load disaggregation methods based on deep learning have some problems; for example, they easily lose features and have difficulty in detection, they do not identify low-use electrical appliances well, and their networks degrade easily due to gradient disappearance. The disaggregation effects of traditional methods are also very poor. In order to solve these problems, the MSA-Resnet is proposed for NILD. The residual network deepens the network structure, avoids the gradient, and reduces the optimization difficulty. Multi-scale convolutions obtain richer load characteristics and avoids feature simplification. The attention mechanism is used to enhance the ability of the network to learn load characteristics and improves the performance of the network. With its excellent performance on the WikiEnergy and UK-DALE datasets, the MSA-Resnet is shown to be an effective way to solve non-intrusive load disaggregation. In future work, we will conduct further experiments on public datasets such as REDD and real household data of the State Grid to verify the generalization performance of the model.