Nonintrusive Residential Electricity Load Decomposition Based on Transfer Learning

Monitoring electricity consumption in residential buildings is an important way to help reduce energy usage. Nonintrusive load monitoring is a technique to separate the total electrical load of a single household into specific appliance loads. This problem is difficult because we aim to extract the energy consumption of each appliance by only using the total electrical load. Deep transfer learning is expected to solve this problem. This paper proposes a deep neural network model based on an attention mechanism. This model improves the traditional sequence-to-sequence model with a time-embedding layer and an attention layer so that it can be better applied in nonintrusive load monitoring. In particular, the improved model abandons the recurrent neural network structure and shortens the training time, which means it is more appropriate for use in model pretraining with large datasets. To verify the validity of the model, we selected three open datasets and compared them with the current leading model. The results show that transfer learning can effectively improve the prediction ability of the model, and the model proposed in this study has a better performance than the most advanced available model.


Introduction
Nonintrusive load monitoring (NILM) decomposes the power of each electric appliance according to the total load power in a household. This research was first proposed by Professor Hart in the 1980s [1]. NILM is significant for power plants, retailers, and end users. With the rapid development of smart grid and home-energy-management systems, massive residential electricity consumption data and controllable electric appliances provide new opportunities to realize high levels of energy efficiency, energy conservation, and emission reductions. Traditional load decomposition methods identify equipment by seeking changes in signals, such as the voltage, current, and power, and use specific recognition algorithms for load decomposition [2][3][4]. Based on the working conditions of the equipment, signal features can be classified into three types: steady state, transient state, and running mode. These features will reappear over time. This periodic trend constitutes the fundamental principle of load identification [5]. Traditional NILM methods build equipment feature databases by manually extracting features, while deep neural networks can realize automatic learning from data, thereby avoiding the step of manually extracting features [6,7]. As deep learning has been effectively applied in image and voice identification, some people have started to explore its application in NILM. However, deep learning requires a large amount of training data to understand the potential data patterns. Acquiring high-quality NILM training samples in the field is very costly. Usually, a separate metering device is needed to obtain sub-equipment label data. Depending on the acquisition cycle, data can include high-frequency data and low-frequency data. According to ref. [8], the sampling period is macro or micro whenever it is more than or less than 1 s, respectively. High-frequency data can accurately record the micro changes when the equipment state is transitioning, but the need for an additional acquisition device leads to a high transmission and storage cost.

Deep Transfer Learning and NILM
For the application of deep-learning technology in NILM, ref. [14] was the first to propose three deep neural-network architectures that can be used to solve the NILM problem, i.e., long short-term memory (LSTM), denoising autoencoders (DA), and Rectangles, and to verify them in the public dataset UK-DALE [15]. Recent studies have proven that the deep neural-network model is better than the traditional combinatorial optimization (CO) model and the factorial hidden Markov model (FHMM), especially in terms of the generalizability, which is the weakness of traditional models [16,17]. However, simple recurrent neural networks (RNNs), such as the LSTM, cannot learn from equipment with multiple states and long power-changing intervals, as this will lead to a vanishing gradient [18,19]. The sequence-to-sequence (seq2seq) model is a general end-to-end training approach that maps the input sequence into the output sequence through the encoder-decoder structure. This model solves the problem in which traditional neural networks cannot map a sequence into a sequence. In the literature, ref. [20] was the first to propose the seq2seq model and use it for text translation, and the effects were remarkable. Similarly, ref. [14] was the first to use the seq2seq model to solve the NILM problem and demonstrate that this model can identify a target appliance from total power signals. This result is significant because it illustrates that complex electrical features can be automatically extracted through deep neural networks, which settles the problems related to the difficult manual extraction of features and the low efficiency. However, the seq2seq model has its limitations. When there Sustainability 2021, 13, 6546 3 of 11 is a long input sequence, it cannot memorize the state information in the distance, which lowers the prediction accuracy. For example, there are several seconds or minutes between the start-up periods of the different working stages of a washing machine, including washing, dewatering, and drying, which can cause the model to lose information. Generally, a sliding window may be used to solve this problem. A window of a certain length can be selected, and then by using this window, the input signals can be divided into sequences and mapped into the output window. However, when the sliding window is used for prediction, each final output point is the mean of multiple predictions, which can make the prediction value at the edge of a window smooth and thereby lower the prediction accuracy. Through research, ref. [21] improved the seq2seq model and proposed the seq2point model. This model changes the output from predicting one sequence to predicting a single point and uses a one-dimensional convolutional neural network (CNN) as the encoder for training. The results showed that the CNN is capable of learning the features of target signals, which significantly increases the ability of this model. If Y(t:t + w − 1) represents the window sequence input into the model at moment t, then the seq2point model only predicts the output at moment t + (w/2). As a result, the model concentrates its expressing ability on the middle points of the window, thereby fully utilizing the information of adjacent front and back areas to improve the prediction accuracy.
Deep neural networks are able to effectively extract general features from the bottom layer that can also be used to predict other problems of the same kind. This transfer learning approach has been widely applied in image identification, natural language processing, and other areas [11,13]. Currently, the research on the application of transfer learning in NILM is limited. Ref. [22] classified NILM transfer learning into two types-appliance transfer learning (ATL) and cross-domain transfer learning (CTL)-and used the seq2point model for training and transfer learning in several public datasets. The results showed that for ATL, the pattern learned by a complex appliance (e.g., washing machine) can be transferred to a simple appliance (e.g., kettle). For CTL, the seq2point model is also capable of transfer learning. When the training and test data are located in similar domains, it can be applied to test data without fine-tuning. However, if the training and test data are located in different domains, fine-tuning is necessary. The experimental results also showed that fine-tuning the fully connected layer alone achieves a good effect. The major structure of the seq2point model is a one-dimensional convolutional layer, which indicates that a convolutional network is appropriate for transfer learning. This conclusion is consistent with the research on transfer learning in image identification. However, power load data are typical time sequences. The numerical value at a certain moment has a close relationship with the values at adjacent moments. Therefore, we present in this paper a new model that increases the prediction ability of the seq2seq model by introducing an attention mechanism.

Attention Model
Attention, also called an attention mechanism, is similar to the manner in which the human brain ignores unimportant information and focuses on important information when processing information. The key is to compute and allocate attention scores to concentrate attention on relevant content and thereby increase its contributions. Ref. [23] proposed the initial attention mechanism, which enables the machine translation model to automatically search and predict target term-related source statements for translation. This approach uses variable context vectors to solve the problem in which the seq2seq model loses information and leads to inaccurate predictions when there is a long sequence. Ref. [24] proposed two types of attention mechanisms: a global attention mechanism, which focuses on all source words, and a local attention mechanism, which focuses on a portion of source words. The attention score, which is an important concept in attention models, can measure the significance of each character in the input sentence to each character in the target sentence. This score can be obtained through a substantial amount of sample training. With the attention mechanism, the seq2seq model shows a large improvement in translating long sentences [24]. In essence, the models proposed in ref. [14,21] are also seq2seq models. The decoder only utilizes the final state information of input sequences without considering the influence of their different locations on the output. The model mentioned herein is actually the seq2seq model that contains an attention mechanism. It has been changed to an attention model. Figure 1 explains how the attention mechanism works. model loses information and leads to inaccurate predictions when there is a long sequence. Ref. [24] proposed two types of attention mechanisms: a global attention mechanism, which focuses on all source words, and a local attention mechanism, which focuses on a portion of source words. The attention score, which is an important concept in attention models, can measure the significance of each character in the input sentence to each character in the target sentence. This score can be obtained through a substantial amount of sample training. With the attention mechanism, the seq2seq model shows a large improvement in translating long sentences [24]. In essence, the models proposed in ref. [14,21] are also seq2seq models. The decoder only utilizes the final state information of input sequences without considering the influence of their different locations on the output. The model mentioned herein is actually the seq2seq model that contains an attention mechanism. It has been changed to an attention model. Figure 1 explains how the attention mechanism works. Depending on the length of the input and output sequences, the seq2seq model has three forms: many-to-many, many-to-one, and one-to-many [20]. The attention mechanism was first implemented in the many-to-many model [25]. The attention score of this model can be calculated as ℎ ℎ where ℎ is the current decoding state of the decoder, and ℎ is the hidden state of the encoder. However, there is no h in the many-to-one model. If we deem the many-to-one problem as a special form of the many-to-many problem, then ℎ can be replaced with the last hidden state ℎ of the encoder. As shown in formula 1, tensor W is a square matrix with a hidden size of the encoder. The parameters in this matrix are obtained through model training. The final output y of the model uses the context vector and the final hidden state of encoder as the inputs for decoding. The context vector (Formula (2)), which is the core of the model, is the weighted sum of the attention weight and the hidden state ℎ of the encoder. In particular, the attention weight (Formula (3)) refers to the contribution made by the values of input sequence at different locations to the prediction of the current output y , which indicates that the different input locations may be referenced to varying degrees. Generally, both the encoder and the decoder use an RNN, which relies on the output of the last moment for computation. Thus, parallel computing cannot be realized, thereby leading to a very long period of model training. The model mentioned herein replaces the RNN with the embedding layer Time2Vector and the attention layer. Time2Vector is a method for processing time sequences proposed by ref. [26]. It can learn time features and conduct parallel computing, thereby accelerating model training. Because ref. [21] demonstrated that predicting the points in the middle of the output window alone is more accurate than predicting the entire output sequence, the model herein uses the many-to-one form. Depending on the length of the input and output sequences, the seq2seq model has three forms: many-to-many, many-to-one, and one-to-many [20]. The attention mechanism was first implemented in the many-to-many model [25]. The attention score of this model can be calculated as h t Wh i where h t is the current decoding state of the decoder, and h i is the hidden state of the encoder. However, there is no h t in the many-to-one model. If we deem the many-to-one problem as a special form of the many-to-many problem, then h t can be replaced with the last hidden state h last of the encoder. As shown in Formula (1), tensor W is a square matrix with a hidden size of the encoder. The parameters in this matrix are obtained through model training. The final output y i of the model uses the context vector and the final hidden state of encoder as the inputs for decoding. The context vector (Formula (2)), which is the core of the model, is the weighted sum of the attention weight w ti and the hidden state h i of the encoder. In particular, the attention weight (Formula (3)) refers to the contribution made by the values of input sequence at different locations to the prediction of the current output y i , which indicates that the different input locations may be referenced to varying degrees. Generally, both the encoder and the decoder use an RNN, which relies on the output of the last moment for computation. Thus, parallel computing cannot be realized, thereby leading to a very long period of model training. The model mentioned herein replaces the RNN with the embedding layer Time2Vector and the attention layer. Time2Vector is a method for processing time sequences proposed by ref. [26]. It can learn time features and conduct parallel computing, thereby accelerating model training. Because ref. [21] demonstrated that predicting the points in the middle of the output window alone is more accurate than predicting the entire output sequence, the model herein uses the many-to-one form.

of 11
The seq2point model is composed of five one-dimensional convolutional layers and two fully connected layers [21]. Its specific structure is shown Figure 2. Because it uses a CNN, which can conduct parallel computing, this model has a high training speed. The attention model mentioned herein replaces the RNN with the time embedding layer as the encoder. The concatenate layer connects the embedding layer with the input layer and then sends them to the attention layer, and the two fully connected layers function as a decoder, as shown in Figure 3. Because the model finally predicts a point rather than a sequence, the activation function of the last layer FC-1 is linear, and the other layers use the rectified linear unit (ReLU). This approach has been demonstrated to provide the optimal solution in previous research [21,27,28].
The seq2point model is composed of five one-dimensional convolutional layers and two fully connected layers [21]. Its specific structure is shown Figure 2. Because it uses a CNN, which can conduct parallel computing, this model has a high training speed. The attention model mentioned herein replaces the RNN with the time embedding layer as the encoder. The concatenate layer connects the embedding layer with the input layer and then sends them to the attention layer, and the two fully connected layers function as a decoder, as shown in Figure 3. Because the model finally predicts a point rather than a sequence, the activation function of the last layer FC-1 is linear, and the other layers use the rectified linear unit (ReLU). This approach has been demonstrated to provide the optimal solution in previous research [21,27,28].

Approaches to Transfer Learning
In the literature, ref. [12] was the first to study the transferability of deep neural networks. They divided the ImageNet dataset containing 1000 categories into two parts: A and B. Then, they trained an eight-layer neural network for A and B and conducted a finetuning experiment from layers 1-7 to explore the transferability of the network. This paper uses two approaches to transfer learning. The first approach, AnB, fixes the first n layers of network A, randomly initializes the remaining layers, and then classifies B; the other approach, BnB, fixes the first n layers of network B, randomly initializes the remaining layers, and then classifies B. The experimental results of this paper show that (1) the first three layers of the neural network all have general features, which are good for transfer; (2) fine-tuning the deep transfer network can greatly improve the effect, even better than the original network (based on B's own training); and (3) fine-tuning can resolve the difference between data. Fine-tuning a deep network is the process of adjusting tasks based on the network pretrained by its predecessors. As a pretrained model may not be completely appropriate for the current task, the training data from different tasks may not be subject to the same distribution. The advantage of fine-tuning is the saving of time because for new tasks, it is unnecessary to train the network from scratch. Generally, pretrained models are based on large datasets, which means that the training data from new tasks have been expanded so that the model has a good generalizability. Few studies have focused on determining whether or not NILM transfer learning can also be applied in other ℎ , ℎ = ℎ ℎ (1) The seq2point model is composed of five one-dimensional convolutional layers and two fully connected layers [21]. Its specific structure is shown Figure 2. Because it uses a CNN, which can conduct parallel computing, this model has a high training speed. The attention model mentioned herein replaces the RNN with the time embedding layer as the encoder. The concatenate layer connects the embedding layer with the input layer and then sends them to the attention layer, and the two fully connected layers function as a decoder, as shown in Figure 3. Because the model finally predicts a point rather than a sequence, the activation function of the last layer FC-1 is linear, and the other layers use the rectified linear unit (ReLU). This approach has been demonstrated to provide the optimal solution in previous research [21,27,28].

Approaches to Transfer Learning
In the literature, ref. [12] was the first to study the transferability of deep neural networks. They divided the ImageNet dataset containing 1000 categories into two parts: A and B. Then, they trained an eight-layer neural network for A and B and conducted a finetuning experiment from layers 1-7 to explore the transferability of the network. This paper uses two approaches to transfer learning. The first approach, AnB, fixes the first n layers of network A, randomly initializes the remaining layers, and then classifies B; the other approach, BnB, fixes the first n layers of network B, randomly initializes the remaining layers, and then classifies B. The experimental results of this paper show that (1) the first three layers of the neural network all have general features, which are good for transfer; (2) fine-tuning the deep transfer network can greatly improve the effect, even better than the original network (based on B's own training); and (3) fine-tuning can resolve the difference between data. Fine-tuning a deep network is the process of adjusting tasks based on the network pretrained by its predecessors. As a pretrained model may not be completely appropriate for the current task, the training data from different tasks may not be subject to the same distribution. The advantage of fine-tuning is the saving of time because for new tasks, it is unnecessary to train the network from scratch. Generally, pretrained models are based on large datasets, which means that the training data from new tasks have been expanded so that the model has a good generalizability. Few studies have focused on determining whether or not NILM transfer learning can also be applied in other

Approaches to Transfer Learning
In the literature, ref. [12] was the first to study the transferability of deep neural networks. They divided the ImageNet dataset containing 1000 categories into two parts: A and B. Then, they trained an eight-layer neural network for A and B and conducted a fine-tuning experiment from layers 1-7 to explore the transferability of the network. This paper uses two approaches to transfer learning. The first approach, AnB, fixes the first n layers of network A, randomly initializes the remaining layers, and then classifies B; the other approach, BnB, fixes the first n layers of network B, randomly initializes the remaining layers, and then classifies B. The experimental results of this paper show that (1) the first three layers of the neural network all have general features, which are good for transfer; (2) fine-tuning the deep transfer network can greatly improve the effect, even better than the original network (based on B's own training); and (3) fine-tuning can resolve the difference between data. Fine-tuning a deep network is the process of adjusting tasks based on the network pretrained by its predecessors. As a pretrained model may not be completely appropriate for the current task, the training data from different tasks may not be subject to the same distribution. The advantage of fine-tuning is the saving of time because for new tasks, it is unnecessary to train the network from scratch. Generally, pretrained models are based on large datasets, which means that the training data from new tasks have been expanded so that the model has a good generalizability. Few studies have focused on determining whether or not NILM transfer learning can also be applied in other types of networks. In this paper, the appropriateness of the attention model for transfer learning is verified, and this model is compared with the results of the seq2point model.

Dataset
The REDD dataset records the electricity consumption data from six families in North America for 9-23 days [29], including high-frequency data and low-frequency data. For the low-frequency data, the electric meter was sampled every 3 s, and appliances were sampled every 1 s. In total, 20 types of home appliance are covered in this dataset. Because families 4-6 used many unlabeled appliances, only the data from families 1-3 were selected for the experiment. The UK-DALE dataset contains data from smart electric meters used by five British families from 2013 to 2015 [15]. The electric meters and appliances were sampled every 6 s and 1 s, respectively. Families 1 and 2, which yielded a large amount of data, were selected for the experiment in this study. The REFIT dataset was created from the data from 20 residential buildings in Loughborough, Great Britain, from 2013 to 2015. The electric meters and appliances were sampled every 8 s and 1 s, respectively [30]. This dataset is the largest of the three and is appropriate for pretraining deep learning models. We used four target appliances in all our experiments, including refrigerators, washing machines, microwave ovens, and dishwashers. We chose these appliances because each is present in at least two houses in REDD, UKDALE, and REFIT. This means we can train our model on at least one house and test on a different house for each appliance. These four appliances consume a significant proportion of energy. In particular, refrigerators are a typical appliance that runs for long cyclic periods. Microwave ovens have a short operation time and a great variation in power. Washing machines run under multiple states for long periods of time. All of these devices are typical home appliances. Therefore, low-frequency and active power data were selected from the datasets for the experiment.

Data Preprocessing
To train the deep learning model, the data had to be standardized, or the efficiency of the gradient descent algorithm would be affected. To compare the training effects of the different datasets, we standardized the aggregate data by presetting the mean and variance and using the uniform mean and variance to standardize the data from the same appliance in the different datasets using Formula (4). x t represents the power value of the electric meter or the appliance at moment t, and x is the standard deviation of the electric meter or the appliance at all moments. Their values are shown in Table 1. The test sets and training sets were processed in the same way.
The REFIT dataset, the largest one, could not be read into memory in its entirety. The training was processed block by block, and each round of training needed to traverse all the data blocks. Because the appliances used different frequencies for the three datasets, we sampled the appliance data from both REDD and UK-DALE for 8 s, the same as those of REFIT, and then aligned the sampling of the electric meters with the timestamp of the appliance data. At last, all the sample electric meters and appliances were sampled every 8 s. Ref. [27] demonstrated that using a sliding window to process data results in a better decomposition effect. To be specific, we divided the time sequence data from the electric meters into overlapping windows with a fixed length, with a window serving as an input sequence. The output was the active power of the appliance to be decomposed (target appliance) in the middle point of this time window. The shape of the input tensor X and the output tensor Y of the training model are (batch size, window size, 1) and (batch size, 1), respectively. For the window size and training batch size used in the experiment, see Table 2 below.

Model Training
To compare the effects of different models, the seq2point model mentioned in ref. [21] was reused in the experiment, and the model proposed herein is named the attention model. For their network structures, see Figures 2 and 3. Neither model used a dropout layer to ensure fairness in the comparison of the results. Both models used ReLU as the activation function, the mean square error (MSE) as the loss function, and the Adam algorithm to optimize the learning process. The hardware environment was as follows: Intel Xeon W2235 CPU (base frequency 3.8 GHz), 128 GB DDR4 memory, and NVIDIA RTX3080 video card (10 GB video memory). The software environment was as follows: ubuntu18.04 64-bit operating system, python3.7, tensorflow2.1, and cuDNN10.0. The source code can be found on GitHub (https://github.com/eyangs/transferNILM, accessed on 2 June 2021).
To verify the effect of transfer learning, we first trained and tested the REDD and UK-DALE datasets. In REDD, houses 2 and 3 served as the training set, and house 1 was used as the test set; in UK-DALE, house 2 was the training set, while house 1 was the test set. The size of the data in REFIT is far larger than the size of the data in REDD and UK-DALE, so REFIT is suitable for model pretraining. The pretrained model was transferred to REDD and UK-DALE for testing. The division of the REFIT data is shown in Table 3. Transfer learning can be conducted without fine-tuning or based on fine-tuning. The former directly uses the pretrained model to predict the target data, while the latter needs to retrain the pretrained model in the target dataset prior to predicting. To deal with the overlong training period, we applied an early-stopping strategy with the specific experimental parameters shown in Table 2. Because the attention model proposed herein does not use the RNN, it can use a graphics processing units (GPUs) for parallel computing, which substantially increases the training speed. Its training time is almost the same as that of the CNN in the seq2point model.
The NILM model can be assessed using the MSE and the mean absolute error (MAE), which are frequently used to solve regression problems, or the accuracy rate, recall rate, F1-Score, etc., after turning it into a classification problem. As in ref. [22], the MAE and the normalized signal aggregate error (SAE) were selected as assessment indicators. The MAE is the average difference between the actual value and the predicted value at different moments, and the SAE represents the relative value of the difference between the actual and predicted power consumption values. In particular, r is the sum of the actual power consumption, andr represents the sum of the predicted power consumption. The specific computation method is shown in Formulas (5) and (6).

Results and Discussion
The experiment included two parts. In the first part, the pretrained model was not fine-tuned. Instead, the target data were tested directly. Table 4 shows the test results of the two models in the REDD dataset. The effect of predicting with the pretrained model was not good and led to many more test errors for both models. This result indicates that the pattern learned by the model from REFIT is not applicable to the REDD data. Table 5 provides the test results of the two models for the UK-DALE dataset. Using the pretrained model significantly reduced the test errors of the two models, indicating that the pattern learned by the model from REFIT can be applied to UK-DALE. The experimental results of this part support the conclusions about CTL in ref. [22]: due to differences in manufacturing standards, appliances from different countries (REFIT and REDD are from Britain and America, respectively) cannot be directly transferred; appliances in the same country can be directly transferred (both REFIT and UK-DALE come from Britain). The attention model proposed herein performed better than the seq2point model in decomposing the three appliances (except for the dishwasher), and the conclusions before and after transfer are consistent. For the dishwasher, although the seq2point model is better than the attention model according to the MAE, the SAE indicates that the attention model is superior to the seq2point model. Thus, both the MAE and the SAE may have defects as indicators for assessing NILM problems. More effective assessment indicators should be identified in subsequent research. In the second part of the experiment, the pretrained model was fine-tuned through the training sets UK-DALE and REDD prior to predicting the target data. According to the experience of others, fine-tuning generally only needs to retrain the fully connected layer at the end of the neural network. Table 6 shows the effect of the model on the target test set after fine-tuning the fully connected layer. By comparing Tables 4 and 5, the fine-tuned model performed much better on REDD, while it performed much worse on UK-DALE. This result also supports the conclusions drawn in ref. [22]: fine-tuning can improve the prediction effects for cross-domain data but cause overfitting for non-cross-domain data. only trained on the target dataset, M2 was pretrained on REFIT without fine-tuning, and M3 was both pretrained and fine-tuned. Overall, regardless of whether or not transfer learning was conducted, the SAE of the attention model was smaller than that of the seq2point model. For the four types of electric appliances in the experiment, the attention model proposed herein performed better for the microwave, refrigerator, and washing machine than the seq2point model, yet it performed much worse for the dishwasher. This finding may be because this appliance has a complex operating principle so that it is hard for the model to extract the features of its state changes. In addition, it may be related to the data window size selected for training. For example, ref. [31] reduced the data window of the seq2point model from 599 to 100, introduced the short-seq2point model, and demonstrated that the new model can better complete the NILM task. In terms of the transfer ability, the attention model after transfer had a better performance than that without transfer learning on REDD for the three appliances other than the washing machine, while the seq2point model only performed well for the microwave. The results on UK-DALE show that the model after transfer worked much better than that without transfer. With respect to indicator improvement, the attention model was slightly better than the seq2point model, indicating that the new model proposed herein can perfectly solve the NILM problem and is more appropriate for transfer learning.

Conclusions
In this paper, a deep neural network model based on an attention mechanism w

Conclusions
In this paper, a deep neural network model based on an attention mechanism wa proposed and applied to NILM transfer learning to decompose a residential power load 28

Conclusions
In this paper, a deep neural network model based on an attention mechanism was proposed and applied to NILM transfer learning to decompose a residential power load. This model replaces the traditional RNN by introducing an embedding layer of a time vector (Time2Vector) and then uses an attention layer to obtain a different attention score for each point of the input sequence through training prior to predicting the output. This design utilizes an attention mechanism to improve the traditional seq2seq model, which increases the training speed, enhances the ability of the model to extract signal features, and significantly promotes the accuracy of load decomposition. The experimental results on the three public datasets REFIT, UK-DALE, and REDD show that in most cases, the new model has a better effect than the seq2point model, which is the most commonly used model, especially in terms of its transferability in transfer learning. The conclusion that transfer learning can be applied in NILM was also verified in papers by ref. [22,32]. This result has important practical value, as transfer learning offers a solution when there is a lack of high-quality NILM datasets in the field. Due to space limitations, CTL but not ATL was discussed in this paper; load decomposition only uses low-frequency data and ignores finer-grained signal features, which may lead to the inaccurate prediction of appliance-related events; in the experiment, we showed that the conclusions from the MAE and SAE assessments are inconsistent, which means that assessment indicators that are more applicable to NILM need to be further developed. In addition, smart-home technology is promising to improve NILM technology. For example, smart sockets can record and provide a large number of labelled data to solve insufficient data issues for NILM model training. These issues should be addressed in future research.