Non-Intrusive Load Identiﬁcation Method Based on Improved Long Short Term Memory Network

: Non-intrusive load monitoring (NILM) is an important research direction and development goal on the distribution side of smart grid, which can signiﬁcantly improve the timeliness of demand side response and users’ awareness of load. Due to rapid development, deep learning becomes an effective way to optimize NILM. In this paper, we propose a novel load identiﬁcation method based on long short term memory (LSTM) on deep learning. Sequence-to-point (seq2point) learning is introduced into LSTM. The innovative combination of the LSTM and the seq2point brings their respective advantages together, so that the proposed model can accurately identify the load in process of time series data. In this paper, we proved the feature of reducing identiﬁcation error in the experimental data, from three datasets, UK-DALE dataset, REDD dataset, and REFIT dataset. In terms of mean absolute error (MAE), the three datasets have increased by 15%, 14%, and 18% respectively; in terms of normalized signal aggregate error (SAE), the three datasets have increased by 21%, 24%, and 30% respectively. Compared with the existing models, the proposed model has better accuracy and generalization in identifying three open source datasets.


Introduction
With the advent of the era of energy internet, technologies such as clean power generation, efficient power distribution, and convenient power consumption will bring new changes to the smart grid [1]. Specifically, monitoring systems that help consumers manage their energy use and expenditure are currently under development [2]. This information is expected to encourage energy-saving behaviors, improve fault detection, make precise energy incentives, and obtain demand forecasts, and so on [3]. Energy management has become a growing energy research field. Especially, due to the emerging large scale data, energy management needs to be more understood in principle. Therefore, applying machine learning to energy management is a particular concern of both energies and machine learning communities.
Energy management first needs to monitor all appliances in the home, which requires recording instantaneous power readings [4]. The easiest way to get these power readings is to deploy smart sensor devices for each application that can record power consumption. These data can be used to retrieve all the information about the energy consumption of home buildings. However, this method is an intrusive load-monitoring method, which is difficult to popularize. In addition, the installation cost of each device is high with difficult maintenance. As a new technology on the distribution side, non-intrusive load monitoring (NILM) can monitor the use of internal electrical equipment just by decomposing the entire power load inlet inside the house [5]. With the purpose of reducing energy consumption, NILM can help families understand the specific electricity consumption of their electrical appliances. Previous research has shown that NILM could help household users to reduce energy consumption by 15% [6]. It is more useful for energy suppliers to optimize their smart grid operations and propose specific electricity prices based on users' energy consumption habits. For the incomparable advantages of previous technologies, NILM is an indispensable part of the intelligent development process on the power distribution side. Nonetheless, the NILM (a.k.a energy disaggregation) technology can only access current and voltage at the entrance. The accuracy of NILM, such as internal load types deduction and energy consumption is usually affected by load characteristics and load identification algorithms [7]. For this reason, this paper solves this problem by improving the load identification algorithm which can decompose the user's overall consumption into the energy use of each consumer.
As a single-channel blind source separation problem, energy disaggregation is a difficult prediction problem, even unrecognizable. Obviously, artificial intelligence and deep learning will be a good way to solve this problem [8]. With the help of big data, machine learning methods, especially deep learning algorithms, have been greatly developed. The main methods of NILM include two machine learning algorithms: unsupervised learning and supervised learning [9].
In unsupervised learning, the model only uses aggregated data for training without the need of labeled data for pre-training. The main method in unsupervised learning is an additive factorial hidden Markov model (AFHMM) [10]. In order to improve performance, the model was restricted from different aspects on the basis of the original model. The main method was to use knowledge in related fields to constrain existing models. Bonfigli proposed that the appliance model is represented by a bivariate Hidden Markov Model, of which emitted symbols are the joint active-reactive power signals [11]. This method makes the extracted load characteristics more accurate. Pattem designed a novel "segmented" application of the Viterbi algorithm for sequence decoding with the AFHMM, which optimizes the process of acquiring device features with low-power models [12]. However, the main disadvantage of these methods is the necessecity to manually extract features from observed data and build a large database with specific data feature. This makes it difficult to operate this method in practice.
In supervised learning, it is required that the consumption data of one single device is as one training set. Many published household datasets make it possible to apply on NILM. At present, some researchers have applied deep learning to energy disaggregation, which are de-noising champ auto-encoders [13], adversarial networks [14], convolutional neural networks (CNN) [15], or recurrent neural networks (RNN) [16]. In particular, the researchers show that CNN can extract meaningful latent features for the device. These features are particularly useful for NILM to achieve the best performance. Kelly proposed a method of constructing deep learning models by using CNN and RNN, which is currently recognized as a better comprehensive performance model achieved by deep learning [17]. However, when the model is disturbed by noise, the recognition accuracy is greatly reduced. Shin proposed a subtask gated network that combines the main regression network with an on/off classification subtask network, which indicates a promising improvement of combining regression subnetworks and classification subnetworks [16]. However, RNN is more suitable for time series data rather than CNN. Although its can modify or combine with other algorithms, the disadvantages of CNN cannot be avoided.
The neural network learns in a nonlinear regression between a series of trunk readings and a series of equipment readings with the same basics and is named sequence-to-sequence (seq2seq) learning. Because the prediction effect of the midpoint is better and the error is smaller, the effect of sequence-to-point (seq2point) learning is better than seq2seq learning. Zhang proposed a CNN model with seq2point learning [18], the load recognition accuracy of which was higher than seq2seq, with a lower computational cost.
According to the existing literatures, various methods are not as advantageous as long short term memory (LSTM) networks for processing time series data. LSTM was proposed by Sepp Hochreiter and Jürgen Schmidhuber in 1997 [19]. It is an improved algorithm based on RNN and belongs to feedback neural networks [20]. Most importantly, the addition of the forget gate and memory unit determines the deletion and retention of long-term or short-term memory. To a certain extent, this improvement limits the gradient disappearance and explosion problems of RNN due to structural problems [21,22]. This is the very reason that we selected LSTM to process electricity consumption data in this paper.
In this paper, we pursued a high accuracy of load identification under the premise of ensuring a high degree of fit between the model and the characteristics of the data. We proposed a new method based on an LSTM network with seq2point learning to solve the problem of poor accuracy of power decomposition values for NILM. The main idea includes three aspects. Firstly, the ability of extract features in CNN is helpful to screen data features. Secondly, as a variant of RNN, LSTM has a higher accuracy when processing power parameters with time series. Thirdly, seq2point learning is more efficient than seq2seq learning for extracting output features and with a smaller amount of calculation. Therefore, this method can use the active power in the data to identify the load. In addition, we apply this new method to analyze in the dataset UK-DALE, REDD and REFIT.
The main contributions of this paper are as follows. We selected LSTM to process the power data of appliances, which are more suitable for time series data. Due to the harmony between LSTM and the power series data, the accuracy of load identification was improved. In this paper, the purpose of experiments is two-classification verification, which provide foundation for multi-classification and generalization optimization in the future.
The organization structure of the paper is as follows. Section 2 outlines the methods related to our model. Sections 3 and 4 show the experimental setup and results respectively. At last, Section 5 draws the conclusions of this paper.

Methods
In NILM, the power data collected from the main distribution line is a kind of time series data. The associated LSTM model can effectively establish internal correlation based on the characteristics of the power load data, so as to realize the identification of the power load. The LSTM deep learning algorithm learns the rules and characteristics by training a large number of data samples, with an entire framework as shown in Figure 1.
Energies 2021, 14, x FOR PEER REVIEW 3 of 16 seq2seq learning. Zhang proposed a CNN model with seq2point learning [18], the load recognition accuracy of which was higher than seq2seq, with a lower computational cost. According to the existing literatures, various methods are not as advantageous as long short term memory (LSTM) networks for processing time series data. LSTM was proposed by Sepp Hochreiter and Jürgen Schmidhuber in 1997 [19]. It is an improved algorithm based on RNN and belongs to feedback neural networks [20]. Most importantly, the addition of the forget gate and memory unit determines the deletion and retention of longterm or short-term memory. To a certain extent, this improvement limits the gradient disappearance and explosion problems of RNN due to structural problems [21,22]. This is the very reason that we selected LSTM to process electricity consumption data in this paper.
In this paper, we pursued a high accuracy of load identification under the premise of ensuring a high degree of fit between the model and the characteristics of the data. We proposed a new method based on an LSTM network with seq2point learning to solve the problem of poor accuracy of power decomposition values for NILM. The main idea includes three aspects. Firstly, the ability of extract features in CNN is helpful to screen data features. Secondly, as a variant of RNN, LSTM has a higher accuracy when processing power parameters with time series. Thirdly, seq2point learning is more efficient than seq2seq learning for extracting output features and with a smaller amount of calculation. Therefore, this method can use the active power in the data to identify the load. In addition, we apply this new method to analyze in the dataset UK-DALE, REDD and REFIT.
The main contributions of this paper are as follows. We selected LSTM to process the power data of appliances, which are more suitable for time series data. Due to the harmony between LSTM and the power series data, the accuracy of load identification was improved. In this paper, the purpose of experiments is two-classification verification, which provide foundation for multi-classification and generalization optimization in the future.
The organization structure of the paper is as follows. Section 2 outlines the methods related to our model. Section 3 and 4 show the experimental setup and results respectively. At last, section 5 draws the conclusions of this paper.

Methods
In NILM, the power data collected from the main distribution line is a kind of time series data. The associated LSTM model can effectively establish internal correlation based on the characteristics of the power load data, so as to realize the identification of the power load. The LSTM deep learning algorithm learns the rules and characteristics by training a large number of data samples, with an entire framework as shown in Figure 1.

LSTM
The structure of the LSTM model is detailed in Figure 2. Multiple activated neurons are used in the hidden layer to achieve long-term storage and transmission of node information. Among them, the memory module connected by a recursive structure plays a special memory function, including one or more memory cells and compound control gate units. Three compound control gate units are input gates, output gates and forget gates respectively. All neurons are controlled by these three control gates to read and write information, and reset accordingly. Compared with the RNN model, the LSTM model adds a softmax layer after the output layer, and selects the logarithmic likelihood function as the loss function of the error back propagation mechanism. The purpose of these changes is to avoid the gradient vanishing problem.

LSTM
The structure of the LSTM model is detailed in Figure 2. Multiple activated neurons are used in the hidden layer to achieve long-term storage and transmission of node information. Among them, the memory module connected by a recursive structure plays a special memory function, including one or more memory cells and compound control gate units. Three compound control gate units are input gates, output gates and forget gates respectively. All neurons are controlled by these three control gates to read and write information, and reset accordingly. Compared with the RNN model, the LSTM model adds a softmax layer after the output layer, and selects the logarithmic likelihood function as the loss function of the error back propagation mechanism. The purpose of these changes is to avoid the gradient vanishing problem. In this paper, we used the structure of the LSTM with reference to [19]. In order to describe the LSTM model, different subscripts are used to mark different structures, in which I represents the dimension of the input sample, N represents the dimension of the output, f represents the forget gate, l represents the input gate, c represents the memory unit, o represents the output gate, and n represents the output layer. In the rest of this paper, a represents the input before activation and b represents the output after activation.
The function of forget gate is to simulate short term memory, which controls the neuron to lose information filtering unnecessary information. The activation function determines data retention or discarding. Corresponding to input t x , the output f t b is as follows: (2) Figure 2. Topology of LSTM [19].
In this paper, we used the structure of the LSTM with reference to [19]. In order to describe the LSTM model, different subscripts are used to mark different structures, in which I represents the dimension of the input sample, N represents the dimension of the output, f represents the forget gate, l represents the input gate, c represents the memory unit, o represents the output gate, and n represents the output layer. In the rest of this paper, a represents the input before activation and b represents the output after activation.
The function of forget gate is to simulate short term memory, which controls the neuron to lose information filtering unnecessary information. The activation function determines data retention or discarding. Corresponding to input x t , the output b f t is as follows: where t is the time, W x f is the weight between the input layer and the forget gate, W h f is the weight between the hidden layer and the forget gate, α f denotes paranoid vector, and activation function is σ. As the general understands, the input gate controls the update neuron information as b l t , where W xl is the weight between the input layer and the input gate and W hl is the weight between the memory unit and the input gate. α l denotes paranoid vector. The memory unit is responsible for long term memory. The function of the memory unit is to save useful data information, which is preserved by the forget gate. The candidate state of the memory unit at time t is b c t , where W xc is the weight between the input layer and memory unit tanh gate, W hc is the weight between the memory unit tanh gate and the hidden layer, α c denotes paranoid vector.
The status information s t of the memory unit updating neurons is expressed as follows, The output information of the output gate control neuron b o t is interpreted by paranoid where W xo is the weight between the input layer and the output layer and W ho is the weight between the memory unit and the forget gate. The output information of the memory unit b n t is, The output information of the output layer h t is, where W nh is the weight between the memory unit and the output layer. The purpose of setting the softmax layer is to solve the problem of gradient disappearance effectively. In this mechanism, the output y t of the output layer in network is directly obtained from the softmax function instead of applying the sigmoid function.
It is obvious that the output of softmax is a probability distribution, corresponding to the probability of each category in the classification result.

Seq2point Learning
Recently, seq2seq learning in neural networks is used to solve load identification, with the benefit of allowing an output sequence length different from input sequence length. The idea of seq2seq learning is to train a deep network so as to create a mapping between input sequences (such as power reading in NILM problems) and output sequences (such as power reading in a single device). The function of the seq2seq learning is to convert an original sequence into another sequence through two steps of encoding and decoding, with strong processing capabilities for timing information. However, applying seq2seq has several difficulties to overcome. For example, a vanishing gradient problem would result from limits to graphics processing units (GPUs) in the processing of long sequences. Therefore, the seq2point learning method is introduced to train the neural network to predict only the midpoint element of the window, instead of predicting the entire window of device readings. The method of seq2point learning is widely used to model the distribution of speech and images [23]. This method assumes that the midpoint unit is a nonlinear regression of the main window. The seq2point learning method defines a neural network which maps the input sliding window to the midpoint of the corresponding output window. As one of the obvious characteristics, the change point (or edge) in the trunk states that the network uses to infer the state of the device in the experiment of [18]. In seq2seq learning, the average of multiple predictions will be used to smooth the edges, because each element of the output signal has to be predicted many times. Obviously, the prediction effect at the midpoint of the window is better than other areas. For this benefit, we choose the seq2point learning method in this paper. This learning strategy uses a sliding window method. The new method allows the neural network to focus its characterization ability on the midpoint of the window, rather than on the more difficult output of the edge, which is expected to produce a more accurate prediction. Figure 3 shows the architecture of our new model. Firstly, the preprocessed data is sent to our model as input data. The obvious features are selected through two CNN layers. Secondly, the pooled data are input to the LSTM layer for load identification. At last, the midpoint value of a series of sequences is output as a final result.
with the benefit of allowing an output sequence length different from input sequence length. The idea of seq2seq learning is to train a deep network so as to create a mapping between input sequences (such as power reading in NILM problems) and output sequences (such as power reading in a single device). The function of the seq2seq learning is to convert an original sequence into another sequence through two steps of encoding and decoding, with strong processing capabilities for timing information.
However, applying seq2seq has several difficulties to overcome. For example, a vanishing gradient problem would result from limits to graphics processing units (GPUs) in the processing of long sequences. Therefore, the seq2point learning method is introduced to train the neural network to predict only the midpoint element of the window, instead of predicting the entire window of device readings. The method of seq2point learning is widely used to model the distribution of speech and images [23]. This method assumes that the midpoint unit is a nonlinear regression of the main window. The seq2point learning method defines a neural network which maps the input sliding window to the midpoint of the corresponding output window. As one of the obvious characteristics, the change point (or edge) in the trunk states that the network uses to infer the state of the device in the experiment of [18]. In seq2seq learning, the average of multiple predictions will be used to smooth the edges, because each element of the output signal has to be predicted many times. Obviously, the prediction effect at the midpoint of the window is better than other areas. For this benefit, we choose the seq2point learning method in this paper. This learning strategy uses a sliding window method. The new method allows the neural network to focus its characterization ability on the midpoint of the window, rather than on the more difficult output of the edge, which is expected to produce a more accurate prediction. Figure 3 shows the architecture of our new model. Firstly, the preprocessed data is sent to our model as input data. The obvious features are selected through two CNN layers. Secondly, the pooled data are input to the LSTM layer for load identification. At last, the midpoint value of a series of sequences is output as a final result.

Datasets
Currently, several open source datasets are used for load identification testing. We chose three datasets, UK-DALE dataset [24], REDD dataset [25], and REFIT dataset [26]. These data come from some household buildings in different countries, and include a lot of information, such as active power, reactive power, current and voltage. In this experiment, we analyze the active power. In each house, the data of each electrical appliance are recorded from corresponding sub-sensors, the same as the total data collected by the main sensor. For different datasets, the sampling frequency is different. Hence, a data preprocess is required before it can be applied to NILM, to ensure that frequencies of collected data are aligned. In this paper, we select five electrical appliances for experimentation, which are dish washer, fridge, kettle, microwave and washing machine. Therefore, we

Datasets
Currently, several open source datasets are used for load identification testing. We chose three datasets, UK-DALE dataset [24], REDD dataset [25], and REFIT dataset [26]. These data come from some household buildings in different countries, and include a lot of information, such as active power, reactive power, current and voltage. In this experiment, we analyze the active power. In each house, the data of each electrical appliance are recorded from corresponding sub-sensors, the same as the total data collected by the main sensor. For different datasets, the sampling frequency is different. Hence, a data preprocess is required before it can be applied to NILM, to ensure that frequencies of collected data are aligned. In this paper, we select five electrical appliances for experimentation, which are dish washer, fridge, kettle, microwave and washing machine. Therefore, we select those kinds of equipment to verify two-classification appliances, providing foundation of multi-classification appliances in subsequent studies.

UK-DALE
The UK-DALE (U.K. Domestic Appliance-Level Electricity) dataset contains the electricity data of 5 buildings in the UK from 2013 to 2015. Although more than ten kinds of equipment were measured in this data set, only houses 1 and 2 contain all the data of these five representative electrical appliances we selected to analyze. The sampling period of the main sensor and the sub-sensor are 1 s and 6 s respectively.

REDD
The REDD (Reference Energy Disaggregation Data) dataset collected measurement data of 6 buildings in the United States, including four types of electrical appliances, yet it lacked kettle data. The data collection period varies from 3 to 19 days. The sampling frequency is 1 s and 3 s respectively, including some high-frequency voltage and current measurement signals with a frequency of 15 kHz.

REFIT
The REFIT dataset contains 20 buildings in England during the period from 2013 to 2015. The sampling frequency of all sensors in this data set is 8 s. As the largest dataset so far, we can dig out more information in the experiment. Specifically, the data of 20 households in the dataset provides a variety of different electricity usage behaviors and electricity usage habits, which enriches the diversity of data.

Data Preprocessing
We need to preprocess the data before experimenting. If the missing rate is more than one-third of a cycle, we define it as an obvious large piece of missing data and must remove it manually first. By contrast, if the missing rate is less than one-third of a cycle, we define it as a small piece of missing data and it can be retained and regarded as noise. Through such processing, the accuracy of the experiment can be improved. Then we will use the following formula to normalize the data, where x t denotes the value at time t, x is the mean value of power data from all electrical appliances, σ is the standard deviation of power data from all electrical appliances. The normalized value x t can be used to enter the experimental model as input data. In Table 1 we specify the mean and standard deviation values. Although the three datasets contain data on many houses, we only selected a part of them for analysis. We divided the data into training set, validation set and test set. The training set and validation set belong to the same group of data. For the UK-DALE dataset, we selected houses 1, 3, 4, and 5 as the training set, and house 2 as the test set, shown in Table 2. The reason for the selection is that only houses 1 and 2 completely include the data of the five electrical appliances we need. For to this reason, we used houses 2 to 6 to train and house 1 to test in the REDD dataset shown in Table 3. We observed that the validation set of the first two datasets was actually part of the training set, while the validation set and training set of REFIT dataset belonged to different houses. Then the settings of the REFIT dataset are shown in Table 4.

Evaluation Parameters
In order to comprehensively evaluate the overall performance of the model, this paper uses three evaluation parameters: F-score, MAE, and SAE.
(1) F-score F-score, a.k.a balanced F-score, is a common criterion for classification. In NILM, four possibilities of classification are introduced as follows:

1.
True positive (TP): the number of times the consumer was correctly detected during operation; 2.
True negative (TN): the number of times the consumer was correctly detected when stopped; 3.
False positive (FP): the number of times the consumer was incorrectly detected as running; 4.
False negative (FN): the number of times when the consumer was wrongly detected as stopped.
Precision represents the percentage of the appliances actually running in all that predicted operation.
Recall represents the proportion of running samples among all samples accurately predicted.
Hence, F-score is described by precision and recall.
(2) Mean absolute error (MAE) MAE is another important index, which relates to the power error at each time point.
where x t denotes ground truth, x t denotes prediction of an appliance at time t.
(3) Normalized signal aggregate error (SAE) SAE is the third index, which assesses the total error in energy over a period. SAE = |r − r| r (18) where denote r and r as the ground way and inferred the total energy consumption of an appliance respectively.

Parameter Setup in Training Process
We use the sliding window method to perform sequence-to-point learning, for which we need to set an input length of the active power signal. Theoretically speaking, all suitable lengths are available. According to [27], analyzing the influence of different input lengths on the learning method, the length of 599 data points were selected as a window input for the neural network model. ADAM optimizer is used in this paper, for its advantage of having a learning rate that varies with the model [28]. In order to prevent over-fitting, the early stopping method was adopted. It is well known that the learning rate will increase instead of decrease when the training period reaches a certain point. The early stopping method can find this point and then stop training. Indeed, the effect stays the same as before with a lot of shortened time for training.
The structural characteristics of LSTM lead to the problem of gradient disappearance or explosion. Therefore, we solve this problem by dropout reference to [29]. Dropout is an efficient way to train deep neural networks. The main method of dropout is to ignore half of the feature detectors in each training batch. Specifically, the value of the half hidden layer nodes is set to 0. Because detector interaction means that some detectors rely on other detectors to function, this method can reduce the interaction between feature detectors, i.e., hidden layer nodes. Thus over-fitting can be significantly reduced.
In this paper, the networks were running on computers with CPU i7-8750H (RAM 8G) and NVIDIA GTX 1050 (VRAM 4G). The deep learning model is implemented to NILM by TensorFlow-GPU of Python. The specific version number are python-3.6.10 and tensorflow-gpu-1.6.0.

Results and Discussion
Because of the early stopping method, the training time was greatly shortened. The training period set in the experiment was 50 times, and the patience value of the early stopping method was 10 times. At the same time, each experiment was repeated 5 times, and the results were averaged. For example, in Figure 4, it shows the image of training loss (tra_loss) and validation loss (val_loss) of the washing machine training with epoch. We observed that a model that was supposed to be trained 50 times stopped at 26 times. Because the validation loss is no longer reduced within the patience value range, the training is stopped early, which shortens the training time and improves efficiency.
In this paper, we select the seq2seq model of [18] as an experimental comparison. The experimental results of the three datasets are generally better than the control group, for two reasons: (1) the representation ability of seq2point model is stronger than that of the seq2seq model; (2) the capability of processing time series data in LSTM networks is better than that in the CNN network.
Next, the experimental comparison results of each dataset will be introduced. Firstly, for the UK-DALE dataset, Figure 5 shows that our model has reduced MAE and SAE by about 15% and 21% respectively. We can see that in addition to the washing machine, the MAE and SAE of the other four electrical appliances have better effects on seq2point than seq2seq. The optimization effect of low-power and periodic appliances is higher than that of the appliance. We hypothesize that complex operating conditions result in poor identification effects of the washing machine, the fluctuating power curves of which are obviously more complicated in comparison. On the other hand, the possible solutions are increasing the number of house samples in the dataset and accessing the data of electrical appliances from more brands.
Because the validation loss is no longer reduced within the patience value range, the training is stopped early, which shortens the training time and improves efficiency.
In this paper, we select the seq2seq model of [18] as an experimental comparison. The experimental results of the three datasets are generally better than the control group, for two reasons: (1) the representation ability of seq2point model is stronger than that of the seq2seq model; (2) the capability of processing time series data in LSTM networks is better than that in the CNN network. Next, the experimental comparison results of each dataset will be introduced. Firstly, for the UK-DALE dataset, Figure 5 shows that our model has reduced MAE and SAE by about 15% and 21% respectively. We can see that in addition to the washing machine, the MAE and SAE of the other four electrical appliances have better effects on seq2point than seq2seq. The optimization effect of low-power and periodic appliances is higher than that of the appliance. We hypothesize that complex operating conditions result in poor identification effects of the washing machine, the fluctuating power curves of which are obviously more complicated in comparison. On the other hand, the possible solutions are increasing the number of house samples in the dataset and accessing the data of electrical appliances from more brands.
Secondly, for the REDD dataset, Figure 6 shows that our new model improved MAE by about 14% and SAE by about 24%, which showed an optimization effect on seq2point that was better than that on seq2seq. Different from the UK-DALE dataset, the identification effect of the five electrical appliances has been improved to a certain extent. Due to the increase in the number of houses, the identification effect of the washing machine was improved to a certain extent. The reason also includes the increase in the number of brands of electrical appliances. Hence, our conjecture is confirmed.
Thirdly, Figure 7 shows that the REFIT dataset had the most obvious improvement effect in comparison with MAE and SAE, which reached about 18% and 30% respectively. The reason for this is that REFIT is the largest dataset in our experiment, covering many types of electrical appliances. A large number of monitoring data and electrical appliance models minimize the possibility of special situations in the experiment. Sufficient and abundant data make the iteration results more realistic and reliable. Therefore, the identification effect of the REFIT dataset is better than that of the other two datasets.  Secondly, for the REDD dataset, Figure 6 shows that our new model improved MAE by about 14% and SAE by about 24%, which showed an optimization effect on seq2point that was better than that on seq2seq. Different from the UK-DALE dataset, the identification effect of the five electrical appliances has been improved to a certain extent. Due to the increase in the number of houses, the identification effect of the washing machine was improved to a certain extent. The reason also includes the increase in the number of brands of electrical appliances. Hence, our conjecture is confirmed.
Thirdly, Figure 7 shows that the REFIT dataset had the most obvious improvement effect in comparison with MAE and SAE, which reached about 18% and 30% respectively. The reason for this is that REFIT is the largest dataset in our experiment, covering many types of electrical appliances. A large number of monitoring data and electrical appliance models minimize the possibility of special situations in the experiment. Sufficient and abundant data make the iteration results more realistic and reliable. Therefore, the identification effect of the REFIT dataset is better than that of the other two datasets.    In order to examine the recognition accuracy, we exhibit the comparison between identity value and actual value of five electrical appliances in the REFIT dataset. The decomposition results are illustrated in Figure 8. It is clearly distinguished the identified curve much closer to the actual data curve, involving dishwasher, kettle and microwave in Figure 8a,c,d respectively. Because of lower power of fridge self, the y-axis of watt is shorter than that of others in Figure 8b. It is also distinguished better performance of the new network for fridge identification. Since power value of washing machine is lower in end of work, the recognition of the new method is not better in Figure 8e. In deed, the proposed method performed better in work period of washing machine. Thus, the deep learning model proposed in this paper is suitable for non-intrusive load identification, with powerful information extraction and application capabilities. training set in Figure 9. The identification effect of the REFIT dataset is better than the other two datasets, which coincides with the previous viewpoint. When the collected data types are similar, the recognition effect is good. Hence, it is required that the data in certain types are applied in the model. In summary, the results show that the model in this paper has a strong generalization ability, even with a better load identification performance in completely unfamiliar families.  Additionally, it is very important to apply the new model to the houses not participating in the training, so as to test recognition ability. This is the only way to verify the generalization performance of the model. Generally for a deep learning model, the more data used in training and a wider source should lead to a better generalization performance of the model. It is presented that the F-score of the test set is similar to that of the training set in Figure 9. The identification effect of the REFIT dataset is better than the other two datasets, which coincides with the previous viewpoint. When the collected data types are similar, the recognition effect is good. Hence, it is required that the data in certain types are applied in the model. In summary, the results show that the model in this paper has a strong generalization ability, even with a better load identification performance in completely unfamiliar families.

Conclusions
In this paper, we propose a new method of LSTM networks improved by seq2point learning in load identification. The seq2point learning has proved better than seq2seq learning in NILM, with the experimental analysis of five kinds of appliances in UK-DALE dataset, the REDD dataset and the REFIT dataset. Through three evaluation parameters, MAE, SAE and F-score, it indicates that our model has the best comprehensive performance compared with the current models. Our method reduces errors caused by insufficient data utilization. According to the experimental results, MAE and SAE are increased by 15% and 21% respectively, and identified the appliances in the UK-DALE dataset. MAE and SAE of recognition are increased by 14% and 24% respectively in the REDD dataset. The value of MAE and SAE are increased by 18% and 30% respectively with the data from the REFIT dataset. In these datasets, the identification effect of the five electrical appliances have all improved. At the same time, its good generalization ability has also been confirmed. In this paper, the proposed model combines LSTM and seq2point performs better than that of CNN's application on seq2seq or seq2point in load recognition. This result proves the advantage of LSTM in processing time series data. However, in terms of training time and computational burden, LSTM is inferior to RNN. The training time of each loop is about 150 s for the washing machine in the REDD dataset, spending more time than training by CNN model. Although the training speed has been reduced, the timeliness can still be guaranteed.
This paper fills in the limitations of CNN processing time series data. It is proved that the method combined with LSTM and seq2point is more effective in identifying traveling load through experiments. The method has a certain generalization ability for different houses in the same dataset. In this paper, the electrical appliances in the experiments are all operated by two-classification, with the foundation of multi-classification for electrical appliances in our further processing. Although they lost training speed, these time gaps are acceptable in practice experiments. We focus on an improvement in the generalization ability for data identification from different datasets, so as to make the model universal. We will cross-train the data in other, different datasets to verify the recognition capability of the new method. Additionally, we consider studying multiple categories of electrical appliances and computing burden in the next work. In the future data acquired by our own hardware, we should consider the issue of data granularity to make the entire NILM system more practical.
We will further improve and optimize the NILM method, so that the algorithm can reduce the error and improve the accuracy. According to practical applications, the relatively long training time of LSTM is acceptable, which is the focus for improvement in our future work. Besides the optimization of the training process, we also consider improvement of the generalization ability to be equally important to identify other kinds of data type, even appliances. The available direction of improvement would be a modification of the model structure in order to make the non-intrusive method universal and efficient.