A Deep Recurrent Neural Network for Non-Intrusive Load Monitoring Based on Multi-Feature Input Space and Post-Processing

: Non-intrusive load monitoring (NILM) is a process of estimating operational states and power consumption of individual appliances, which if implemented in real-time, can provide actionable feedback in terms of energy usage and personalized recommendations to consumers. Intelligent disaggregation algorithms such as deep neural networks can fulﬁll this objective if they possess high estimation accuracy and lowest generalization error. In order to achieve these two goals, this paper presents a disaggregation algorithm based on a deep recurrent neural network using multi-feature input space and post-processing. First, the mutual information method was used to select electrical parameters that had the most inﬂuence on the power consumption of each target appliance. Second, selected steady-state parameters based multi-feature input space (MFS) was used to train the 4-layered bidirectional long short-term memory (LSTM) model for each target appliance. Finally, a post-processing technique was used at the disaggregation stage to eliminate irrelevant predicted sequences, enhancing the classiﬁcation and estimation accuracy of the algorithm. A comprehensive evaluation was conducted on 1-Hz sampled UKDALE and ECO datasets in a noised scenario with seen and unseen test cases. Performance evaluation showed that the MFS-LSTM algorithm is computationally e ﬃ cient, scalable, and possesses better estimation accuracy in a noised scenario, and generalized to unseen loads as compared to benchmark algorithms. Presented results proved that the proposed algorithm fulﬁlls practical application requirements and can be deployed in real-time.


Introduction
Energy conservation in residential and commercial buildings through smart electrification has been a hot topic for researchers in recent years [1,2]. Because of the large deployments of smart meters, non-intrusive load monitoring (NILM) has become a very valuable tool to achieve this objective. NILM or simply energy disaggregation system estimates power consumption of individual household appliances or other electrical apparatus from an aggregated power signal, which is acquired through single-point sensing from a smart meter using some supervised/unsupervised technique [3,4]. A practical NILM system can provide real-time actionable feedback to consumers that gives them an idea about an individual appliance operation state, its power consumption, and cumulative energy usage. Studies have shown that appliance specific feedback urges the consumer to use energy wisely In context, existing DNN-based disaggregation algorithms have shown better performance in terms of scalability and learning feature-rich appliance signatures. However, the practical feasibility of these approaches is still an open problem [11,37]. From an algorithm point-of-view, high power estimation accuracy and generalization are two most essential abilities that an algorithm should possess in order to be feasible for practical application. DNN models can be made generalized and highly accurate if they are trained on a large quantity of data and/or by performing hyperparameter optimization [31], [38]. Training machine learning or deep learning models on a huge amount of data with lots of features does not guarantee the best performance due to some misleading and irrelevant features [39]. However, low time complexity and better performance can be achieved with limited data if it is comprised of the most effective features [40,41]. Previous DNN-based works used either steady-state active power, reactive power or both as input features. Similarly, [42] showed that apart from active power, other electrical features (line current, line voltage, neutral current, and load angle) significantly improve the events classification accuracy for non-linear appliances. This implies that many electrical features, when combined to make a feature space can be a substitute for DNN models requiring a high amount of data for training. Recent multi-feature input based NILM approaches [24,43,44] selected the steady-state features using experiments and previous knowledge. Although those approaches reported an overall improvement in accuracy. However, disaggregation performance of individual loads deteriorated in some cases because of the ineffectiveness of selected feature(s) on individual appliances. Therefore, there is a need to determine a comprehensive set of features that can aid in disaggregating all types of appliances with high estimation accuracy and lowest generalization error. For that purpose, the influence of steady-state electrical features on the power consumption of individual appliances should be analyzed to make an intuition about relevant and most effective features.
At the disaggregation stage, deep neural networks tend to predict irrelevant activations, which does not belong to target appliance activations. Kong et al. [45] tackled this problem through post-processing, which included training a separate deep CNN model to classify predicted appliance activations. Their classification model ensured whether the predicted sequence belongs to a target appliance activation or not. Following a similar problem domain, this paper also proposes a training-less yet effective post-processing technique (as a part of disaggregation algorithm), which eliminates irrelevant activations by comparing the lengths of actual and predicted appliance activations.
In this paper, we explicitly focus on determining relevant and effective electrical features that can aid in achieving high estimation accuracy for type-1 and type-2 appliances, and generalized on unseen data. We present a multi-feature subspace for an LSTM (MFS-LSTM) algorithm that forms multi-feature input data by using the mutual information method and train deep LSTM models for individual appliances. The mutual information method measures the influence of steady-state electrical measurements on the active power consumption of individual appliances. From that information, relevant and most influential electrical features are selected to form multi-feature input data. This paper also proposes an effective post-processing technique that eliminates irrelevant appliance activations during the disaggregation stage and helps to keep predicted energy close to ground-truth energy. In addition, to make an effort towards a deployable deep learning-based NILM solution, we also design a three-stage practical NILM framework that intends to use pre-trained models (trained with the MFS-LSTM algorithm) to perform online disaggregation.
The rest of the paper is organized as follows. Section 2 introduces the MFS-LSTM algorithm in terms of multi-feature input space and post-processing. The design of a deep learning-based practical NILM framework is also discussed in Section 2. Section 3 presents a case study by providing details of the chosen dataset, model training, testing scenarios, and evaluation metrics. Section 4 presents the results. Section 5 concludes the work presented in this paper.

Proposed Energy Disaggregation Approach (MFS-LSTM Algorithm)
To achieve high estimation accuracy and lowest generalization error with a limited amount of data, this paper proposes a three-stage disaggregation algorithm based on a deep LSTM network. At the data pre-processing stage (first stage), multi-feature input data based on low-frequency electrical measurements were prepared that aims to extract more useful information from the limited training data. To prepare multi-feature input data; first, a mutual information principle was used to select the relevant and most effective features. A set of five features were used to make multi-feature input data for a deep LSTM network.
At the training stage (second stage), multi-feature input data were used to train four-layered bidirectional LSTM models for each target appliance. Hyperparameter optimization was performed to tune parameters that lead to the lowest training error and lowest convergence time for each deep LSTM model. At the third stage (disaggregation stage), a post-processing technique was employed to eliminate irrelevant appliance activations to improve disaggregation performance. Figure 1 shows the detailed architecture of our proposed energy disaggregation algorithm. Shaded regions in a grey color points-out the three stages of the disaggregation algorithm.
Energies 2020, 13, x FOR PEER REVIEW 4 of 27 relevant and most effective features. A set of five features were used to make multi-feature input data for a deep LSTM network. At the training stage (second stage), multi-feature input data were used to train four-layered bidirectional LSTM models for each target appliance. Hyperparameter optimization was performed to tune parameters that lead to the lowest training error and lowest convergence time for each deep LSTM model. At the third stage (disaggregation stage), a post-processing technique was employed to eliminate irrelevant appliance activations to improve disaggregation performance. Figure 1 shows the detailed architecture of our proposed energy disaggregation algorithm. Shaded regions in a grey color points-out the three stages of the disaggregation algorithm.

Steady-State Signatures as Multi-Feature Input Subspace
Input space construction is the starting point of any machine learning/deep learning modeling. In this paper, multi-feature input space was used to extract more information from the limited amount of data to improve the accuracy of the proposed deep recurrent neural networks (RNN)based LSTM model.   Input space construction is the starting point of any machine learning/deep learning modeling. In this paper, multi-feature input space was used to extract more information from the limited amount of data to improve the accuracy of the proposed deep recurrent neural networks (RNN)-based LSTM model.
In the field of NILM, variables including active power (P), apparent power (S), reactive power (Q), voltage (V rms ), current (I rms ), and power factor (cos θ) are available through measurement instruments and they are constrained by the following relationships: Equation (1) shows that all the variables provide some information that indicates an appliance's state of operation and amount of power consumption. This implies that many electrical features, when combined to make a feature space, have so much to offer for energy disaggregation.
It would also be meaningful if we can analyze the influence of each electrical feature on the active power consumption of different type-1 and type-2 appliances so that insight into relevant and irrelevant features can be gained. For this purpose, he mutual information method is used in this paper. Mutual information measures share information contained in two variables, and knowing how much the value of one variable reduces the unpredictability on the other [46].
Previously in NILM research, the mutual information method is used for selecting delay parameters for different datasets [47] and for formulating utility-privacy trade-offs [48]. We have used the mutual information method to select relevant and most effective electrical features, which would make a feature space for our deep RNN model. First, the mutual information value between each electrical feature and power consumption of each target appliance was calculated using the formula stated in (2).
where p(x) and p(y) are probability density functions of x and y, and p(x, y) denotes joint probability functions of two variables. Secondly, results from (2) were combined in a tabular form, where rows represented power consumption of target appliances and columns represented each electrical feature considered for the analysis. Mutual information values were categorized into three ranges of strong, moderate, and weak influence. Those features, which had a weak influence on the power consumption of each target appliance were discarded, and those which had strong or moderate influence were selected to form multi-feature input space. This was unlike previous multi-feature NILM algorithms that made conclusions about relevant features based on the final results. Details of the mutual information analysis are provided in Section 3.1.

LSTM-based Deep Recurrent Neural Network
Deep recurrent neural networks (RNNs) are a variation of feed-forward neural networks that are used to process sequential data. Among the family of RNNs, the bidirectional LSTM network uses both previous and future information to predict current value; therefore, they are a natural fit for the NILM problem. In addition, they have longer memory and are able to deal with vanishing gradient problems [27,28]. Figure 2 shows the LSTM architecture in a single RNN unit. The decision on which information should be stored and which should be discarded is made by the forget layer (forget gate). The output Energies 2020, 13, 2195 6 of 26 of the forget gate is calculated using the weight and bias values of the current input sample x (t) and information from the previous time step a (t-1) . This is represented by the following equation: where, Γ f is the output of the forget layer, σ is the sigmoid function, W f is the weight for the forget layer, and b f is the bias value for the forget layer. The output of the forget layer can be either 0 or 1. Zero (0) means totally discard the information, and '1' means totally store the information. Applying sigmoid activation to these values gives us the value of the forget state between 0 and 1. After discarding some information from the input sequence, a decision is made on what new information is to be stored at the current time step. This step is completed in two stages through the update gate and the Tanh layer (Tanh gate). The update gate works in the same manner as the forget gate and it decides which values will be updated using the following relationship: Energies 2020, 13, x FOR PEER REVIEW 6 of 27 The Tanh layer creates a vector of new candidate values that could be added to the state. The 'Tanh' function is used in this step instead of the sigmoid function as shown in (5).
where, ̂( ) is the output of Tanh layer, is the weight for the Tanh layer, and is the bias value for the Tanh layer. The memory cell is updated from ( −1) to ( ) in the third step. The new cell state ( ) is calculated using the information from the update gate ( ) and forget gate ( ) and the previous cell state ( ( −1) ): Th elast step is to update the activation function ( ) which is updated by multiplying the output gate with the current cell state.

Post-Processing
Most of the type-1 and type-2 appliances have activation durations lasting from a few minutes (in the case of a microwave) to hours (dishwashers, washing machines) depending upon their actions to perform a task. However, predicted energy of target appliances not only contains ground-truth activations but some irrelevant activations as well, whose lengths (activation durations) are less than ground-truth activations as shown in Figure 3. These irrelevant activations lead to high false-positive cases that compromise the performance of the DNN algorithm. The Tanh layer creates a vector of new candidate values that could be added to the state. The 'Tanh' function is used in this step instead of the sigmoid function as shown in (5).
where,ĉ (t) is the output of Tanh layer, W c is the weight for the Tanh layer, and b c is the bias value for the Tanh layer. The memory cell is updated from c (t−1) to c (t) in the third step. The new cell state c (t) is calculated using the information from the update gate (Γ u ) and forget gate (Γ f ) and the previous cell state (c (t−1) ): Th elast step is to update the activation function a (t) which is updated by multiplying the output gate with the current cell state.
Energies 2020, 13, 2195 7 of 26 Equations (4) to (8) are used to update the cell state of the LSTM in a single hidden unit. A deep LSTM architecture is a variation of LSTM where more than one LSTM hidden layers are stacked after each other to form a deep recurrent neural network.

Post-Processing
Most of the type-1 and type-2 appliances have activation durations lasting from a few minutes (in the case of a microwave) to hours (dishwashers, washing machines) depending upon their actions to perform a task. However, predicted energy of target appliances not only contains ground-truth activations but some irrelevant activations as well, whose lengths (activation durations) are less than ground-truth activations as shown in Figure 3. These irrelevant activations lead to high false-positive cases that compromise the performance of the DNN algorithm.  Our proposed post-processing algorithm is mainly composed of five steps. Given the groundtruth energy ( ) and predicted energy profile ( ), the first step is to calculate target appliance ground-truth activations = { 1 , 2 , ⋯ , } and predicted appliance activations = { 1 , 2 , ⋯ , }. For that purpose, we use customized get_activations() function from NILMTK toolkit [49]. Here, ground-truth energy ( ) refers to actual sub-metered power consumption of an appliance containing both ON and OFF events in a given period, and predicted energy ( ) refers to estimated power consumption of an appliance by the algorithm.
At the second step, our post-processing algorithm calculates the length of ground-truth activations of a target appliance and creates a separate list of these activation lengths. The third step determines the minimum length from the list of ground-truth activations. At the fourth step, our post-processing algorithm compares minimum length with the length of every predicted appliance activation. At the fifth step, our algorithm eliminates all those predicted appliance activations whose length was less than the minimum length of ground-truth activations. In this way, we ensured that the total predicted energy is close to the ground-truth energy (sub-metered energy) of a target appliance. Details of the post-processing algorithm are provided in Table 1.  Based on a visual analysis of irregular activations, an effective and robust post-processing technique is adopted in this paper. Unlike in [45], our proposed technique does not require training a separate DNN model. In fact, it eliminates irrelevant activations during the disaggregation stage by comparing the lengths of ground-truth and predicted appliance activations of both type-1 and type-2 appliances.
Our proposed post-processing algorithm is mainly composed of five steps. Given the ground-truth energy (E g ) and predicted energy profile (E p ), the first step is to calculate target appliance ground-truth activations A g = {a 1a , a 2a , · · · , a na } and predicted appliance activations A p = a 1p , a 2p , · · · , a np . For that purpose, we use customized get_activations() function from NILMTK toolkit [49]. Here, ground-truth energy (E g ) refers to actual sub-metered power consumption of an appliance containing both ON and OFF events in a given period, and predicted energy (E p ) refers to estimated power consumption of an appliance by the algorithm.
At the second step, our post-processing algorithm calculates the length of ground-truth activations of a target appliance and creates a separate list of these activation lengths. The third step determines the minimum length from the list of ground-truth activations. At the fourth step, our post-processing algorithm compares minimum length with the length of every predicted appliance activation. At the fifth step, our algorithm eliminates all those predicted appliance activations whose length was less than the minimum length of ground-truth activations. In this way, we ensured that the total predicted energy is close to the ground-truth energy (sub-metered energy) of a target appliance. Details of the post-processing algorithm are provided in Table 1. Inputs: target appliance ground-truth activations set A g and predicted activations set A p 2 Zero initialization of updated predicted activation profile (Â p ), and updated predicted energy profile (Ê p ) 3 Calculate lengths of each activation from A g and A p : l g = len A g and l p = len A p 4 Determine minimum length from list of actual activations lengths: min{l g } 5 Set pointer j = 0 6 For j (0, n) : 7 If len a jp < min l g : 8 a jp = 0; 9 UpdateÂ p andÊ p ; 10 End if 11 End For 12 ReturnÊ p

Real-time Deployable NILM Framework
Existing deep learning-based NILM algorithms have used low-frequency steady-state measurements to disaggregate household appliances. However, those algorithms do not comply with practical application requirements such as generalization, the high number of disaggregated appliances, and high power estimation accuracy on a noised scenario. Therefore, those algorithms are not feasible for real-time deployment. As in this paper, we are proposing a multi-feature input space and post-processing based deep learning algorithm to achieve high estimation accuracy in a noised scenario and lowest generalization error on unseen data. Therefore, we attempted to design a real-time deployable NILM framework that incorporates our proposed MFS-LSTM algorithm.
Our proposed real-time deployable NILM framework intends to use pre-trained deep learning models (trained by the MFS-LSTM algorithm) to perform online disaggregation. Figure 4 shows the design of our three-stage deep learning-based practical NILM framework. At the first stage, deep LSTM models for individual appliances will be trained using the MFS-LSTM algorithm and will be integrated into the cloud-based server. This stage is called the data preprocessing and training stage because it prepares multi-feature input data and trains deep LSTM models according to the MFS-LSTM algorithm shown in Figure 1. In the second stage, the NILM service provider will collect the customer's aggregate measurement (in the form of active power, apparent power, reactive power, current, and power factor) using a Consumer Access Device (CAD) [50] and will upload to the cloud server where online disaggregation will be performed using pre-trained deep LSTM models. The third stage refers to the NILM analysis in which post-processed disaggregation results along with energy consumption analysis will be downloaded to the customer. Energies 2020, 13, x FOR PEER REVIEW 10 of 27

Datasets and Pre-Processing
Two publicly available datasets: UK-Domestic Appliance-Level Electricity (UK-DALE) dataset [50] and Electricity Consumption and Occupancy (ECO) dataset [51] were used for training and testing of our proposed algorithm. In UKDALE, aggregated mains power data is comprised of active power, apparent power, and voltage measurements; whereas sub-metered data contains only active power measurement sampled at 1/6 Hz, which we up-sampled to 1-Hz. The ECO dataset contains five electrical measurements, which are sampled at 1-Hz. Whereas, sub-meter data contains only active power measurements sampled at 1-Hz.
Based on the methodology given in Section 2.1, we considered six steady-state electrical measurements (active power, apparent power, reactive power, line voltage, line current, and power factor) for mutual information analysis, and the results are shown in Figure 5. As the mutual information score ranges between [0, +∞], we normalized it to lie in the range [0, 1]. Mutual information scores were categorized into three ranges: scores above 0.2 were considered high; scores between 0.1 and 0.2 were considered moderate, and; scores below 0.1 were considered weak.

Datasets and Pre-Processing
Two publicly available datasets: UK-Domestic Appliance-Level Electricity (UK-DALE) dataset [50] and Electricity Consumption and Occupancy (ECO) dataset [51] were used for training and testing of our proposed algorithm. In UKDALE, aggregated mains power data is comprised of active power, apparent power, and voltage measurements; whereas sub-metered data contains only active power measurement sampled at 1/6 Hz, which we up-sampled to 1-Hz. The ECO dataset contains five electrical measurements, which are sampled at 1-Hz. Whereas, sub-meter data contains only active power measurements sampled at 1-Hz.
Based on the methodology given in Section 2.1, we considered six steady-state electrical measurements (active power, apparent power, reactive power, line voltage, line current, and power factor) for mutual information analysis, and the results are shown in Figure 5. As the mutual information score ranges between [0, +∞], we normalized it to lie in the range [0, 1]. Mutual information scores were categorized into three ranges: scores above 0.2 were considered high; scores between 0.1 and 0.2 were considered moderate, and; scores below 0.1 were considered weak. There could be a couple of interpretations of these scores. The first noticeable factor was that reactive power, current, and power factors have higher scores as compared to other electrical features. This implies that these three electrical features have a high influence on all appliances' power consumption except the kettle, microwave, and rice cooker. Similarly, active power and apparent power comparatively have less influence on all appliances with the score range between 0.103 and 0.318. Voltage is the only feature that had the weakest influence on all appliances. Therefore, we considered voltage as an irrelevant feature and selected only five electrical features to form the multi-feature input space for training the deep LSTM network. Another useful insight is that all electrical features (except voltage) have shown influence on all target appliances, which indicate that if these features are combined to form a multi-feature input space, then better disaggregation accuracy can be expected for all disaggregated appliances as compared to one or two feature-based input data.

Training and Hyperparameter Optimization
For training deep LSTM (MFS-LSTM) models, at first, we split the input data into training data, validation data, and testing data. After data pre-processing, we trained our proposed architecture on multi-feature based training data using the Keras library [52] with GPU based Tensorflow as a backend engine. GPU used for training was NVIDIA GeForce GTX 1060 6GB.
We selected the LSTM architecture presented in [28] as our baseline model, which was composed of two hidden layers. We performed comprehensive hyperparameter tuning to select those hyperparameters values, which had the most influence on the learning behavior of the deep LSTM network for the reduction in the generalization error and convergence time. In particular, we focused mainly on three parameters: the number of hidden units, learning rate, and the activation function to see which combination of values aids in achieving local minima. We trained one model for each target appliance with different layer widths ranging from 50 to 250 for the first hidden layer. Whereas, the number of hidden units in the second layer was twice that of the first layer. We kept the number of units consistent with the first and second layers in each trial. This means the number of units in the second hidden layer varied from 100 to 500 (double of first layer units). Similarly, we varied the learning rate by the factor of 10 starting from 0.1 to 1 × 10 −6 . We also tried three activation functions depending upon the learning curve responses. There could be a couple of interpretations of these scores. The first noticeable factor was that reactive power, current, and power factors have higher scores as compared to other electrical features. This implies that these three electrical features have a high influence on all appliances' power consumption except the kettle, microwave, and rice cooker. Similarly, active power and apparent power comparatively have less influence on all appliances with the score range between 0.103 and 0.318. Voltage is the only feature that had the weakest influence on all appliances. Therefore, we considered voltage as an irrelevant feature and selected only five electrical features to form the multi-feature input space for training the deep LSTM network. Another useful insight is that all electrical features (except voltage) have shown influence on all target appliances, which indicate that if these features are combined to form a multi-feature input space, then better disaggregation accuracy can be expected for all disaggregated appliances as compared to one or two feature-based input data.

Training and Hyperparameter Optimization
For training deep LSTM (MFS-LSTM) models, at first, we split the input data into training data, validation data, and testing data. After data pre-processing, we trained our proposed architecture on multi-feature based training data using the Keras library [52] with GPU based Tensorflow as a backend engine. GPU used for training was NVIDIA GeForce GTX 1060 6GB.
We selected the LSTM architecture presented in [28] as our baseline model, which was composed of two hidden layers. We performed comprehensive hyperparameter tuning to select those hyperparameters values, which had the most influence on the learning behavior of the deep LSTM network for the reduction in the generalization error and convergence time. In particular, we focused mainly on three parameters: the number of hidden units, learning rate, and the activation function to see which combination of values aids in achieving local minima. We trained one model for each target appliance with different layer widths ranging from 50 to 250 for the first hidden layer. Whereas, the number of hidden units in the second layer was twice that of the first layer. We kept the number of units consistent with the first and second layers in each trial. This means the number of units in the second hidden layer varied from 100 to 500 (double of first layer units). Similarly, we varied the learning rate by the factor of 10 starting from 0.1 to 1 × 10 −6 . We also tried three activation functions depending upon the learning curve responses. Figure 6a shows the learning behavior with varying layer widths corresponding to the first layer. Because, the number of units in each layer was varied simultaneously, the learning response shown in Figure 6a also stands for the second layer width. Increasing the layer width from 50 to 200 units demonstrated a downward response. Figure 6b shows the influence of the layer width on the training and validation loss. Increasing the hidden units (layer width) decreases the training loss but the network tends to overfit for larger units. This response urged us to use a layer width of 150 units for the first layer and 300 units for the second layer. Figure 6c shows the influence of the learning rate on the training loss. At a high learning rate, for instance, at 0.1, 0.001, the training loss and validation loss fluctuated, which revealed that the weights diverged, and the network was broken as a result. We ramped down the learning rate by a factor of 10 and achieved an optimal learning response at a learning rate of 1 × 10 −4 . Figure 6(d) shows the impact of the activation function on learning behavior. The rectified linear unit (ReLU) activation function was found to be the optimal activation function in our case. The dropout rate varied between 0.2 and 0.5 and a 0.3 rate was found to be the best in reducing overfitting.
Energies 2020, 13, x FOR PEER REVIEW 12 of 27 Figure 6a shows the learning behavior with varying layer widths corresponding to the first layer. Because, the number of units in each layer was varied simultaneously, the learning response shown in Figure 6a also stands for the second layer width. Increasing the layer width from 50 to 200 units demonstrated a downward response. Figure 6b shows the influence of the layer width on the training and validation loss. Increasing the hidden units (layer width) decreases the training loss but the network tends to overfit for larger units. This response urged us to use a layer width of 150 units for the first layer and 300 units for the second layer. Figure 6c shows the influence of the learning rate on the training loss. At a high learning rate, for instance, at 0.1, 0.001, the training loss and validation loss fluctuated, which revealed that the weights diverged, and the network was broken as a result. We ramped down the learning rate by a factor of 10 and achieved an optimal learning response at a learning rate of 1 × 10 −4 . Figure 6(d) shows the impact of the activation function on learning behavior. The rectified linear unit (ReLU) activation function was found to be the optimal activation function in our case. The dropout rate varied between 0.2 and 0.5 and a 0.3 rate was found to be the best in reducing overfitting.

Performance Evaluation Metrics
In this paper, we have evaluated our approach considering the noised test data from house-2 and house-5 of the UKDALE dataset, and house-1 and house-2 of the ECO dataset. Percent noise ratio [37,38] was calculated on actual data using the following equation: Since our approach is based on classification and power estimation of target appliances, which requires both classification and estimation evaluation metrics to be used. We used state-based precision, recall, and F1 measure metrics, which are defined as: where TP, FP, and FN refer to the total number of true positives, false positives, and false negatives in the data, respectively. For power estimation evaluation, we have used the mean absolute error (MAE), signal aggregate error (SAE), and estimation accuracy (EA) metrics, which are defined in (13), (14), and (15), respectively: whereŷ t refers to predicted power at time t, and y t refers to ground truth power at time t.
where E p is total predicted energy and E g is total ground truth energy for each appliance. The estimation accuracy (EA) [53] metric was proposed to calculate the correct value of accuracy and error for power estimation based NILM problems. For each appliance, the estimation accuracy is defined as: whereŷ k t is predicted power for appliance k at time t, and y k t is ground truth power for appliance k at time t. K refers to the total number of target appliances and T refers to the total time sequence used for testing.

Results with the UKDALE Dataset
Seen scenario refers to test data, which was unseen during training. We tested individual appliance models of the kettle, microwave, dishwasher, fridge, washing machine, rice cooker, electric oven, and television on last week's data from two houses of the UKDALE dataset. Submeter data of six appliances were taken from house-2 of the UKDALE dataset, whereas the electric oven and television data were obtained from house-5 of the UKDALE dataset. Last week's data was unused during the training which makes it unseen data during training. Trained MFS-LSTM models for each target appliance were tested using a noised aggregated signal as input and the algorithm's task was to predict a clean disaggregated signal for each target appliance. Figure 7 shows the disaggregation results of some of the target appliances. Visual inspection of Figure 7 shows that our proposed MFS-LSTM algorithm successfully predicted activations and energy consumption sequences of all target appliances in a given period. The proposed algorithm also predicted some irrelevant activations, which were successfully eliminated using our post-processing technique. Elimination of irrelevant activations improved precision and reduced extra predicted energy, which in turn improved classification and power estimation results of all target appliances. Numerical results of eight target appliances of UKDALE in a seen scenario are presented in Table 2. With the help of the post-processing technique, overall F1 scores (average score of all target appliances) improved from 0.688 to 0.976 (30% improvement) and MAE reduced from 23.541 watts to 8.999 watts on the UKDALE dataset. Similarly, the estimation accuracy improved from 0.714 to 0.959. Although, a significant improvement in F1-scores and MAE was observed with the use of the post-processing technique, the SAE and EA results have slightly decreased for the kettle, microwave, and dishwasher as compared to the results without post-processing. The reasons for the decrease in estimation accuracy and increase in signal aggregate error is due to the overall decrease in predicted energy after eliminating irrelevant activations. results of some of the target appliances. Visual inspection of Figure 7 shows that our proposed MFS-LSTM algorithm successfully predicted activations and energy consumption sequences of all target appliances in a given period. The proposed algorithm also predicted some irrelevant activations, which were successfully eliminated using our post-processing technique. Elimination of irrelevant activations improved precision and reduced extra predicted energy, which in turn improved classification and power estimation results of all target appliances. Numerical results of eight target appliances of UKDALE in a seen scenario are presented in Table 2. With the help of the postprocessing technique, overall F1 scores (average score of all target appliances) improved from 0.688 to 0.976 (30% improvement) and MAE reduced from 23.541 watts to 8.999 watts on the UKDALE dataset. Similarly, the estimation accuracy improved from 0.714 to 0.959. Although, a significant improvement in F1-scores and MAE was observed with the use of the post-processing technique, the SAE and EA results have slightly decreased for the kettle, microwave, and dishwasher as compared to the results without post-processing. The reasons for the decrease in estimation accuracy and increase in signal aggregate error is due to the overall decrease in predicted energy after eliminating irrelevant activations.

Results with the ECO Dataset
The disaggregation results of seven appliances are shown in Table 3. These results were calculated using 1-month data which was unseen during training. Not all the appliances were present in all six houses of the ECO dataset. Kettle, fridge, and washing machine data were obtained from house-1, whereas dishwasher, electric stove, and television data were retrieved from house-2 of the ECO dataset. Similarly, microwave data were obtained from house-5 of the ECO dataset. Type-2 appliances such as the dishwasher and washing machine are very hard to classify because of various operational cycles present during their operation. With our proposed MFS-LSTM integrated with post-processing, type-2 appliances have successfully been classified and their power consumption estimation resembles ground-truth consumption according to Figure 8. The disaggregation results of seven appliances are shown in Table 3. These results were calculated using 1-month data which was unseen during training. Not all the appliances were present in all six houses of the ECO dataset. Kettle, fridge, and washing machine data were obtained from house-1, whereas dishwasher, electric stove, and television data were retrieved from house-2 of the ECO dataset. Similarly, microwave data were obtained from house-5 of the ECO dataset. Type-2 appliances such as the dishwasher and washing machine are very hard to classify because of various operational cycles present during their operation. With our proposed MFS-LSTM integrated with post-processing, type-2 appliances have successfully been classified and their power consumption estimation resembles ground-truth consumption according to Figure 8. Although our algorithm was able to classify all target appliance activations, the presence of irrelevant activations in Figure 6 (left) indicates that the deep LSTM model learned some features of non-target appliances during training. This can happen due to the similar looking activation profiles of type-1 and type-2 appliances. This effect was eliminated with the use of the post-processing technique whose advantage can easily be realized with the results shown in Tables 2 and 3 for a seen scenario.  Although our algorithm was able to classify all target appliance activations, the presence of irrelevant activations in Figure 6 (left) indicates that the deep LSTM model learned some features of non-target appliances during training. This can happen due to the similar looking activation profiles of type-1 and type-2 appliances. This effect was eliminated with the use of the post-processing technique whose advantage can easily be realized with the results shown in Tables 2 and 3 for a seen scenario.

Testing in an Unseen Scenario (Unseen Data from UKDALE House-5)
The generalization capability of our network was tested using unseen data during training. Data used for testing the algorithms was completely unseen for the trained model. In this test case, we used entire house-5 data from the UKDALE dataset for disaggregation and made sure that the testing period contains activations from all target appliances. The UKDALE dataset contains 1-sec and 6-sec sampled mains and sub-metered data, therefore, we up-sampled ground truth data to 1-sec for comparison.
Performance evaluation results of the proposed algorithm with and without post-processing in the unseen scenario are presented in Table 4. In the unseen scenario, the post-processed MFS-LSTM algorithm achieved an overall F1-score of 0.746, which was 54% better than without post-processing. Similarly, MAE reduced from 26.90W to 10.33W, SAE reduced from 0.782 to 0.438, and estimation accuracy (EA) improved from 0.609 to 0.781 (28% improvement). When MAE, SAE, and EA scores of the unseen test case were compared with the seen scenario then a visible difference in overall results was observed. One obvious reason for this difference was the different power consumption patterns of house-5 appliances; also, %-NR was higher in house-5 (72%) as compared to the house-2 noise ratio, which was 19%. However, overall results prove that the proposed algorithm can estimate the power consumption of target appliances from the seen house but can also identify appliances from a completely unseen house with unseen appliance activations.

Energy Contributions by Target Appliances
Apart from individual appliance evaluation, it is also necessary to analyze total energy contributions from each target appliance. In this way, we can understand the overall performance of the algorithms when acting as a part of the NILM system. This information is helpful to analyze algorithm performance on estimating power consumption of composite appliances for a given period and how it is closely related to actual aggregated power consumption. Figure 9 shows energy contributions from all target appliances in both seen and unseen test cases from the UKDALE and ECO datasets. The first thing to notice from Figure 9 is the amount of estimated power consumption, which is less than actual power consumption in both datasets. This happened because of the elimination of irrelevant activations which caused extra predicted energy. Another useful insight is the difference between the amount of estimated power consumption and actual consumption for type-2 appliances (dishwasher, washing machine, electric oven), which is relatively higher than the type-1 appliances difference. This could have happened due to multiple operational states of type-2 appliances which are very hard to identify as well as their power consumption is also very difficult to estimate by the DNN models. Energy contributions for all target appliances of the ECO dataset (Figure 7) are higher as compared to UKDALE appliances. This is due to the time span during which energy consumption by individual appliances was computed. For the UKDALE dataset, 1-week test data was used for evaluation. Whereas for the ECO dataset, one-month data was used for evaluation. Detailed results for energy consumption evaluation in terms of noise-ratio, percentage of disaggregated energy, and estimation accuracy are shown in Table 5. ratio, percentage of disaggregated energy, and estimation accuracy are shown in Table 5.
As described in Section 3.3, the noise ratio refers to energy contribution by non-target appliances. In our test cases, total energy contributions by all target appliances in said houses were 80.66%, 27.92%, 16.24%, and 79.49% respectively. Based on the results presented in Table 5, our algorithm successfully estimated power consumption of target appliances with an accuracy of 0.891 in UKDALE house-2, 0.886 in UKDALE house-5, 0.900 for ECO house-1, and 0.916 for ECO house-2.    As described in Section 3.3, the noise ratio refers to energy contribution by non-target appliances. In our test cases, total energy contributions by all target appliances in said houses were 80.66%, 27.92%, 16.24%, and 79.49% respectively. Based on the results presented in Table 5, our algorithm successfully estimated power consumption of target appliances with an accuracy of 0.891 in UKDALE house-2, 0.886 in UKDALE house-5, 0.900 for ECO house-1, and 0.916 for ECO house-2.

Performance Comparison with State-of-the-Art Disaggregation Algorithms
We compared the performance of our proposed MFS-LSTM algorithm with the neural-LSTM [31], denoising autoencoder (dAE) algorithm [32], CNN based sequence-to-sequence algorithm CNN(S-S) [33], and benchmark implementations of the factorial hidden Markov model (FHMM) algorithm, and the combinatorial optimization (CO) algorithm [12] from the NILM toolkit [49]. We chose these algorithms for comparison for various reasons. First, the neural LSTM, dAE, and CNN(S-S) were also evaluated on the UKDALE dataset. Secondly, these algorithms were validated on individual appliance models as we did. Thirdly, [31][32][33] also evaluated their approach on both seen and unseen scenarios. Lastly, recent NILM works [45,54,55] have used these algorithms (CNN(S-S), CNN(S-P), neural-LSTM) to compare their approaches. That is why these three are referred to as benchmark algorithms in the NILM research.
UKDALE house-2 and house-5 data were used to train and test benchmark algorithms for seen and unseen test cases. Four-month data was used for training, whereas 10-day data was used for testing. The min-max scaling method was used to normalize the input data and individual models of five appliances were prepared for comparison. Hardware and software specifications were the same as described in Section 3.2. Table 6 shows training and testing times for the above-mentioned disaggregation algorithms in terms of length of days. Many factors affect the training time of algorithms, including training samples, trainable parameters, hyper-parameters, GPU power, and complexity of the algorithm. Considering these factors, the combinatorial optimization (CO) algorithm has the lowest complexity, thus it is the fastest to execute [56]. This can also be observed from the training time of the CO algorithm from Table 6. The FHMM algorithm was the second-fastest followed by the dAE algorithm. Training time results show that the proposed MFS-LSTM algorithm has faster execution time than the neural-LSTM and CNN(S-S) because of the fewer parameters and relatively simple deep RNN architecture.  Figure 10 shows the load disaggregation comparison of the MFS-LSTM with dAE, CNN(S-S), and neural-LSTM algorithms in the seen scenario. Qualitative comparison from Figure 10 shows that the MFS-LSTM algorithm disaggregated all target appliances and proved better as compared to the dAE, neural-LSTM, and CNN(S-S) algorithms in terms of power estimation and states estimation accuracy. Although, all algorithms correctly estimated operational states of target appliances. However, the dAE algorithm showed relatively poor power estimation performance for the disaggregating kettle, fridge, and microwave. The CNN(S-S) performance was better for the disaggregating microwave. However, for all other appliances, its performance seemed to be comparative with the MFS-LSTM algorithm. These findings can be better understood through quantitative scores for all algorithms in terms of the F1 score and estimation accuracy as shown in Table 7. As shown in Figure 10, the dAE's F1 score was lower for the kettle as compared to all other algorithms. The neural-LSTM performed better in terms of the F1 score except for the dishwasher and washing machine. The CNN(S-S) performance remained comparative with the MFS-LSTM for all target appliances. The CO and FHMM algorithms showed lower state estimation accuracy compared to all other algorithms. When overall (average score) performance was considered, the MFS-LSTM achieved an overall F1 score of 0.887, which was 5% better than the CNN(S-S), 31% better than the dAE, and 43% better than the neural-LSTM and 200% better than the CO and FHMM algorithms. Considering the MAE scores, the MFS-LSTM achieved the lowest mean absolute error for all target appliances with an overall score of 5.908 watts. Only the CNN(S-S) scores were a bit close to the MFS-LSTM scores, however, the overall MAE score of the MSF-LSTM was two times less than CNN (S-S), almost four times less than the dAE, and six times less than the neural-LSTM.
Considering SAE scores, our algorithm achieved lowest SAE score of 0.043 for kettle, 0.121 for fridge, and 0.288 for dishwasher. MFS-LSTM algorithm's consistent scores for all target appliances ensured an overall SAE score of 0.306, which was very competitive with CNN(S-S), Neural-LSTM and dAE. However, overall score of 0.306 was 71.6% lower than CO, and 92.5% lower than FHMM algorithm. When estimation accuracy scores were considered, then dAE power estimation accuracy was higher for fridge and dishwasher, and lower for microwave and washing machine. EA scores for Neural-LSTM algorithm were lower for multi-state appliances. However, MFS-LSTM algorithm achieved an overall estimation accuracy of 0.847 for being consistent in disaggregating all target appliances with high classification and power estimation accuracy. Table 8 shows performance evaluation scores for benchmark algorithms in unseen scenario. F1, MAE, SAE, and estimation accuracy scores again proves effectiveness of MFS-LSTM algorithm in unseen scenario compared to benchmark algorithms. Considering F1 score, it can be observed that MFS-LSTM algorithm achieved more than 0.76 score for all target appliances except for microwave. MFS-LSTM achieved an overall score of 0.746, which was 200% better than Neural-LSTM, 27% better than CNN(S-S) and 22% better than dAE algorithm. MAE scores for MFS-LSTM were lower for all target appliances as compared to benchmark algorithms in unseen scenario. Our algorithm achieved an overall score of 10.33 watt, which was six times lower than dAE and CNN(S-S), and seven times lower than Neural-LSTM. Same trend was also observed with SAE scores, in which MFS-LSTM algorithm achieved lowest SAE scores for all target appliances except for microwave. An overall SAE  As shown in Figure 10, the dAE's F1 score was lower for the kettle as compared to all other algorithms. The neural-LSTM performed better in terms of the F1 score except for the dishwasher and washing machine. The CNN(S-S) performance remained comparative with the MFS-LSTM for all target appliances. The CO and FHMM algorithms showed lower state estimation accuracy compared to all other algorithms. When overall (average score) performance was considered, the MFS-LSTM achieved an overall F1 score of 0.887, which was 5% better than the CNN(S-S), 31% better than the dAE, and 43% better than the neural-LSTM and 200% better than the CO and FHMM algorithms. Considering the MAE scores, the MFS-LSTM achieved the lowest mean absolute error for all target appliances with an overall score of 5.908 watts. Only the CNN(S-S) scores were a bit close to the MFS-LSTM scores, however, the overall MAE score of the MSF-LSTM was two times less than CNN (S-S), almost four times less than the dAE, and six times less than the neural-LSTM.
Considering SAE scores, our algorithm achieved lowest SAE score of 0.043 for kettle, 0.121 for fridge, and 0.288 for dishwasher. MFS-LSTM algorithm's consistent scores for all target appliances ensured an overall SAE score of 0.306, which was very competitive with CNN(S-S), Neural-LSTM and dAE. However, overall score of 0.306 was 71.6% lower than CO, and 92.5% lower than FHMM algorithm. When estimation accuracy scores were considered, then dAE power estimation accuracy was higher for fridge and dishwasher, and lower for microwave and washing machine. EA scores for Neural-LSTM algorithm were lower for multi-state appliances. However, MFS-LSTM algorithm achieved an overall estimation accuracy of 0.847 for being consistent in disaggregating all target appliances with high classification and power estimation accuracy. Table 8 shows performance evaluation scores for benchmark algorithms in unseen scenario. F1, MAE, SAE, and estimation accuracy scores again proves effectiveness of MFS-LSTM algorithm in unseen scenario compared to benchmark algorithms. Considering F1 score, it can be observed that MFS-LSTM algorithm achieved more than 0.76 score for all target appliances except for microwave. MFS-LSTM achieved an overall score of 0.746, which was 200% better than Neural-LSTM, 27% better than CNN(S-S) and 22% better than dAE algorithm. MAE scores for MFS-LSTM were lower for all target appliances as compared to benchmark algorithms in unseen scenario. Our algorithm achieved an overall score of 10.33 watt, which was six times lower than dAE and CNN(S-S), and seven times lower than Neural-LSTM. Same trend was also observed with SAE scores, in which MFS-LSTM algorithm achieved lowest SAE scores for all target appliances except for microwave. An overall SAE score of 0.438 for MFS-LSTM algorithm was 38% lower than CNN(S-S), 59% lower than CO, 80% lower than FHMM and 87% lower than Neural-LSTM. Estimation accuracy scores were also high for the MFS-LSTM with an overall score of 0.781. One noticeable factor is the difference in scores between the MFS-LSTM and all other algorithms in the unseen scenario. The differences shown prove the superiority of the proposed algorithm in the unseen scenario as well. Considering the noised aggregate power signal, our multi-feature input space-based approach together with post-processing can disaggregate target appliances with high power estimation accuracy as compared to state-of-the-art algorithms.
In accordance with Table 5 parameters, UKDALE house-2 and house-5 noise ratio was 19.34% and 72.08%, respectively. This implies that total predictable power was 80.66% and 27.92%. In order to estimate the percentage of predicted energy (energy contributions by all target appliances), estimation accuracy scores for all disaggregation algorithms are shown in Table 9. Presented results also highlight the proposed algorithm's superior performance with an estimation accuracy of 0.994 and 0.956 in the seen and unseen test cases, respectively. These results suggest that our proposed algorithm efficiently estimates the power consumption of all target appliances for a given period of time. Table 9. Evaluation of total energy contributions by target appliances in disaggregation algorithms.

Conclusions
The ultimate goal of a NILM solution is to apply it in real-time, which is possible if the disaggregation algorithm fulfills practical application requirements. Intelligent and viable disaggregation algorithms such as deep neural networks can fulfill the NILM objective if they have high estimation accuracy and lowest generalization error. However, high accuracy and lowest error are subject to the availability of a sufficient amount of training data. Recent applications of deep learning have proved that feature space exploration is a viable substitute for a high amount of data. Therefore, in order to achieve high estimation accuracy and lowest generalization error in a noised scenario, this paper proposed a multi-feature subspace-based LSTM algorithm integrated with post-processing.
First, the mutual information method was used to select those electrical features, which had a strong influence on the target appliance's active power consumption. Based on the mutual information analysis, five electrical features were selected to form a multi-feature input space. After training individual models of nine target appliances using multi-feature input data, we tested them on unseen data from a seen house during training and unseen data from an unseen house during training in a noised scenario. The proposed MFS-LSTM algorithm successfully predicted appliance activations, along with some irrelevant activations. To eliminate those sporadic activations, we introduced a post-processing technique at the disaggregation stage. Our post-processor compared the lengths of actual and predicted activations and eliminated those activations whose lengths were less than actual activations. In order to make our solution deployable in real-time, we also proposed a three-stage NILM framework that aimed to use pre-trained appliance models (trained with the MFS-LSTM algorithm) to perform online disaggregation.
To prove our approach commendable, we compared our MFS-LSTM based deep RNN model with state-of-the-art disaggregation algorithms and accomplished an overall 66% improvement in F1-score in the seen scenario, and 120% improvement in the unseen scenario. In terms of MAE scores, we achieved 60% less error in the seen scenario and 69% less error in the unseen scenario for the MFS-LSTM algorithm. Considering SAE scores, the MFS-LSTM algorithm achieved an overall 15% less error in the seen scenario and 34% less error in the unseen scenario. Similarly, we achieved a 5% and 40% improvement in estimation accuracy in the seen and unseen test cases respectively. Results showed individual appliance performance degradation in the unseen test case due to the high percentage of noise present in the unseen test data and different power consumption patterns of the unseen target house. However, when the actual measured power and percentage of its disaggregation were considered, our proposed algorithm was able to disaggregate target appliances with more than 0.88 estimation accuracy in all target houses despite the high amount of noise present. Apart from better estimation accuracy and generalization, we also showed that the proposed algorithm is computationally efficient and its performance is independent of the increased number of appliances that make it more suitable for practical application. Future work relevant to this study will focus on training more appliances using both real and synthetic data so that trained models are generalized for various appliances having different shapes and lengths.   The activation value at time index 't' a (t−1) The activation value at time index 't-1 a na The nth-activation from ground-truth activations list a np The nth activations from predicted appliance activation list A g The list of ground-truth activations A p The list of predicted appliance activationŝ A p The updated predicted activation profile b c The bias value for Tanh layer b f The bias value for forget layer b o The bias value for output layer b u The bias value for update layer c (t) The new cell statê c (t) The output of Tanh layer c (t−1) The The probability density function of variable 'x' p(y) The probability density function of variable 'y' p(x, y) The joint probability density functions of variable 'x' and 'y' Q The measured reactive power ReLU Rectified Linear Unit RNN Recurrent Neural Network S The measured apparent power sin θ The sine of angle between RMS line voltage and line current SVM Support Vector Machine T The total time sequence used for training/testing Tanh The hyperbolic tangent function TP The accumulated True-Positives UKDALE UK Domestic Appliance-Level Electricity dataset V rms The RMS line voltage W c The weight value for Tanh layer W f The weight value for forget layer W o The weight value for output layer W u The weight value for update layer x (t) The input power sequence at time step 't' y t The ground-truth power at time step 't' y k t The ground-truth power for appliance 'k' at time step 't' y t The predicted power at time step 't' y k t The predicted power for appliance 'k' at time step 't'