Comparative Evaluation of Non-Intrusive Load Monitoring Methods Using Relevant Features and Transfer Learning

: Non-Intrusive Load Monitoring (NILM) refers to the analysis of the aggregated current and voltage measurements of Home Electrical Appliances (HEAs) recorded by the house electrical panel. Such methods aim to identify each HEA for a better control of the energy consumption and for future smart grid applications. Here, we are interested in an event-based NILM pipeline, and particularly in the HEAs’ recognition step. This paper focuses on the selection of relevant and understandable features for efﬁciently discriminating distinct HEAs. Our contributions are manifold. First, we introduce a new publicly available annotated dataset of individual HEAs described by a large set of electrical features computed from current and voltage measurements in steady-state conditions. Second, we investigate through a comparative evaluation a large number of new methods resulting from the combination of different feature selection techniques with several classiﬁcation algorithms. To this end, we also investigate an original feature selection method based on a deep neural network architecture. Then, through a machine learning framework, we study the beneﬁts of these methods for improving Home Electrical Appliance (HEA) identiﬁcation in a supervised classiﬁcation scenario. Finally, we introduce new transfer learning results, which conﬁrm the relevance and the robustness of the selected features learned from our proposed dataset when they are transferred to a larger dataset. As a result, the best investigated methods outperform the previous state-of-the-art results and reach a maximum recognition accuracy above 99% on the PLAID evaluation dataset.


Introduction
During the last decades, the electricity consumption in the residential sector increased steadily with the worldwide population growth and became a major ecological issue.Prior studies show that a real-time feedback down to the HEA level can help to effectively reduce consumption, with almost 15% of energy savings [1].For consumers, the main advantages are the control and the understanding of their electricity consumption through a transparent access and a promptly forwarded information.For utilities, it can improve the load-forecasting accuracy and provide a basic scheme to set up energy management strategies [2].In this context, Non Intrusive Load Monitoring (NILM) methods offer an efficient answer, since they provide a breakdown of the residential energy consumption without instrumenting each HEA.
Here, we are interested in event-based NILM systems, where the current and voltage measurements are recorded using a single sensor connected to the house electrical panel [3][4][5][6][7].An event detection method is used to predict the changes in the aggre-gated power signals that occur at each HEA's operating state [6].Then, relevant features that meet the additive criterion [8] (which is required for the subsequent steps) are computed to recognize the HEA electrical signature that triggered an event using a pattern matching method.
NILM encounters several challenges that concern the correct identification of HEAs.Indeed, in a house, a broad range of HEAs can be present, and many of them can rely on the same electrical behavior [9][10][11].Hence, discerning the most relevant and informative features is of paramount importance for any NILM application whose performances depend on the HEA signature uniqueness [4,9,12].This is the reason why an HEA must be described by a reduced number of relevant features.To date, prior studies have focused on HEA recognition performance and often involve non-physics-related and difficult to interpret features.This is the case for the features provided by deep convolutional neural networks (CNN), which can suffer from the robustness issue with adversarial examples [13,14], which require the use of attention mechanisms [15].However, only a few works investigate in detail the role and the meaning of Feature Selection (FS) methods in NILM problems when addressed through a pattern recognition approach [16][17][18][19][20].
Thus, this study aims at filling the gap by investigating different FS techniques to tackle the HEA identification problem in a supervised machine learning scenario when a large number of electrical features are available and when HEAs from distinct manufacturers belonging to the same device category are considered.The goal is to show the benefits provided by FS in terms of HEAs' classification performance and interpretability, and for enhancing the generalization capability by reducing overfitting of the trained models.Thus, transfer learning [21], which deals with the generalization capability of the selected features for different datasets, is also an important NILM challenge that is investigated in this study.
This paper is organized as follows.In Section 2, the addressed problem is formulated.In Section 3, we introduce the two HEAs' datasets considered in this study.In particular, we present a novel dataset of HEAs' current and voltage measurements in steady-state conditions, and we detail the electrical features computed from these measurements.In Section 4, several FS methods are presented.Two of them that are novel are detailed: a heuristic forward method and a method based on a trained Dense neural network.Finally, in Section 5, the selected features are used in combination with several classification algorithms on both considered datasets to demonstrate the importance of selecting the suitable combination of features and classification algorithm.The paper is concluded with future work directions in Section 6.

Problem Statement 2.1. Supervised HEA Identification
From the values of a set of features describing a unique HEA signature, we aim at identifying the closest pre-registered HEA, referred to (according to the machine learning literature) as a class.To this end, we use a supervised machine learning framework trained on an annotated dataset.The overall HEAs' recognition problem can be illustrated in the flowchart of Figure 1, which also includes references to the paper sections where each method is detailed.During the training step, several features (or descriptors) deduced from voltage and current HEAs' measurements are first computed and used to derive a discriminative model for each HEA.Then, the most discriminant features are automatically selected to obtain the optimal classification accuracy.During the test and operating procedures, the selected features are computed from the observed measurements to predict the class of the corresponding HEA.
Since this work is motivated by HEA recognition for energy consumption estimation and prediction, the same appliance type can thus be considered from a different class when the HEA is recorded at different power levels, or when its energy consumption sufficiently differs due to a different manufacturing brand.Hence, we use a larger and more complicated classification taxonomy than those commonly used for HEA identification in the literature.From another hand, we only consider loads in steady-state condition because switching on (or off) introduces transient signals or fluctuations which are often not sufficient to accurately characterize an HEA.In fact, transients are also affected by HEAindependent factors such as the time on the voltage waveform where the HEA is switched on/off, the network impedance, supply voltage distortion, the sampling frequency or the switching on/off mechanism [22].To deal with transient signals and state change detection for multiple HEA recognition, the reader can for example refer to a multivariate statistical approach recently proposed in [6,23].
Flowchart of the general process for supervised HEA identification.

Features Selection for HEA Identification
be the overall set of p features.The FS process aims at finding the optimal subset of features F ⊆ F that maximizes the identification accuracy such that its cardinality verifies card(F ) = d ≤ p [24].This process induces several benefits such as avoiding the curse of dimensionality and overfitting phenomenon [25], improving the classification performance with the removal of nondiscriminating features and decreasing the computational cost by only collecting the needed features.In contrast to the other dimensionality reduction techniques [24], FS does not change the original meaning of features and therefore allows further interpretation by a domain expert.Our work consists in evaluating several feature selection techniques when applied to an HEA recognition scenario, in terms of recognition accuracy and in terms of robustness by investigating two distinct datasets and several classification methods.

Transfer Learning
The last challenge addressed in this study concerns transfer learning [21], which consists in using the knowledge extracted from a given dataset to tackle a new problem based on a different dataset with a different setup and classification taxonomy.The investigated datasets can be of different natures since they use different recording protocols, different grid properties and different annotation taxonomies.This main motivation is to show the validity and the generalization capability of the selected features from one dataset to another.To this end, we introduce a novel dataset recorded on a French grid (utility frequency at 50 Hz) which is different from the other existing publicly available datasets such as PLAID [26], which was recorded recorded in the USA grid with utility frequency at 60 Hz.Hence, our new proposed dataset contains a large variety of HEA types including new ones such as LCD and Plasma TV, Coffee maker, and oven.

HEAs Datasets
To test the performance of an NILM technique, it is important to consider real data.Due to the challenges in the creation of such datasets, mainly related to the required time and the high costs involved, we investigated in this study two data sets that are freely available: the PLAID dataset [26], which is shared online and is used as common reference, and our novel publicly available dataset freely accessible at http://dx.doi.org/10.21227/ww76-d733).Both datasets contain individual HEAs measurements which are convenient for extracting features, training models, conducting performance evaluation and performing benchmarking on a common basis.Indeed, some existing datasets include scenarios of multiple simultaneous loads [27].However, before conducting disaggregation (i.e., decomposing the whole energy consumption of a dwelling into the energy usage of individual HEAs), it is important to build an initial signature database that is key to many NILM techniques.
For both datasets considered in this study, a class of HEA corresponds to a brand of a category of HEA, e.g., the class "Incandescent light bulb-Electrix-soft white", which is distinct from the HEA class "Incandescent light bulb-Philips Durama".Furthermore, we only consider steady-state conditions, and we extracted for each class of HEAs in both datasets several periods of the current and voltage steady-state waveforms.Indeed, during HEAs' switching On/Off, momentary fluctuations of the current and voltage signals occur before settling in to a steady-state value.These fluctuations are called transients and can characterize a given HEA.However, one major drawback of the switching transients is their reproducibility, since they can be affected by HEA-independent factors.

PLAID Dataset
This dataset [26] contains current and voltage measurements sampled at 30 kHz from 11 different HEA types present in more than 60 households in Pittsburgh, Pennsylvania, USA.The goal of this dataset is to provide a public library for high-frequency measurements that can be used to assess existing or novel HEA classification algorithms.There are 11 categories of HEAs, which correspond to: air conditioner, compact fluorescent lamp, fridge, hairdryer, laptop, microwave, washing machine, bulb, vacuum, fan, and heater.Each category of HEA is represented by more than ten different instances.For each HEA, three to six measurements are collected for each state transition (i.e., on/off changing state).As mentioned, we only consider the steady-state operations, and we extracted for each class of HEAs in the PLAID dataset several periods of the current and voltage steady-state waveforms such that we have a total of 71 HEAs (with different categories and brands) and n = 36, 720 distinct recordings (also called individuals in the statistical terminology).

New Proposed Dataset
We introduce a novel publicly availabledataset containing 24 categories of HEAs (e.g., fans, fridges, washers, etc.) considered with distinct brands (35 types) and that were recorded at several power levels to lead us to a total of 61 considered HEAs.For HEAs with a wide variety of operational programs or adjustable settings such as temperature or intensity, we recorded all power consumption patterns and considered that an HEA power level corresponds to a specific class of HEA, although power consumption patterns refer to the same device.Figure 2  The HEAs have been recorded in steady-state conditions in a French 50 Hz electrical grid.The measurement setup consists of an AC current probe (E3N Chauvin Arnoux) with a 10 mV/A sensitivity and a differential voltage probe with a 1/100 attenuation.Voltage and current waveforms were captured by an 8-bit resolution digital oscilloscope (RIGOL DS1104Z).The sampling rate was set at f s 1 = 250 kHz for some of the recordings and f s 2 = 50 kHz for the other part.Electrical assembly and instrumentation are depicted in Figure 3. Hence, each HEA of this dataset is described by 8 periods of the current and voltage steady-state waveforms, resulting consequently in a set of n = 8 × 61 = 488 distinct individuals.

Electrical Features Computed from Current and Voltage Measurements
We can describe the different HEAs using 90 features extracted at each voltage period summarized in Table 1.The detail of their computation was introduced in [19], based on the latest IEEE 1459-2010 standard for the definition of single phase physical components under non-sinusoidal conditions [28,29].From the voltage v(t) and current i(t) signals, we compute the Fourier coefficients v ak , v bk , i ak and i bk , using the following formulas: where x[m] = x(m/ f s ) is the sampled discrete-time signal, f s is the sampling frequency, f 0 the utility frequency (e.g., 50 Hz in France) and M = f s / f 0 .From these Fourier coefficients, the following features can be computed for any harmonic rank k ∈ N.
• The Root Mean Squared (RMS) value of the k th harmonic component of the voltage V k and currents I k , and their sum V H and I H : • The RMS voltage V and current I: • The k th harmonic component of the active, reactive, and apparent powers P k , Q k , S k , and their sums P H , Q H , S H : • The active, reactive apparent, and distorsion powers P, Q, S, and D: • The voltage and current total harmonic distortion THD V and THD I : • The voltage and current distortion powers D V and D I : • The non-fundamental apparent power S N : • The voltage and current crest factors for m ∈ [0, M − 1]: • Finally, the global and harmonic power factors F p and F pk : Table 1.Summary of the proposed electrical features [19].The features colored in red meet the additivity criterion given by Equation (13).

Electrical Features Name Number of Computed Features
Current harmonic distortion THD I 1 Distortion power D, D I , D V (VAD) 3 Power factor F p and F pk for k ∈ {1, . . ., 15} 16 Current crest factor F CI 1 Total 90 Since the NILM problem consists in the breakdown of an unknown mixture of HEAs into a set of identifiable HEAs signatures possibly belonging to a database, it is important to consider the electrical features that meet the additivity criterion [8] such that: where f (v, i) is an electrical feature, v is a vector of voltage, i a vector of current samples acquired during one voltage period and N s is the number of HEAs that are simultaneously switched on.Thus, when an HEA is connected (resp.disconnected) to the power network, an "additive" feature is increased (resp.decreased) by an amount equal to that produced by this HEA operating individually.Among the 90 features, p = 34 features meet the additivity criterion and are reported in red in Table 1.This property is required for extracting HEAs features from an aggregated signal and for comparing them with the dataset of individual HEAs signatures.For example, a change detection method as proposed in [6] can be used in order to separate the distinct contribution of each HEA when several ones are switched on simultaneously.Hence, this study only focuses on the additive features computed for a unique observed HEA.Both datasets are stored in an n × p normalized matrix X, where the p = 34 features detailed earlier are in columns and have been normalized with a zero mean and a unit standard deviation.

Feature Selection
Existing approaches for feature selection can be categorized into filter-based methods and wrapper-based methods [30,31].Filter-based methods perform FS independently of the classification process.Features are individually assigned to relevance scores, which are assumed to reflect their usefulness in the classification task.Hence, the features are sorted by descending order of the obtained score of relevance [32].Filter-based methods are often computationally faster but are known to be less accurate than other approaches.On the other hand, wrapper-based methods use the classifier of interest to score feature subsets according to the classification accuracy.This allows selecting an optimal subset of features that maximizes the classification accuracy with an improvement of the computational cost [32].

Investigated Feature Selection Methods
This paper investigates the methods listed below, which are comparatively assessed in the remainder.More details are given to the sequential forward FS method and our original contributions concerning the DNN methods, in light of the other methods' literature review.

•
Principal Component Analysis (PCA) can be used as a filter-based method based on the maximization of the dataset dispersion [33,34].• Linear Discriminant Analysis (LDA) can be used as a supervised filter-based method maximizing the separation between classes (see Section 5.1).

•
Mutual Information (MI) is a filter-based method measuring the amount of information each feature conveys from the class labels [35].

New Proposed Sequential Forward Method
This approach is an iterative wrapper-based FS method that aims at maximizing the classifier accuracy when adding each feature one-by-one [30,[36][37][38][39].The first iteration starts with an empty set of selected features, and each feature f ∈ F, where F is the set of p features, is tested individually using a given classifier (in this study, we used LDA and K-nearest-neighbor (KNN) presented in Section 5).The feature that maximizes the classification accuracy (Acc) [40] of the training dataset is the first one selected.Then, each remaining feature is added to the set of previously selected features, and the same process is applied to find again the highest accuracy.The iterative feature selection ends once further feature addition yields no accuracy improvement.The proposed method is presented hereafter in Algorithm 1.

Algorithm 1 Sequential forward FS algorithm
Input: dataset X n,p , whole set of features F, ground-truth labels Y Output: Set of sorted features F p 1: Initialization: end for 8: Select the best remaining feature: Update feature set: This method is inspired by [41], which is a filter-based approach that uses our proposed neural architecture.It uses the sum of the trained weights of the first layer neurons as a score of relevance of the input features.
Deep learning uses a combination of agents (the neurons) to learn high-level non-linear relationships and correlations in the analyzed data.Such methods can tackle complicated problems and become a promising approach for smart grid applications [42].Here, we use a dense fully connected DNN architecture [43], where each neuron of the input layer is associated with a feature f i as presented in Figure 4. Our architecture is made of one input layer with p neurons and 2 hidden blocks containing one layer of 128 dense formal neurons including a Batch Normalization (BN) combined with a dropout layer.The output y of each neuron can be expressed as a function of an input vector x ∈ R N such as: where g is the neuron activation function chosen as REctified Linear Unit (RELU) defined by g(x) = max(0, x) except for the last output layer, which uses the softmax activation function [44].The w i coefficients (w 0 being the bias) are the synaptic weights, which are learned during the training process.In our implementation, we choose w 0 = 0 and we use a BN of the output, which was shown to improve the training performances [45] when applied on the activation of each neuron.Thus, we have BN(y) = y−µ σ , where µ and σ are the mean and the standard deviation computed from the processed training batch defined with a size equal to 64.Each dropout layer is defined to randomly discard 10 % of the connected input layer and is used in order to reduce overfitting [46].The last layer of our proposed DNN model applies a softmax activation function on each output neuron (each output neuron is associated with a prediction class) returning output values in [0, 1].This value corresponds to the probability of predicting the analyzed input as a member of class i.Hence, the final output label corresponds to the class index which maximizes the resulting probability of the last layer such as ŷ = arg max i y i , y i being the output of neuron i of the last layer.The proposed DNN is designed for classification or regression problems; however, we propose here a new feature selection method based on the trained weights of each neuron.Following this idea, we consider the sum of the weights of all the neurons of the first layer linked to the same input feature as a score of relevance for this feature.The DNN is trained on the studied datasets using cross-entropy [43] as a loss function.

Feature Selection Results
In the context of event-based NILM [6], the selection of features meeting the additivity criterion matters.Each FS method, among the five ones previously presented in Section 4.1, is then applied on the two datasets considered in this study (see Section 3.1), where each HEA is represented by a vector of p = 34 features meeting the additivity criterion (see Section 3.2).Datasets are centered and reduced in order to get an (n × p) matrix X with a zero mean and a unit standard deviation.This is obtained by applying the same operation as used for the Batch Normalization described above (subtracting the mean and dividing by the standard deviation of each individual).For the FS methods based on a feature relevance score, we sort the features by descending order of relevance, and then we select the subset before the highest decrease in the relevancy score.The results obtained on our own dataset and the PLAID dataset are reported in Tables 2 and 3.The following observations can be made according to the obtained results: • For both datasets, some features such as P, P 1 , P H , Q, Q 1 , or Q H are present regardless of the used FS method; • For both datasets, the features selected by the DNN method for FS and by the PCA method are diversified in terms of harmonic orders; • For both datasets, the features selected by the MI and LDA methods are related to odd-order harmonics, which describe the power supply structures included in most of the HEAs; • For the sequential forward FS method, our experiments compare the results provided by the Euclidean-based KNN classifier (where the neighborhood parameter is set to K = 7) and to the LDA classifier.The number of nearest neighbors is set to 7 because it is the closest odd number to the number of instances in a class in the proposed dataset, so that for each neighborhood, there is a majority vote.The selected features are those that reach the maximum accuracy [47] reported in Tables 2 and 3.
For our dataset, only 12 features allow maximizing the KNN classifier accuracy and 18 features maximize the accuracy of the LDA classifier (see Figure 5).For the PLAID dataset, 25 features allow maximizing the KNN classifier accuracy and 33 features maximize the LDA classifier accuracy (see Figure 6).The low accuracy reached by the LDA classifier in the PLAID dataset can be explained by the unbalanced training sets (where one or several classes outnumber the other classes) [48].Indeed, LDA is known to not provide good performances in this setting since classification is generally biased towards the majority classes.It is observed that for the LDA classifier, the accuracies less rapidly reach a plateau than for the KNN.Indeed, the KNN is known to be affected by the overfitting phenomenon [49,50], which increases the distance between individuals of the same class and decreases the accuracy.The classifier cannot deal with the feature relevance and is more sensitive to FS than other classifiers such as LDA, which can handle irrelevant features [51].

Investigated Classification Methods
We investigate four classification methods that can be combined with the different feature subsets provided by the FS methods presented in Section 4.

•
The KNN method is widely used by the NILM community for HEAs' identification [12,52].We use the Euclidean distance and K = 7, which corresponds to the closest odd number to the number of instances in a class in the proposed dataset.Hence, the predicted class corresponds to the most represented one in the neighborhood through majority voting.

•
The LDA method estimates the optimal linear combination between features using the eigenvectors of the projection matrix (B + S) −1 B of dimension p × p, where The proposed DNN classification method uses the same fully connected DNN architecture as presented in Section 4.1.3for FS.Our implementation is based on tensorflow/keras (https://keras.io/).The training is completed with a batch size equal to 64 and a maximal number of 350 epochs (one epoch is reached each time the whole training dataset is processed once).The optimization is completed using the RMSprop algorithm [43] with a learning rate set to η = 10 −3 .

•
The Random Forest (RF) classification method creates a set of decision trees and aggregates the votes from the decision trees to predict the class of the tested individuals [53].The number of trees was set to 5 after experimental tuning to get the best results.
Other classification methods such as Support Vector Machine (SVM) [8] or Adaboost [54] were not evaluated in this study because they require a very high computation cost, and our preliminary results did not reveal a significant accuracy improvement in comparison to the four investigated methods.

Classification Test Procedure
The evaluation of the classification performances uses an 8-fold cross validation methodology, which randomly splits the dataset into eight equal partitions, which are individually tested using the seven remaining partitions for training.This process is repeated eight times, until all subsamples have been used for cross validation [55].The number of folds was arbitrarily set to 8 so that 12.5% of the studied dataset corresponds to the test set and the remaining 87.5% to the training set.The final metrics for classification performances were then computed by merging the results of each partition.
Our results are expressed in terms of the classical classification metrics used for NILM-Accuracy (Acc), F Measure (F M ), Recall (Rec) and Precision (Pre) [40,47]-which are deduced from the computed confusion matrices.Thus, if we denote C a resulting confusion matrix of dimension I × I (I being the considered number of classes), where C ij corresponds to the number of individuals of the true class i (row) predicted as being in the class j (column), the Accuracy, Precision, and Recall are computed as: where n = ∑ i ∑ j C ij is the total number of individuals.The F-measure (also called F-or F 1 -score) is the harmonic mean of the Precision and Recall, computed as: We also present the ratio Acc # f eatures , which measures the efficiency and allows us to judge on the right combination of FS method and classifier.Indeed, the goal is to achieve the highest accuracy for a small number of descriptors.A large ratio means that a high accuracy is reached with a small number of selected features.Each method is evaluated on both datasets.

Self-Database Results
Sections 5.3.1 and 5.3.2present the classification success rate as a function of the provided subset of selected features used to describe each HEA.Overall results using our new proposed dataset are summarized in Table 4 for which the details are provided in Table 5.The results obtained using the PLAID dataset are summarized in Table 6 where the details are presented in Table 7.The confusion matrices of the best methods are also presented in Appendix A. We also compare the results with the active and reactive powers (P,Q) which are the usual features proposed in the NILM literature [56].The results show the clear improvement provided by the FS methods on each evaluated dataset.In addition, to study how noise influences our training, we have trained the studied classifiers using the selected features and Data Augmentation (DA) [57].For this, both studied datasets are 100 % augmented by adding a white Gaussian noise to the current signals of each HEA class, to obtain an Signal to Noise Ratio (SNR) = 20 dB (defined as 10 log 10 ||x|| 2 ||b|| 2 , with x the original signal amplitude and b the noise signal).This leads to new augmented datasets containing noisy current signals, from which we compute the 34 features meeting the additive criterion.Then, for each original dataset, we apply cross-validation for classification evaluation using the selected features in Section 4.2 and the classification methods used in Section 5.The experiment setup also consists of an 8-fold cross-validation experiment, where the original datasets were partitioned into eight.For each of the 8 simulations, 20% of each HEA class of the noisy generated dataset was added to the training set.No noise was added to the test part that was used for performance generalization for each of the eight simulations in our experiments.The success rate of a classification method for a specific subset of selected features was obtained by calculating the average scores of the performances metrics used in Section 5.All the results presented in the following tables are ranked in descending order of the best F M scores.

Proposed Dataset
Table 5 shows that the RF classifier outperforms all the other classifiers.This is consistent with the findings of X. Wu et al. in [53], where an accuracy equal to 98.0% was found using eight steady-state features (which do not meet the additive criterion).These authors showed the advantages of RF classifier over KNN.In our experiment, RF classifier obtains the best recognition rate equal to 99.18% with the features selected by the MI method and DA, and 98.15% of accuracy when combined with LDA feature selection method without DA.This leads to a ratio (Acc/# feat.) of 7.01 and a computational time that is very low.It can be observed that the classifier performances are also better when data is artificially augmented during training.The confusion matrix depicted in Figure A1 shows that 20% of tested individuals that belong to class index 4 "Fan-Coala level 1" are classified as class index 5 "Fan -Coala Level 2" and 10% of the tested individuals belonging to class index 1 "Electric mixer Moulinex" are classified as class index 7 "Fan-Coala Level 3".Interestingly, the performance of the DNN method is improved by a suitable choice of relevant descriptors as confirmed by the usage of the MI and the DNN featurs selection methods.Indeed, DNN classifier usually obtains the best results when using all the considered features.This point is of interest to develop new strategies to improve the training efficiency of DNN when used on a small training dataset.Finally, it can be observed that the results obtained with the DNN classifier are lower than those obtained for the PLAID dataset in Table 7.This can be explained by the small size of the training dataset.Indeed, DNN is known to require a large number of data to efficiently be trained.It can also be observed that for most of the classifiers, data augmentation by the addition of white Gaussian noise can significantly improve the results.Table 4 depicts the average F M scores obtained for each FS method considering all the studied classifiers.It can be seen that the odd order harmonics features selected by the MI method are the ones for which the best average F M score is reached.

PLAID Dataset
In Table 7, we can notice the improvement brought by FS approaches combined with the KNN classifier.Indeed, the obtained ratios (Acc/# feat.) are all greater, which denotes the fact that KNN classifier needs a small number of features to reach excellent accuracy.The best ratio (Acc/# feat.) of 47.29 is obtained when using KNN with P, Q features, which are usually considered for HEAs' identification in the NILM literature, but the reached accuracy is only of 94.58 %.The best accuracy of 99.19% is reached for only 25 features selected by the KNN-based sequential forward-FS method with and without data augmentation (according to the confusion matrix depicted in Figure A2, 10% of the tested individuals belonging to class index 7 "Incandescent light bulb-Electrix-soft white" are classified as class index 8 "Incandescent light bulb-Philips Duramax").However, an accuracy of 99.13 % with a very good ratio Acc # f eat = 4.96 is obtained when considering the subset of 20 features selected by the MI method with the KNN classifier.As we seek to achieve the best identification rates (close to 99 % of accuracy) for the smallest number of selected features, this combination classifier/ selected features is a good trade-off.This allows the KNN classifier to slightly exceed the performances reached by the RF classifier with the 20 features selected by the MI method (98.91 % of accuracy).
In contrast to our dataset, the high number of individuals (n = 36 720) eases the DNN to learn the features from the dataset and to figure out that an important number of features is reliable and should be used.Our results outperform the previous ones reported in [58] for the PLAID dataset using the VI trajectories features, where an F-measure of F M ≈ 77% is given.This validates the efficiency of our approach consisting in carefully selecting features before classification.Combining RF classifier with features selected by the KNN-based sequential forward FS method also allows us to exceed the performances obtained in [18], where authors also used an RF classifier applied on PLAID dataset using an optimal subset of 20 steady-state and transient features (selected through a systematic feature elimination process) and were able to reach an accuracy of 93.2%.The performances obtained by the LDA are the worst ones.As mentioned, this can be explained by the fact that the LDA classifier struggles when the number of individuals is very large, resulting in individuals overlapping between classes.In almost all the cases, data augmentation improves or slightly modifies the classification performance for all methods (exception for the KNN classifier).Finally, Table 6 depicts the average F M scores obtained for each FS method considering all the studied classifiers.The same observation as the one made for the proposed dataset can be made: the odd order harmonics features selected by the MI method are the ones for which the best average F M score is reached.

Transfer Learning Results
Now, we propose evaluating if the features selected from a dataset are able to be transferred to another dataset.For this, a cross-learning strategy is adopted for both studied datasets.The goal is to study to what extent the features selected for a particular dataset are invariant across HEAs and can be used to get good classification performances in another dataset.Indeed, as several common features are selected from each dataset separately, we assume that they convey common information that can be transferred from one dataset to another.This approach allows us to reduce the number of training samples from unknown HEAs.First, the subsets of selected features from the proposed dataset are tested using the several classifiers presented previously and applied to the PLAID dataset.The results are presented in Table 8.Very good results are obtained with the KNN, the DNN, and the RF classifiers with the different subsets of selected features.The best ones are obtained with the KNN classifier and particularly when combined to the MI feature selection method, which allows us to obtain the best performances with only 20 features.Figure A3 shows the tested individuals that were misclassified, such as the following: 10% of the individuals belonging to class index 7 "Incandescent bulb light -Electrix soft white" were classified as class index 8 "Incandescent light bulb-Philips Duramax" or 10% of the individuals in class index 39 "Laptop-HP-C24" were classified as class index 40 "Laptop-Apple macbook air".Second, the subsets of selected features from the PLAID dataset were tested using the several classifiers presented previously and applied to the proposed dataset.The results are presented in Table 9. Very good results were obtained with the RF and the LDA classifiers with the different subsets of selected features.An accuracy of 98.77% was obtained when combining the RF classifier to the subset of features selected with the LDA sequential forward-FS method.The confusion matrix in Figure A4 shows that 50% of the individuals belonging to class index 5 "Fan-Coala Level 2" were classified as class index 4 "Fan-Coala Level 1", and 50% of the individuals that belong to class index 53 "Washing machine-LG state (b)" were classified as class index 58 "Washing machine-LG state (g)".Through cross transfer learning, we show that in both situations, the features selected from another dataset improve the classification rate.There is therefore transfer of knowledge on the discriminating "power" of the selected features.

Conclusions
In this paper, we addressed one of the main challenges of the NILM problem consisting in the HEA identification from electrical measurements.To this end, we covered a broad sweep of existing supervised FS and classification methods to show the efficiency of this approach for identifying distinct HEAs using the suitable set of relevant features.
As a first contribution, in addition to a novel publicly available dataset, we introduced a comparative evaluation of a large number of methods in an event-based NILM context, involving all the possible combinations of these techniques applied on two distinct annotated HEA datasets where each HEA signature made of relevant features is extracted from different categories and manufacturers.
Second, thanks to our proposed data augmentation and feature selection, we have improved the best HEA identification results on the PLAID dataset by obtaining a resulting classification rate above 99%.To our knowledge, this result outperforms the best available results obtained with the PLAID dataset using state-of-the-art methods.Furthermore, in this regard, validating our solution using two datasets has helped in (i) showing its high performance despite the data collection procedure being different and (ii) proving its capability to give very good results even if HEAs are from different categories and manufacturers.
Moreover, our results show that the number of extracted features can significantly be reduced to efficiently perform HEAs recognition.A cross transfer learning strategy was adopted by using the subsets of features selected by FS approaches applied to our dataset to classify HEAs of the PLAID dataset and conversely.Very good results were obtained, which confirms that the features selected with one dataset can be transferred to another one.
Several conclusions can be safely drawn from this study.First, the selected electrical features can be justified by the power supply topologies included in an HEA (the front-end circuitry that connects them to the power grid), which affect their current waveforms.Each subset of selected features contains for the most part odd-order harmonics related to power components.The performance of a designed classifier can be improved by the use of an optimal subset of features.Some features, such as P, P 1 , P H , Q, Q 1 , andQ H , are retrieved in most of the subsets of selected features, which shows their importance for HEAs' identification.Secondly, several observations can be made according to the chosen classifier: DNN requires many training data and works poorly on the small proposed dataset (and works better on the PLAID dataset); LDA is sensitive to unbalanced datasets, and it works less well on PLAID; KNN method is efficient and sensitive to the choice of descriptors; RF is a state-of-the art method in automatic classification before the arrival of deep learning and is robust and gives good results in most cases.In addition, the augmentation of data shows that it is possible to improve in almost all cases classification performance for all methods (exception for PLAID with KNN classifier).Finally, features selected through FS methods in a dataset could be used to correctly identify unknown HEAs from another dataset using, for example, an unsupervised statistical modeling approach [6].This addresses one of the biggest NILM challenges: generalization.

Figure 4 .
Figure 4. Diagram of the proposed deep neural network architecture with L hidden layers.

Figure 5 .
Figure 5. Classification success rate as a function of the number of the features on our dataset using KNN and LDA classifiers with the subset of p = 34 features meeting the additive criterion.

Figure 6 .
Figure 6.Classification success rate as a function of the number of the features on the PLAID dataset using KNN and LDA classifiers with the subset of p = 34 features meeting the additive criterion.
where V k are the covariance matrices built from the corresponding n k individuals (number of individuals of class k ∈ {1, . . ., K}); g = 1 the mean over all the individuals of the whole dataset X and the mean over all the individuals in the class k, respectively.Then, the tested individuals are projected into the discriminative linear space before being assigned to the class whose centroid is the closest in terms of the Euclidean distance.•

Figure A1 .
Figure A1.Confusion matrix from the proposed dataset obtained with RF classifier using the subset of 20 features selected with the MI FS method.

Figure A2 .
Figure A2.Confusion matrix from PLAID dataset obtained with KNN classifier using the subset of 25 features selected with the KNN based Seq.forw.FS method.

Figure A3 .
Figure A3.Confusion matrix from the PLAID dataset obtained with KNN classifier using the subset of 20 features selected with the MI FS method from the proposed dataset.

Figure A4 .
Figure A4.Confusion matrix from the proposed dataset obtained with the RF classifier using the subset of 33 features selected with the LDA based Seq.forw.FS method from the PLAID dataset.

Category of HEA Number Number Category of HEA Number Number Category of HEA Number of HEAs of power of HEAs of power of HEAs of power levels levels levels
lists the considered categories of HEAs.

Table 3 .
Results of the features selection methods applied on PLAID dataset.

Table 4 .
Average F M scores obtained for the several different feature subsets selected for the proposed dataset (61 classes, n = 488) over all considered classifiers.Results are sorted in descending order of F-measure.

Table 5 .
Performance (in percentage) of the classification methods applied on the proposed dataset using different feature subsets from the additive feature set (61 classes, n = 488).Results are sorted in descending order of F-measure.

Table 6 .
Average F M scores obtained for the several different feature subsets selected for the PLAID dataset (71 classes, n = 36,720) over all considered classifiers.Results are sorted in descending order of F-measure.

Table 7 .
Performance (in percentage) of the classification methods applied on PLAID dataset using different feature subsets from the additive feature set (71 classes, n = 36,720).Results are sorted in descending order of F-measure.

Table 8 .
Performance (in percentage) of the classification methods applied on the PLAID dataset using the different feature subsets selected in the proposed dataset.Results are sorted in descending order of F-measure.

Table 9 .
Performance (in percentage) of the classification methods applied on the proposed dataset using the different feature subsets selected in PLAID dataset.Results are sorted in descending order of F-measure.