Selection of Features Based on Electric Power Quantities for Non-Intrusive Load Monitoring

Non-intrusive load monitoring (NILM) is a process of determining the operating states and the energy consumption of single electric devices using a single energy meter providing aggregate load measurements. Due to the large spread of power electronic-based and nonlinear devices connected to the network, the time signals of both voltage and current are typically non-sinusoidal. The effectiveness of a NILM algorithm strongly depends on determining a set of discriminative features. In this paper, voltage and current signals were combined to define, according to the definitions provided in Standard IEEE 1459, different power quantities, that can be used to distinguish different types of appliance. Multi-layer perceptron (MLP) classifiers were trained to solve the appliance detection problem as a multi-class event classification problem, varying the electric features in input. This allowed to select an optimal set of features guarantying good classification performance in identifying typical electric loads.


Introduction
Home energy management systems (HEMS) measure the energy consumption data of the household in real time for monitoring and optimizing the energy usage. Traditional intrusive monitoring systems collect fine-grained electricity data, posing higher initial cost for the homeowners since several smart meters must be plugged into the appliances. Non-intrusive load monitoring (NILM) systems allow to reduce hardware and maintenance costs since only one household-level smart meter is needed. As a consequence, the energy consumption at appliance-level is not available. NILM identifies a set of techniques that can disaggregate the power usage into the individual appliances that are functioning and disaggregate the consumption of electricity for each of them.
NILM has been an active area of research in the last years. One of the earliest studies about the NILM system was developed by Schweppe and Hart. In [1] the authors classify different appliances applying clustering algorithm in the active-reactive power features plane. Since the publication of this algorithm in 1992, the field of load-disaggregation has seen a tremendous amount of further research and novel approaches, see [2,3] for recent reviews. Furthermore, once the load has been disaggregated through a proper NILM technique, the disaggregated data can be used to improve the performance of some other important functionalities in the energy management system. This holds in particular for load demand forecasting [4], which may play a key role in the active management of modern distribution grid [3,5]. As an example, the method proposed in [6] first disaggregates the overall power into single appliances, so that the consumptions of each appliance can be forecasted separately, and then forecasts the total power of an aggregated set of houses by aggregating the single predicted powers. These NILM-based methods allow improving forecasting accuracy with respect to traditional methods that operate directly on the aggregate signal.
The most promising approaches recently presented in the literature for load disaggregation are based on machine learning techniques. A machine learning system classifies the switching events associating them to a particular appliance, processing significant features from the measured electrical parameters. Solutions based on machine learning range from classic supervised machine-learning algorithms (e.g., support vector machines, artificial neural networks, random forests) to supervised statistical learning methods (e.g., K-nearest neighbours and Bayes classifiers) and unsupervised method (e.g., hidden-Markov Models and its variants). Recently, deep learning (DL) methods were also employed and seem promising for the most challenging problem posed by the consumption profiles of multi-state appliances [7][8][9][10].
Commercial household smart meters usually provide the electric active and reactive power quantities measured at low frequency, typically on the order of a few values per second. As a consequence, these are the most widely used features and measurement rates reported in NILM-related literature. Since different appliances may show overlapping power values, performance is often unsatisfying. To overcome this issue, different features can be considered. Instantaneous samples acquired at high frequency (up to tens of kilohertz) have been proposed to extract information on the transient behavior of the supplied loads, but the drawbacks of these approaches are the increase in computational complexity and the real-time constraints associated with their implementation [3]. On the other hand, considering that the use of nonlinear appliances has increased continuously during the last decades, the frequency content of the involved electric signals can be exploited to improve the accuracy in the load identification process [11]. This implies the sampling rate used to acquired voltage and current signals must be sufficiently high to allow the frequency components of interest to be assessed correctly, but the measurement rate used in the disaggregation algorithm (i.e., the number of measured values per time unit, where each value is often obtained by processing a large number of instantaneous voltage and current samples), can be still in the order of few values per second, thus greatly improving the feasibility of these approaches with respect to the drawbacks mentioned above for the high-frequency solutions.
Based on these considerations, to take into account the presence of non-sinusoidal power draws, several authors recently resorted to additional information, extracting extremely detailed features such as, for example, V/I trajectory, harmonics of different orders [12], the harmonic phase [13,14], percentage total harmonic distortion (THD) [15], active and non-active current components [16], spectrograms of high-frequency current data [9]. The choice of the optimal set of features among them is crucial to maximize NILM effectiveness.
The literature contains few contributions where the different features were applied simultaneously. In [17] six different signal signatures have been applied such as current, current harmonics, real and reactive power, geometrical properties of the V-I curve and instantaneous power, for ten appliances.
Gao et al. [19] apply real and reactive power, harmonics, quantized currents and voltages waveforms and V-I binary image obtaining better results with respect to those obtained with features belonging to one category.
In [20] the authors first explored a systematic approach of optimally combining features analyzing a plug load appliance identification dataset (PLAID) [21] retaining 58 features. Then, they performed several training sessions of the algorithm and the feature ranking and finally selected 20 features eliminating all of the voltage harmonics and all of the power-based features from transient state. The selected features included the total harmonic distortion (THD) of current, energy of detail wavelet coefficients at 2nd scale, enclosed area by V-I trajectory, current root mean square, 3-rd, 5-th, 9-th and 11-th current harmonic coefficient, THD of nonactive current and normalized real power.
In [22] the authors apply a complex hybrid energy disaggregation approach and propose a monitoring system consisting of a composite dual sliding window-based cumulative sum control chart (DSWC) to detect the transient event, a multi-layer Hungarian matching process to match load events. The detected events are converted to a bipartite graph optimization; a supervised clustering procedure is used to activate the corresponding category. Input data consist of rms current and voltage, active power, reactive power and the 15-th harmonic of the current.
This paper focuses on the choice of a suitable set of features for a non-intrusive load monitoring system, providing the following main contributions: • different power terms, accounting for possible distortion in the voltage and current signals, have been considered as possible features; • the power terms have been evaluated in different time intervals after the detected event; • a systematic approach to rapidly obtain the lowest possible number of features preserving model performance has been explored applying two procedures of feature selection.
The most effective feature sets are given as input to a multi-layer perceptron (MLP) classifier trained to identify the on/off events for each device. The procedure is applied to Building-Level fUlly-labeled for Electricity Disaggregation (BLUED) dataset [23]. BLUED is a big dataset consisting of real voltage and current measurements for a single-family residence in the United States, sampled at 12 kHz for a whole week. BLUED is publicly available and it has been applied as benchmark dataset in several recent papers on NILM [22,24].

Feature Definition
Nowadays, due to the large spread of power electronic-based and nonlinear devices connected to the network, the time signals of both voltage and current, respectively v(t) and i(t), are typically non-sinusoidal. In particular, for steady state conditions, these signals can be modelled by considering two components: • the fundamental component, typically denoted with the subscript 1, that is a sinusoidal signal at system frequency f 1 : where ω = 2π f 1 , while V 1 , α 1 and I 1 , β 1 , are the root mean square (rms) and the initial phase angle for voltage and current, respectively • the remaining term, typically denoted with the subscript H, which is obtained by summing up the sinusoids of all the harmonics and the possible dc component.
where V h , α h , I h , and β h are the rms and the initial phase angle of the voltage and current h-th order harmonic component.
Consequently, the time signals of both voltage and currents can be obtained as follows: With reference to the analysis of a single-phase system, and according to the standard IEEE 1459 [25], it is possible to combine the two components of both voltage and current signals to define a number of different rms and power quantities. Among all the quantities proposed in the standard [25], the focus of this work will be on the 4 power-based quantities shown in Table 1, where θ h is the phase displacement between the h-th harmonic components of voltage and current. Parameters related to voltage were excluded because voltage variations are minimal while connecting and disconnecting loads mainly affect the current draws and power consumptions. Table 1. Definitions of the considered power quantities under non-sinusoidal conditions, according to [25].

Name Measurement Unit Definition
Active power Fundamental reactive power Var According to their definitions, the power terms shown in Table 1 include information on the fundamental components (namely active, fundamental active and fundamental reactive power) and on the harmonic components (namely harmonic active power). These quantities will be considered as possible features in the proposed NILM approach and will be referred to as parameters in the following, for the sake of brevity. In the following, the variation of all the parameters due to an event will be used as possible source of information to determine which load caused the event.

Database Creation
When testing the performance of a NILM technique, it is important to consider real data. Due to the challenges in the creation of such data sets, mainly related to the required time and the high costs involved, researchers typically refer to data sets that have been shared online and can be used as common reference. This choice also allows comparing the performance of different approaches.

BLUED
Among all the datasets available in literature, in this work the BLUED dataset [23] has been considered. This dataset was built in 2011 by monitoring a whole house located in Pennsylvania (USA), for 8 days. It contains the samples of voltage and current signals, and the corresponding timestamps, collected with a sampling frequency of f s = 12 kHz. During the creation of the data set, all the changes in the state of power consumption higher than 30 W and lasting at least 5 s, for almost 50 appliances, have been labelled and recorded. This results in 2355 events recorded from known sources, and additional 127 events recorded from unknown sources, for a total of 2482 events. In this work, only the events involving appliances with a nominal active power higher than 30 W, with more than 10 recorded events, have been considered, resulting in 1585 events. In Table 2, the list of the considered appliances is reported, together with the identification code in the BLUED dataset (appliance label), and the number of recorded events.

Signal Processing
In order to be able to evaluate all the electrical quantities in Table 1, the time signals stored in the BLUED database have been processed through a standard scheme, described in the following. Before to proceed with the evaluation of the electrical quantities of interest, according to a generic timeline such as the one represented in Figure 1, it is important to focus on a specific section of both voltage and current signals. Given t e , the time instant in which an event occurs for the generic device, a 2 s time window, centered in t e , is selected. Consequently, the selected portion of the signal is divided in 8 observation windows (ow 1 , ow 2 , . . . , ow 8 ), each one 15 cycles long (250 ms at the 60 Hz rated frequency). For each observation window, a calculation window is selected, by considering 10 cycles starting from the first zero-crossing in the voltage signal. Then, the fast Fourier transform (FFT) is applied to each calculation window of both voltage and current signals. In particular, since the bandwidth of the acquisition system for the BLUED database was limited to 300 Hz, only the frequency components up to the 5th order have been considered to evaluate all four electric parameters by means of Equations (7)- (10). This results in eight values for each one of the four parameters: four values for the pre-event period and four values for the post-event period, stored in a matrix P ∈ 4 * 8 : In order to define a reference value representative of the pre-event scenario for the k-th parameter, the mean value among the four pre-event values has been considered: The obtained reference values have been then considered to evaluate the variation of the k-th parameter due to the occurrence of the event, ∆P k , according to the following equation: ∆P where, given post = [5,6,7,8], the superscript post − 4 denotes the index of the post-event values of the parameter P k . Consequently, the matrix containing the variations of all the P k parameters, named ∆P ∈ 4 * 4 , has been obtained: Then, these values have been used to train and test the neural network in charge of the identification of the device that caused the event. The time instants corresponding to the ON/OFF events are considered known.

The Multilayer Perceptron
Multilayer perceptrons (MLPs) are feedforward artificial neural networks consisting of at least three layers of nodes called input layer, hidden layer and output layer.
The relationship between input and output patterns is described by the following algebraic equations system:

Input layer Hidden layer Output layer
where i is the input vector, W 1 = is the weights matrix of the input layer, b 1 is the bias vector of the input layer, y is the input of the hidden layer, h l is the output of the hidden layer, f (·) is the hidden neurons nonlinear activation function, W 2 = is the weights matrix of the output layer, b 2 is the bias vector of the output layer, o is the output vector. Feedforward networks can be trained for classification purposes, i.e., to classify input data according to the specified target classes.
Learning occurs through a supervised learning procedure, by evaluating the error in the neural network output after each input pattern is fed to the network and modifying the connection weights, in order to minimize the error. In this case the conjugate gradient algorithm is applied to train any network [26] and the error function to be minimized with respect to the weight and bias variables is the cross entropy. For each pair of output class/target class oc/tc the cross-entropy is calculated as: The cross entropy penalizes the classification errors in the different classes, mainly penalizing misclassifications that are confident. The aggregate cross-entropy performance is the mean of the individual values. Figure 2 shows the structure of the MLP neural network considered in this paper. The variables used as input are the selected features, whereas the output of the network consists of a binary vector in which the neuron corresponding to the ON or OFF event of an appliance is set equal to 1. Hence, the dimension of the output vector is twice the number of the considered appliances (44 in the case study). In order to obtain an unbiased evaluation of the neural model fit, the available dataset is generally split into three sets. The Training set is used to fit the model and determine weights and biases; the Validation set is used for parameter selection and it should provide an unbiased evaluation of the fit during the network training. During training, the error of each model is checked on the validation set and the model that achieves the lowest validation error is selected. This is the final choice of features and model. The Test set is used once the training is completed to evaluate the performance of the final model.

Feature Selection
Feature selection is the procedure performed to reduce the number of input variables when developing a model in order to facilitate the learning and the interpretation, reduce the computational cost and often improve the model performance. The addition of unrelated or redundant features influences model accuracy: redundant features increase the model complexity enhancing the risk of overfitting. Moreover, they could introduce spurious correlations between the features and the class labels.
Many feature extraction algorithms map the parameters in a lower dimensional feature space, suffering from the lack of interpretation of the projection dimensions. In this case, the two methods chosen in this work retain a subset of original features allowing further interpretation of physical meaning behind the selection.
The first method is based on neighborhood component analysis (NCA) [27], which implements a feature weighting approach to optimize the nearest neighbor classifier performance by maximizing an objective function that evaluates the average leave-one-out classification accuracy over the training data. However, the cost function of NCA is prone to overfitting, and an improved version of NCA with a regularization term was proposed by Yang et al. [28], where the regularization term was selected empirically.
In the second method, the ranking is done applying minimum redundancy maximum relevance (MRMR) algorithm. The MRMR algorithm [29] minimizes the redundancy of a feature set while maximizing the relevance of the feature set to the target. Redundancy and relevance are expresses by the mutual information of features and mutual information of a feature and the target.
It is worth noting that there is no guarantee that the feature selection algorithm will supply a feature ranking applicable to a generic classifier, based on a different algorithm. However, in practice, the significance of features can often be generalized, especially when both algorithms show good performance in classification. The outcome of the feature ranking was used to design a set of tests employing an increasing number of features.

Performance Indexes
The results have been evaluated using the metric F-score, which is defined as: Precision is the ratio between the number of correctly detected appliances (true positive (TP)) and the total number of the detected ones (TP + false positive (FP)), and it represents the fraction of correct detections.
The recall is the ratio between the number of correctly detected appliances (TP) and the total number of appliances in the dataset (TP + false negative (FN)) and it represents the fraction of events that has been detected.
The confusion matrix is useful for visualizing precision and recall for the different classes. A confusion matrix for multiclass classification shows the different classifier answers. The actual values form the columns, and the predicted values form the rows. The intersection of the rows and columns show one of the answers.

Test and Results
To obtain the results, data where split in three parts. For each appliance, the 80% of events is used for training, the 10% for validation and the other 10% for test. On this training set, cross-validation was applied in order to avoid overfitting. From the crossvalidation results, optimal configuration settings for the methods were obtained and the evaluation on the test set gave us the final performance. The composition of the three datasets is reported in Table 3.  Figure 3a,b report the cumulative feature weights (CFW) obtained applying the NCA and MRMR methods to the training set respectively, cumulating from the most important to the less one.
As it can be noticed, the CFW curve for NCA algorithm shows a clear linear growth for the top eight features of the ranking and saturation after a knot point. As shown in Figure 4, these features include active power, fundamental active and reactive power for the 3rd and 4th time instant after the event; and harmonic active power in the 2nd and 4th time instant after the event. Thus, also the dynamic of these parameters supplies useful information, since in some cases the same parameter is selected in different time instants. In order to detect the relevant features, the best regularization parameter that corresponds to the minimum average loss has been obtained using cross-validation. In Figure 5 the average loss values versus the regularization parameter values obtained with this optimization procedure is shown.   The feature score proposed by MRMR algorithm does not suggest any cut on the feature ranking since the CFW curve continues to linearly grow until the bottom feature is added. Thus, no useful information is supplied by the algorithm in this context.

Load Classification Results
Following the NCA suggestion, for each number of features, a bunch of training sessions was done varying the number of hidden neurons from 10 to 100 for an MLP, choosing as input the 8 top features.
Results are reported in Table 4 for the network with 50 hidden neurons showing the best validation error. Figure 6 shows the confusion matrix. Since training, validation and test sets show similar performance, the confusion matrix is built for the entire dataset.  In order to verify the goodness of the approach, the MLP classifier was trained varying the number of hidden neurons from 10 to 100. The network with 50 hidden neurons shows the minimum validation error, so it has been assumed as the best model. Figure 7 shows the performance in term of cross-entropy obtained by progressively entering the input features following the ranking provided by NCA algorithm. As it can be noticed, increasing the number of features initially improves model performance. The major improvement is obtained adding the fundamental active power. Then, the validation error fairly decreases, and best results are obtained when 8 features are selected. Further increasing the input dimension does not provide more accurate classifications. Thus, the test confirms the goodness of the choice suggested by the NCA algorithm. The model performance was compared with those shown when the MLP is fed with the fundamental active and reactive power ∆P 1 and ∆Q 1 , widely applied in literature, also by the authors [30,31], and adding to them the harmonic power that includes the information on the non-sinusoidal current. Table 5 reports the results for the described tests. Although the obtained performances are slightly worse than those achieved by the NMR top features, results confirm that relevant information is carried out by the fundamental and harmonic active power features in different time instants.
Performance evaluation and comparison of different approaches to NILM remain open issues in literature because the authors test their techniques on different datasets, with different criteria and metrics. In this paper, in order to make a meaningful comparison, the techniques proposed in [22,24] and [9] have been considered. In all three cases BLUED dataset is used and the performance is evaluated with the same indices considered in the present paper.
In particular, in [22] the authors apply a complex multivariable hybrid energy disaggregation approach to a subset of BLUED database consisting of 24 h of monitoring. The study is limited to the following six devices: fridge, air compressor, hair dryer, backyards lights, bathroom upstairs light, bedroom lights for 107 events, obtaining a F-score equal to 91.8%. Table 6 reports the obtained performance for five individual appliances both for [22] and the present paper. In [24] random forest algorithms are applied to active and reactive powers for the devices in BLUED dataset. The performance for the load classification step is reported in Table 7. Table 7. Performance reported in [24] for random forest algorithms applied to BLUED dataset.

Precision
Recall F-Score 0.89 0.90 0.89 In [9] the authors apply a deep neural network, called concatenate convolutional neural network, to high-dimensional spectrograms of high-frequency current data. Data are a subset of BLUED database (537 samples) sampled to 2 kHz and the devices are grouped in six classes: fridge, two groups of lights, two groups of high-power devices and one group with computer and monitor. Table 8 reports the classification performance. Table 8. Performance reported in [9] for concatenate convolutional neural networks applied to six classes of devices in BLUED dataset.

Precision
Recall F-Score Even if all these results are not directly comparable since the datasets, although selected from BLUED data, are different in composition and size, they confirm the goodness of the proposed approach for the appliance classification.

Conclusions
This study presents the effectiveness of NCA as a feature selection method to enhance the classification performance of a multilayer perceptron for non-intrusive load monitoring. Two feature selection methods, namely the neighborhood component analysis and minimum redundancy maximum relevance algorithms, have been applied to a set of sixteen features including active fundamental and harmonic active power and fundamental reactive power in four windows successive to a switching event. The results reveal that NCA feature selection method can select an optimal set of features tackling the model complexity. Indeed, the top features of the algorithm, fundamental active and reactive power and active power harmonics, allows one to obtain the best MLP performance also reducing the model complexity. Comparisons with results presented in literature show that the procedure achieves competitive performance despite the model simplicity.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.