Statistical and Electrical Features Evaluation for Electrical Appliances Energy Disaggregation

: In this paper we evaluate several well-known and widely used machine learning algorithms for regression in the energy disaggregation task. Speciﬁcally, the Non-Intrusive Load Monitoring approach was considered and the K-Nearest-Neighbours, Support Vector Machines, Deep Neural Networks and Random Forest algorithms were evaluated across ﬁve datasets using seven different sets of statistical and electrical features. The experimental results demonstrated the importance of selecting both appropriate features and regression algorithms. Analysis on device level showed that linear devices can be disaggregated using statistical features, while for non-linear devices the use of electrical features signiﬁcantly improves the disaggregation accuracy, as non-linear appliances have non-sinusoidal current draw and thus cannot be well parametrized only by their active power consumption. The best performance in terms of energy disaggregation accuracy was achieved by the Random Forest regression algorithm.


Introduction
With the development of technology and the increasing usage of electrical appliances and automated services, the electric energy needs have been growing steadily for the last century with an annual growth of approximately 3.4% per year in the last decade [1]. Nowadays residential and commercial buildings account already for roughly 36% of the total electrical demand in the USA and 25% in the EU while they are responsible for roughly 43% of carbon dioxide (CO 2 ) emissions [2][3][4]. To assure balance between renewable energies, CO 2 emissions, political stability and economic growth it is essential to focus on a sustainable development [5]. To achieve sustainable economic growth energy consumption in industrial and residential areas must be minimized under the consideration of rising volatility of nowadays energy production with increasing amounts of renewable energies [6]. Under the consideration of sustainable development several studies investigated real time pricing with additional storage systems [7,8] or large scale energy buffering [6] to reduce electrical energy consumption and peak loads. Other studies indicate that detailed analysis and real-time feedback of energy consumption in residential areas can lead to up to 20% savings in energy consumption through detection of faulty devices and poor operational strategies and thus would improve the sustainability of nowadays consumer households [9,10]. Therefore in the last few decades extensive research in smart grids, smart systems and demand management was carried out and different optimization techniques have been developed to reduce residential energy consumption [11][12][13]. To make use of those techniques, accurate and fine-grained monitoring of electrical energy consumption is needed [14].
However nowadays the energy consumption of most households is monitored via monthly aggregated measurements and therefore cannot provide real-time feedback information.
To measure the energy consumption of a household or building with high resolution in the order of seconds or below smart meters are utilized. According to [15] the largest improvements in terms of energy savings can be made when monitoring energy consumption on device level. Therefore the analysis of energy on device level is performed through energy disaggregation, i.e., the extraction of energy consumption on appliance level based on one or multiple measures from smart meters. When using only one sensor (smart meter) per household or building, therefore measuring only the aggregated consumption, the task is referred to as Non-Intrusive Load Monitoring (N ILM) [16], in contrast to Intrusive Load Monitoring (ILM) where multiple sensors are used, usually one per device. The goal of N ILM is to find the inverse of the aggregation function through a disaggregation algorithm using as input only the aggregated power consumption, which makes N ILM a highly under-determined problem and thus impossible to solve analytically [17].
In order to solve the N ILM problem different approaches have been proposed in literature, which can be split into methods with or without Source Separation (SS). Approaches with SS consider the task of energy disaggregation as a single channel source separation problem and extract the corresponding signal of each device from the aggregated samples using a set of conditions and constraints (e.g., sparseness or sum-to-one) [18,19]. Approaches without SS are based on the decomposition of the aggregated signal to a sequence of feature vectors. These feature vectors are then classified to device labels using a machine learning algorithm [8,20,21] or by predefined set of rules or thresholds [22,23]. As machine learning classification/regression models a wide variety of algorithms have been used such as Artificial Neural Networks (ANNs) [8], Decision Trees (DTs) [24], Hidden Markov Models (HMMs) [24][25][26][27][28][29], K-Nearest-Neighbours (KNNs) [30], Random Forests (RFs) [20], Support Vector Machines (SV Ms) [24] and ensemble classifiers [31].
Another classification of N ILM methods is based on the sampling frequency f s of the smart meter and thus the features that can be extracted from the measured data [32]. In detail, depending on the sampling frequency either macroscopic (e.g., active/reactive power [23,33,34]) or microscopic (e.g., transient energy, harmonics, wavelets [22,35]) features are extracted to disaggregate energy consumption on appliance level for steady state and transient behaviour, respectively. Macroscopic features are extracted in low sampling frequencies in the order of 1 60 Hz to 1 Hz while microscopic features are extracted in high sampling frequencies from 50 Hz up to 30 kHz [32]. Many researchers have used microscopic features to efficiently detect transient device behaviour and thus improve energy disaggregation [36,37]. However measuring the power consumption with high sampling frequency has the drawback of higher cost through hardware and increase of computational power [38]. Therefore most studies focus on disaggregation algorithms using macroscopic features or only active power samples in combination with low computational cost disaggregation algorithms utilizing sampling rates in the order of seconds and minutes [28,[39][40][41][42][43][44].
Considering the wide range of appliances with either steady-state behaviour [16], where appliances are modelled as finite state machines [16,45] or appliances with transient behaviour including non-linear and continuous appliances [18,36,46], investigation of the effect of different features and classification algorithms is essential. In this paper we evaluate the performance of various well-known and widely used classifiers and various features on the energy disaggregation task for the N ILM task. Specifically, we present a large scale evaluation of several features with respect to the N ILM performance on specific appliance types in combination with several widely used classification algorithms, in order to investigate which feature sets are more appropriate for accurately detecting specific appliance types, e.g non-linear appliances and the effect of using appliance specific features in the overall N ILM performance. The proposed methodology with appliance specific features is evaluated using several combinations of feature sets and classification algorithms.
The remainder of this paper is organized as follows: In Section 2 the baseline N ILM system is presented. In Section 3 the experimental setup is described and in Section 4 the evaluation results are presented. Finally conclusions are provided in Section 5.

NILM Architecture
N ILM energy disaggregation can be formulated as the task of determining the power consumption on device level based on the measurements of one sensor, within time windows (frames or epochs). Specifically, for a set of M − 1 known devices each consuming power p m with 1 ≤ m ≤ M, the aggregated power P agg measured by the sensor will be where g = p M is a 'ghost' power consumption (noise) consumed by one or more unknown devices and f is the aggregation function. In N ILM the goal is to find estimationsp m ,ĝ of the power consumption of each device m using an estimation method f −1 with minimal estimation error andp M =ĝ, resulting in the total estimated powerP, i.e.,P The block diagram of the N ILM architecture adopted in the present evaluation is illustrated in Figure 1 and consists of of four stages, namely pre-processing, feature extraction, appliance detection and post-processing. In detail, the aggregated power consumption signal calculated from a smart meter is initially pre-processed, i.e., passed through a median filter [47] and then frame blocked in time frames. After pre-processing feature vectors, v of length v , one for each frame are calculated. In the appliance detection stage the feature vectors are processed by a regression algorithm using a set of pre-trained appliance models to estimate the power consumption of each device. The output of the regression algorithm (P reg ) estimates the corresponding device consumption and a set of thresholds T m with 1 ≤ m ≤ M with T g = T M for the each device including the ghost device (m = M) is used to decide whether a device is switched on or off. The detection of appliances and estimation of their power consumption is performed for each frame of the aggregated signal. A post-processing stage is refining the power estimates form the regression model by mapping them to apriori known device states using a Look-Up- Table (LUT), i.e., if the distance of the regression output to any state in the device model is larger than 25W the regression output is mapped to the closest device state. In order to define the number of states per device the K-Means algorithm was used for initialisation followed by Expectation-Maximization (EM) clustering to calculate the power consumption for each state of each device and form the LUT for the post-processing stage [48].

Experimental Setup
The N ILM architecture presented in Section 2 was evaluated using a number of publicly available datasets and a number of well-known machine learning algorithms for regression.

Datasets
To evaluate performance five different datasets of the ECO [47] database were used. The ECO database was chosen as it contains power consumption measurements per device as well as the aggregated consumption. The ECO-3 dataset was excluded as it contains only the aggregated signal and not the power consumptions per device. Furthermore the aggregated consumption measurements include not only the active power, but also the line currents (I x ), line voltages (V x ) and load angles (ϕ x ) for all three phases (x ∈ {1, 2, 3}).
The evaluated datasets and their characteristics are tabulated in Table 1 with the number of appliances denoted in the column #App. In the same column, the number of appliances in brackets is the number of appliances after excluding devices with power consumption below 25W, which were added to the power of the ghost device, similarly to the experimental setup followed in [40,49]. The next three columns in Table 1 are listing the sampling period T s , the duration T of the aggregated signal used and the appliance type for each evaluated dataset. The appliances type categorization is based on their operation as described in [50,51], i.e., one-state devices have only on/off status (e.g., resistive lamps, kettles or fridges without significant power spikes), multi-state devices having several discrete power consumption states (e.g., washing machines including different washing cycles) and non-linear loads (e.g., electronic appliances) having various states and stronger power variation. Considering their electrical layout all one-and multi-state appliances consist of a series of resistors, inductors and capacitors and thus can further be classified into resistive, inductive and capacitive devices. Non-linear appliances can include additional active components (e.g., semiconductors) and non-linear passive elements (e.g., diodes). To evaluate N ILM performance in close to real conditions the aggregated signal including the ghost power from unknown devices was used as proposed in [52], instead of creating an artificial aggregated signal by adding the corresponding power consumption from each device. The aggregated signal includes real power samples, raw current samples, raw voltage samples and load angles, depending on the chosen feature set.

Pre-Processing and Feature Ranking
During pre-processing the aggregated signal was processed by a median filter of 5 samples as proposed in [47] and then was frame blocked in frames of 10 samples with overlap between successive frames equal to 50% (i.e., 5 samples). For every frame a feature vector v ∈ R D consisting of 15 statistical (mean value, minimum and maximum values, Root-Mean-Square (RMS) value, median value, percentiles 25% and 75%, variance, standard deviation (std), skewness, kurtosis, range, energy and Zero Crossings (ZC)) and four electrical features (line current, neutral current, line voltage and load angle) were calculated resulting in a feature vector of dimensionality equal to D = 19. In order to calculate the statistical importance of the 19 features the Relie f F feature ranking algorithm [53] was used. The Relie f F algorithm was chosen as it can deal with noisy data (in our task mainly coming from the ghost power) and is appropriate for feature ranking estimation of multi-class datasets [54], as the multiple devices of the N ILM task. The average Relie f F ranking scores across the five evaluated ECO datasets are shown in Figure 2.  As can be seen in Figure 2 statistical and electrical features can be divided into two groups based on their Relie f F scores. The first group includes eight statistical and three electrical features with high ranking score (≥0.04) namely the Min/Max, Mean, Energy, RMS, Percentiles75/25 (Per75/Per25), Median and the load angles (ϕ 1,2,3 ), the line currents (I 1,2,3 ) and the neutral current (I N ), respectively. The second group includes features with lower ranking score (<0.04) namely the Zero Crossing rate, Peak2Rms, Range, Standard Deviation, Skewness, Kurtosis, Variance from the statistical features and the line voltages (V 1,2,3 ) from the electrical features. For the electrical features it must be mentioned that the neutral current and the load angles are given by the sum of the line currents and the phase-shift between line currents and voltages, respectively, therefore they carry complementary information which affects their ranking scores.

Min
The outcome of the feature ranking was used to design a set of seven experimental protocols, with the first four including only statistical features and the last three employing the additional electrical features. The chosen features for each experiment are tabulated in Table 2 where the first experiment is only considering the mean value of active power samples and thus is considered as baseline system, while for the following experiments features with decreasing feature score from the feature ranking were added under consideration of keeping similar pairs of features together (e.g., Min/Max). There is one exceptional case (in protocol 5/7) for the electrical features were the load angles are added after the line currents/voltages. This is due to two reasons, namely that the combination of line current and voltages contains the same information as the load angles and that load angles can only be computed if line currents and voltages are available.
For the regression stage four different well known and widely used machine learning algorithms have been employed namely the feed-forward Deep Neural Networks (DNNs), the k-Nearest Neighbours (KNNs), the Random Forests (RFs) and the Support Vector Machines (SV Ms). The free parameters of each regression algorithm were empirically optimized after grid search on a bootstrap training subset including data from the ECO-1/2/4/5/6 datasets with ideal aggregated data (without ghost power) as shown in Table 3. The best performance corresponding to the optimal values of each regression model is shown in bold. A "one vs. all" approach was chosen with the output of each regression model being the prediction of the power of the m th appliance. In order to avoid overlap between training and test data, each of the evaluated datasets was equally split into two subsets, one for training each regression model and one for evaluating its performance.  As can be seen in Table 3 the parameter optimized regression models are a DNN feed-forward architecture with 3-hidden layers and 32 sigmoid nodes per layer, a KNN with K = 5 nearest neighbours, a RF with 32 trees per forest and a SV M with Radial Basis Function (Rb f ) as kernel and optimized kernel parameters γ = 12.8, C = 1.45. The DNN regression model achieved accuracy equal to 88.71% and outperformed all other evaluated regression models on the bootstrap training set.

Experimental Results
The N ILM architecture presented in Section 2 was evaluated according to the experimental setup described in Section 3 with the optimized parameters shown in Table 3. The performance was evaluated in terms of power estimation accuracy (E ACC ), as proposed in [55] and defined in Equation (3). The estimation accuracy is taking into account the estimated powerp m and the ground-truth power consumption p m for each device m, where T is the number of frames and M is the number of disaggregated devices including the ghost power. For evaluating estimation accuracy on device level Equation (3) is modified by eliminating the summation over M appliances resulting in Equation (4) measuring the estimation accuracy on device level (E m ACC ).
The evaluation results for different experimental protocols and different regression models are tabulated in Table 4. As can be seen in Table 4 adding additional statistical and electrical features improves the energy disaggregation performance across all evaluated datasets, with the RF regression model outperforming all other regression algorithms. In detail for the RF regression model the greatest absolute improvement was observed for the ECO-6 dataset (9.32%), followed by the ECO-1 dataset (7.80%), while the lowest absolute improvement was found for the ECO-3 dataset (5.22%). Moreover, for almost all of the evaluated datasets and regression algorithms the best energy disaggregation performance was achieved when using additional electrical features, i.e., in protocols 5-7. In order to evaluate the appropriateness of each regression algorithm in the seven experimental protocols the average performance across the five datasets for each of the regression models was calculated. The results are tabulated in Table 5. As can be seen in Table 5, protocol five shows the best average performance for the DNN, SV M and RF regression models, while the KNN model shows a slightly higher performance protocol seven followed by protocol five. Moreover, as also shown in Table 4, the RF regression model outperforms the other models (87.6% with 6.5% performance increase), followed by KNN and DNN with similar performance (∼83.8% with 2.5% performance increase), while SV M achieved the lowest performance (79.6% with 1.9% performance increase).
Further analysis of the evaluated results was conducted on appliance type level, as they are described in Section 3 and tabulated in Table 1. The results for per device improvement using the best performing classifier (RF) are tabulated in Table 6. The first experimental protocol uses only the mean value of the active power as feature and thus is considered here as baseline system, against which all performance improvements have been calculated in Table 6 with the corresponding protocol denoted in brackets. Moreover appliances that are not operating during the testing are marked in red and were excluded from further investigation. Table 6. Maximum improvement (%) with respect to the baseline (first) protocol for the best performing regression model (RF). The protocol with the best performance is given in brackets. Further devices that are not operating during the testing period are marked in red. As can be seen in Table 6 high improvements of performance do not necessarily appear in experimental protocols with the highest number of features. To investigate the relation between appliance types and features on the energy disaggregation task we consider two types of linear appliances, either with pure resistive equivalent circuit diagram or complex loads with inductive/capacitive behaviour. Therefore three appliances categories are formed, namely one/multi-state appliances with resistive behaviour, one/multi-state appliances that can be modelled as complex loads (mainly inductive) and non-linear appliances. This appliance categorization is illustrated in Table 7. Table 7. Impact of employing temporal contextual infromation for three different devices categories. After examining the results from Table 6 under the consideration of Table 7 it can be seen that for resistive one/multi-state appliances (e.g., kettle, coffee machine or lamp) where the reactive power is zero (Q = 0) the best performing experimental protocol is protocol five in which together with the statistical features the line current is included in the feature vector as an electrical feature. For this appliance type adding the line voltage or the load angle as additional feature is not beneficial, since the load angle or the shift between current and voltage is always zero and thus does not contribute to their parametrized power signature with significant information. Except this, one/multi-state appliances with strong inductive behaviour (e.g., fridges or freezers) benefit from adding the load angle as a feature, as they consume a significant amount of reactive power and thus achieved their best performance with experimental protocol seven. In addition, non-linear devices cannot be described in terms of the active and reactive power consumption including the corresponding load angle, since the current flowing through them is non-sinusoidal as illustrated in Table 7. Thus their power consumption must be described through different techniques, as for example the Fryze power theory [56], where time domain analysis of active and non-active currents is used and the reactive power is split into a reactive component caused by the time domain shift between current and voltage and a component caused by the non-linearity of the device. For such appliances (e.g., entertainment, laptop or TV) the best performing experimental protocol was protocol six where line current and line voltage are added as features hence a time domain description is performed as suggested in [56] and does not include the load angle, since in non-linear appliances the load angle does not carry any device-dependent information.

One/Multi-State (Resistive) One/Multi-State (Inductive) Non-Linear
As regards performance on dataset level the maximum overall performance can be achieved when detecting each device using its own set of optimal features (i.e., the best performing experimental protocol) as tabulated in Table 6. Additionally to selecting appliance-driven features the disaggregation results can be improved when employing the post-processing step from Section 2 where the power estimates from the regression stage P reg are mapped to the appliance states determined through the appliance model during the pre-processing. The per dataset results when choosing the optimal set of features individually for each device and utilizing the post-processing are tabulated in Table 8 with the best performing datasets shown in bold. Table 8. Maximum performance (%) per dataset for all classifiers using appliance driven features with ('Post') and without post-processing ('App') when compared to the best performing protocol without post-procesing and with uniform appliance features for all appliances ('Base'). As can be seen in Table 8, employing the optimal set of features for each device results in further improvement of the disaggregation accuracy varying from 0.1% to 3.2% depending on the dataset and the regression model. The maximum average performance increase for the best performing classifier (RF) is 0.8% with an overall average disaggregation accuracy of 89.0%. When further employing the post-processing as described in Section 2 another performance increase between 0.5% and 1.2% can be observed when utilizing DNNs, RFs or SV Ms. However, no performance increase was observed when using KNNs. The performance increase when using DNNs, RFs or SV Ms is mainly due to one/multi-state linear appliances, which can be modelled as finite-state-machines and benefit from the post-processing step where power estimates are mapped to discrete power states of the corresponding appliance. In terms of absolute improvement RF still outperforms all other classifiers when applying the LUT post-processing with overall disaggregation accuracy equal to 89.5%.

Conclusions
In this paper the performance of different classifiers in combination with different sets of features for energy disaggregation in non-intrusive load monitoring was investigated. The evaluation results showed significant importance on the selection of features, with the electrical features being more discriminative than the statistical ones. It was also shown that the optimal choice of features strongly depends on the device type and its electrical characteristics, with the non-linear devices better being disaggregated when using line current and line voltage, while linear devices were disaggregated well with statistical features only. After evaluating energy disaggregation across several datasets, random forest (RF) was found the best performing regression algorithm outperforming all other evaluated machine learning algorithms by an absolute performance increase of approximately 6.5%. Moreover, it was shown that, when using device dependent features and device state mapping as post-processing, further improvement in energy disaggregation accuracy can be achieved (up-to 1.3%), resulting in a maximum disaggregation performance of 89.5%. Energy disaggregation and especially non-intrusive load monitoring is a very challenging task. The use of detectors which have been designed, adapted or fine-tuned to the specifications of each appliance is a direction which will result in further improvement of disaggregation performance and, in combination with the recent evolution of deep learning [43,57,58] and the use of big data for training device models, e.g., in the order of years [57], is expected to contribute to even more accurate disaggregation methodologies. Moreover multimodal information other than energy data acquired from smart meters like weather, occupancy or socio-economical events could be supportive for disaggregating the energy consumption of households and buildings [59]. Acknowledgments: This work was supported by the U A Doctoral Training Alliance (https://www.unialliance. ac.uk/) for Energy in the United Kingdom.

Conflicts of Interest:
The authors declare no conflict of interest.