Energy Disaggregation Using Two-Stage Fusion of Binary Device Detectors

: A data-driven methodology to improve the energy disaggregation accuracy during Non-Intrusive Load Monitoring is proposed. In detail, the method uses a two-stage classiﬁcation scheme, with the ﬁrst stage consisting of classiﬁcation models processing the aggregated signal in parallel and each of them producing a binary device detection score, and the second stage consisting of fusion regression models for estimating the power consumption for each of the electrical appliances. The accuracy of the proposed approach was tested on three datasets—ECO (Electricity Consumption & Occupancy), REDD (Reference Energy Disaggregation Data Set), and iAWE (Indian Dataset for Ambient Water and Energy)—which are available online, using four di ﬀ erent classiﬁers. The presented approach improves the estimation accuracy by up to 4.1% with respect to a basic energy disaggregation architecture, while the improvement on device level was up to 10.1%. Analysis on device level showed signiﬁcant improvement of power consumption estimation accuracy especially for continuous and nonlinear appliances across all evaluated datasets.


Introduction
Between 25% and 40% of the global energy consumption and the corresponding amount of carbon dioxide emissions comes from residential buildings [1][2][3][4].It is estimated that in the next two decades the average number of electrical devices used in houses is going to rise [4].In parallel, climate change and urbanization are affecting the energy load of urban buildings, with the energy load demand growing two times faster than the expansion of urbanization [5] have shown that roughly 20% of households consumed energy is due to faulty equipment or poor operational strategies [6][7][8].Therefore, to detect faulty device operation and improve operation strategies, optimization techniques in terms of device detection and load scheduling have been developed to find optimal and suboptimal operational strategies [9].Additionally, significant progress in smart grids, smart systems, and smart devices was made in the last few decades, considering optimized energy generation and distribution [9,10].Accordingly, energy management and the deployment of Information and Communication Technologies (ICT) in residential buildings increased as well, in order to reduce households' energy consumption without decreasing living quality levels or violating consumers personality rights and privacy [11,12].In general, the amount of information gathered is increased progressively with respect to consumer behavior.Especially, usage of energy is monitored to reduce overall energy consumption and peak loads, while improvement of the well-being of consumers is tried to be achieved as well [13].
Studies have shown that for achieving significant decrease in energy consumption smart energy management, smart grids, fine-grained energy monitoring, as well as load forecasting on household level are indispensable [14,15].However, nowadays energy monitoring is mostly done via an aggregated measure of energy on monthly bills and does not offer detailed information about energy monitoring.Therefore, to accurately measure energy consumption, smart meters are utilized usually measuring with sampling frequency equal to 1 Hz or more.Smart meters are devices used to measure energy consumption of electrical appliances, based on voltage and current measurements.The energy consumption is calculated at periods of time which usually are every 1 second or more frequently, e.g., up to 30,000 samples per second [16].The more frequently energy consumption is calculated the more detailed is the captured information of energy consumption; however, increasing the sampling frequency will linearly increase the data to be stored, processed, or transmitted which in turn increases the hardware cost exponentially [17].Therefore, most recent studies focus on low sampling frequency data, as the majority of commercial smart meters collect data usually at 0.1 Hz or up to 1 Hz to minimize the hardware cost of smart meters and to address the transmission and data-storage capacity limitations [18,19].Energy saving enhancement can be achieved on device level by detecting faulty device operation and inefficient operating strategies [7].Knowledge about the appliances' consumption can lead to a reduction of total consumption through increased awareness of energy consumption [20].Recent studies have shown that households are usually bad at estimating individual power consumption (e.g., overrating small appliances consumption and under-rating the amount of energy for heating) [21].This means that the energy consumption must be either measured on device level, which disadvantageously results in increased cost due to wiring issues and data acquisition [19], or that the aggregated energy (consumed energy measured centrally for each household) must be split to appliance level automatically, which is called energy disaggregation.Energy disaggregation as defined in [22] is the Non-Intrusive Load Monitoring (NILM) determining the consumption of energy from each individual appliance of a house, performed by processing of measurements of the current and voltage of the overall household's load.The term non-intrusive is used to point out the distinction to Intrusive Load Monitoring (ILM) methods utilizing several measurements and smart meters and set the focus on determining the per device consumption.In other words, NILM is extracting electrical energy consumption at appliance-level based on one central measurement, thus to identify the onsets t on (switch-on times) and t o f f (switch-off times) of appliances from the aggregated energy signal in order to find the corresponding consumptions per appliance [23].
Several methods to solve the NILM energy disaggregation challenge can be found in the bibliography.These methods are briefly classified in methods using Source Separation (SS) algorithms and in approaches that do not use SS algorithms.Common for all NILM approaches is that they use measurements of the aggregated energy consumption of a household with a sampling frequency f s in the order of a sample per second up to few tens of kHz [16].NILM methods may use macroscopic signal parameters (e.g., active/reactive power [24,25]) or microscopic ones (e.g., transient energy and harmonics [26][27][28]), depending on the sampling rate f s , to split the aggregated signal in appliance level [29].Appliance identification methods not using SS algorithms are based mainly on supervised methods and the extraction of features, which will be used either for training a Machine Learning (ML) algorithm (e.g., Support Vector Machines (SVM) [30], Artificial Neural Network (ANN) [31], Decision Tree (DT) [32], K-Nearest Neighbours (KNN) [33]), or defining a set of rules or thresholds [28].As regards appliance identification methods using SS algorithms, they are based on single-channel source separation and solve the task with optimization criteria.Approaches using source separation extract the power consumption characteristic pattern of every appliance from the aggregated signal using an optimization algorithm with constrains [19,34,35].Commonly reported SS algorithms in the NILM task are Independent Component Analysis (ICA) [36], Non-Negative Matrix Factorization (NMF) [37], and Sparse Component Analysis (SCA) [38].Source Separation-based NILM approaches are unsupervised; however, a priori knowledge is needed as only the aggregated signal measurements are used, thus making them semi-unsupervised [19], in contrast to the NILM approaches without using SS algorithms, which are supervised.Furthermore, cutting edge technology in machine learning has led to a number of recently proposed in the literature deep learning approaches using big datasets, like the Almanac of Minutely Power Dataset (AMPds) [39].Methodologies using Convolutional Neural Networks (CNNs) [40][41][42], Recurrent Neural Networks (RNNs) [43,44] and Long Short-Term Memory (LSTM) architectures [44,45], denoising autoencoders (dAEs) [46], and Gated Recurrent Units (GRUs) [40] can be found in the bibliography.Furthermore, additional questions regarding consumer privacy and real-time capability arise with the high frequent measurements of energy consumption, and have been discussed in [47,48] for security relevant issues and in [17] and [49] for low cost disaggregation and real-time capability.
There is still no established approach for solving the NILM problem and literature reports multiple solutions with and without source separation.There are numerous electrical devices which have steady state behavior [22] and are typically modeled as finite state machines [22,50] as well as electrical devices with non-steady behavior, which have nonlinear and/or continuous characteristics [51,52].The identification of such appliances when working in parallel or showing strong time-dependent behaviors [53] is still an unsolved problem, especially for nonlinear and continuous devices.In this paper a two-stage fusion approach is proposed aiming at representing different device combinations and their time varying behavior more accurately.The proposed methodology is based on supervised learning and utilizes low frequency data as well as steady-state features, similar as in [54][55][56].
The remaining of this article is organized as follows.Section 2 presents the proposed two-stage fusion methodology.In Sections 3 and 4, the experimental set-up and the experimental results are given, respectively.Finally, in Section 5 conclusions are provided.

Two-Stage Fusion Methodology
The NILM energy disaggregation task can be described as the problem of estimation of the power consumption of each electrical appliance using the measurements acquired from one central smart meter, within time windows (frames or epochs).In detail, given a set of M − 1 known appliances each consuming power p m , with 1 ≤ m ≤ M, the aggregated power P agg measured by the central smart meter will be where p g = p M is a "ghost" power consumption, which is usually consumed by one or more unknown appliances.In NILM, the aim is to calculate estimations P = pm , 1 ≤ m ≤ M of the power consumption of each electrical appliance m using an estimation method f −1 with minimal estimation error and pM = pg , i.e., P = p1 , p2 , . . ., pM−1 , pg = f −1 (P agg s.t.argmin f −1 As Equation ( 2) is practically impossible to be solved using an analytical solution, most energy disaggregation methodologies are based on segmentation of the aggregated signal into frames and estimation of the power consumption on device level within each frame using a machine learning based model, which can either be one model per device following the "one vs. all" approach [57] or a multi-class device identification model [58].The architecture of the baseline one-stage NILM approach based on regression estimators of power consumption is presented in Figure 1.
Specifically, the one-stage NILM methodology consists of preprocessing, feature extraction, and a regression model for estimating the appliances power consumption P.During preprocessing the aggregated signal is initially filtered, in order to remove peaks as proposed in [59], frame blocked in time frames h t of length L, and a feature vector v t , v t ∈ R K , is calculated for each frame h t , where 1 ≤ t ≤ T and T is the last frame of the aggregated signal.Finally, a regression model is used to estimate power consumption values P = p1 , p2 , . . ., pM−1 , pg for each of the M devices.The estimation of each device's power consumption can be done either using in parallel one regression model per device or using one regression model with M output-estimations.
As Equation ( 2) is practically impossible to be solved using an analytical solution, most energy disaggregation methodologies are based on segmentation of the aggregated signal into frames and estimation of the power consumption on device level within each frame using a machine learning based model, which can either be one model per device following the "one vs. all" approach [57] or a multi-class device identification model [58].The architecture of the baseline one-stage NILM approach based on regression estimators of power consumption is presented in Figure 1.In this work, the one-stage NILM methodology is extended to two stages.In detail, the first stage consisting of classifiers (device detectors) processing the aggregated signal in parallel and each of them producing a binary device-specific detection score, while the second stage consists of regression fusion models for estimating the power consumption of each appliance using as input the stage I results concatenated with the feature vector.The architecture of the proposed two-stage methodology is presented in Figure 2. Specifically, the one-stage NILM methodology consists of preprocessing, feature extraction, and a regression model for estimating the appliances power consumption  .During preprocessing the aggregated signal is initially filtered, in order to remove peaks as proposed in [59], frame blocked in time frames ℎ of length , and a feature vector  ,  ∈ ℝ , is calculated for each frame ℎ , where 1 ≤  ≤  and  is the last frame of the aggregated signal.Finally, a regression model is used to estimate power consumption values  = ̂ , ̂ , … , ̂ , ̂ for each of the  devices.The estimation of each device's power consumption can be done either using in parallel one regression model per device or using one regression model with  output-estimations.
In this work, the one-stage NILM methodology is extended to two stages.In detail, the first stage consisting of classifiers (device detectors) processing the aggregated signal in parallel and each of them producing a binary device-specific detection score, while the second stage consists of regression fusion models for estimating the power consumption of each appliance using as input the stage I results concatenated with the feature vector.The architecture of the proposed two-stage methodology is presented in Figure 2. ,  }, one for each of the -1 known devices and one for the unknown ghost-power according to the "one vs. all" approach.The output before the last layer of stage I,  ´= ̂ ´, ̂ ´, … , ̂ ´, ̂ ´ is the classification score for each of the  devices: where  is the classification model for the  device and  is the feature vector as calculated in the feature extraction stage.The predicted class is the one with the highest score ̂ ´.To get the binary decision at the end of stage I, a threshold Θ is applied to transform the initial classification scores ̂ ´ to their binary representation, thus labeling if a device is working (1) or not (0): Subsequently, the initial binary estimations,  ´=  ´,  ´, … ,  ´,  ´ with  ´∈ ℝ , from stage I are concatenated together with the feature vector,  to an new feature vector  = { ´| } ∈ ℝ ( ) , so as to estimate the power consumptions of the  appliances.Specifically, in the second stage  fusion models,  = { ,  , … ,  ,  } with  ∈ ℝ , are receiving as input the new feature In detail, during stage I the feature vectors are initially processed by a set of M classification models C = c 1 , c 2 , . . ., c M−1 , c g , one for each of the M − 1 known devices and one for the unknown ghost-power according to the "one vs. all" approach.The output before the last layer of stage I, P = p 1 , p 2 , . . ., p M−1 , p g is the classification score for each of the M devices: where c m is the classification model for the mth device and v t is the feature vector as calculated in the feature extraction stage.The predicted class is the one with the highest score p m .To get the binary decision at the end of stage I, a threshold Θ is applied to transform the initial classification scores p m to their binary representation, thus labeling if a device is working (1) or not (0): Energies 2020, 13, 2148 5 of 17 Subsequently, the initial binary estimations, D = d 1 , d 2 , . . ., d M−1 , d g with D ∈ R M , from stage I are concatenated together with the feature vector, v t to an new feature vector V t = D v t ∈ R (K+M) , so as to estimate the power consumptions of the M appliances.Specifically, in the second stage M fusion models, R = r 1 , r 2 , . . ., r M−1 , r g with R ∈ R M , are receiving as input the new feature vector V t , giving a numerical estimation (regression) for the appliance power consumption for each of the M devices.pm = r m (V t ) s.t.pm ∈ 0, . . . ,max(h t ) The initial binary estimates of device operation D from the first stage are used from the regression models of the second stage to model any power consumption correlations between the different appliances, i.e., the devices that are likely to work simultaneously within the time frame v t .Additionally, the restriction on Equation ( 5) assures that the prediction of power consumption for each single device pm at frame instance t cannot exceed the aggregated power consumption within that frame.
The proposed methodology combines binary device estimates from a first classification stage with a second regression fusion stage, thus any complementary information from the first stage will be captured and learned by the fusion model.Moreover, with the existence of ghost power in the first level, the output of the binary classifiers will be used as a feature for the detection of unknown devices, which offers advantage to the present methodology in real set-up evaluations where unknown devices exist quite often.

Experimental Set-up
A detailed description of the databases used to evaluate the one-stage and the proposed two-stage fusion methodology as well as the description of the parameterization of the machine learning algorithms are provided in this section.

Evaluation Data
To evaluate the proposed methodology presented in Section 2 the data collections Electricity Consumption & Occupancy (ECO) [59], Reference Energy Disaggregation Data Set (REDD) [60], and Indian Dataset for Ambient Water and Energy (iAWE) [61], which are freely and online accessible, were used as they contain low frequency samples from the aggregated data and individual power measurements from each device, respectively.The three databases consist of several datasets with different monitored houses in each.For the present evaluation from the ECO database houses, 1, 2, and 4-6 were used, while the ECO-3 dataset was not used because it does not include the power consumption signals of each appliance but only the aggregated signal.Further, from the REDD database, house 5 was excluded as its measurement duration is significantly shorter than for the rest of the datasets in the REDD database [62].The datasets used in the present evaluation are shown in Table 1 with column "#App" tabulating the total number of appliances (App) in each dataset and in brackets the number of devices with power consumption above 25 W, with the remaining ones considered as "ghost device", in alignment with the experimental protocol introduced in [57,58].The remaining columns of Table 1 are listing the sampling period T s , the duration T, and the device types included in every dataset.As regards the REDD database, all of it was utilized, ignoring the gaps in the measurements as in [63].Regarding the ECO and iAWE databases, one week of energy consumption recordings was used in order to the size of training data to be similar with the REDD dataset.Specifically, we used the week from 05/07/2012 until 11/07/2012 for the ECO database and the week from 08/06/2013 until 14/06/2013 for the iAWE database.These weeks were chosen with the intention of having as many appliances as possible in the selected time interval of the aggregated signal.Except this, in [59,64], where the ECO and the iAWE databases were also used the selected time interval has not been specified.The classification of device types is based on their operation as described in [65,66], i.e., one-state electrical appliances have only on/off status (for example resistive lamps, kettles or fridges without significant power spikes), Energies 2020, 13, 2148 7 of 17 multi-state devices have a number of discrete power consumption states (e.g., washing machines with numerous washing cycles), nonlinear devices (e.g., electronics) and electrical appliances with continuous power consumption pattern, which are controlled by power electronics (e.g., air condition) and usually have an exponential decay signature.The device signatures may present an amplitude peak in the beginning of the signature, as, for example, in the case of refrigerators.An example power signature for each of the four device categories was extracted from the REDD databases and is illustrated in Figure 3.
energies 2020, 13, x; doi: FOR PEER REVIEW www.mdpi.com/journal/energiescontinuous power consumption pattern, which are controlled by power electronics (e.g., air condition) and usually have an exponential decay signature.The device signatures may present an amplitude peak in the beginning of the signature, as, for example, in the case of refrigerators.An example power signature for each of the four device categories was extracted from the REDD databases and is illustrated in Figure 3.As can be seen from Table 1, the evaluated datasets vary in terms of number of appliances, monitoring durations, as well appliance type, and therefore are accurately representing the various characteristics of nowadays households [59,60].All evaluated datasets have a low sampling rate in the order of seconds and only the active power samples of the aggregated signal is utilized offering a good trade-off between computational load and real-time operation [64].

Prameterization and Feature Selection
At the preprocessing of the aggregated signal a median filter of five samples was used for smoothing as proposed in [59], and afterwards the preprocessed signal was segmented in overlapping frames of length equal to L = 10 samples and time shift between successive frames equal to 5 samples.The optimal number of samples per frame was determined through grid search on a bootstrap dataset with ideal aggregated data (without ghost power), consisting of one dataset out of each database (ECO-2, REDD-2 and iAWE) similar as in [67,68].
All devices with constant power consumptions of less than 25 W were removed from the datasets and added to the ghost-power, while the aggregated data was not modified, which ensures that the training as well as the testing was done with real measurements of the aggregated data and not with an artificial dataset created through summing consumptions of all appliances [69].The set of binary classifiers C was trained, one for every device m and separately for each dataset according to the "one vs. all" approach, whereas the threshold was set equally to Θ = 25  for all appliances.During the training phase the set of features,  , was determined from a time window of active power samples ℎ and the Min/Max, Mean, Energy, RMS, Percentiles25/75, Median, Zero Crossing rate, Peak2Rms, Range, Standard Deviation, Skewness, Kurtosis, and Variance values were extracted according to their statistical importance determined by the ReliefF algorithm [70] resulting to a K = 15 dimensional feature vector similar as in [71,72].Specifically, Mean, Energy, RMS were used to As can be seen from Table 1, the evaluated datasets vary in terms of number of appliances, monitoring durations, as well appliance type, and therefore are accurately representing the various characteristics of nowadays households [59,60].All evaluated datasets have a low sampling rate in the order of seconds and only the active power samples of the aggregated signal is utilized offering a good trade-off between computational load and real-time operation [64].

Prameterization and Feature Selection
At the preprocessing of the aggregated signal a median filter of five samples was used for smoothing as proposed in [59], and afterwards the preprocessed signal was segmented in overlapping frames of length equal to L = 10 samples and time shift between successive frames equal to 5 samples.The optimal number of samples per frame was determined through grid search on a bootstrap dataset with ideal aggregated data (without ghost power), consisting of one dataset out of each database (ECO-2, REDD-2 and iAWE) similar as in [67,68].
All devices with constant power consumptions of less than 25 W were removed from the datasets and added to the ghost-power, while the aggregated data was not modified, which ensures that the training as well as the testing was done with real measurements of the aggregated data and not with an artificial dataset created through summing consumptions of all appliances [69].The set of binary classifiers C was trained, one for every device m and separately for each dataset according to the "one vs. all" approach, whereas the threshold was set equally to Θ = 25 W for all appliances.During the training phase the set of features, v t , was determined from a time window of active power samples h t and the Min/Max, Mean, Energy, RMS, Percentiles25/75, Median, Zero Crossing rate, Peak2Rms, Range, Standard Deviation, Skewness, Kurtosis, and Variance values were extracted according to their statistical importance determined by the ReliefF algorithm [70] resulting to a K = 15 dimensional feature vector similar as in [71,72].Specifically, Mean, Energy, RMS were used to model steady-state behavior, while Min/Max, Percentile75/25, Median, Zero Crossing rate, Peak2Rms, Range, Standard Deviation, Skewness, Kurtosis, and Variance was used to model for the transient behavior and the variation within the frames [73].As all databases are sampled with relatively low sampling frequencies the feature vector only contains steady-state features.
Similarly, the regression fusion models were trained using the intermediate binary scores from the first stage, D , as well as the original feature vector v t .In detail D and v t where concatenated into a single feature vector and used to train the set of fusion regression models R, one for each of the M devices.Both the one-stage architecture (Figure 1) and proposed two-stage fusion architecture (Figure 2) were trained with the first half of dataset and tested on the second half of each dataset, thus without overlap between training and test subsets.
For building the models of the one-stage and two-stage architecture Deep Neural Networks (DNNs), K-Nearest-Neighbors (KNNs), Decision Trees (DTs) in a Random Forest (RF) implementation, and Support Vector Machines (SVM) were used.Short description and free parameters of the evaluated classifiers are tabulated in Table 2.The values of the adjustable parameters of the evaluated regression algorithms were fine-tuned empirically by performing grid search on a bootstrap subset of the training data composed of the ECO-1/2/4/5/6 database which didn't include any ghost power.The performance was evaluated in terms of appliance power estimation accuracy (E ACC ), as proposed in [60] and defined in Equation ( 6).
where pm is the estimated power, p m the ground-truth power consumption of the mth device, T denotes the total number of frames, and M is the number of electrical appliances including the ghost power.The free parameters optimization of the regression models with respect to the power estimation accuracy E ACC at the end of the one-stage architecture, pm , are shown in Table 2.As shown from Table 2, the optimized parameters (in bold) of the regression models are a DNN model with 3 hidden layers and 32 sigmoid nodes per layer, a KNN with K = 5 nearest neighbors, a RF with 32 trees per forest and a SVM with Radial Basis Function (RBF) as kernel with parameters gamma equal to 12.8 and C equal to 1.45.The DNN model achieved accuracy equal to 88.7% and outperformed all other evaluated regression models on the bootstrap subset of the training data.

Experimental Results
The NILM methodology described in Section 2 was tested based on the experimental protocol presented in Section 3 using the parameter optimization results of Table 2. To evaluate NILM accuracy on electrical appliance level, Equation ( 6) was modified by removing the sum across the M appliances, thus resulting to The experimental results in terms of E ACC (%) for all evaluated datasets, all evaluated classification algorithms and for both the one-stage and proposed two-stage architecture are tabulated in Table 3.The best performing energy disaggregation scores per dataset are indicated in bold for both one-and two-stage results.As shown in Table 3, the best performing classifier amongst all tested datasets, when using the one-stage architecture, is RF outperforming all other classifiers except for the case of iAWE dataset where the SVM classifier achieves significant higher performance in terms of energy disaggregation.Furthermore, the results in Table 3 show that the two-stage fusion methodology improves the overall E ACC performance across all evaluated datasets.In terms of average improvement per dataset E ACC increases between 0.6% and 4.1% depending on the dataset and the classifier.The most significant improvements in terms of relative performance were observed when using DNN as classifier where performance was improved by 4.1% (REDD-2 dataset).The improvement in terms of absolute E ACC values, i.e., the average increase in estimation accuracy when considering the best experiment for the first stage as the baseline performance, ranges between 0.6% and 3.4% when using SVM and RF as classifiers and the results were statistically significant when comparing their accuracy scores on frame level of the one-stage and the two-stage fusion architectures.In detail, RF outperformed SVM in ten out of eleven datasets with exception of the iAWE database, which is probably due to the significant higher proportion of continues appliances which is in line with results in literature reporting high accuracies for SVM in case of appliances with strong time varying behavior [73,74].The evaluation results demonstrate the validity of the proposed method as it has offered improved performance when tested in several and highly dissimilar (with respect to the sampling rate f s , the number and the type of devices) datasets as presented in Section 3 and shown in Table 1.
In a next step we performed analysis of energy disaggregation performance on device level for one dataset out of each database.Table 4 tabulates the E ACC on device level for the ECO-2, REDD-2, and iAWE datasets.The choice for the three datasets was made according to the characteristics of the datasets shown in Table 1.Specifically, datasets which have roughly the same number of appliances (<10) and are similar in their collection of appliances thus having appliances of the same type were chosen.ACC (%) of the one-stage (I) and the proposed two-stage fusion (II) architecture using the best performance classifier (RF) conducted from the per dataset results.The superior method is given in bold while in the column "category" appliances with significant power spike are marked as "PS".As can be seen in Table 4, there is a relation between performance improvement and appliance category with one/multi-state devices without significant power peak signature showing no performance improvement and nonlinear and continuous appliances as well as one-state appliances with significant power peak showing significant performance improvement.Depending on the dataset, the performance increase varies up to 0.4% for one/multi-state devices without power spikes, up to 7.4% for devices with power spike, up to 10.1% for nonlinear devices and up to 4.9% for continuous devices respectively.In detail the highest performance increase in the three tested datasets was observed for nonlinear appliances namely the TV (10.1%) and the Entertainment (7.7%) in the ECO-2 dataset.Significant increase in performance was also observed for devices with power spikes (PS) in Energies 2020, 13, 2148 11 of 17 their signature, like the Fridge, the Freezer, and the A/C with maximum improvement equal to 7.4%, 3.9%, and 4.9%, respectively.The lowest or no performance improvement was observed for one-state appliances without power spikes, e.g., resistive lamps or disposal.

Device
In order to directly compare the proposed methodology with other approaches proposed in the literature we additionally tested our method on five selected loads from the REDD-2 dataset, namely the refrigerator, lighting, dishwasher, microwave, and furnace.These loads were used in [55] because they carry a large percentage of the overall consumed energy and they have been used in other publications [67,75].Furthermore, the disaggregation results were evaluated both in a noisy (with ghost data) and a noiseless (with synthetic data) setup as in [75] for both the one-stage and the proposed two-stage fusion architecture.The results are tabulated in Table 5.From Table 5 it is seen that the presented two-stage fusion model outperforms the baseline one-stage system in both the noisy and noiseless setup with 93.4% (2.7% improvement) and 95.7% (2.5% improvement), respectively.Moreover, the largest improvements can be observed for the appliances with significant power spikes and nonlinear behavior, i.e., the fridge and the light with 13.0% (6.7%) and 2.8% (3.7%), respectively.For the purpose of comparison with previously published NILM approaches the summary of methods using the same databases and the E ACC performance metric presented in [76] was used.Furthermore, the summary of results of [76] was updated by incorporating very recent results found in the literature utilizing deep learning.However in the latest published deep learning approaches many researchers started utilizing databases with even lower sampling frequency and longer monitoring duration (e.g., AMPds [39] or UK-DALE [77]) as in [41,42,44,78], or utilizing different accuracy metrics (e.g., normalized RMSE in [45]) making direct comparison impossible.The results are tabulated in Table 6.
From Table 6, it is shown that the two-stage fusion methodology achieves higher accuracy than all other published methods evaluated on the REDD datasets 1-4 and 6.As regards the experimental setup using five appliances of the REDD-2 dataset (initially proposed in [55]) the proposed fusion architecture performs better than all reported NILM methods, except the method of Makonin et al. [75] utilizing HMM sparsity which achieved 1.4% higher accuracy than our proposed fusion methodology in the noisy set-up; however, the energy data used in [75] were manually modified to time align data acquired from two different smart meter devices while we have used the original data from the REDD-2 dataset without any modification.Moreover, for the approach presented in [75], the performance on the full REDD dataset with all 18 appliances across all houses (1, 2, 3, 4, and 6) has not been reported in the literature and thus direct comparison with our approach is possible only using the REDD-2 dataset with five devices.Regarding the results presented in [40] are not directly comparable with our approach (which performs 8.8% better) as a modified training/test setup has been used.To compare our performance with the one reported in [45] we calculated the normalized RMSE used in [45].Our proposed methodology has normalized RMSE equal to 0.24, which is 0.11 better than the score reported in [45].Considering the results from Tables 3 and 4, the proposed two-stage fusion methodology demonstrated improvements both in average and per device performance across all evaluated datasets with all evaluated classifiers, demonstrating the validity of the methodology.As regards the effect of different datasets when using the same classifier, the improvement in terms of E ACC varies between 0.6% and 4.1% as can be seen in Table 3.The main reasons are the different number of devices in each dataset and the distribution of appliance types, i.e., how many appliances of a specific type (e.g., one-state or nonlinear) can be found in each dataset.Considering the results in Table 3 in combination with the database categorization in Table 1 it can be seen that datasets with small number of appliances (e.g., ECO-1 or REDD-2) have a slightly higher improvement in estimation accuracy and show improvements of approximately 1.0-4.1%,while datasets with larger number of appliances (e.g., REDD-1 or REDD-3) show improvements of up to 1.6%.Moreover, the datasets including significant number of continuous appliances or nonlinear appliances (e.g., ECO-2 or iAWE) benefit more from the two-stage fusion architecture.Continuous or nonlinear devices may have high correlation with the daily routine of the users/consumers as well as they may have dependencies between them, e.g., the Entertainment appliances which in the general case are interconnected with the TV.For electrical appliances having dependencies with other devices or depending on residents' everyday routine, the a priori information of the devices operating together or following similar everyday routine patterns, e.g., most of the times working or not working at the same time, can boost the estimation of the power consumption of those devices.For such appliances, power consumption estimation can be improved from the proposed two-stage fusion methodology in which estimates of the operation of other devices (identified at the first stage of the proposed architecture) are utilized.In addition, energy consumption estimation for appliances presenting power spikes, i.e., peaks that appear during the switching on of electrical motors, e.g., in fridges or freezers, was found to get improved by the fusion stage of the proposed NILM architecture, given that the existence of a power spike in a frame changes the total amount of energy to be disaggregated.Therefore, it is beneficial having an initial estimate of which appliances are likely to be working (calculated from the first stage in the two-stage architecture), to discriminate power spikes from appliances with constant high-power consumption.

Table 6.
Comparison of E ACC (%) values for recently proposed NILM methodologies (methods marked with an asterisk are not directly comparable because of a dataset transferability set-up used in [40] and a slight change in the accuracy metric in [4,5]).It was shown in Tables 3-6 that the two-stage fusion methodology improved the estimation accuracy across all datasets.Especially in Table 4, it was shown that the two-stage fusion methodology shows higher performance increase for appliances with power spikes as well as nonlinear and continuous appliances.In Table 5, the results were compared to state-of-the-art literature for five selected appliances for both one-stage and proposed two-stage architecture, while a comparison of average estimation accuracy scores was presented in Table 6, showing the improvement of the method when using the complete dataset.

Conclusions
In this paper, a two-stage fusion energy disaggregation approach for non-intrusive load monitoring was presented.The fusion approach combines multiple classifiers producing a binary detection score in the first stage of the architecture, and further uses a fusion of the initial binary estimates to enhance the energy disaggregation accuracy during a second fusion stage.The proposed architecture was evaluated on three different databases using four different classification algorithms and proved to increase the power estimation accuracies for all evaluated databases and classifiers with Random Forests outperforming all other classifiers.Specifically, the proposed two-stage fusion methodology achieved improvement of up to 3.4% among the evaluated datasets and in device level the estimation accuracy was improved by 10.1% when compared to the best performing baseline non-intrusive load monitoring setup.As regards different appliance types, the two-stage methodology significantly improved the power consumption estimation accuracy of continuous and nonlinear devices as well as the power consumption estimation of appliances with high power spikes.The proposed two-stage fusion methodology demonstrated robust performance across several datasets with different characteristics and types of devices as well as estimated well the ghost power produced from unknown devices which is common in households, demonstrating the appropriateness of it in real-life setups.Non-intrusive load monitoring is a difficult task especially when considering nonlinear and continuous appliances.With the evolution of usage of smart meters, large amounts of energy data with duration of several continuous years of recordings is anticipated to be collected in the next years based on which deep learning approaches will be used to develop device identification and energy consumption models.Another future direction is the incorporation of temporal information into the device models to further improve disaggregation accuracy especially in the case of appliances with strongly time varying behavior.

Figure 1 .
Figure 1.Block diagram of the baseline NILM architecture consisting of preprocessing, feature extraction, and regression estimation of power consumption.

Figure 1 .
Figure 1.Block diagram of the baseline NILM architecture consisting of preprocessing, feature extraction, and regression estimation of power consumption.

Figure 2 .
Figure 2. Block diagram of the proposed two-stage energy disaggregation methodology.In detail, during stage I the feature vectors are initially processed by a set of  classification models  = { ,  , … ,  ,  }, one for each of the -1 known devices and one for the unknown

Figure 2 .
Figure 2. Block diagram of the proposed two-stage energy disaggregation methodology.

Figure 3 .
Figure 3. Examples of appliance signatures for (a) one-state appliance with significant peak (refrigerator), (b) multi-state appliance without significant peak (dishwasher), (c) nonlinear appliance (laptop), and (d) continuous appliance with decay (air-conditioning) from the REDD database.

Figure 3 .
Figure 3. Examples of appliance signatures for (a) one-state appliance with significant peak (refrigerator), (b) multi-state appliance without significant peak (dishwasher), (c) nonlinear appliance (laptop), and (d) continuous appliance with decay (air-conditioning) from the REDD database.

Table 1 .
Overview of the evaluated datasets.

Table 3 .
Performance of energy disaggregation in terms of E ACC (%) for different datasets using the one-stage (I) and the proposed two-stage fusion methodology (II).

Table 4 .
Per device performance E m

Table 5 .
Performance evaluation E ACC (%) for five selected appliances from the REDD-2 dataset for both one-stage (I) and proposed two-stage fusion (II) methodology.