Power Proﬁle and Thresholding Assisted Multi-Label NILM Classiﬁcation

: Next-generation power systems aim at optimizing the energy consumption of household appliances by utilising computationally intelligent techniques, referred to as load monitoring. Nonintrusive load monitoring (NILM) is considered to be one of the most cost-effective methods for load classiﬁcation. The objective is to segregate the energy consumption of individual appliances from their aggregated energy consumption. The extracted energy consumption of individual devices can then be used to achieve demand-side management and energy saving through optimal load management strategies. Machine learning (ML) has been popularly used to solve many complex problems including NILM. With the availability of the energy consumption datasets, various ML algorithms have been effectively trained and tested. However, most of the current methodologies for NILM employ neural networks only for a limited operational output level of appliances and their combinations (i.e., only for a small number of classes). On the contrary, this work depicts a more practical scenario where over a hundred different combinations were considered and labelled for the training and testing of various machine learning algorithms. Moreover, two novel concepts—i.e., thresholding/occurrence per million (OPM) along with power windowing—were utilised, which signiﬁcantly improved the performance of the trained algorithms. All the trained algorithms were thoroughly evaluated using various performance parameters. The results shown demonstrate the effectiveness of thresholding and OPM concepts in classifying concurrently operating appliances using ML.


Introduction
There has been a greater focus on utilising renewable energy resources since the Kyoto Protocol in order to reduce the effects of greenhouse gasses, global warming and reduce our carbon footprint. The integration of unpredictable renewable energy into the power grid acts as the driving force towards the evolution of the existing grid system to smart gridcharacterised by bi-directional power flow, control and two-way communication. This provides an opportunity to achieve enhanced energy efficiency through user participation. To implement this, next-generation power systems intend to exploit artificial intelligence and machine learning to design sustainable energy systems. These systems are likely to work with smart grids to optimise energy consumption [1]. The concept of smart building energy management systems (SBEMs) has become popular and consistent with the smart grid concept [2]. The purpose of SBEM is to optimise the energy consumption of a building. Home appliances can be controlled efficiently by monitoring the cost of a consumer's energy usage and yielding a better return on investment for the utility provider. To realise this, we need to obtain the energy consumption of individual appliances. It is understood that some modern appliances can communicate with smart meters and the per appliance energy consumption can be obtained automatically; however, these appliances are expensive. Additionally, existing appliances that do have this advantage require alternate methods/strategies to classify the energy consumption of individual appliances. There are two types of strategies: intrusive and non-intrusive. Intrusive load monitoring (ILM) requires additional hardware and/or complex network to be installed to measure the energy consumption of individual devices, which introduces complexity with an additional system cost that makes it infeasible in many circumstances. NILM, on the other hand, extracts appliance level information from aggregated energy consumption information measured at a single point. Therefore, NILM presents an effective solution that relies on computationally intelligent techniques such as machine leaning [3]. This paper focuses on the second technique, commonly known as non-intrusive load monitoring (NILM). Typical time-synchronised power profiles of a few appliances with respect to main meters are highlighted in Figure 1. Randomly operating concurrent appliances generate multiple levels of power output. The operating ON and OFF state of appliances are labelled as sequences of 1 s and 0 s, respectively. The length of the code is determined by the number of appliances of the household.
To perform the segregation of the power consumption of appliances, one of the fundamental tasks is to obtain and understand the features of the appliances. These features can represent various types of appliance data such as on/off trends, voltage and current, power consumption (real, reactive and apparent) and its temporal variations. There are various types of appliances, each with its own power profile. It is easy to segregate two devices with very different power profiles; however, if several appliances have approximately similar power profiles, then segregation becomes a difficult task. Moreover, devices that consume very low power are also a problem for classification since such low-level power can often be regarded as noise. One of the ways to tackle this is to observe the temporal variations of the power profiles for long periods of time [2].
Most of the recent works that consider NILM techniques focus on utilising complex learning algorithms such as Deep Neural Networks (DNN) and Convolutional Neural Networks (CNN) [1][2][3]. Though these algorithms are effective, they require high computational resources and a considerable time for training. Since the research is moving towards the green machine learning paradigm, we require learning algorithms that are simpler and can be obtained from relatively smaller datasets as compared to the requirements of DNN and CNN. In this paper, we obtain satisfactory results from less complex algorithms. Moreover, several machine learning models are trained for a comprehensive comparative analysis. To tackle the classification problems discussed above and reduce the number of datasets required to train the model, we introduce two novel concepts-i.e., thresholding (TH) and occurrence per million (OPM)-which are detailed in Section 3 of this work. These concepts have two-fold advantages: (i) simplicity of training and (ii) effectively addressing the non-uniform distribution of the practical datasets. The proposed model is robust against a large variety of appliances as opposed to classifying only a few devices as normally showcased in the research. The ML models considered in this study utilise the concepts of TH/OPMs and power windowing for training and are thoroughly tested and verified on a publicly available dataset known as the Reference Energy Disaggregation Dataset (REDD) [4] (http://redd.csail.mit.edu/ last accessed: 13 November 2021).The trained models are comprehensively evaluated using various and well-known performance metrics. The main contributions of this work are enumerated as follows: • Without using a large number of datasets and conventional DNN and CNN approaches, we have trained, tested and validated several classifiers based on a realworld dataset (REDD) that can achieve accuracies up to 98%.

•
As opposed to recent research works that consider only a few appliances and their combinations, we have considered over a hundred combinations of various appliances to train the ML classifiers. This reflects a more practical scenario.

•
We introduce the concepts of TH and OPM to tackle the non-uniformity of the dataset. Various randomly selected values of TH and OPMs are used to demonstrate the effect of their usage on various performance parameters. The results show a significant performance improvement with the utilisation of these concepts.

•
A comprehensive comparative study of various machine learning classifiers is presented.

Literature Review
There has been an increasing interest in devising energy-efficient techniques for load monitoring. NILM has been described as one of the techniques that can be leveraged to provide energy-efficient solutions [5]. The energy segregation is obtained via the signatures of each appliance, which include but are not limited to power consumption, on/off status and temporal variations. The goal of NILM is to study and understand these signatures. This represents a significant challenge since different manufacturers produce different signatures for the purpose of extracting energy. Moreover, a single appliance can have different operating modes in which the power consumption can be very different. Moreover, since the power consumption can vary, this is often confused with noise [6]. Various studies have been presented in the literature to circumvent these challenges and apply the NILM techniques effectively. Hidden Markov Models (HM) and Factorial Markov Models (FHM) [7][8][9][10] are also widely utilised [2,11]. However, one of the problems with these techniques is the requirement of pre-existing knowledge about the number of appliances, which is assumed to be fixed. This may be impractical for many scenarios.
More recently, researchers have produced well-performing algorithms based on deep learning algorithms (DLA). DLAs and the recurrent neural network have been applied for appliance disaggregation [12]. Energy segregation techniques based on CNN have been proposed as well [13,14]. Another research work based on long short-term memory (LSTM) and the recurrent neural network (RNN) has been proposed in [2], which improves the accuracy of segregation.
A few research works have also investigated the combination of DNNs for segregating appliances. A similar approach has been presented in [15], that combines CNNs with autoencoders and improves the performance as compared to using conventional CNN. Research work presented in [16] also utilises CNN for NILM but with a pre-processed input referred to as a differential input. The differential power utilisation of devices is obtained after pre-processing the data, which become the input to CNN. This improves the classification performance.
Another interesting study has been presented in [17], which distributes the learning network into two parts: the main network and sub-network. The main network uses regression and the sub-network performs the classification of the appliances.
Most of the research works discussed above and others such as those presented in  typically rely on large datasets to train the ML algorithms. Moreover, most of these solutions predominantly rely on DNN and CNN techniques. Therefore, the computational complexity is large and such solutions can become difficult to implement practically. It is also evident from the literature review that most of the research works consider only a few appliances and their combination to segregate the aggregated power consumption. However, this does not represent a practical scenario. Therefore, more appliances and their combinations need to be considered as they exist in a real-world scenario. It should be noted that many of these works focus on classifying a particular appliance working at various power levels, thus terming this as multi-classification. However, the work proposed in this study takes a more practical approach, where all home appliances are working simultaneously and, irrespective of the power levels, these appliances are classified.
It should be noted here that most of the research works apply either data augmentation or data pre-processing techniques to clean and uniformly distribute the data. These techniques consume more computational resources and complicate the learning process. To cater for this, we introduce two concepts; TH and OPM. These concepts eliminate the requirement of the data augmentation/complex pre-processing techniques and also require less computational resource for a similar data size for learning.

Training Data Set
The reference energy disaggregation dataset commonly known as REDD is a widely accepted and utilised dataset for NILM techniques [1]. Most of the articles discussed in this work consider the REDD dataset as well. Since it was one of the first openly accessible datasets, it has matured and presents a wider range of information. Information on appliances (labels, power usage etc) for several houses are available. The dataset provides appliance-level power consumptions along with aggregated power consumption. This dataset is purposely built for the development and evaluation of energy segregation techniques. It is effective because it provides insights into both instantaneous and temporal variations in loads and their power profiles. This dataset is exploited for the training and testing of the algorithms utilised in this paper.

Multiclassification Problem
We approach the per appliance power segregation as a multiclassification problem. Each appliance and combinations of several appliances have their labels and ML classifiers train on this information to provide classification labels. This work makes use of aggregated power consumptions (having the on/off state of different appliances). As opposed to binary classification, this technique requires a sample to be mapped to multiple outputs. In particular, this work applies a variety of ML algorithms and performs the multiclassification to a tremendous degree of accuracy.

Simulation Setup
We have used the REDD dataset of House 1 and time-synchronised the aggregate power of the house with the respective appliances. Once the dataset was time-synchronised, the labels were generated based on their on/off state. We marked the label as "0" if it was off or "1" otherwise. The catenated appliance states were then generated as a class of uniquely operating appliances in the dataset. Our experimental setup consisted of six core Intel R i7-8750H mobile processors with 32 GB RAM running on the Microsoft R Windows 10 R operating system. All the processing was conducted in the Python R language (version 3.8.8) while using sklearn and tensorflow modules for machine learning. The house 1 REDD dataset key statistic attributes are listed in Table 1.

Dataset Optimisation
In order to efficiently perform machine learning classification based on multiclassification, the dataset needs to be tuned well for high prediction accuracies. In order to do so, it is important to understand the operation of appliances through their power profiles and gathered data. To implement better prediction accuracies through machine learning algorithms, we have introduced a hybrid approach to label the data.
The first approach defines the operating power window for the appliances. For instance, we have selected the power operating window of a microwave appliance from 20 watts to 1650 watts in the REDD House 1 dataset. The lower limit indicates that the microwave was either on standby mode or when the door of the microwave was open, it activated the light inside the microwave, generating a small power spike in the time domain. However, once the heating cycle started, the power consumption of the microwave increased with respect to the settings selected by the user and was well above the power consumption of standby mode. The power window was selected through the survey of power profiles of various appliances reported in the literature [38][39][40][41][42][43] and by analysing the time domain appliance usage pattern from the House 1 REDD dataset.
Multiclassification labels were generated for the microwave at a particular time utilising the power profile operating window. The same strategy was followed for all the other appliances in the REDD House 1 dataset. Table 3 provides the power window bins for House 1 in the REDD database where the channel number indicates the measured power at a particular channel of the power meter. Power profile values that do not fall in the range of the power window are considered as not operating or turned off at that particular instance. This strategy has removed power spikes or outliers in the dataset for a particular appliance in the time domain. The second approach we have introduced is to identify the unique combination of appliances that seldom operate in the dataset. The operations of appliances is random as it is based on the resident's time spent during a particular day, weather conditions and variations in appliance usage due to the occupant's personal choice or due to some random event. The overall power consumed by a particular resident is the sum of power consumed by all the appliances. We noticed that there are certain unique combinations of devices that seldom operate simultaneously. Such unique combinations that seldom operate make the classification of appliances based on aggregate power difficult due to a lack of training data. In order to correctly classify the appliance based on aggregate power and to increase the overall classification accuracy, it is paramount to identify seldom occurring concurrent unique combinations of appliances and remove them from the dataset. To implement this, we have used a threshold or occurrence per million (OPM) frequency of such unique combinations. The threshold value from 5 to 50 with a step of 5 is chosen to remove unique combinations of seldom operating appliances that fall below the set threshold value or frequency. For instance, the House 1 dataset for REDD has 406,748 instances of operations, which are synced in the time domain with the main meter. Selecting a threshold value of 5 equates to the unique instances of the events being at least 12 per million for this particular dataset. Any unique combination of appliances that falls below the threshold of 12 per million, or in the case of the House 1 REDD dataset below 5 instances, was dropped from the dataset. This procedure resulted in the feature reduction of unique labels, which makes the classification more effective at the cost of removing a particular set of events that is less probable to happen in future.
The above strategy resulted in the generation of 22 different datasets for House 1 in the REDD dataset. These datasets include 10 datasets for different OPMs that do not include any power window, and additional 10 datasets for which the power window has been applied along with the different OPMs and finally, two datasets in which one dataset contained only power window multiclassification label and another dataset in which neither the power window nor OPMs were applied to generate the classification labels.

Results
The data preprocessing mainly comprised of thresholding and windowing concepts, which have already been described in the methodology section. All the segments of the data obtained from House 1 were extracted from the aggregate power reading. These segments of aggregated values were used for training and testing purposes. The concept of window size was introduced to depict a practical operational window of every appliance since each appliance has a specific power profile. We removed all the data sequences where the power profiles readings were too large or small for the given window, practically representing noise. Moreover, the concept of thresholding was used as well to cater for the non-uniformity of data (i.e., imbalance data). Therefore, different values of thresholding were used to evaluate the performance of the ML algorithms.

Performance Metrics
Data distribution plays a significant role in the training process, specifically for a multiclassification problem [44][45][46]. The accuracy of the trained algorithms, in one way or the other, also depends on the data distribution. In general, we prefer machine learning algorithm to have higher weighted precision and recall scores for evaluating performance. However, there is a tradeoff between weighted precision and the recall score performance metric. Often, tuning a machine learning algorithm for high-weighted precision results in lower-weighted recall scores, and vice versa. The better approach is to consider the weighted F1-score, which is a function of precision and recall. The F1-score combines the precision and recall by calculating the harmonic mean of precision and recall. However, for class imbalanced data, even the weighted F1-score of the machine learning algorithm can be misleading in most cases as it assigns weights to the classes based on their sample size in the dataset to calculate the weighted precision, recall and the resulting F1-score, thus favouring the majority class. The weighted F1-score can be mathematically represented by Equation (1):

Micro and Macro Averaging
In order to improve the prediction of class imbalance data for multi-label/multiclassification, it is better to investigate the micro and macro performance metrics in the Python sklearn library for machine learning. Micro performance metrics in sklearn compute the precision, recall and resulting F1-score by considering the total true positives, false negatives and false positives without considering the proportion of predictions for each label in the dataset denoted by Equation (2).
On the other hand, macro performance metric in sklearn computes the precision, recall and resulting F1-score for each label and returns the average of each performance metric by incorporating the proportion of each label in the dataset while evaluating the predicted value against the actual state in the dataset denoted by Equation (3): In a nutshell, the macro-averaging performance metric computes a given metric (i.e., accuracy, precision, recall and F1-score) independent of the sample size of each class and then takes the average, while micro-averaging aggregates the metric of all classes and computes the average.
A simple weighted accuracy does not cater for the frequency of a particular class and therefore does not represent the true performance of the algorithm. Let us consider an example where 10 data samples are considered and a total of 3 classes exist in the data. It is assumed that 1 class appears 6 times in the data, whereas the frequencies of the other 2 are 3 and 1, respectively. If a machine learning algorithm is trained on such data, the algorithm will already be biased, with 1 class appearing 60% of the time.
Since the macro performance metric does not include the label proportions of the dataset to compute the performance metric, it is a better performance evaluation metric for class-imbalanced multi-label/multiclassification datasets.

Accuracy
A comprehensive performance evaluation has been performed for the trained ML algorithms. The performance parameters such as accuracy, precision, F1-score and processing times have been considered. It should be noted that, since this work introduces the concept of windowing and thresholding, the impact of these concepts on various parameters has been detailed as well. Moreover, as discussed in the preceding section, due to the presence of class imbalances (different numbers of input instances for each class), considering simple weighted accuracy does not depict the correct evaluation of the algorithms. In this regard, weighted accuracies and macro accuracies for various machine learning algorithms have been computed for the class-imbalanced REDD House 1 dataset, as shown in Figures 2 and 3. It can be seen that there is a huge difference between the two accuracies which represents the stated facts. While the weighted accuracy is more than 90% for some algorithms. It drops down to only 37% when macro accuracy is considered. Ideally, these two values should be as close to each other as possible, but this is not the case as the REDD House 1 dataset is a class-imbalanced dataset. To improve the macro accuracy, we demonstrate the impact of utilising the concepts of OPMs/thresholding and power windowing on the trained algorithms by reevaluating the macro accuracy on the REDD House 1 dataset. The tabulated results in Table 4 present the quantitative advantages of employing OPMs/thersholding and widowing on the evaluated metrics of the various machine learning algorithms.   It should be noted that OPMs/thresholding and power windowing greatly improved the results of all the machine learning algorithms, irrespective of their working principles. Figure 4 provides the variation of macro precision with respect to OPMs/thresholding and the power windowing of considered algorithms. It can be observed from Table 4 that the KNN-City Block algorithm achieves the best accuracy as compared to the rest of the algorithms. However, the performance of a few of the other algorithms, such as KNN, RF and ET, remains very close to the best-performing algorithm. This result demonstrates that, even without deep learning techniques, significant performance can be achieved.
Another important aspect of this result is the impact of OPM, windowing and thresholding on the accuracy of the trained algorithms. It can be clearly seen that with the increase in thresholding, accuracy curves with OPMs and windowing significantly increase for all the implemented algorithms. The three curves from bottom to top represent the accuracy without OPM and windowing concepts (we call it baseline accuracy), accuracy with OPMs but without windowing and lastly accuracy with both OPM and windowing concepts, respectively. The percentage increase in the accuracy of all the implemented algorithms as a result of applying OPMs with windowing is shown in the Table 5. Similar trends are observed for precision and F1 score.

Per Appliance Performance Evaluation
To evaluate the performance of the trained algorithm on individual appliances, the accuracy, precision and F1-score of five different appliances are detailed in Table 6. It can be seen that the trained algorithms can classify individual appliances with high confidence.

Processing Time
Another parameter of interest is processing time. One of the important aspects of this study is achieving the desired accuracies without implementing computationally complex algorithms such as DNN. Moreover, as a result of applying OPM and windowing, the processing time decreases. Figure 5 highlights the variations in processing time for the four best algorithms. It can be seen from the results that the processing times for CART, LDA, RF and ET have tremendously decreased with the increase in thresholding. The percentage decreases in processing times from the baseline are 50%, 47%, 61% and 67% respectively. For both versions of KNN, the processing times do not significantly vary but improve substantially with respect to the baseline. The percentage decreases in processing times of all the algorithms are detailed in Table 7.

Class Variations
The samples obtained from the REDD dataset contain a huge number of appliance combinations. However, we believe that appliances can be classified even with a smaller number of combinations. More appliance combinations makes the learning process more complex. Therefore, another advantage of using OPM and windowing concepts is a decrease in the number of appliance combinations. The decrease in unique combinations of the classes reduces the complexity of machine learning and leads to an improvement of the respective accuracy and processing time, as shown in previous sets of results. Figure 6 highlights the impact of OPM and power windowing on the number of classes required to train the ML algorithms.
It can be seen that, for most of the classifiers, the number of classes required to segregate the appliances decreases drastically as we increase the thresholding. It is interesting to observe that this trend is valid for both curves representing OPM with and without windowing. The percentage decrease in the number of classes for all the algorithms is presented in Table 8.

Training Samples
Another important aspect of utilising OPM and windowing is that, without an appreciable decrease in the data size/number of training samples, an improvement in performance metrics can be realised. Figure 7 illustrates the variation of the number of samples required to train the ML algorithm as a result of applying OPMs/thresholding and power windowing data optimisation techniques. It should be noted that at the expense of less than a 1% decrease in data size, there is an overall improvement in accuracy and processing time of more than 100% and 40%, respectively, for the best-performing machine learning algorithms.

Performance Comparison
The performance of the proposed methodology (the best-performing algorithm) is compared with the recent NILM approaches. Different parameters such as accuracy, precision and F1-score have been compared. It can be observed from Table 9 that the proposed method performs better compared to approaches in other relevant research works. This implies that, applying the concepts of OPM and TH, the NILM system can accurately classify different appliances. It should be noted that not all research works report every parameter considered; therefore, only the reported parameters have been presented in the table. --0.78 [53] 0.92 -- [54] 0.95 -0.89 [3] 0.58 -- [55] 0.70 -- [56] 0.89 -- [27] --0.91 [24] 0

Conclusions
In this paper, we investigated the problem of the classification of multiple appliances operating at a particular time through their aggregate power. For this purpose, various ML algorithms were employed. This research demonstrated that, without applying computationally complex algorithms such as DNN, CNN or RNN, we can still achieve an accuracy of close to 99% while utilising only a moderate number of samples. Moreover, in order to handle imbalanced data while improving metrics such as accuracy, processing time and the number of appliance combinations, the concepts of OPM and windowing were utilised. The results demonstrate that, with the application of OPM and windowing, the accuracy increased by more than 100% for all the algorithms and the processing time decreased by 40% for LDA, NB, RF and ET, respectively. This is significant not only for NILM approaches but also for the development of the futuristic approach of green ML, where a low-energy footprint of computational resources is expected in the prediction of load demand and appliance usage patterns.
In the future, we intend to consider the temporal variations of power profiles and train the ML algorithms accordingly. Moreover, we also aim to demonstrate a comparative study of segregating the appliances of different houses representing a different sample/data distribution.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: