An Enhanced Ensemble Approach for Non-Intrusive Energy Use Monitoring Based on Multidimensional Heterogeneity

Acting as a virtual sensor network for household appliance energy use monitoring, non-intrusive load monitoring is emerging as the technical basis for refined electricity analysis as well as home energy management. Aiming for robust and reliable monitoring, the ensemble approach has been expected in load disaggregation, but the obstacles of design difficulty and computational inefficiency still exist. To address this, an ensemble design integrated with multi-heterogeneity is proposed for non-intrusive energy use disaggregation in this paper. Firstly, the idea of utilizing a heterogeneous design is presented, and the corresponding ensemble framework for load disaggregation is established. Then, a sparse coding model is allocated for individual classifiers, and the combined classifier is diversified by introducing different distance and similarity measures without consideration of sparsity, forming mutually heterogeneous classifiers. Lastly, a multiple-evaluations-based decision process is fine-tuned following the interactions of multi-heterogeneous committees, and finally deployed as the decision maker. Through verifications on both a low-voltage network simulator and a field measurement dataset, the proposed approach is demonstrated to be effective in enhancing load disaggregation performance robustly. By appropriately introducing the heterogeneous design into the ensemble approach, load monitoring improvements are observed with reduced computational burden, which stimulates research enthusiasm in investigating valid ensemble strategies for practical non-intrusive load monitoring implementations.


Introduction
Knowing the refined electricity behaviors of household energy consumption is important to residents, by means of which energy consciousness can be awakened and energy conservation schemes can be customized [1]. Meanwhile, it is also important to power utilities, where understanding the load components helps to model power system operations and schedule the demand response better [2]. Furthermore, it is also meaningful to the development of the entire power industry, e.g., it is the technological base of tracking household energy carbon emissions [3]. Therefore, insights into household electricity usage are emerging as a vital link in the energy consumption chain and are attracting more and more attention in both academic and industrial fields.
A straightforward way to realize refined electricity monitoring is to install smart sockets for target appliances and form a sensor network for household electricity monitoring [4]. The exploration enthusiasm for such a project had lasted for a period of time, but it decreased due to the high financial costs associated with too many sockets [5]. Besides, this method of electricity monitoring is strictly confined to socketed appliances, and it is not friendly to residents due to intrusive installations [6]. Hence, the socket-based sensor dictionary learning was proven to be an outstanding formulation, naturally applicable to NILM [32]. Additionally, further inspired by the sparse coding principle of dictionary learning, transform learning was also explored in [33] and was found to be well-adapted to NILM formulation.
In recent years, the practical experiences from world-leading artificial intelligence races show that the ensemble method is the most powerful approach in machine learning. Therefore, although limited, researchers have noticed the value of ensemble methods in NILM studies, and conducted some explorations. In [34], a multiscale wavelet packet tree is applied to collect comprehensive energy consumption features, and an ensemble bagging tree is adopted as a classifier, where the performance is compared with various machine learning schemes. In [35], an event detection and disaggregation framework based on an ensemble approach is proposed, whose disaggregation target is the water heating operation. Both of the above works focus on event-based load monitoring. Our team has established a general ensemble framework based on bagging in [36] for the load disaggregation of steady-state data, and proved its performance robustness and model flexibility in diverse NILM scenarios.
However, to our knowledge, the current ensemble strategies applied to NILM all follow the evaluation criterion used in individual classifiers, even the probabilistic quantitative scoring method proposed in [36]. Such implementation requires the individual classifier to be reliable and differentiated, but the bias can hardly be avoided since the combined classifier and individual classifiers are homogeneous. If the classifiers are chosen in an inappropriate way, e.g., overemphasizing a specific electrical feature, some errors may be generated and finally cause false decomposition. Since the combined classifier needs the information from individual classifiers for decision making, such disadvantages always exist in the ensemble decision system with homogeneous classifiers, only explicitly or implicitly. Based on this observation and knowledge, the idea of utilizing a heterogeneous design for ensemble-approach-based NILM is proposed and investigated in this paper. Firstly, the multidimensional heterogeneity for an NILM-oriented ensemble method is discussed. Since the individual classifiers can be naturally distinct in a traditional ensemble framework, our research is featured by investigating the heterogeneities from the following aspects, i.e., the heterogeneity between the combined classifier and individual classifiers, as well as the heterogeneity in independent evaluation committees of the combined classifier. Then, an implementation design is illustrated, where the individual classifiers are established based on dictionary learning, while the sparsity is not considered in the combined classifier. Meanwhile, multiple committees with distinguishing similarity measures are employed and coordinated in the decision-making stage, providing valid disaggregation evaluations from multi-perspective points. Through verifications on both a simulation platform and a field measurement dataset, the proposed idea and strategy are proven to be effective in enhancing NILM performance.
The major contribution of this paper is the presence of a multidimensional-heterogeneityenhanced ensemble approach for NILM. By introducing heterogeneity, the obstacles of ensemble application, including design difficulty and computational inefficiency, are overcome. In addition to providing an effective method to improve NILM performance, this study also stimulates the explorations of applying the ensemble method to NILM for robust and reliable disaggregation. Furthermore, deep thinking of the nature of NILM problems as well as the rationality and completeness of disaggregation models is also inspired. In support of the contribution, the following aspects are highlighted:

•
Based on the properties of NILM problems, the heterogeneous evaluation design is utilized in an ensemble model.

•
Dictionary learning is deployed for basic load disaggregation, while the sparsity measures are featured in individual classifiers.

•
The combined classifier is free of sparsity measures, but composed of multiple decision committees with different similarity measures. • Verifications on both a simulation platform and a field measurement dataset show the effectiveness of our work.

Methodology
The task of non-intrusive load monitoring is to disaggregate the detailed appliances' states via integral electrical measurements. In other words, it is to distinguish the components of monitoring signals, which can be formulated as: where x ∈ R S×1 is the target signal with the length of S, x i ∈ R S×1 is the ith appliance selective electrical signature in length, S, and Ω stands for the candidate appliance set.
Considering the background noise and signature fluctuations, Equation (1) can hardly be followed in practical applications. Therefore, an error term is usually considered in load disaggregation problems, i.e.,: where e 0 ∈ R S×1 is the error term for the decomposition and also in length, S.
Since background noise always exists in daily power consumption, and appliances' operation states are highly dependent on manufacturing standards and electrical aging, the error term provided in Equation (2) not only exists but also plays a key role in load decomposition. The objective of the multidimensional-heterogeneity-enhanced ensemble model is to evaluate the error term from diverse perspectives and avoid the bias caused by certain evaluation approaches.

Ensemble Method Framework
Following the load disaggregation formulation provided in Equation (2), the corresponding ensemble method framework is established in Figure 1, while the proposed idea of multidimensional heterogeneity enhancement is highlighted in color.

•
The combined classifier is free of sparsity measures, but composed of multiple decision committees with different similarity measures.

•
Verifications on both a simulation platform and a field measurement dataset show the effectiveness of our work.

Methodology
The task of non-intrusive load monitoring is to disaggregate the detailed appliances' states via integral electrical measurements. In other words, it is to distinguish the components of monitoring signals, which can be formulated as: (1) where xR S×1 is the target signal with the length of S, xiR S×1 is the ith appliance selective electrical signature in length, S, and Ω stands for the candidate appliance set.
Considering the background noise and signature fluctuations, Equation (1) can hardly be followed in practical applications. Therefore, an error term is usually considered in load disaggregation problems, i.e.: (2) where e0R S×1 is the error term for the decomposition and also in length, S.
Since background noise always exists in daily power consumption, and appliances' operation states are highly dependent on manufacturing standards and electrical aging, the error term provided in Equation (2) not only exists but also plays a key role in load decomposition. The objective of the multidimensional-heterogeneity-enhanced ensemble model is to evaluate the error term from diverse perspectives and avoid the bias caused by certain evaluation approaches.

Ensemble Method Framework
Following the load disaggregation formulation provided in Equation (2), the corresponding ensemble method framework is established in Figure 1, while the proposed idea of multidimensional heterogeneity enhancement is highlighted in color. Data    As seen in Figure 1, the NILM-oriented ensemble framework follows the bagging strategy. The proposed multidimensional heterogeneity is integrated in the architecture with the following considerations:

1.
First: Dimensional heterogeneity in the individual classifiers. The basic idea of bagging is to establish several weak classifiers to combine into a strong classifier. For an effective combination, the weak classifiers should be distinctive from each other. Therefore, the individual classifiers may be heterogeneous according to the definition of the ensemble method. Therefore, in our following sections we do not present the detailed discussions of this point. However, considering the entirety of the description, we still illustrate this dimension in Figure 1 for readers to understand it better.

2.
Second: Dimensional heterogeneity between the combined classifier and individual classifiers. The individual classifiers act as the basic appliance disaggregation tool in ensemble-method-based NILM, and the combined classifier acts as the ultimate decision maker. Therefore, if these classifiers are homogeneous, the disaggregation results may be biased, following the features of the applied algorithms. Hence, we introduce heterogeneous evaluation for the combined classifier to assess the candidate solutions from diverse perspectives.

3.
Third: Dimensional heterogeneity in the multiple committees established for the combined classifier. The combination strategy is essential for the ensemble method, which is majorly dependent on the design of the combined classifier. In order to create a valid combined classifier, we split the decision maker to be multiple committees and also introduced heterogeneity into these committees. By evaluating the candidate solutions from multi-dimensional points (these points are also distinct with individual classifiers), a more reliable result may be provided.
For a better understanding and also the verification of the proposed idea, the detailed designs and implementations are illustrated in the following sections. As mentioned above, we focus on the newly proposed schemes, i.e., the heterogeneity designs for the last two dimensions.

Heterogeneous Design for Combined Classifier and Individual Classifiers
Aiming for heterogeneity, the individual classifiers and combined classifier should follow different objective models. Since we will design multiple committees for the combined classifier, the most commonly used model in Equation (1) is reserved for the combined classifier. As to the individual classifiers, dictionary learning is employed for formulation where the sparsity is seriously considered. Therefore, whether considering the sparsity or not will be the featured heterogeneity between the combined classifier and individual classifiers.

Dictionary Learning Model for Individual Classifiers
The dictionary learning models tries to establish a dictionary for the target signal in Equation (1) and decompose the signal with as few dictionary atoms as possible. The basic formulation is illustrated as: where the dictionary is defined as D = [d 1 ,d 2 , . . . ,d N ] ∈ R S×N , whose column d k ∈ R S×1 is defined as an atom. One dictionary contains N atoms. α ∈ R N×1 is defined as a sparsity parameter. For a well-established model, sparsity, α, has as an important role. On one hand, the dictionary, D, is established based on an alternative optimization for both dictionary and sparsity. On the other hand, once the dictionary is determined, sparsity becomes a key factor for problem solving.
Therefore, based on the principles of dictionary learning, it is required to determine the dictionary, D, first. The problem is defined as: where ||•|| F is the F-norm calculation, measuring the differences between the target and fitting in the physical sense. λ is the regularization parameter, indicating the proportion of sparsity in the optimization objective. g • (•) is the unified sparsity measurement function, revealing the sparsity calculation in the objective. Since both dictionary, D, and sparsity, α, are unknown variables to be solved in the model, the K-SVD algorithm is utilized to solve the alternative problem [31]. After completing the training stage, we have a feasible D for the NILM problem in a specific house. Hence, the load disaggregation problem under normal operations is a straightforward optimization, which is free of calculation burden:

Heterogeneous Design for the Combined Classifier
As seen from Equations (4) and (5), the role of sparsity may vary in the load disaggregation problem, but will always be considered in the model. However, back to the original problem in (1), sparsity is not tightly bounded. Therefore, in order to introduce the heterogeneous evaluation system, the design of the combined classifier considers the physical properties only, while the sparsity is totally ignored. The key to this idea is illustrated in Figure 2.
where ||•||F is the F-norm calculation, measuring the differences between the target and fitting in the physical sense. λ is the regularization parameter, indicating the proportion of sparsity in the optimization objective. g•(•) is the unified sparsity measurement function, revealing the sparsity calculation in the objective. Since both dictionary, D, and sparsity, α, are unknown variables to be solved in the model, the K-SVD algorithm is utilized to solve the alternative problem [31]. After completing the training stage, we have a feasible D for the NILM problem in a specific house. Hence, the load disaggregation problem under normal operations is a straightforward optimization, which is free of calculation burden:

Heterogeneous Design for the Combined Classifier
As seen from Equations (4) and (5), the role of sparsity may vary in the load disaggregation problem, but will always be considered in the model. However, back to the original problem in (1), sparsity is not tightly bounded. Therefore, in order to introduce the heterogeneous evaluation system, the design of the combined classifier considers the physical properties only, while the sparsity is totally ignored. The key to this idea is illustrated in Figure 2. The architecture shown in Figure 2 provides a design sample for the heterogeneity between individual classifiers and the combined classifier. The core of this is that the evaluation criteria of individual classifiers follow sparsity measures, while those of the combined classifier follow similarity measures. The sparsity measures are calculated based on Equations (4) and (5), and diverse individual classifiers can be personalized by allocating a different regularization parameter, λ. The similarity measures, through which sparsity is not considered, should provide an effective and justified evaluation for the candidate solutions. Therefore, a multi-committee decision-making system is designed for the combined classifier, where different committees hold different similarity measures.

Heterogeneous Design for Decision-Making Committees of the Combined Classifier
Following the heterogeneous design idea for the combined classifier discussed above, diverse similarity measures should be selected for evaluation committees of the combined classifier. Among dozens of similarity measures, three commonly used measures, i.e., Euclidean distance, Manhattan distance, and cosine similarity, are selected considering the physical features of NILM. The design of the combined classifier is illus- The architecture shown in Figure 2 provides a design sample for the heterogeneity between individual classifiers and the combined classifier. The core of this is that the evaluation criteria of individual classifiers follow sparsity measures, while those of the combined classifier follow similarity measures. The sparsity measures are calculated based on Equations (4) and (5), and diverse individual classifiers can be personalized by allocating a different regularization parameter, λ. The similarity measures, through which sparsity is not considered, should provide an effective and justified evaluation for the candidate solutions. Therefore, a multi-committee decision-making system is designed for the combined classifier, where different committees hold different similarity measures.

Heterogeneous Design for Decision-Making Committees of the Combined Classifier
Following the heterogeneous design idea for the combined classifier discussed above, diverse similarity measures should be selected for evaluation committees of the combined classifier. Among dozens of similarity measures, three commonly used measures, i.e., Euclidean distance, Manhattan distance, and cosine similarity, are selected considering the physical features of NILM. The design of the combined classifier is illustrated in Figure 3, where the physical meanings of heterogeneous committees are visualized. The basic ideas for choosing these three measures are listed below, while the rationality is demonstrated by case results:

•
Euclidean distance is the most commonly used measure to evaluate the absolute distance between two points in multidimensional space. Therefore, Euclidean distance would provide an overall assessment of the differences between the estimation and target in NILM.

•
Manhattan distance measures the total sum of absolute distance on each coordinate axis for a multidimensional system. Hence, Manhattan distance focuses on the fitting differences for each electric features, paying more attention to the details. • Cosine similarity utilizes the cosine value of the angle between two vectors in multidimensional space to quantify the differences. Compared with distance measures, it is more interested in the direction as opposed to the distance or length. This measure would highlight the electric feature relevance of appliances in NILM. trated in Figure 3, where the physical meanings of heterogeneous committees are ized. The basic ideas for choosing these three measures are listed below, while the ality is demonstrated by case results: • Euclidean distance is the most commonly used measure to evaluate the absolu tance between two points in multidimensional space. Therefore, Euclidean d would provide an overall assessment of the differences between the estimati target in NILM.

•
Manhattan distance measures the total sum of absolute distance on each coor axis for a multidimensional system. Hence, Manhattan distance focuses on the differences for each electric features, paying more attention to the details.

•
Cosine similarity utilizes the cosine value of the angle between two vectors in dimensional space to quantify the differences. Compared with distance meas is more interested in the direction as opposed to the distance or length. This m would highlight the electric feature relevance of appliances in NILM.

Multidimensional Space Mapping and Standardization
A vital design in similarity analysis for a multidimensional problem is how to the measurements of diverse dimensions together. From the view of NILM, it is esse a trade-off problem of multi-objective fitting. This problem is quite similar to par tuning in many system designs, which seems insignificant but actually matters.
In practice, the unity of multiple dimensions does exist in individual clas where the dictionary-learning-formulated disaggregation approach utilizes the di regularization parameters to coordinate diverse electric features together: where norm() is the normalization function, P is the target signal of real power, a is the dictionary for the normalized real power analysis. D* is the dictionary for th malized electric feature of *. λ* is the regularization parameter for the electric featu LS is the load signature features apart from real power P, including reactive pow and different orders of harmonics, H. For designs with heterogeneity, the mapping and standardization for the com classifier follows another strategy. All electric features are considered equally imp and the target is mapped to be a reference point with all dimensions equaling to Consistently, the estimation is also standardized by selecting the target values as a base. The calculations are as follows:

Multidimensional Space Mapping and Standardization
A vital design in similarity analysis for a multidimensional problem is how to unify the measurements of diverse dimensions together. From the view of NILM, it is essentially a trade-off problem of multi-objective fitting. This problem is quite similar to parameter tuning in many system designs, which seems insignificant but actually matters.
In practice, the unity of multiple dimensions does exist in individual classifiers, where the dictionary-learning-formulated disaggregation approach utilizes the different regularization parameters to coordinate diverse electric features together: where norm (·) is the normalization function, P is the target signal of real power, and D P is the dictionary for the normalized real power analysis. D * is the dictionary for the normalized electric feature of *. λ * is the regularization parameter for the electric feature of *. LS is the load signature features apart from real power P, including reactive power, Q, and different orders of harmonics, H.
For designs with heterogeneity, the mapping and standardization for the combined classifier follows another strategy. All electric features are considered equally important, and the target is mapped to be a reference point with all dimensions equaling to unity.
where P tar , Q tar , and H tar are, respectively, the measured value of real power, reactive power, and harmonics, indicating the target. P est , Q est , and H est are, respectively the estimation value of real power, reactive power, and harmonics through load disaggregation. The above variables are all related to the original electric feature space. Meanwhile, P tar , Q tar , and H tar are, respectively, the standardized target of real power, reactive power, and harmonics. P est , Q est , and H est are, respectively, the standardized estimation of real power, reactive power, and harmonics. These variables are considered in unified space, which are comparable. Hence, by the above detailed designs, the combined classifier is completely heterogeneous with individual classifiers, which conforms to the proposals of this article.

Similarity Evaluation and Scoring
With comparable multidimensional objects, it is possible to evaluate and score from different views of similarity. Following the physical meanings of the selected measures shown in Figure 3, the detailed calculations for the three committees are: Socre3 = 100 × P tar × P est + Q tar × Q est + · · · + H tar × H est P tar 2 + Q tar 2 + · · · + H tar 2 × P est 2 + Q est 2 + · · · + H est 2 (10) where Score1 is the evaluated score for the candidate by the first committee of the combined classifier, following the Euclidean distance. Score2 is the evaluated score for the candidate by the second committee of the combined classifier, following the Manhattan distance. Score3 is the evaluated score for the candidate by the third committee of the combined classifier, following the Cosine similarity. r pe and r pm are, respectively, the regulation parameters for the scoring of the first and second committees. By comparing the sum of scores, the most optimal solution is determined from all candidates. Since the candidates are generated following weighted standardization and sparsity evaluation, and selected by unified standardization and disparate measures, the decision process is totally heterogeneous. Therefore, the idea of establishing an ensemble-method-based NILM model with multidimensional heterogeneity is realized by the above implementations.

Results and Discussions
The proposed approach is tested and discussed in this section. Firstly, the evaluation metrics for NILM are presented. Then, results and discussions are provided based on simulation studies and field measurements analysis, respectively.

Evaluation Metrics for NILM
The most commonly used metrics evaluating the performance of NILM, including precision, sensitivity, and F-measure, are utilized in this section to verify the effectiveness of the proposed approach. The calculations are as follows: where P s , S s , and F s are, respectively, the precision metric, sensitivity metric, and F-measure metric for an appliance, s. TP s is the true positive disaggregation, indicating the number of detections that are correctly detected as the appliance, s. FP s is the false positive disaggregation, indicating the number of detections that are incorrectly detected as s. FN s is the false negative disaggregation, indicating the number of detections related to s that are incorrectly detected as other appliances. If in the target house, all the electrical appliances form a set, Ω s . Then, the average values of all appliance metrics are utilized for the evaluation of overall NILM performance, i.e.,: where Pre, Sen, and F-mea are, respectively, the average metric values of precision, sensitivity, and F-measure for an appliance set, Ω s . N s is the total number of electrical appliances in an appliance set, Ω s .

Studies on Low-Voltage Network Simulator
For a comprehensive investigation of the proposed idea and approach, a simulation platform, named the low-voltage network simulator (LVNS) [37], is employed in our work. Since the validation of NILM studies is one of the original motivations for developing the LVNS, it is appropriate for our extensive explorations.
A North American house, with almost twenty appliances, is simulated. The detailed information of the appliance set is shown in Table 1. As seen, all types of commonly used appliances, including ON-OFF, multi-state, repetitive mode, and transient mode, are considered in our work. Such a setup contributes to the demonstration of the validity and rationality of our study.
The proposed heterogeneity-enhanced ensemble approach is denoted as PHA in the following discussions. Since the individual classifier is established based on the sparse coding approach, the conventional dictionary learning approach is compared, and denoted as CDA [32]. In addition, the former proposal of ensemble-method-based NILM in [36], framed by the probability model, is also compared and denoted as EPA. Besides, in order to investigate the insights of our proposed approach, the detailed performance of individual classifiers is also recorded and analyzed. In this subsection, the individual classifiers are formed based on the feature selection bagging strategy [36], and denoted as ICA1, ICA2, ICA3, and ICA4, respectively.

Overall NILM Performance and Comparisons
The average NILM performances by diverse approaches are shown in Table 2. As seen, by applying the ensemble strategy, the improvement of NILM performance is observed, no matter by EPA or PHA. Although the enhancement is slight, it is an important contribution to data-driven NILM research because such an improvement is achieved based on a given dataset and with the same basic disaggregation algorithm.  Comparing PHA with EPA, we find that the average performance of ensemble-methodbased NILM approaches is quite similar. Nevertheless, strictly speaking, the proposed approach in this paper is slightly weaker than the probability-model-framed approach, though the margin is very small. Such results are acceptable based on the calculation burdens by these two approaches. In order to include the true solution, EPA conducts multiple optimization calculations for each individual classifier, which requires high computational power. However, the proposed approach in this paper requires only one calculation for each individual classifier, which is desired by the practical implementation of smart meters. The calculation burdens and statistical computation time are shown in Table 3. As seen, the optimization times for one ensemble decision by PHA is one-fifth of that by EPA. Note that the multiple optimizations by EPA are sequentially executed, so parallel computing methods are not applicable. Besides, the practical NILM applications are usually deployed on smart meters, so the high computation burden is not appropriate. By decreasing the optimization times, the decision time of PHA is correspondingly reduced to one-third. Therefore, the proposed approach shows superiority when considering both calculation performance and efficiency. Besides, PHA does not always perform worse than EPA. In the simulations, six days are randomly selected, and we use the metrics results of EPA as references, while the relative performances of PHA are visualized in Figure 4. As seen, the two approaches perform similarly on day one and day five. From day two to day four, EPA outperforms PHA. However, we still have one day, i.e., day six, on which PHA outperforms EPA. Additionally, the biggest change for all metrics also happens on day six, where we have a more than 5% increase for the precision metric. Such results indicate that the proposed ensemble strategy is indeed effective and does contribute to the enhancement of NILM. Besides, PHA does not always perform worse than EPA. In the simulations, six days are randomly selected, and we use the metrics results of EPA as references, while the relative performances of PHA are visualized in Figure 4. As seen, the two approaches perform similarly on day one and day five. From day two to day four, EPA outperforms PHA. However, we still have one day, i.e., day six, on which PHA outperforms EPA. Additionally, the biggest change for all metrics also happens on day six, where we have a more than 5% increase for the precision metric. Such results indicate that the proposed ensemble strategy is indeed effective and does contribute to the enhancement of NILM. The detailed statistical disaggregation metrics for all appliances are also recorded, as shown in Table 4. The load disaggregation performances on different appliances are differed by diverse approaches. For example, PHA outperforms CDA and EPA for appliances HEA and LAP stably, while it shows some degradation for appliances RFR and CRTTV. Generally speaking, the proposed approach guarantees a reliable disaggregation for all appliances.  The detailed statistical disaggregation metrics for all appliances are also recorded, as shown in Table 4. The load disaggregation performances on different appliances are differed by diverse approaches. For example, PHA outperforms CDA and EPA for appliances HEA and LAP stably, while it shows some degradation for appliances RFR and CRTTV. Generally speaking, the proposed approach guarantees a reliable disaggregation for all appliances.  In order to reveal the effectiveness of the ensemble design for the proposed method, extensive results and discussions are provided in this subsection, focusing on the performance comparison between individual classifiers and combined classifier. The general results are shown in Table 5, while the detailed appliance results are shown in Table 6. As seen in Table 5, by the proposed ensemble strategy, the NILM performance is improved compared with individual classifiers. The maximum enhancements of precision, sensitivity, and F-measure are, respectively, 6%, 11%, and 10%, while the minimum improvements are around 2%, 4%, and 4%, respectively. In general, the proposed approach shows a robust enhancement via the ensemble strategy. As seen in Table 6, once the individual classifiers perform the same, such as the results of WSH, the combined classifier also has the same results. Although the proposed approach shows a degraded performance for RFR, it successfully combines the individual classifiers for most of the other appliances, demonstrating the effectiveness of the proposed study.  Note that the heterogeneous design also plays an important role, which contributes to the improved performance in our study. In order to clarify this, an additional test is conducted. The combined classifier is redesigned with an additional decision-making committee, and the fourth committee holds the evaluation criterion similar to the individual classifiers but loses the sparsity. By doing so, the evaluation heterogeneity between individual classifiers and the combined classifier discussed in Section 2.3.1. is no longer complied strictly. This additional design is denoted as WHA, and the disaggregation comparisons are illustrated in Table 7. As seen in Table 7, by ignoring the heterogeneity, the ensemble strategy is no longer effective in NILM enhancement. The performance is not only worse than our proposed method, but also the conventional dictionary learning approach, where electric features are considered all at once. Therefore, the NILM performance is highly dependent on the ensemble strategy, and our heterogeneous design is demonstrated to be effective in improving disaggregation results.

Studies on Field Measurement Dataset
The above comprehensive investigations on the simulation platform have verified the efficiency of the proposed approach. In order to further demonstrate the practical application capabilities of the study, a well-known public dataset, REDD [38], collected via field measurements from real houses in North America, is utilized and tested in the following discussions. Specifically, House 1 is selected for verification. The electrical appliances in this house are illustrated in Table 8.
The proposed heterogeneity-enhanced ensemble approach is still denoted as PHA in this subsection, as well as the compared approaches by conventional methods, i.e., CDA [32] and EPA [36]. Since the data of House 1 are low-frequency without harmonics information, the individual classifiers are generated based on the original bootstrap sampling strategy and, respectively, denoted as ICA1, ICA2, ICA3, and ICA4 in this subsection.  Table 9 provides the general results of REDD-based NILM for different approaches. As seen, by applying the ensemble strategy, the NILM performance based on field measurements are all improved. However, the ensemble design matters for the specific results. The probability-model-framed strategy in [36] achieves a higher precision, while the enhancement for sensitivity is limited, resulting in a slight improvement for F-measure metric. As to the proposed approach in this article, though the enhancement for the precision is not that high, there is a remarkable increase in the sensitivity metric, leading to very satisfying progress in F-measure.  Table 10 provides the detailed appliance disaggregation results by diverse approaches. As seen, NILM enhancement by the proposed approach is mainly due to the sensitivity metric increase for most appliances. By Tables 9 and 10, it is observed that the heterogeneityenhanced ensemble NILM approach is effective in load disaggregation under a field environment, even when lacking sufficient data. For further investigations, the detailed results by individual classifiers are compared in Table 11. Because the data of field measurement are limited, the bagging strategy cannot generate highly differentiated individual classifiers. However, such deficiency is addressed by embedding the heterogeneous evaluation method into the ensemble framework. Therefore, by enhancing REDD-based NILM performance, the proposed study is verified to be an effective solution for energy monitoring.

Conclusions
In this paper, ensemble-method-based NILM studies are further investigated in terms of calculation accuracy and efficiency. For the effective utilization of the ensemble strategy in NILM, a multidimensional heterogeneity design is embedded into the NILM-oriented ensemble model. Firstly, the individual classifiers are mutually heterogeneous by following the bagging strategy. Then, the heterogeneity between individual classifiers and the combined classifier is designed by applying diverse measure calculations from two perspectives: evaluation considering sparsity or not and weighed standardization or not. Lastly, the combined classifier is also split into multiple heterogeneous decision-making committees, whose similarity evaluations are distinct from each other. Through verifications on a simulator platform and a field measurement dataset, the proposed approach is demonstrated to be able to enhance NILM performance with limited computing consumption. Besides, the heterogeneity design is effective in reinforcing the diversity requirement of the ensemble method, which shows a potential in expanding ensemble-approach-based NILM applications.

Data Availability Statement:
The data presented in this study involve simulation data and a public dataset. The simulation platform is available in reference [37] and the public dataset is openly available in reference [38].