Semi-Supervised Domain Adaptation for Multi-Label Classification on Nonintrusive Load Monitoring

Nonintrusive load monitoring (NILM) is a technology that analyzes the load consumption and usage of an appliance from the total load. NILM is becoming increasingly important because residential and commercial power consumption account for about 60% of global energy consumption. Deep neural network-based NILM studies have increased rapidly as hardware computation costs have decreased. A significant amount of labeled data is required to train deep neural networks. However, installing smart meters on each appliance of all households for data collection requires the cost of geometric series. Therefore, it is urgent to detect whether the appliance is used from the total load without installing a separate smart meter. In other words, domain adaptation research, which can interpret the huge complexity of data and generalize information from various environments, has become a major challenge for NILM. In this research, we optimize domain adaptation by employing techniques such as robust knowledge distillation based on teacher–student structure, reduced complexity of feature distribution based on gkMMD, TCN-based feature extraction, and pseudo-labeling-based domain stabilization. In the experiments, we down-sample the UK-DALE and REDD datasets as in the real environment, and then verify the proposed model in various cases and discuss the results.


Introduction
Understanding energy usage in buildings has been considered an important issue because residential and commercial power consumption account for about 60% of global energy consumption [1]. Optimized energy usage management has advantages for both suppliers and consumers of energy. From the supplier's point of view, planned consumption may be encouraged according to the frequency and pattern of use of home appliances. In addition, it is also easy for consumers to develop plans that can reduce costs through comprehensive information about device-specific operations [2]. The electricity usage profile is to install a submeter for each appliance and record instantaneous power readings, but in reality, applying this method to all devices is difficult to realize due to cost and difficulty in maintenance. Therefore, nonintrusive load monitoring (NILM) aims to disaggregate energy consumption by device. The NILM method that does not depend on submeters has shown significant efficiency in commercial and residential energy utilization and remains an important task [3].
NILM is inherently difficult because it analyzes information about the simultaneous switching or noise generation of multiple devices without attaching multiple submeters [4][5][6]. To solve the problem, many techniques such as dynamic time warping (DTW), matrix factorization, neuro-fuzzy modeling, and graph signal processing (GSP) have been proposed and supervised and unsupervised learning-based techniques have been studied [7][8][9]. Hart [10] first introduced unsupervised learning methods to decompose electrical Sensors 2022, 22, 5838 2 of 20 loads through clustering. However, various techniques such as hidden Markov models (HMM) have been proposed for a while because clustering-based methods do not have training data and are difficult to predict accurate power loads. In recent years, the number of research studies on deep neural networks (DNNs) has increased rapidly with the advancement of high-end hardware devices, and the availability of data for supervised learning has increased [11]. Long short-term memory (LSTM), a representative supervised learning technology, considered NILM as a prediction problem based on time series data. Refs. [12,13] proposed a method for learning models by controlling data applied with various data sampling-based windows. Nolasco et al. [14] included multi-label procedures to increase the recognition rate for multi-loads by marking on loads at any given time and developed architectures based on convolutional neural networks (CNNs), resulting in an outstanding performance in signal detection and feature extraction. However, existing supervised learning methods for NILM still have two problems. First, there is a fundamental problem that assumes that the power usage data of real devices has a distribution similar to that of training data. It is impossible to ensure the same performance in actual situations because devices of the same type have different energy consumption depending on products and brands, noise form, intensity, physical environment, etc. [15]. To overcome this problem, training data containing all domain information must be acquired, but it is practically impossible since collecting the energy consumption of each device from different houses requires huge costs. Another problem is that, even assuming that neural network models are trained on all the data for different environments, extracting critical information is very difficult because of the vast amount of complex data [16][17][18]. Therefore, identifying suitable techniques that can handle the large complexity of data and generalize various domains of information is the main challenge in NILM.
To solve these problems, we consider domain adaptation [19,20]. Domain adaptation is one of the transfer learnings, which can adapt the trained model to the other domain dataset on the same task. This concept can easily be applied to the NILM system. Many researchers proposed domain adaptation systems to generalize various domain information [21,22]. Liu et al. [21] conducted a regression study to refine energy consumption by applying the most typical domain adaptation method to NILM. Since only the basic concept of domain adaptation has been applied to NILM, it has the potential to develop in various ways. Ref. [22] proposed a method that incorporates the mean teacher method into domain adaptation. Regression work is performed on the source and target domains using one model. However, this method did not show good performance in domain generalization due to its shallow model structure. To the best of our knowledge, there are no papers on classification tasks in domain adaptation studies for NILM. In this paper, we perform classification tasks for device usage detection in NILM by incorporating powerful feature information distillation based on the teacher-student structure and pseudo-labeling (PL) into domain adaptation.
The main contents of this paper are as follows: 1.
We conduct the first classification study in the domain adaptation field for NILM; 2.
We show performance improvements by incorporating robust feature information distillation techniques based on the teacher-student structure into domain adaptation; 3.
The decision boundaries are refined through PL-based domain stabilization.
The remainder of this paper is organized as follows. Section 2 shows a brief review of related studies of NILM and domain adaptation. Section 3 introduces the proposed method. Section 4 presents the experimental setup, case study, and discussions. Finally, Section 5 concludes the paper.

Nonintrusive Load Monitoring
Consider a building with m appliances and k operating power modes of each appliance for time [1, . . . , T]. Let x i = (x i (1 . . . . . . , x i (T)) denote the energy consumption of the i-th device. The whole energy usage of the i-th device in sample time n can be formulated as follows: where ψ i k is the electricity consumption consumed in a particular operating mode, i (n) denotes the measurement of background noises, and U i k (n) is the operating On/Off [0, 1] status of the i-th appliance in time n. The operating status assures the equality constraint ∑ k j=1 U i j (n) = 1 since all appliances operate in a single mode. At time n, the final energy aggregate of the house is expressed as follows: The goal of the NILM algorithm is to disaggregate the measured electricity usage x to generate appliance-specific energy consumption profiles [23,24]. Therefore, the final challenge is to reduce the difference between the actual measurements of the device and the disaggregated energy consumption [25].
Elafoudi et al. [7] detected the edge within the time window and used DTW to identify the unique load signature. Lin et al. [8] proposed a hybrid classification technique that combines fuzzy c-means and clustering piloting with neuro-fuzzy classification to distinguish devices that have similar load signatures. He et al. [9] handled the NILM as a single-channel blind source separation problem to perform low-complexity classification active power measurements. Based on this idea, they proposed the GSP-based NILM system to handle the large training overhead and the computational cost of the conventional graph-based method.

Domain Adaptation
Domain adaptation is an area of transfer learning [26]. In general transfer learning, a task or domain can be changed from source to target; however, in domain adaptation, the task sets the premise that only the domain is changed [19,27]. This aims to generalize the classification or regression model, which is trained on the source domain to be applied to the target domains with different distributions, since distribution disagreement between training and real data yields poor model performance. Ganin et al. [19] proposed a multitask learning model with a class and domain classifier. The model was trained to only classify class labels, except for domain labels. For this, they introduced the gradient reversal layer (GRL) to the domain classifier. The GRL multiplies the negative constant and the gradient on the backward pass. Additionally, it makes the model remove the domain information in its feature extractor. With the advancement of deep neural networks (DNNs), the performance of domain adaptation has achieved outstanding performance in various fields [11,14,[28][29][30][31][32][33]. In [34], domain adversarial training of neural networks (DANN), inspired by the generative adversarial network (GAN), laid the foundation for applying adversarial learning methodologies to domain adaptation and accomplished excellent performance. In addition, domain adaptation algorithms based on maximum mean discrepancy (MMD) between source and target were mainly studied [35][36][37][38]. In [39], Long et al. proposed a joint MMD to adjust the joint distribution. Deep domain confusion (DDC) [34] proposed a technique for using pre-trained networks by adding adaptive layers based on MMD.
Although domain adaptation is used in various fields as expressed above, the application of domain adaptation in NILM has not been researched a lot and requires advancement. In [40], Wan proposed a domain adaptation algorithm for optical character recognition (OCR), which was extended to apply it to the NILM field and produce prominent results. Recently, Wang and Mao proposed applying a model-agnostic meta-learning (MAML)based domain adaptation algorithm to NILM, inspired by the pre-trained model, which is heavily studied in the NLP field and outperformed the state-of-the-art deep learning-based methods. [41].

Semi-Supervised Domain Adaptation for Multi-Label Classification on Non-Intrusive Load Monitoring
Various deep learning models are applied to the NILM field. However, the task of segmenting the use of different devices in many houses is still a relatively new concept. To solve this problem, we propose the semi-supervised domain adaptation for multi-label classification on non-intrusive load monitoring. The overall diagram is shown in Figure 1.
classification on non-intrusive load monitoring. The overall diagram is shown in Figu 1.
Several hypotheses are proposed in this work to apply semi-supervised domain a aptation to NILM. The first hypothesis is that the distribution of source and target d mains is different. Most NILM systems are based on this hypothesis. We also use lab on the source for domain adaptation, not on the target. Second, even if the distribution source domain data and target domain data is different in NILM, it is assumed that same device has domain-independent common characteristics regardless of the doma Because in motor devices, lagging current with slow current flow occurs, which results a low power factor. Additionally, capacitor devices generate leading current with fast c rent flow, which results in a high-power factor. The power factor is the ratio of act power to apparent power regardless of the magnitude of power consumption. In oth words, if two different houses use the same electronic devices (e.g., refrigerator, TV, e from different manufacturers, it is assumed that there is a common usage pattern even the power consumption is different.  Several hypotheses are proposed in this work to apply semi-supervised domain adaptation to NILM. The first hypothesis is that the distribution of source and target domains is different. Most NILM systems are based on this hypothesis. We also use labels on the source for domain adaptation, not on the target. Second, even if the distribution of source domain data and target domain data is different in NILM, it is assumed that the same device has domain-independent common characteristics regardless of the domain. Because in motor devices, lagging current with slow current flow occurs, which results in a low power factor. Additionally, capacitor devices generate leading current with fast current flow, which results in a high-power factor. The power factor is the ratio of active power to apparent power regardless of the magnitude of power consumption. In other words, if two different houses use the same electronic devices (e.g., refrigerator, TV, etc.) from different manufacturers, it is assumed that there is a common usage pattern even if the power consumption is different.
The proposed method consists of three main steps, shown in Figure 2. In the knowledge distillation stage, high-level knowledge is distilled into the student network (SN) by a temporal convolutional network (TCN) [42]-based teacher network (TN) [43] trained using labeled source data. Domain-dependent features vary depending on the domain, and domain-independent features are constant regardless of the domain. In the next step, we perform a robust domain adaptation that allows us to extract only domain-independent features to adapt source and target data to neural networks regardless of domain. Appliance usage detection classifies devices from source domain data. Additionally, domain classifiers are trained with GRL to prevent classification for source and target domains. As a result, feature extractors can extract robust domain-independent features that enable device usage classification regardless of domain. In the domain stabilization step in Figure 2, we stabilize the domain through PL-based fine-tuning. First, domain-independent features of target data are extracted from the feature extractor and then pseudo-labeled based on the source domain label in appliance usage detection. Since all the target data cannot be pseudo-labeled, it is partially pseudo-labeled. Therefore, the target data consists of pseudolabeled data and unlabeled data. Secondary domain adaptation is performed based on the enhanced target domain data and domain-independent features extracted through robust distillation. The network performance is stabilized and improved through the advantages of low-density separation between classes and entropy regularization. Details of each part of the proposed framework are in the subsections. The proposed method consists of three main steps, shown in Figure 2. In the knowledge distillation stage, high-level knowledge is distilled into the student network (SN) by a temporal convolutional network (TCN) [42]-based teacher network (TN) [43] trained using labeled source data. Domain-dependent features vary depending on the domain, and domain-independent features are constant regardless of the domain. In the next step, we perform a robust domain adaptation that allows us to extract only domain-independent features to adapt source and target data to neural networks regardless of domain. Appliance usage detection classifies devices from source domain data. Additionally, domain classifiers are trained with GRL to prevent classification for source and target domains. As a result, feature extractors can extract robust domain-independent features that enable device usage classification regardless of domain. In the domain stabilization step in Figure 2, we stabilize the domain through PL-based fine-tuning. First, domain-independent features of target data are extracted from the feature extractor and then pseudolabeled based on the source domain label in appliance usage detection. Since all the target data cannot be pseudo-labeled, it is partially pseudo-labeled. Therefore, the target data consists of pseudo-labeled data and unlabeled data. Secondary domain adaptation is performed based on the enhanced target domain data and domain-independent features extracted through robust distillation. The network performance is stabilized and improved through the advantages of low-density separation between classes and entropy regularization. Details of each part of the proposed framework are in the subsections. Step-by-step flowchart of the proposed method.

Network Architecture
The goal of this section is to build a semi-supervised domain adaptation model that can estimate the target domain label using labeled source data , and target data . As shown in Figure 2, the model includes three parts: knowledge distillation, robust domain adaptation, and domain stabilization. Details of the network structure are as follows: (1) Knowledge distillation: knowledge is distilled using a TCN feature extraction-based teacher-student network to receive robust domain-independent features of source data. TCN is an extended time-series data modeling structure in CNN. It provides better performance than typical time-series deep learning models such as LSTM because it has a much longer and more effective memory without a gate mechanism. The TCN consists of several residual blocks, and this block consists of a dilated casual convolution operation O. For input ∈ ℝ and filter ∶ 0,1 … , where is the dilated factor, * is * -dilated convolution, k is the filter size, and • is the past value. However, as the network depth increases, performance decreases Step-by-step flowchart of the proposed method.

Network Architecture
The goal of this section is to build a semi-supervised domain adaptation model that can estimate the target domain label Y t using labeled source data (X s , Y s ) and target data X t . As shown in Figure 2, the model includes three parts: knowledge distillation, robust domain adaptation, and domain stabilization. Details of the network structure are as follows: (1) Knowledge distillation: knowledge is distilled using a TCN feature extraction-based teacher-student network to receive robust domain-independent features of source data. TCN is an extended time-series data modeling structure in CNN. It provides better performance than typical time-series deep learning models such as LSTM because it has a much longer and more effective memory without a gate mechanism. The TCN consists of several residual blocks, and this block consists of a dilated casual convolution operation O. For input x ∈ R n and filter f t : where d is the dilated factor, * d is * d -dilated convolution, k is the filter size, and s − d·i is the past value. However, as the network depth increases, performance decreases rapidly due to overfitting. However, as the network depth increases, performance decreases quickly due to overfitting. Resnet's key concept, namely, residual mapping, can solve this problem. The TCN residual block includes two layers of dilated casual convolution based on the ReLU activation function, weighted normalization, and dropout. The 1 × 1 convolution layer on the TCN ensures that the input and output are the same size. The output of the transformation T of the time series data in the TCN dual block is added to the identity mapping of the input x and expressed as follows: where θ means the set of parameters of the network. It has already been demonstrated that this concept of residual block improves network performance by learning modifications to identity mapping rather than overall transformations. Based on this, it is possible to build a deep TCN network by stacking multiple TCN residual blocks.
Assuming that x I is the input of the I-th block, the network forward propagation from the I-th block to the I + n-th block can then be formulated as follows: where, x I is I-th block, θ i is the parameter set of the i-th block. Therefore, the feature extractor FE te (x s , θ f te ) of the TN is defined as follows: where the number of layers is k, x s is source data, θ f te is the parameter set of TN, x s i is i th source data, θ f te i is the parameter set of i th block in the TN. Additionally, the feature extractor FE st (x s , θ f st ) of SN can be defined as follows: where l is the number of layers, θ f st is the parameter set of SN, θ f st i is the parameter set of i th block in SN. Based on f te extracted from Equation (6), the TN must extract soft label information for transferring knowledge to the SN through appliance usage detection, which consists of a fully connected layer. The outputŷ te of the TN is defined as follows:ŷ where te refers to the TN,ŷ te is a predicted classification label of x s in the TN, T is a temperature parameter, So f tmax with T is a So f tmax function with a temperature parameter. θ te is the parameter set of AUD te , AUD te ( f te , θ te ) i is the elements of output vector of AUD te , i refers to i th element, K is the number of elements of the output vector. Maximize the benefits of soft label values for knowledge distillation by using temperature parameters to prevent information loss in So f tmax output. The estimated soft labelŷ te is compared to the soft predictionŷ st sp of SN and is used as a distillation loss in network training.ŷ st sp is obtained as follows: where st refers to SN,ŷ st sp is a predicted classification label of x s in the SN and a soft prediction value of SN, θ st is the parameter set of AUD st , and AUD st ( f st s , θ st ) i is the i-th element of the output vector of AUD st . The classification performance of SN should be evaluated along with knowledge distillation. The performance can be evaluated by comparing the hard predictionŷ st hp of SN with the ground truth y s of the source domain data.ŷ st hp is obtained as follows: whereŷ st hp is a predicted classification label of x s in the SN and is used as a hard prediction value of SN. In Equation (10), the temperature parameter is not used.
where ϕ is a feature space mapping function that turns the original feature space into the reproducing kernel Hilbert space H. Further descriptions of the kernel are given in the following subsection. The domain classifier DC( f , θ dc ) learns by setting the ground truth values of the source domain data and the target domain data dc s = 0 and dc t = 1, respectively, to separate the domain-independent features from the feature extractor. DC( f , θ dc ) has an outputdc s for source domain data and an outputdc t for target domain data. The two outputs are defined as follows: dc t = So f tmax(DC( f st_t , θ dc )) (13) where f st s is the source domain feature, f st t is the target domain feature and θ dc is the parameter set of DC.dc s anddc t values between 0 and 1. DC can obtain domainindependent features from FE st by learning that the source and target domains cannot be classified. Appliance usage detection uses the AUD st of SN. AUD st verifies classification performance using source data as input domain-independent features. The prediction of device usage detection can be obtained using Equation (10). In network inference, prediction of the target domain may be obtained using Equation (14).
whereŷ t is the prediction of target data. Detection performance for target domain data is evaluated by comparingŷ t with ground truth y t of target domain data.  (14), PL is generated as a prediction value having the highest probability among So f tmax values. However, if the probability is lower than the threshold, the data is not pseudo-labeled. The threshold is obtained experimentally. Domain stabilization consists of three steps, such as feature extraction and domain classifier. Appliance usage detection uses  (12) and (13). The appliance usage detection performs AUD st f st s , f st tl ; θ st .

Network Losses
We carefully design network losses to obtain domain-independent features from feature distributions. We divide the network loss into the following four stages: knowledge distillation loss, feature distribution difference loss, domain classification loss, and appliance usage detection loss.
(1) Knowledge distillation loss: As shown in Figure 1, the knowledge distillation phase loss is the sum of the distillation loss L ds and student loss L ds . L ds is to include the difference in the classification results of the TN and the SN in the loss. L ds is defined as follows: where L ce is the cross-entropy loss and α is the learning rate. The cross-entropy loss is calculated about teacher and student output. If the classification results of the teacher and the student are the same and distillation is good, L ds takes a small value. Additionally, L st means the cross-entropy loss of the classification of SN. L st is defined as follows: Even in a network with relatively fewer parameters than in the TN, L st is also reduced when L ds is smaller, so it shows good feature extraction and classification performance.
(2) Feature distribution difference loss: As shown in Figure 1, the feature distribution difference loss is MMD Loss [44] L f . L f estimates the difference between the feature distribution of the source domain data X s and the feature distribution of the target domain data X t through MMD. L f is generally defined as follows: For the mapping function ϕ of Equation (17), we use kernel tricks because computational resources are required too much to obtain all the moments. We utilize the Gaussian kernel as shown in Equation (18).
where gk is the Gaussian kernel. In the Equation (18), Taylor's development of the exponential develops as in Equation (19). Since Equation (19) contains all the moments for x, we use the Gaussian kernel.
When Equation (15) is organized using Equation (20), L f is re-formulated as shown in Equation (21).
(3) Domain classification loss: As shown in Figure 1, the domain classification loss L dc is related to FE st and DC. DC( f , θ dc ) is modeled so that the source domain and the target domain cannot be distinguished. To minimize the distribution difference between f st s and f st t , the loss of DC( f , θ dc ) should be maximized. Usingdc s anddc t of DC( f , θ dc ), cross-entropy loss as a binary classifier-based L dc can be obtained as Equation (22).
where, sn is the sample number of mini-batch.
(4) Appliance usage detection loss: as shown in Figure 1, the appliance usage detection loss uses L st in the domain adaptation phase and L aud in the robust domain adaptation phase. Since both losses are applied to the same AUD st , the same loss equation is formularized as in Equations (23) and (24).
Each neural network is learned by differentiating loss with corresponding weights, as shown in the dotted line in Figure 1.

Training Strategy
According to the network loss discussed above, the final optimization objective can be expressed as follows: Assuming that θ f _te , θ te are pre-learned high-performance networks, they do not perform additional learning to reduce network loss. When we learn L dc of Equation (22), we apply the gradient reversal layer (GRL) to learn in a direction that fails to classify domains. The pseudo-code of the proposed model is summarized in Algorithm 1.

Algorithm 1:
Parameter optimization procedure of the proposed method.

Input:
The source domain data (x s , y s ), The target domain data (x t ) with M total samples, respectively. Output: The optimized parameters (θ * f st , θ * st , θ * dc ) # Knowledge Distillation Phase for m = 0 to epochs do for n to minibatch do #Foward propagation

Dataset
Two publicly available NILM datasets, UK-DALE [45] and REDD [46], were used for performance evaluation. UK-DALE collected smart meter data from five UK buildings, with sampling resolution and corresponding device-level consumption of 1 s and 6 s, respectively, for the total home consumption. The data set was recorded for 39-600 days. REDD was collected from six actual buildings in the United States. The measurement period is between 3 and 19 days, consisting of appliance-level energy consumption data sampled every 3 s and total measurements sampled every 1 s. This article analyzes the use of the following five representative house appliances: dishwasher (DW), refrigerator (FG), kettle (KT), microwave (MV), and washing machine (WM). Since REDD does not have kettle data, NILM uses four house appliances, excluding kettles. The selected electronic products exhibit various power patterns and power levels.

Data Preprocessing
Each power consumption of the two datasets is downsampled to 1 min and then preprocessed for missing values using linear interpolation. Each house appliance is classified as ON (1) if the power consumption (for 15 min) is greater than the experimentally set threshold and is classified as OFF (0) if it is less than the threshold. Figures 3 and 4 show the power usage of each home appliance in UK-DALE and REDD, respectively, and the thresholds for determining the ON event accordingly. The threshold was experimentally determined to be sure to include all ON states. However, since the FG is continuously operating, the threshold was determined based on the state in which the motor was running. Table 1 shows the exact threshold value of each home appliance and the number of ON events determined accordingly. The split ratio of training, validation, and test data are 6:2:2. The sliding window is used for around 15 min based on the ON event. A sliding window W with a stride length l s runs the sequence forward to obtain an input sample x = (x 1 , x 2 , . . . , x W ). For each i th window, the network has y i = y i DW , y i FG , y i KT , y i MV , y i W M as output power.

Implementation Configuration
To obtain an input sample, W is set to 15, and l s is set to 15 so that data is nonoverlapped. In the TN, there are 3.2 times more parameters in the feature extractor and 1.6 times more parameters in the fully connected layer compared to the SN. The epochs in the robust domain adaptation and the domain stabilization phases are not set separately because the early stopping parameter automatically controls learning. The basic structure of SN is cited in [20]. The TN is experimentally determined to have a structure approximately twice as large as the SN. The mini-batch size is set to the maximum value applicable in the experimental environment. The decaying learning rate is used to determine the optimal value by repeatedly reducing it by one-third. The parameters of the proposed model are listed in Table 2.  DW  2000  4431  1800  3236  1000  6712  650  2934  FG  250  2441  400  5291  400  2944  350  3344  KT  2200  4495  2000  1694  ----MV  1400  1242  1200  4218  1200  4809  1600  1327  WM  1800  4980  1500  1524  2500  4796 2200 5764  All experimental models were modified and executed in Python 3.6 [47] and the Pytorch framework [48], and learning and inferencing used the NVIDIA RTX 2070 SUPER.

Ablation Study Methods
Our model consists of the following four main techniques: TCN, gkMMD, teacherstudent (TS) structure, and PL. We introduce an ablation study on five methods to investigate how individual components influence performance improvements in the proposed model.

Implementation Configuration
To obtain an input sample, is set to 15, and is set to 15 so that data is nonoverlapped. In the TN, there are 3.2 times more parameters in the feature extractor and 1.6 times more parameters in the fully connected layer compared to the SN. The epochs in the robust domain adaptation and the domain stabilization phases are not set separately because the early stopping parameter automatically controls learning. The basic structure of SN is cited in [20]. The TN is experimentally determined to have a structure approximately twice as large as the SN. The mini-batch size is set to the maximum value applicable in the experimental environment. The decaying learning rate is used to determine the optimal value by repeatedly reducing it by one-third. The parameters of the proposed model are listed in Table 2.
All experimental models were modified and executed in Python 3.6 [47] and the Pytorch framework [48], and learning and inferencing used the NVIDIA RTX 2070 SUPER.

Evaluation Metrics
Performance evaluation uses the F1-score, a general metric. The F1-score is derived as shown in Equation (26).
where TP is true positive, FP is false positive, and FN is false negative.
To the best of our knowledge, there is no low sampling-based classification study in the domain adaptation field for NILM. Therefore, we did not conduct a one-on-one comparison with other studies.

Case Studies and Discussions
In this section, we conduct an experiment assuming two cases. In the first case, a house was designated as a source domain and a house was designated as a target domain within the same dataset. The second case was experimented with by specifying a source domain and a target domain between different datasets. Tables 3-5 show the F1 scores of domain adaptations for six segmentation methods. The 'Improvement' row shows how much the proposed method has improved. In addition, experiments on ablation studies are included, indicating how much each method affects overall performance.

Domain Adaptation within the Same Dataset
In this subsection, experiments are carried out on the first case described above. In Table 3, U 1 denotes House1, U 2 denotes House2, R 1 denotes House1 of REDD, and R 3 denotes House3 of REDD. There is no result for the appliances since REDD does not have a kettle, and DW is not used in R 3 .
Based on the baseline, TCN-DA was the method that had the most influence on performance except for our method, showing an average performance improvement of 3.38%. Next, TS-DA showed a performance improvement of 2.45%. In the case of gkMMD-DA, there was a bit of performance improvement or slightly reduced performance. Table 4 shows F1 score for TCN and gkMMD. gkMMD generally helps improve the performance when used with networks with residual blocks. PL-DA showed an average performance stabilization of 0.51% because it learns models in the direction of stabilizing the domain by finetuning the network. Our method showed a significant performance improvement of 6.03% on average compared to the baseline.

Domain Adaptation between Different Datasets
In this subsection, experiments are performed on the second case described above. In Table 5, UK-DALE → REDD is an experiment using UK-DALE as a source domain and REDD as a target domain, and REDD → UK-DALE is an experiment using REDD as a source domain and UK-DALE as a target domain. In the second case experiment, the average performance is improved by 5.74% even though the degree of domain characteristic change is greater than that of the first case experiment. Although the domains are different, the same type of appliance has almost the same pattern as the power usage, so the domain adaptation is well performed. Therefore, we have confirmed the possibility that in the field of NILM, we do not have to learn new neural networks even if each household and living area are different. Our method shows better results compared to the baseline.
Experiments show that domain adaptations within the same dataset perform well when the proposed method is used, and performance improvements can also be seen for domain adaptations between different datasets. It is a very significant result that our method without individual model learning for all households achieves a performance improvement of 5-6% through only one learning. There are several main reasons for improving accuracy. (1) Rich domain independent feature information is extracted by learning through teacher-student-based knowledge distillation. (2) By using TCN residual blocks and gkMMD together can effectively reduce the distribution mismatch between the two domains. (3) PL can stabilize the network's decision boundaries.

Discussions
The proposed model can automatically track the use of individual appliances under full load. We look at a series of our method-based applications for elderly households living alone and public electricity management institutions.
In the case of elderly households living alone, the risk of dying alone is generally very high. This risk situation is one of the critical problems to be solved at the government level. By analyzing device usage patterns, it is possible to develop a household risk detection system through abnormality detection in the household. Efficient energy management is an essential issue in public electricity management institutions. It is possible to develop an energy management system that adjusts the power generation ratio by identifying and managing energy-inefficient customers using home appliance usage patterns and power usage.
There are several limitations to the proposed method. (1) Domain adaptation is difficult to apply if house appliances of source and target data are different. (2) The difference in power usage between households is so large that the data imbalance is severe. (3) Although performance is improved by reducing distribution differences over the source and target features, there is no clear academic basis for extracting domain-independent features by reducing distribution differences. It is generally on an experimental basis. In future work, we aim to address the second limitation, which is the data imbalance. Data imbalance is the most fundamental problem in neural network training. Future work is planned in the direction of GAN-based sampling methods to resolve data imbalance or networks that perform high-quality learning despite data imbalance.

Conclusions
We developed a novel methodology that combines robust knowledge transfer and network stabilization for NILM to improve previous tasks and perform generalization across domains. The proposed method improves the detection performance of device usage for unlabeled target domain data by using a network trained only on the labeled source data. Teacher-student-based knowledge distillation is adopted to transfer quality features from the source domain. PL is utilized for domain stabilization through low-density separation between classes and entropy regularization effects. gkMMD is employed to reduce distribution differences between domain-independent features. Based on various techniques, we improve the performance of the proposed domain adaptation method by considering the distribution of robust domain-independent features.
To prove the proposed method, we used UK-DALE data and REDD as data. For data preprocessing, data such as training, verification, and testing were constructed by experimentally setting thresholds for distinguishing ON events in each appliance. Five methods of ablation study were performed for the performance test. Within the same dataset, domain adaptation improved the F1 score of the proposed method over the baseline by an average of 6.04%. Domain adaptation on different datasets improved the F1 score of the proposed method over the baseline by an average of 5.74%. While performance has not improved significantly for problems with much larger domain feature changes, maintaining existing performance alone is a great achievement.