# Semi-Supervised Domain Adaptation for Multi-Label Classification on Nonintrusive Load Monitoring

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- We conduct the first classification study in the domain adaptation field for NILM;
- We show performance improvements by incorporating robust feature information distillation techniques based on the teacher–student structure into domain adaptation;
- The decision boundaries are refined through PL-based domain stabilization.

## 2. Related Work

#### 2.1. Nonintrusive Load Monitoring

#### 2.2. Domain Adaptation

## 3. Semi-Supervised Domain Adaptation for Multi-Label Classification on Non-Intrusive Load Monitoring

#### 3.1. Network Architecture

- (1)
- Knowledge distillation: knowledge is distilled using a TCN feature extraction-based teacher–student network to receive robust domain-independent features of source data. TCN is an extended time-series data modeling structure in CNN. It provides better performance than typical time-series deep learning models such as LSTM because it has a much longer and more effective memory without a gate mechanism. The TCN consists of several residual blocks, and this block consists of a dilated casual convolution operation O. For input $x\in {\mathbb{R}}^{n}$ and filter $ft:\left\{0,1\dots ,k-1\right\}\to \mathbb{R}$, $O$ at point $s$ is defined by Equation (3).$$O\left(s\right)=\left(x{\ast}_{d}ft\right)\left(s\right)={\displaystyle \sum}_{i=0}^{k-1}ft\left(i\right)\xb7{x}_{s-d\xb7i}$$$$R\left(s\right)=T\left(x,\theta \right)+x$$$${x}_{I+n}={x}_{I}+{\displaystyle \sum}_{i=I}^{I+n-1}T\left({x}_{i},{\theta}_{i}\right)$$$$F{E}_{te}\left({x}_{s},\text{}{\theta}_{f\_te}\right)={x}_{s}+{\displaystyle \sum}_{i=0}^{k-1}T\left({x}_{s\_i},{\theta}_{f\_te\_i}\right)$$$$F{E}_{st}\left({x}_{s},\text{}{\theta}_{f\_st}\right)={x}_{s}+{\displaystyle \sum}_{i=0}^{l-1}T\left({x}_{s\_i},{\theta}_{f\_st\_i}\right)$$$${\widehat{y}}_{te}=Softma{x}_{withT}\left(AU{D}_{te}{\left({f}_{te},\text{}{\theta}_{te}\right)}_{i}\right)=\frac{\raisebox{1ex}{${e}^{AU{D}_{te}{\left({f}_{te},\text{}{\theta}_{te}\right)}_{i}}$}\!\left/ \!\raisebox{-1ex}{$T$}\right.}{{\sum}_{j=1}^{K}\raisebox{1ex}{${e}^{AU{D}_{te}{\left({f}_{te},\text{}{\theta}_{te}\right)}_{j}}$}\!\left/ \!\raisebox{-1ex}{$T$}\right.}$$$${\widehat{y}}_{st\_sp}=Softma{x}_{withT}\left(AU{D}_{st}{\left({f}_{st\_s},\text{}{\theta}_{st}\right)}_{i}\right)=\frac{\raisebox{1ex}{${e}^{AU{D}_{st}{\left({f}_{st\_s},\text{}{\theta}_{st}\right)}_{i}}$}\!\left/ \!\raisebox{-1ex}{$T$}\right.}{{{\displaystyle \sum}}_{j=1}^{K}\raisebox{1ex}{${e}^{AU{D}_{st}{\left({f}_{st\_s},\text{}{\theta}_{st}\right)}_{j}}$}\!\left/ \!\raisebox{-1ex}{$T$}\right.}$$$${\widehat{y}}_{st\_hp}=Softmax\left(AU{D}_{st}{\left({f}_{st\_s},\text{}{\theta}_{st}\right)}_{i}\right)=\frac{{e}^{AU{D}_{st}{\left({f}_{st\_s},\text{}{\theta}_{st}\right)}_{i}}}{{{\displaystyle \sum}}_{j=1}^{K}{e}^{AU{D}_{st}{\left({f}_{st\_s},\text{}{\theta}_{st}\right)}_{j}}}$$
- (2)
- Robust domain adaptation: domain adaptation is performed with robust features extracted with knowledge distillation to obtain domain-independent features. Domain adaptation consists of the following three stages: feature extractor, domain classifier, and appliance usage detection. First, a feature extractor $F{E}_{st}$of SN is used. A feature extractor $F{E}_{st}\left({x}_{s},\text{}{\theta}_{{f}_{st}}\right)$ of the source data and an $F{E}_{st}\left({x}_{t},\text{}{\theta}_{{f}_{st}}\right)$ of the target data share a parameter set. Models learned with only source data are difficult to express with target data. To adapt the target domain data representation to $F{E}_{st}$, the model learns the feature distribution difference between the two domains using MMD and minimizes it. The MMD distance is obtained as follows:$$MMD\left({X}_{s},{X}_{t}\right)=\Vert \frac{1}{{n}_{s}}{\displaystyle \sum}_{i=1}^{{n}_{s}}\phi \left({x}_{s}^{i}\right)-\frac{1}{{n}_{t}}{\displaystyle \sum}_{j=1}^{{n}_{t}}\phi {\left({x}_{t}^{j}\right)\Vert}_{\mathcal{H}}$$$${\widehat{dc}}_{s}=Softmax\left(DC\left({f}_{st\_s},\text{}{\theta}_{dc}\right)\right)$$$${\widehat{dc}}_{t}=Softmax\left(DC\left({f}_{st\_t},\text{}{\theta}_{dc}\right)\right)$$$${\widehat{y}}_{t}=Softmax\left(AU{D}_{st}\left({f}_{st\_t},\text{}{\theta}_{st}\right)\right)$$
- (3)
- Domain stabilization: The target domain data is pseudo-labeled with $AU{D}_{st}$ to enhance the data, thereby stabilizing the domain and improving the performance of the network. First, the feature ${f}_{{\mathrm{st}}_{\mathrm{t}}}$of the target domain data ${x}_{t}$ is input to the $AU{D}_{st}$. If $Softmax\left(AU{D}_{st}\left({f}_{s{t}_{t}},\text{}{\theta}_{st}\right)\right)$ is obtained through Equation (14), PL is generated as a prediction value having the highest probability among $Softmax$ values. However, if the probability is lower than the threshold, the data is not pseudo-labeled. The threshold is obtained experimentally. Domain stabilization consists of three steps, such as feature extraction and domain classifier. Appliance usage detection uses the following three types of data: source data (${X}_{s}$, ${Y}_{s}$), pseudo-labeled target data (${X}_{t}$, ${Y}_{tl}$), and unlabeled target data ${X}_{t}$. For feature extraction, ${f}_{{\mathrm{st}}_{\mathrm{s}}}$, ${f}_{{\mathrm{st}}_{\mathrm{tl}}}$ and ${f}_{{\mathrm{st}}_{\mathrm{t}}}$ are output through $F{E}_{st}$. DC has no change in the domain, ${f}_{{\mathrm{st}}_{\mathrm{s}}}$, ${f}_{{\mathrm{st}}_{\mathrm{tl}}}$ and ${f}_{{\mathrm{st}}_{\mathrm{t}}}$ are classified as inputs, as in Equations (12) and (13). The appliance usage detection performs $\mathrm{AU}{D}_{st}\left({f}_{s{t}_{s}},{f}_{s{t}_{tl}};\text{}{\theta}_{st}\right)$.

#### 3.2. Network Losses

- (1)
- Knowledge distillation loss: As shown in Figure 1, the knowledge distillation phase loss is the sum of the distillation loss ${\mathcal{L}}_{ds}$ and student loss ${\mathcal{L}}_{ds}$. ${\mathcal{L}}_{ds}$ is to include the difference in the classification results of the TN and the SN in the loss. ${\mathcal{L}}_{ds}$ is defined as follows:$$\begin{array}{l}{\mathcal{L}}_{ds}=2\alpha {T}^{2}{\mathcal{L}}_{ce}\left(\frac{\raisebox{1ex}{${e}^{AU{D}_{te}{\left({f}_{te},\text{}{\theta}_{te}\right)}_{i}}$}\!\left/ \!\raisebox{-1ex}{$T$}\right.}{{{\displaystyle \sum}}_{j=1}^{K}\raisebox{1ex}{${e}^{AU{D}_{te}{\left({f}_{te},\text{}{\theta}_{te}\right)}_{j}}$}\!\left/ \!\raisebox{-1ex}{$T$}\right.},\frac{\raisebox{1ex}{${e}^{AU{D}_{st}{\left({f}_{s{t}_{s}},\text{}{\theta}_{st}\right)}_{i}}$}\!\left/ \!\raisebox{-1ex}{$T$}\right.}{{{\displaystyle \sum}}_{j=1}^{K}\raisebox{1ex}{${e}^{AU{D}_{st}{\left({f}_{s{t}_{s}},\text{}{\theta}_{st}\right)}_{j}}$}\!\left/ \!\raisebox{-1ex}{$T$}\right.}\right)\\ =2\alpha {T}^{2}{\mathcal{L}}_{ce}\left(Softma{x}_{withT}\left(AU{D}_{te}{\left({f}_{te},\text{}{\theta}_{te}\right)}_{i}\right),Softma{x}_{withT}\left(AU{D}_{st}{\left({f}_{s{t}_{s}},\text{}{\theta}_{st}\right)}_{i}\right)\right)\\ =2\alpha {T}^{2}{\mathcal{L}}_{ce}\left({\widehat{y}}_{te},{\widehat{y}}_{st\_sp}\right)\end{array}$$$$\begin{array}{l}{\mathcal{L}}_{st}=\left(1-\alpha \right){\mathcal{L}}_{ce}\left(\frac{{e}^{AU{D}_{st}{\left({f}_{s{t}_{s}},\text{}{\theta}_{st}\right)}_{i}}}{{{\displaystyle \sum}}_{j=1}^{K}{e}^{AU{D}_{st}{\left({f}_{s{t}_{s}},\text{}{\theta}_{st}\right)}_{j}}},{y}_{s}\right)=\left(1-\alpha \right){\mathcal{L}}_{ce}\left(Softmax\left(AU{D}_{st}{\left({f}_{s{t}_{s}},\text{}{\theta}_{st}\right)}_{i}\right),{y}_{s}\right)\\ =\left(1-\alpha \right){\mathcal{L}}_{ce}\left({\widehat{y}}_{st\_hp},{y}_{s}\right)\end{array}$$Even in a network with relatively fewer parameters than in the TN, ${\mathcal{L}}_{st}$ is also reduced when ${\mathcal{L}}_{ds}$ is smaller, so it shows good feature extraction and classification performance.

- (2)
- Feature distribution difference loss: As shown in Figure 1, the feature distribution difference loss is MMD Loss [44] ${\mathcal{L}}_{f}$. ${\mathcal{L}}_{f}$ estimates the difference between the feature distribution of the source domain data ${X}_{s}$ and the feature distribution of the target domain data ${X}_{t}$ through MMD. ${\mathcal{L}}_{f}$ is generally defined as follows:$$\begin{array}{l}{\mathcal{L}}_{f}\left({f}_{st\_s},{f}_{st\_t}\right)=MM{D}^{2}\left({f}_{st\_s},{f}_{st\_t}\right)=\Vert {\mathbb{E}}_{{X}_{s}~{f}_{st\_s}}\phi \left({X}_{s}\right)-{\mathbb{E}}_{{X}_{t}~{f}_{st\_t}}\phi {\left({X}_{t}\right)\Vert}_{\mathcal{H}}^{2}\\ =\u2039{\mathbb{E}}_{{X}_{s}~{f}_{s{t}_{s}}}\phi \left({X}_{s}\right),{\mathbb{E}}_{{{X}^{\prime}}_{s}~{f}_{s{t}_{s}}}\phi {\left({{X}^{\prime}}_{s}\right)\u203a}_{\mathcal{H}}+\u2039{\mathbb{E}}_{{X}_{t}~{f}_{s{t}_{t}}}\phi \left({X}_{t}\right),{\mathbb{E}}_{{{X}^{\prime}}_{t}~{f}_{s{t}_{t}}}\phi {\left({{X}^{\prime}}_{t}\right)\u203a}_{\mathcal{H}}\\ -2\u2039{\mathbb{E}}_{{X}_{s}~{f}_{st\_s}}\phi \left({X}_{s}\right),{\mathbb{E}}_{{X}_{t}~{f}_{st\_t}}\phi {\left({X}_{t}\right)\u203a}_{\mathcal{H}}\end{array}$$For the mapping function $\phi $ of Equation (17), we use kernel tricks because computational resources are required too much to obtain all the moments. We utilize the Gaussian kernel as shown in Equation (18).$$gk\left(x,y\right)=exp\left(-\frac{\Vert x-{y\Vert}^{2}}{2{\sigma}^{2}}\right)$$$${e}^{x}=1+x+\frac{1}{2!}{x}^{2}+\frac{1}{3!}{x}^{3}+\cdots $$Since Equation (19) contains all the moments for x, we use the Gaussian kernel. G$k\left(x,y\right)$is derived as Equation (20).$$\mathrm{g}k\left(x,y\right)=\u2039\phi \left(x\right),\phi {\left(y\right)\u203a}_{\mathcal{H}}$$When Equation (15) is organized using Equation (20), ${\mathcal{L}}_{f}$ is re-formulated as shown in Equation (21).$$\begin{array}{l}{\mathcal{L}}_{f}\left({f}_{st\_s},{f}_{st\_t}\right)=\u2039{\mathbb{E}}_{{X}_{s}~{f}_{st\_s}}\phi \left({X}_{s}\right),{\mathbb{E}}_{X{\prime}_{s}~{f}_{st\_s}}\phi {\left(X{\prime}_{s}\right)\u203a}_{\mathcal{H}}+\u2039{\mathbb{E}}_{{X}_{t}~{f}_{st\_t}}\phi \left({X}_{t}\right),{\mathbb{E}}_{X{\prime}_{t}~{f}_{st\_t}}\phi {\left(X{\prime}_{t}\right)\u203a}_{\mathcal{H}}\\ -2\u2039{\mathbb{E}}_{{X}_{s}~{f}_{s{t}_{s}}}\phi \left({X}_{s}\right),{\mathbb{E}}_{{X}_{t}~{f}_{s{t}_{t}}}\phi {\left({X}_{t}\right)\u203a}_{\mathcal{H}}\\ ={\mathbb{E}}_{{X}_{s}{{X}^{\prime}}_{s}~{f}_{s{t}_{s}}}gk\left({X}_{s},{{X}^{\prime}}_{s}\right)+{\mathbb{E}}_{{X}_{t}{{X}^{\prime}}_{t}~{f}_{s{t}_{t}}}gk\left({X}_{t},{{X}^{\prime}}_{t}\right)-2{\mathbb{E}}_{{X}_{s}~{f}_{st\_s},{X}_{t}~{f}_{st\_t}}gk\left({X}_{s},{X}_{t}\right)\end{array}$$

- (3)
- Domain classification loss: As shown in Figure 1, the domain classification loss ${\mathcal{L}}_{dc}$ is related to $F{E}_{st}$ and $DC$. $DC\left(f,\text{}{\theta}_{dc}\right)$ is modeled so that the source domain and the target domain cannot be distinguished. To minimize the distribution difference between ${f}_{{\mathrm{st}}_{\mathrm{s}}}$ and ${f}_{{\mathrm{st}}_{\mathrm{t}}}$, the loss of $DC\left(f,\text{}{\theta}_{dc}\right)$ should be maximized. Using ${\widehat{dc}}_{s}$ and ${\widehat{dc}}_{t}$ of $DC\left(f,\text{}{\theta}_{dc}\right)$, cross-entropy loss as a binary classifier-based ${\mathcal{L}}_{dc}$ can be obtained as Equation (22).$${\mathcal{L}}_{dc}\left({x}_{s},{x}_{t};{\theta}_{f\_st},{\theta}_{dc}\right)=-{\displaystyle \sum}_{i=1}^{sn}\left[\mathrm{log}\left(1-{\widehat{dc}}_{s}^{i}\right)+\mathrm{log}\left({\widehat{dc}}_{t}^{i}\right)\right]$$

- (4)
- Appliance usage detection loss: as shown in Figure 1, the appliance usage detection loss uses ${\mathcal{L}}_{st}$ in the domain adaptation phase and ${\mathcal{L}}_{aud}$ in the robust domain adaptation phase. Since both losses are applied to the same $\mathrm{AU}{D}_{st}$, the same loss equation is formularized as in Equations (23) and (24).$${\mathcal{L}}_{st}={\mathcal{L}}_{ce}\left(Softmax\left(AU{D}_{st}{\left({f}_{st\_s},\text{}{\theta}_{st}\right)}_{i}\right),{y}_{s}\right)$$$${\mathcal{L}}_{aud}={\mathcal{L}}_{ce}\left(Softmax\left(AU{D}_{st}{\left({f}_{st\_s},\text{}{\theta}_{st}\right)}_{i}\right),{y}_{s}\right)+{\mathcal{L}}_{ce}\left(Softmax\left(AU{D}_{st}{\left({f}_{st\_tl},\text{}{\theta}_{st}\right)}_{i}\right),{y}_{t}\right)$$Each neural network is learned by differentiating loss with corresponding weights, as shown in the dotted line in Figure 1.

#### 3.3. Training Strategy

Algorithm 1: Parameter optimization procedure of the proposed method. |

Input$:\text{}\mathrm{The}\text{}\mathrm{source}\text{}\mathrm{domain}\text{}\mathrm{data}\text{}\left({x}_{s},{y}_{s}\right),$$\text{}\mathrm{The}\text{}\mathrm{target}\text{}\mathrm{domain}\text{}\mathrm{data}\text{}\left({x}_{t}\right)$ with M total samples, respectively.Output$:\text{}\mathrm{The}\text{}\mathrm{optimized}\text{}\mathrm{parameters}\text{}\left({\theta}_{{f}_{st}}^{\ast},{\theta}_{st}^{\ast},{\theta}_{dc}^{\ast}\right)$# Knowledge Distillation Phasefor m = 0 to epochs dofor n to minibatch do#Foward propagationTeacher: ${f}_{te}\leftarrow F{E}_{te}\left({x}_{s},\text{}{\theta}_{f\_te}\right)$, ${\widehat{y}}_{te}\leftarrow AU{D}_{te}\left({f}_{te},\text{}{\theta}_{te}\right)$ Student: ${f}_{\mathrm{st}\_\mathrm{s}}\leftarrow F{E}_{st}\left({x}_{s},\text{}{\theta}_{f\_st}\right)$, ${\widehat{y}}_{\mathrm{st}\_\mathrm{sp}}\leftarrow \mathrm{AU}{D}_{st}\left({f}_{st\_s},\text{}{\theta}_{st}\right)$, ${\widehat{y}}_{st\_hp}\leftarrow \mathrm{AU}{D}_{st}\left({f}_{st\_s},\text{}{\theta}_{st}\right)$ ${\mathcal{L}}_{ds}\leftarrow \left({\widehat{y}}_{te},{\widehat{y}}_{\mathrm{st}\_\mathrm{sp}}\right)=2\alpha {T}^{2}{\mathcal{L}}_{ce}\left({\widehat{y}}_{te},{\widehat{y}}_{st\_sp}\right)$, ${\mathcal{L}}_{st}\leftarrow \left({\widehat{y}}_{st\_hp},{y}_{s}\right)=\left(1-\alpha \right){\mathcal{L}}_{ce}\left({\widehat{y}}_{st\_hp},{y}_{s}\right)$ $\mathcal{L}\leftarrow {\mathcal{L}}_{ds}+{\mathcal{L}}_{st}$ #Back propagation${\theta}_{f\_st},{\theta}_{st}\leftarrow \mathrm{Adam}\left({\nabla}_{\theta}\mathcal{L},{\theta}_{f\_st},{\theta}_{st}\right)\text{}$ end forend for# Domain Adaptation Phasefor m = 0 to epochs dofor n to minibatch do#Foward propagationSource: ${f}_{\mathrm{st}\_\mathrm{s}}\leftarrow F{E}_{st}\left({x}_{s},\text{}{\theta}_{f\_st}\right),{\widehat{dc}}_{s}\leftarrow DC\left({f}_{\mathrm{st}\_\mathrm{s}},\text{}{\theta}_{dc}\right),{\widehat{y}}_{st\_hp}\leftarrow AU{D}_{st}\left({f}_{st\_s},\text{}{\theta}_{st}\right)$ Target: ${f}_{\mathrm{st}\_\mathrm{t}}\leftarrow F{E}_{st}\left({x}_{t},\text{}{\theta}_{f\_st}\right)$, ${\widehat{dc}}_{t}\leftarrow DC\left({f}_{\mathrm{st}\_\mathrm{t}},\text{}{\theta}_{dc}\right)$ ${\mathcal{L}}_{f}\leftarrow \left({f}_{st\_s},{f}_{st\_t}\right)={\mathbb{E}}_{{X}_{s}{{X}^{\prime}}_{s}~{f}_{s{t}_{s}}}gk\left({X}_{s},{{X}^{\prime}}_{s}\right)+{\mathbb{E}}_{{X}_{t}{{X}^{\prime}}_{t}~{f}_{s{t}_{t}}}gk\left({X}_{t},{{X}^{\prime}}_{t}\right)-2{\mathbb{E}}_{{X}_{s}~{f}_{s{t}_{s}},{X}_{t}~{f}_{s{t}_{t}}}gk\left({X}_{s},{X}_{t}\right),$ ${\mathcal{L}}_{dc}\leftarrow \left({x}_{s},{x}_{t};{\theta}_{f\_st},{\theta}_{dc}\right)=-{\displaystyle \sum}_{i=1}^{sn}\left[\mathrm{log}\left(1-{\widehat{dc}}_{s}^{i}\right)+\mathrm{log}\left({\widehat{dc}}_{t}^{i}\right)\right],$ ${\mathcal{L}}_{st}\leftarrow \left({f}_{st\_s},\text{}{\theta}_{st}\right)={\mathcal{L}}_{ce}\left(Softmax\left(AU{D}_{st}\left({f}_{st\_s},\text{}{\theta}_{st}\right)\right),{y}_{s}\right)$ $\mathcal{L}\leftarrow {\mathcal{L}}_{f}+{\mathcal{L}}_{dc}+{\mathcal{L}}_{st}$ #Back propagation${\theta}_{f\_st},{\theta}_{st},{\theta}_{dc}\leftarrow \mathrm{Adam}\left({\nabla}_{\theta}\mathcal{L},{\theta}_{{f}_{st}},{\theta}_{st},{\theta}_{dc}\right)\text{}$ end forend for# Robust Domain Adaptation Phase#Pseudo labeling${f}_{\mathrm{st}\_\mathrm{t}}\leftarrow F{E}_{st}\left({x}_{t},\text{}{\theta}_{f\_st}\right)$, ${y}_{tl}\leftarrow AU{D}_{st}\left({f}_{\mathrm{st}\_\mathrm{t}},\text{}{\theta}_{st}\right)$ for m = 0 to epochs dofor n to minibatch do#Foward propagationSource: ${f}_{\mathrm{st}\_\mathrm{s}}\leftarrow F{E}_{st}\left({x}_{s},\text{}{\theta}_{f\_st}\right),{\widehat{dc}}_{s}\leftarrow DC\left({f}_{\mathrm{st}\_\mathrm{s}},\text{}{\theta}_{dc}\right),{\widehat{y}}_{st\_hp}\leftarrow AU{D}_{st}\left({f}_{st\_s},\text{}{\theta}_{st}\right)$ Target: ${f}_{\mathrm{st}\_\mathrm{t}}\leftarrow F{E}_{st}\left({x}_{t},\text{}{\theta}_{f\_st}\right)$, ${\widehat{dc}}_{t}\leftarrow DC\left({f}_{\mathrm{st}\_\mathrm{t}},\text{}{\theta}_{dc}\right)$ Pseudo Target: ${f}_{\mathrm{st}\_\mathrm{tl}}\leftarrow F{E}_{st}\left({x}_{t},\text{}{\theta}_{f\_st}\right),{\widehat{y}}_{st\_tl}\leftarrow AU{D}_{st}\left({f}_{\mathrm{st}\_\mathrm{tl}},\text{}{\theta}_{st}\right)$ ${\mathcal{L}}_{f}\leftarrow \left({f}_{st\_s},{f}_{st\_t}\right)={\mathbb{E}}_{{X}_{s}{{X}^{\prime}}_{s}~{f}_{s{t}_{s}}}gk\left({X}_{s},{{X}^{\prime}}_{s}\right)+{\mathbb{E}}_{{X}_{t}{{X}^{\prime}}_{t}~{f}_{s{t}_{t}}}gk\left({X}_{t},{{X}^{\prime}}_{t}\right)-2{\mathbb{E}}_{{X}_{s}~{f}_{s{t}_{s}},{X}_{t}~{f}_{s{t}_{t}}}gk\left({X}_{s},{X}_{t}\right),$ ${\mathcal{L}}_{dc}\leftarrow \left({x}_{s},{x}_{t};{\theta}_{f\_st},{\theta}_{dc}\right)=-{\displaystyle \sum}_{i=1}^{sn}\left[\mathrm{log}\left(1-{\widehat{dc}}_{s}^{i}\right)+\mathrm{log}\left({\widehat{dc}}_{t}^{i}\right)\right],$ ${\mathcal{L}}_{aud}\leftarrow \left({f}_{s{t}_{s}},{f}_{s{t}_{tl}};\text{}{\theta}_{st}\right)$ $={\mathcal{L}}_{ce}\left(Softmax\left(AU{D}_{st}{\left({f}_{st\_s},\text{}{\theta}_{st}\right)}_{i}\right),{y}_{s}\right)+{\mathcal{L}}_{ce}\left(Softmax\left(AU{D}_{st}{\left({f}_{st\_tl},\text{}{\theta}_{st}\right)}_{i}\right),{y}_{tl}\right)$ $\mathcal{L}\leftarrow {\mathcal{L}}_{f}+{\mathcal{L}}_{dc}+{\mathcal{L}}_{aud}$ #Back propagation ${\theta}_{f\_st},{\theta}_{st},{\theta}_{dc}\leftarrow \mathrm{Adam}\left({\nabla}_{\theta}\mathcal{L},{\theta}_{{f}_{st}},{\theta}_{st},{\theta}_{dc}\right)\text{}$ end forend for $\left({\theta}_{{f}_{st}}^{\ast},{\theta}_{st}^{\ast},{\theta}_{dc}^{\ast}\right)$ |

## 4. Experiments

#### 4.1. Data Preparation

#### 4.1.1. Dataset

#### 4.1.2. Data Preprocessing

#### 4.2. Experimental Setup

#### 4.2.1. Implementation Configuration

#### 4.2.2. Ablation Study Methods

- Baseline: Typical domain adaptation method with BiLSTM-based feature extractors;
- TCN-DA: Domain adaptation method with TCN-based feature extractor;
- gkMMD-DA: Domain adaptation method with Gaussian kernel trick-based MMD Loss in baseline;
- TS-DA: A domain adaptation method for extracting features based on the robust knowledge distillation of the teacher–student structure. The feature extractor of SN used BiLSTM, such as the baseline, and the feature extractor of TN used BiLSTM, which is four times the size of the student;
- PL-DA: How to perform domain optimization with pseudo labeling on baseline method

#### 4.2.3. Evaluation Metrics

#### 4.3. Case Studies and Discussions

#### 4.3.1. Domain Adaptation within the Same Dataset

#### 4.3.2. Domain Adaptation between Different Datasets

#### 4.3.3. Discussions

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Gherheș, V.; Fărcașiu, M.A. Sustainable Behavior among Romanian Students: A Perspective on Electricity Consumption in Households. Sustainability
**2021**, 13, 9357. [Google Scholar] [CrossRef] - Somchai, B.; Boonyang, P. Non-intrusive appliances load monitoring (nilm) for energy conservation in household with low sampling rate. Procedia Comput. Sci.
**2016**, 86, 172–175. [Google Scholar] - Nur Farahin, E.; Md Pauzi, A.; Yusri, H.M. RETRACTED: A review disaggregation method in Non-intrusive Appliance Load Monitoring. Renew. Sustain. Energy Rev.
**2016**, 66, 163–173. [Google Scholar] - Shikha, S.; Angshul, M. Deep sparse coding for non–intrusive load monitoring. IEEE Trans. Smart Grid
**2017**, 9, 4669–4678. [Google Scholar] - Cominola, A.; Giuliani, M.; Piga, D.; Castelletti, A.; Rizzoli, A.E. A hybrid signature-based iterative disaggregation algorithm for non-intrusive load monitoring. Appl. Energy
**2017**, 185, 331–344. [Google Scholar] [CrossRef] - Shi, X.; Ming, H.; Shakkottai, S.; Xie, L.; Yao, J. Nonintrusive load monitoring in residential households with low-resolution data. Appl. Energy
**2019**, 252, 113283. [Google Scholar] [CrossRef] - Georgia, E.; Lina, S.; Vladimir, S. Power Disaggregation of Domestic Smart Meter Readings Using Dynamic Time warping. In Proceedings of the 2014 6th International Symposium on Communications, Control and Signal Processing (ISCCSP), Athens, Greece, 21–23 May 2014; IEEE: Manhattan, NY, USA, 2014; pp. 36–39. [Google Scholar]
- Yu-Hsiu, L.; Men-Shen, T. Non-intrusive load monitoring by novel neuro-fuzzy classification considering uncertainties. IEEE Trans. Smart Grid
**2014**, 5, 2376–2384. [Google Scholar] - Kanghang, H.; He, K.; Stankovic, L.; Liao, J.; Stankovic, V. Non-intrusive load disaggregation using graph signal processing. IEEE Trans. Smart Grid
**2016**, 9, 1739–1747. [Google Scholar] - Hart, G.W. Nonintrusive appliance load monitoring. Proc. IEEE
**1992**, 80, 1870–1891. [Google Scholar] [CrossRef] - Yang, Y.; Zhong, J.; Li, W.; Gulliver, T.A.; Li, S. Semisupervised multilabel deep learning based nonintrusive load monitoring in smart grids. IEEE Trans. Ind. Inform.
**2019**, 16, 6892–6902. [Google Scholar] [CrossRef] - Sagar, V.; Shikha, S.; Angshul, M. Multi-label LSTM autoencoder for non-intrusive appliance load monitoring. Electr. Power Syst. Res.
**2021**, 199, 107414. [Google Scholar] - Hyeontaek, H.; Sanggil, K. Nonintrusive Load Monitoring using a LSTM with Feedback Structure. IEEE Trans. Instrum. Meas.
**2022**, 71, 1–11. [Google Scholar] - Da Silva Nolasco, L.; Lazzaretti, A.E.; Mulinari, B.M. DeepDFML-NILM: A New CNN-Based Architecture for Detection, Feature Extraction and Multi-Label Classification in NILM Signals. IEEE Sens. J.
**2021**, 22, 501–509. [Google Scholar] [CrossRef] - Christoforos, N.; Dimitris, V. On time series representations for multi-label NILM. Neural Comput. Appl.
**2020**, 32, 17275–17290. [Google Scholar] - Patrick, H.; Calatroni, A.; Rumsch, A.; Paice, A. Review on deep neural networks applied to low-frequency nilm. Energies
**2021**, 14, 2390. [Google Scholar] - Kong, W.; Dong, Z.Y.; Hill, D.J.; Luo, F.; Xu, Y. Improving nonintrusive load monitoring efficiency via a hybrid programing method. IEEE Trans. Ind. Inform.
**2016**, 12, 2148–2157. [Google Scholar] [CrossRef] - Basu, K.; Debusschere, V.; Douzal-Chouakria, A.; Bacha, S. Time series distance-based methods for non-intrusive load monitoring in residential buildings. Energy Build.
**2015**, 96, 109–117. [Google Scholar] [CrossRef] - Yaroslav, G.; Lempitsky, V. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1180–1189. [Google Scholar]
- Long, M.; Zhu, H.; Wang, J.; Jordan, M.I. Unsupervised domain adaptation with residual transfer networks. Adv. Neural Inf. Processing Syst.
**2016**, 29, 136–144. [Google Scholar] - Liu, Y.; Zhong, L.; Qiu, J.; Lu, J.; Wang, W. Unsupervised domain adaptation for nonintrusive load monitoring via adversarial and joint adaptation network. IEEE Trans. Ind. Inform.
**2021**, 18, 266–277. [Google Scholar] [CrossRef] - Lin, J.; Ma, J.; Zhu, J.; Liang, H. Deep Domain Adaptation for Non-Intrusive Load Monitoring Based on a Knowledge Transfer Learning Network. IEEE Trans. Smart Grid
**2021**, 13, 280–292. [Google Scholar] [CrossRef] - Suzuki, K.; Inagaki, S.; Suzuki, T.; Nakamura, H.; Ito, K. Nonintrusive Appliance Load Monitoring Based on Integer Programming. In Proceedings of the 2008 SICE Annual Conference, Tokyo, Japan, 20–22 August 2008; IEEE: Manhattan, NY, USA, 2008; pp. 2742–2747. [Google Scholar]
- Michael, B.; Jürgen, V. Nonintrusive appliance load monitoring based on an optical sensor. In Proceedings of the 2003 IEEE Bologna Power Tech Conference Proceedings, Bologna, Italy, 23–26 June 2003; IEEE: Manhattan, NY, USA, 2003; Volume 4, p. 8. [Google Scholar]
- Arend, B.J.; Xiaohua, X.; Jiangfeng, Z. Active Power Residential Non-Intrusive Appliance Load Monitoring System. In Proceedings of the AFRICON 2009, Nairobi, Kenya, 23–25 September 2009; IEEE: Manhattan, NY, USA, 2009; pp. 1–6. [Google Scholar]
- Pan, S.J.; Tsang, I.W.; Kwok, J.T.; Yang, Q. Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw.
**2010**, 22, 199–210. [Google Scholar] [CrossRef] [PubMed][Green Version] - Mei, W.; Weihong, D. Deep visual domain adaptation: A survey. Neurocomputing
**2018**, 312, 135–153. [Google Scholar] - Isobe, T.; Jia, X.; Chen, S.; He, J.; Shi, Y.; Liu, J.; Lu, H.; Wang, S. Multi-Target Domain Adaptation with Collaborative Consistency Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8187–8196. [Google Scholar]
- Yuang, L.; Wei, Z.; Jun, W. Source-Free Domain Adaptation for Semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1215–1224. [Google Scholar]
- Guoqiang, W.; Lan, C.; Zeng, W.; Chen, Z. Metaalign: Coordinating Domain Alignment and Classification for Unsupervised Domain Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16643–16653. [Google Scholar]
- Zechen, B.; Wang, Z.; Wang, J.; Hu, D.; Ding, E. Unsupervised Multi-Source Domain Adaptation for Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12914–12923. [Google Scholar]
- Jingjing, L.; Jing, M.; Su, H.; Lu, K.; Zhu, L.; Shen, H.T. Faster domain adaptation networks. IEEE Trans. Knowl. Data Eng.
**2021**, 1. [Google Scholar] [CrossRef] - Dongdong, W.; Han, T.; Chu, F.; Zuo, M.J. Weighted domain adaptation networks for machinery fault diagnosis. Mech. Syst. Signal Processing
**2021**, 158, 107744. [Google Scholar] - Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; Darrell, T. Deep domain confusion: Maximizing for domain invariance. arXiv
**2014**, arXiv:1412.3474. [Google Scholar] - Hao, W.; Wang, W.; Zhang, C.; Xu, F. Cross-Domain Metric Learning Based on Information Theory. In Proceedings of the AAAI Conference on Artificial Intelligence, Québec City, QC, Canada, 27–31 July 2014. [Google Scholar]
- Juntao, H.; Hongsheng, Q. Unsupervised Domain Adaptation with Multi-kernel MMD. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; IEEE: Manhattan, NY, USA, 2021; pp. 8576–8581. [Google Scholar]
- Zhang, W.; Zhang, X.; Lan, L.; Luo, Z. Maximum mean and covariance discrepancy for unsupervised domain adaptation. Neural Processing Lett.
**2020**, 51, 347–366. [Google Scholar] [CrossRef] - Wen, Z.; Wu, W. Discriminative Joint Probability Maximum Mean Discrepancy (DJP-MMD) for Domain Adaptation. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; IEEE: Manhattan, NY, USA, 2020; pp. 1–8. [Google Scholar]
- Mingsheng, L.; Zhu, H.; Wang, J.; Jordan, M.I. Deep Transfer Learning with Joint Adaptation Networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2208–2217. [Google Scholar]
- Wan, N.; Zhang, C.; Chen, Q.; Li, H.; Liu, X.; Wei, X. MDDA: A Multi-scene Recognition Model with Multi-dimensional Domain Adaptation. In Proceedings of the 2021 IEEE 4th International Conference on Electronics Technology (ICET), Chengdu, China, 7–10 May 2021; IEEE: Manhattan, NY, USA, 2021; pp. 1245–1250. [Google Scholar]
- Wang, L.; Mao, S.; Wilamowski, B.M.; Nelms, R.M. Pre-trained models for non-intrusive appliance load monitoring. IEEE Trans. Green Commun. Netw.
**2021**, 6, 56–68. [Google Scholar] [CrossRef] - Shaojie, B.; Zico, K.J.; Koltun, V.K. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv
**2018**, arXiv:1803.01271. [Google Scholar] - Geoffrey, H.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv
**2015**, arXiv:1503.02531. [Google Scholar] - Xin, Y.; Chaofeng, H.; Lifeng, S. Two-Stream Federated Learning: Reduce the Communication Costs. In Proceedings of the 2018 IEEE Visual Communications and Image Processing (VCIP), Taichung, Taiwan, 9–12 December 2018; IEEE: Manhattan, NY, USA, 2018; pp. 1–4. [Google Scholar]
- Kelly, J.K.; Knottenbelt, W. The UK-DALE dataset, domestic appliance-level electricity demand and whole-house demand from five UK homes. Sci. Data
**2015**, 2, 150007. [Google Scholar] [CrossRef][Green Version] - Zico, K.J.; Johnson, M.J. Redd: A public data set for energy disaggregation research. In Proceedings of the Workshop on Data Mining Applications in Sustainability (SIGKDD), San Diego, CA, USA, 21 August 2011; pp. 59–62. [Google Scholar]
- Linge, S.; Langtangen, H.P. Programming for Computations-Python: A Gentle Introduction to Numerical Simulations with Python 3.6; Springer Nature: Cham, Switzerland, 2020. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Processing Syst.
**2019**, 32, 8024–8035. [Google Scholar]

**Figure 1.**A detailed overall configuration diagram of the proposed semi-supervised domain adaptation for multi-label classification on nonintrusive load monitoring.

UK-DALE | REDD | |||||||
---|---|---|---|---|---|---|---|---|

House 1 | House 2 | House 1 | House 3 | |||||

Appliance | Threshold | The Number of ON Event | Threshold | The Number of ON Event | Threshold | The Number of ON Event | Threshold | The Number of ON Event |

DW | 2000 | 4431 | 1800 | 3236 | 1000 | 6712 | 650 | 2934 |

FG | 250 | 2441 | 400 | 5291 | 400 | 2944 | 350 | 3344 |

KT | 2200 | 4495 | 2000 | 1694 | - | - | - | - |

MV | 1400 | 1242 | 1200 | 4218 | 1200 | 4809 | 1600 | 1327 |

WM | 1800 | 4980 | 1500 | 1524 | 2500 | 4796 | 2200 | 5764 |

Parameter Description | Value |
---|---|

Number of TCN blocks | 8 (TN) |

5 (SN) | |

Number of filters in each TCN block | 128 (TN) |

64 (SN) | |

Filter size | 3 |

Number of fully connected layers | 5 (TN) |

3 (SN) | |

2 (Domain Classifier) | |

Dilation factor | ${2}^{i}$$\mathrm{for}\text{}\mathrm{block}\text{}i$ |

Activation function | ReLU |

Dropout probability | 0.1 |

Number of maximum epochs | 200 |

Number of minimum early stopping epochs | 4 |

Mini-batch size | 512 |

Learning rate | 3 × 10^{−3} |

Appliance | Method | UK-DALE | REDD | ||
---|---|---|---|---|---|

$({\mathit{U}}^{1}\to {\mathit{U}}^{2})$ | $({\mathit{U}}^{2}\to {\mathit{U}}^{1})$ | $({\mathit{R}}^{1}\to {\mathit{R}}^{3})$ | $({\mathit{R}}^{3}\to {\mathit{R}}^{1})$ | ||

DW | Baseline | 0.781 | 0.805 | $-$ | $-$ |

TCN-DA | 0.832 | 0.827 | $-$ | $-$ | |

gkMMD-DA | 0.778 | 0.793 | $-$ | $-$ | |

TS-DA | 0.812 | 0.826 | $-$ | $-$ | |

PL-DA | 0.787 | 0.811 | $-$ | $-$ | |

Ours | 0.822 | 0.832 | $-$ | $-$ | |

Improvement | 5.25% | 3.35% | $-$ | $-$ | |

FG | Baseline | 0.833 | 0.834 | 0.817 | 0.818 |

TCN-DA | 0.842 | 0.841 | 0.829 | 0.840 | |

gkMMD-DA | 0.837 | 0.836 | 0.819 | 0.819 | |

TS-DA | 0.850 | 0.853 | 0.824 | 0.827 | |

PL-DA | 0.834 | 0.845 | 0.818 | 0.819 | |

Ours | 0.875 | 0.872 | 0.843 | 0.852 | |

Improvement | 5.04% | 4.56% | 3.18% | 4.16% | |

KT | Baseline | 0.761 | 0.832 | $-$ | $-$ |

TCN-DA | 0.811 | 0.839 | $-$ | $-$ | |

gkMMD-DA | 0.753 | 0.820 | $-$ | $-$ | |

TS-DA | 0.807 | 0.835 | $-$ | $-$ | |

PL-DA | 0.770 | 0.833 | $-$ | $-$ | |

Ours | 0.817 | 0.868 | $-$ | $-$ | |

Improvement | 7.36% | 4.33% | $-$ | $-$ | |

MV | Baseline | 0.742 | 0.791 | 0.793 | 0.790 |

TCN-DA | 0.751 | 0.798 | 0.806 | 0.721 | |

gkMMD-DA | 0.746 | 0.795 | 0.797 | 0.774 | |

TS-DA | 0.753 | 0.803 | 0.804 | 0.798 | |

PL-DA | 0.744 | 0.796 | 0.794 | 0.793 | |

Ours | 0.774 | 0.812 | 0.814 | 0.818 | |

Improvement | 4.31% | 2.65% | 2.65% | 3.54% | |

WM | Baseline | 0.615 | 0.611 | 0.841 | 0.782 |

TCN-DA | 0.725 | 0.708 | 0.844 | 0.799 | |

gkMMD-DA | 0.623 | 0.625 | 0.842 | 0.786 | |

TS-DA | 0.668 | 0.653 | 0.832 | 0.783 | |

PL-DA | 0.623 | 0.615 | 0.843 | 0.783 | |

Ours | 0.736 | 0.713 | 0.870 | 0.832 | |

Improvement | 19.67% | 16.69% | 3.45% | 6.39% |

Appliance | UK-DALE | REDD | ||
---|---|---|---|---|

$({\mathit{U}}^{1}\to {\mathit{U}}^{2})$ | $({\mathit{U}}^{2}\to {\mathit{U}}^{1})$ | $({\mathit{R}}^{1}\to {\mathit{R}}^{3})$ | $({\mathit{R}}^{3}\to {\mathit{R}}^{1})$ | |

DW | 0.823 | 0.828 | $-$ | $-$ |

FG | 0.857 | 0.854 | 0.834 | 0.847 |

KT | 0.813 | 0.841 | $-$ | $-$ |

MV | 0.762 | 0.805 | 0.809 | 0.764 |

WM | 0.730 | 0.709 | 0.852 | 0.815 |

Appliance | Method | $\mathbf{UK}-\mathbf{DALE}\text{}\to \mathbf{REDD}$ | $\mathbf{REDD}\text{}\to $ $\mathbf{UK}-\mathbf{DALE}$ |
---|---|---|---|

DW | Baseline | 0.741 | 0.712 |

TCN-DA | 0.779 | 0.737 | |

gkMMD-DA | 0.736 | 0.713 | |

TS-DA | 0.770 | 0.745 | |

PL-DA | 0.747 | 0.714 | |

Ours | 0.778 | 0.747 | |

Improvement | 4.99% | 4.92% | |

FG | Baseline | 0.786 | 0.764 |

TCN-DA | 0.794 | 0.787 | |

gkMMD-DA | 0.787 | 0.769 | |

TS-DA | 0.800 | 0.772 | |

PL-DA | 0.787 | 0.770 | |

Ours | 0.821 | 0.797 | |

Improvement | 4.45% | 4.32% | |

MV | Baseline | 0.719 | 0.739 |

TCN-DA | 0.726 | 0.716 | |

gkMMD-DA | 0.719 | 0.746 | |

TS-DA | 0.729 | 0.749 | |

PL-DA | 0.717 | 0.743 | |

Ours | 0.742 | 0.763 | |

Improvement | 3.2% | 3.25% | |

WM | Baseline | 0.563 | 0.758 |

TCN-DA | 0.669 | 0.773 | |

gkMMD-DA | 0.573 | 0.766 | |

TS-DA | 0.610 | 0.758 | |

PL-DA | 0.568 | 0.763 | |

Ours | 0.672 | 0.769 | |

Improvement | 19.36% | 1.45% |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hur, C.-H.; Lee, H.-E.; Kim, Y.-J.; Kang, S.-G. Semi-Supervised Domain Adaptation for Multi-Label Classification on Nonintrusive Load Monitoring. *Sensors* **2022**, *22*, 5838.
https://doi.org/10.3390/s22155838

**AMA Style**

Hur C-H, Lee H-E, Kim Y-J, Kang S-G. Semi-Supervised Domain Adaptation for Multi-Label Classification on Nonintrusive Load Monitoring. *Sensors*. 2022; 22(15):5838.
https://doi.org/10.3390/s22155838

**Chicago/Turabian Style**

Hur, Cheong-Hwan, Han-Eum Lee, Young-Joo Kim, and Sang-Gil Kang. 2022. "Semi-Supervised Domain Adaptation for Multi-Label Classification on Nonintrusive Load Monitoring" *Sensors* 22, no. 15: 5838.
https://doi.org/10.3390/s22155838