Non-Intrusive Load Monitoring Based on Deep Pairwise-Supervised Hashing to Detect Unidentiﬁed Appliances

: Non-intrusive load monitoring (NILM) is a fast developing technique for appliances operation recognition in power system monitoring. At present, most NILM algorithms rely on the assumption that all ﬂuctuations in the data stream are triggered by identiﬁed appliances. Therefore, NILM of identifying unidentiﬁed appliances is still an open challenge. To pursue a scalable solution to energy monitoring for contemporary unidentiﬁed appliances, we propose a voltage-current (V-I) trajectory enabled deep pairwise-supervised hashing (DPSH) method for NILM. DPSH performs simultaneous feature learning and hash-code learning with deep neural networks, which shows higher identiﬁcation accuracy than a benchmark method. DPSH can generate different hash codes to distinguish identiﬁed appliances. For unidentiﬁed appliances, it generates completely new codes that are different from codes of multiple identiﬁed appliances to distinguish them. Experiments on public datasets show that our method can get better F 1 - score than the benchmark method to achieve state-of-the-art performance in the identiﬁcation of unidentiﬁed appliances, and this method maintains high sustainability to identify other unidentiﬁed appliances through retraining. DPSH can be resilient against appliance changes in the house.


Introduction
Information and Communication Technologies (ICT) and Intelligent Data Analytical Technologies (IDAT) have become the new trend for various industries' development [1][2][3]. Following this trend, ICT and IDAT are increasingly implemented in multiple industries [4][5][6]. Load Monitoring is one of the ICT and IDAT implementation cases in the power system, and it can disaggregate the whole electricity consumption signal into the signals of appliances in a residential, commercial, or industrial building. Load monitoring can identify appliances and report consumers consumption patterns to improve consumer behavior [7]. Furthermore, finding the detailed electricity consumption patterns of the customers helps energy suppliers to efficiently plan and operate power system networks.
Traditional load monitoring equipment is intrusive, that is, a sensor with communication function is installed for one monitoring equipment in the total load, and then, the power consumption information is received through the network for real-time monitoring. This method requires a large number of sensors, which increases installation and maintenance costs. Unlike it, non-intrusive load monitoring (NILM) installs a smart meter at the user's entrance to obtain the total current and terminal voltage. NILM can apply digital signal chemistry to the collected data, and then use algorithms to analyze and extract the power consumption information of various types of indoor appliances. The advantages of this method are as follows: low installation cost, little interference to users, and flexible application. Therefore, non-intrusive load monitoring technology has received widespread attention from scholars in recent years.
NILM was first proposed by Hart for residential load decomposition [8]. The operating states of appliances are divided into steady and transient. Therefore, the load monitoring methods can perform load decomposition based on steady or transient characteristics. The transient characteristics mainly include the change of the current or voltage waveform at the moment when the appliance starts. The duration of transient characteristics is short and unique, which can improve the recognition between loads. However, the transient feature extraction needs complex hardware, and the transient process of the load is affected by conditions such as grid voltage fluctuations, and the aging of electrical equipment. Steady-state load characteristics such as current harmonics [9], power harmonics [10,11], and current waveforms [12][13][14] have been successively applied to NILM. Steady-state characteristics are generally obtained by index quantification. They are less affected by noise, but the probability of similarity of the single steady-state characteristics of the load increases when the number of loads rises. In order to distinguish multiple appliances, a new load characteristic that is V-I trajectory has been developed for NILM in recent years. The V-I trajectory is plotted based on the steady-state voltage and current, and it is used to express appliances' electrical characteristics. The V-I trajectory in conjunction with many popular classification algorithms can offer better or generally comparable overall precision of prediction, robustness, and reliability [15]. In short, the V-I trajectory has advantages as a currently popular feature.
Based on different load characteristics, a variety of load identification algorithms have been proposed in NILM [16,17]. With the development of machine learning, the system results from the learning process can deliver the optimal predictive performance for appliance loads. Therefore, machine learning techniques have become a popular choice for NILM, since they showed significant disaggregation performance; in particular, Factorial Hidden Markov models (FHMMs) [18][19][20], Neural Networks (NN) [21][22][23][24], graph-based signal processing [25], Support Vector Machines (SVM) [26], k-Nearest Neighbours [26], and Decision Trees [27] have been successfully employed for NILM. Specifically, Reference [28] proposed a NILM algorithm based on features of the V-I trajectory. Ten V-I trajectory features were quantified based on physical significance, which accurately represented those appliances that had multiple built-in modes with distinct power consumption profiles, and the support vector machine multi-classification algorithm was employed for load identification. Reference [29] proposed a NILM algorithm based on the joint use of active and reactive power in the Additive Factorial Hidden Markov Models framework. In particular, in the proposed approach, the appliance model was represented by a bivariate Hidden Markov Model whose emitted symbols are the joint active-reactive power signals. The disaggregation was performed by means of an alternative formulation of the Additive Factorial Approximate Maximum a Posteriori (AFAMAP) algorithm for dealing with the bivariate HMM models. Reference [30] proposed an experimental design process for the application of energy disaggregation using multi-label classification. This paper took the electrical parameters of the current (I), real power (P), reactive power (Q), and power factor (PF) at every one-minute and employed RAndom k-labELsets (RAkEL) with Decision Tree as the multi-label classification algorithm together with the right model parameter configuration.
However, it is worth noting that most classification algorithms described in the literature cannot identify unidentified appliances in the consumer environment. In these algorithms, the unidentified appliance will be assigned a label and power consumption. They correspond to the identified appliance which have the most similar features. This leads to confusion between the identification of identified appliances and unidentified appliances. At the same time, the accuracy of appliance identification is reduced. Therefore, the household power consumption that is fed back to consumers and the power department is inaccurate.
Considering that the V-I trajectory is an image feature, its application enables NILM to be transformed into image retrieval. With the explosive growth of data in real applications like image retrieval, approximate nearest neighbor (ANN) search [31] has become a hot research topic in recent years. Due to its fast query speed and low memory cost, hashing [32] has become one of the most popular and effective techniques among existing ANN techniques. Existing hashing methods can be divided into data-independent methods and data-dependent methods. In data-independent methods, the hash function is typically randomly generated. It is independent of any training data. The representative data-independent methods include locality-sensitive hashing (LSH) [33] and its variants. Data-dependent methods try to learn the hash function from some training data, and they are also called learning to hash (L2H) [34] methods. Compared with data-independent methods, L2H methods can achieve comparable or better accuracy with shorter hash codes. Representative learning to hash methods include fast supervised hashing (FastH) [35], supervised discrete hashing (SDH) [36], column-sampling based discrete supervised hashing (COSDISH) [37], and column generation hashing (CGHash) [38].
Therefore, this paper proposes a V-I trajectory enabled deep pairwise-supervised hashing (DPSH) method for NILM. It contains simultaneous feature learning and hashcode learning. DPSH encodes the V-I trajectory images of identified appliances into compact binary hash codes. According to different coding results, we can identify various identified appliances in the environment, and DPSH can detect previously unidentified appliances in an automated way. When there is an unidentified appliance, DPSH will encode the V-I trajectory images of this appliance into brand new hash codes, which are different from other identified appliances. Thence, our proposed method can provide a scalable solution to energy monitoring for contemporary unidentified appliances.
The main contributions of this paper can be summarized as follows: Firstly, to the best of our knowledge, DPSH which can perform simultaneous feature learning and hash-code learning for applications with pairwise labels is first applied to NILM. This method transfers appliance identification to approximate nearest neighbor search, and improves the identification accuracy of identified appliances.
Secondly, the majority of the NILM approaches are sensitive to the replacement and addition of appliances in the house, and thus require regular retraining. In this paper, the focus lies on creating a classification algorithm that is able to detect unidentified appliances. Therefore, the algorithm can be resilient against the replacement and addition of appliances in the house. If an unidentified appliance is detected, labeling and retraining are requested to restore the identified environment and then identify the next unidentified appliance.
Thirdly, this paper also reflects the retraining results of our proposed method after identifying the unidentified appliance. The results show that the identification accuracy of DPSH can be restored to a high level through retraining, and when the next unidentified appliance appears, DPSH can still recognize it. In other words, DPSH maintains high sustainability. Experiments on public datasets show that DPSH can outperform the benchmark method to achieve state-of-the-art performance in NILM. This paper is organized as follows. Section 2 defines some symbols and issues in DPSH method. Section 3 explains the model and learning process of DPSH method as well as how it can be used for load disaggregation. Section 4 introduces benchmark datasets, the input of network, performance metrics, and selection of code length. The experimental results on publicly available datasets are presented in Section 5 to evaluate the performance of the proposed DPSH method for NILM. Moreover, the conclusions are given in Section 6.

Notation and Problem Definition
We convert the classification of electrical appliances to approximate nearest neighbor search of V-I trajectories in this paper.

Notation
The lowercase letters like z are used to denote vectors. We use uppercase letters like Z to denote matrices. Z T denotes the transpose of Z. The Euclidean norm of a vector is denoted as || · || 2 . sgn(·) is used to denote the element-wise sign function. If the element is positive, sgn(·) will return 1. Otherwise it will return −1.

Problem Definition
Suppose we have n V-I trajectory images X = {x i } n i=1 where x i is the i-th element in set X. Besides the set of V-I trajectories, the training set of supervised hashing with pairwise labels also contains a set of pairwise labels S = {s ij } n×n with s ij ∈ {0, 1}. s ij = 1 denotes V-I trajectory x i is similar to V-I trajectory x j . Otherwise, they are dissimilar. Here, the pairwise labels typically refer to semantic labels provided with manual effort.
The goal of supervised hashing with pairwise labels is to classify V-I trajectories correctly by learning hash function h(x). We can get a binary code b i ∈ {−1, 1} c for every trajectory , and c means the length of code. All of binary codes are collected in the set B = {b i } n i=1 . The similarity in S should be preserved in the binary codes B. If s ij = 1, it means that the Hamming distance between the binary codes b i and b j should be as small as possible. Otherwise s ij = 0, when the binary codes b i and b j have a high Hamming distance.

Deep Pairwise-Supervised Hashing
In this section, we introduce the DPSH model based on the V-I trajectory in detail. This section contains the model composition and learning algorithm.

Model Composition
The workflow of the proposed method that is able to detect unidentified appliances is shown in Figure 1. In the training phase, a hash function that can encode samples of the same appliance into the same binary hash codes is computed from the V-I trajectory images by training the feature learning part. The V-I trajectory images must be paired and labeled respectively as must-or cannot-links. This depends on if the images belong to the same class or not. The transformation does not depend on the appliance label. On the transformed input, DPSH determines the final encoding results of each identified appliance by minimizing pairwise loss. In the test phase, a V-I trajectory image is encoded to a kind of only binary hash codes. Next, we calculate the Hamming distance (d) to all representation codes. If the distance is equal to zero, the V-I trajectory is classified as one of the identified appliances. If the distance is not equal to zero, the trajectory is labeled as ''unidentified". DPSH model has an end-to-end deep learning architecture, which contains two essential parts: feature learning part and objective function part, and it is shown in Figure 2. Specifically, the feature learning part and objective function part feedback with each other during the training procedure. Feature learning part aims to learn a deep neural network which can extract multiple features from V-I trajectory images, and then features are encoded into compact binary hash codes. The goal of objective function part is to learn how to encode the features, which can reflect the supervised information (similarity) between two V-I trajectory images. The combination of the two parts achieves the purpose of load identification.

Feature Learning Part
Convolutional neural network (CNN) is a type of neural networks (NNs) that are often used in computer vision because they are highly suitable to classify images. We adopt a classical CNN model to extract the features of V-I trajectory images in this part. The feature learning part contains the Alexnet [39] model as a component, which has eleven layers and is good at extracting the features of images. The structure of the DPSH model is symmetrical. In other words, there are two Alexnets (top Alexnet and bottom Alexnet) in the feature learning part. These two Alexnets have the same structure and share weights.
That is to say, the input of the model are pairs of V-I trajectories and pairwise labels between the trajectories. The feature learning part is an indispensable foundation for NILM.

Pairs of V-I trajectory images
Feature learning

Final trajectory codes
Hash codes

Trajectory codes
Calculate Hamming distance to all representation codes and select minimum d  The network structure and parameters of the Alexnet model have been introduced in Table 1. More specifically, there are 5 convolutional layers (Conv 1-5), 3 max-pooling layers (MaxP 1-3), and 3 fully connected layers (FC 1-3) in the Alexnet model. The role of the convolutional layers is to extract local features of the trajectories. The progress of Alexnet is to imitate a large receptive field effect by using multiple convolution kernels in sequence. Alexnet uses convolution kernels and pooling kernels to deepen the network structure continuously and improve performance. The max-pooling layers are used to reduce the size of images, and fully connected layers are to reassemble the extracted local features into a complete graph through the weight matrix. The parameters of the Alexnet model are mainly concentrated in three fully connected layers. As shown in Table 1, the size of the convolution kernels in each segment gradually decreases, and the pooling kernels' size in each segment is the same.

Layer
Size of Filter Number of Channels Stride

. Objective Function Part
We can define the likelihood of pairwise labels S = {s ij } n×n as follows, when the binary codes B = {b i } n i=1 for all the V-I trajectory images are given. where −Ω ij . Ω ij means half of the inner product of two codes. When the two codes are the same, the inner product is the largest. In contrast, when two codes are completely different, the inner product is the smallest, so Ω ij can indicate the similarity of two codes. It is worth noting that By taking the negative log-likelihood of the pairwise labels observed in S, we can get the following optimization problem: The optimization problem in (2) fully reflects the goal of supervised hashing with pairwise labels. The Formula (2) ensures that the Hamming distance between two similar V-I trajectory images can be as small as possible, while the Hamming distance between two dissimilar V-I trajectory images can be as large as possible. Therefore, the purpose of distinguishing various appliances through the V-I trajectories is achieved. However, the problem in (2) is a discrete optimization problem that is hard to solve. Although it is solved by (1), directly relaxing B = {b i } n i=1 from discrete to continuous, satisfactory performance still cannot be obtained. Therefore, we adopt a novel strategy that can directly solve the problem in (2) in a discrete way. In other words, there is no need to give up the accuracy of B and convert it. We reformulate the problem in (2) as the following equivalent problem: To optimize the above problem, we can move the equality constraints in (3) to the regularization terms, so we reformulate the optimize problem (3) as the following equivalent one: where η is the regularization term that is a hyper-parameter.

DPSH Model
The DPSH model uses an end-to-end framework, which integrates the above feature learning part and objective function part together, and the end-to-end framework is expressed as where the network parameters of Alexnet model in the feature learning part are denoted as θ. Furthermore, We define the output of the Alexnet model with the network parameter θ as φ(x i ; θ), when the V-I trajectory image x i is used as the input of the Alexnet model. W ∈ R 1000×c is a weight matrix, and v ∈ R c×1 denotes a bias vector. That is to say, our DPSH model integrates the feature learning part and the objective function part into an end-to-end framework through a weight matrix and a bias vector. This method is similar to the function of a fully connected layer. After integrating the two parts, the final problem becomes Therefore, we get an end-to-end DPSH model. It can perform both feature learning and hash-code learning simultaneously in the same framework. The method can generate different hash codes to distinguish identified appliances and a completely new kind of hash codes for the unidentified appliance.
In conclusion, DPSH contains three key components. The first component is a deep neural network to learn image features from pixels. The second component is a hash function to map the learned image features to hash codes, and the third component is a loss function to measure the quality of hash codes guided by pairwise labels.

Learning Algorithm
In the DPSH model, known parameters include the pairwise labels S and the regularization term η. Other parameters including W, v, θ, and B need to be learned. In this paper, we adopt a minibatch-based strategy for learning. That is to say, in each iteration, we sample a mini-batch of V-I trajectory images from the whole training set, and then we can perform learning based on these sampled V-I trajectory images, and we design a method of alternating learning. More specifically, we optimize one parameter while other parameters are fixed. The hash codes b i can be directly optimized as follows: where sgn(·) can extract the sign of the input. we adopt the back-propagation (BP) algorithm to calculate the gradient in order to update other parameters W, v, and θ. In particular, we can calculate the derivative of the loss function with regard to u i according to the following formula: where a ij = σ( 1 2 u T i u j ). Then, we can calculate the gradient and use back-propagation algorithm to update the parameters W, v, and θ respectively:

Tool and Environment
All the experiments are implemented using Python 3.6 on a standard PC with an Intel Core i7-6700MQ CPU running at 3.40 GHz and with 16.0 GB of RAM. The CNN architecture is constructed based on Pytorch, and the pre-trained CNN model is migrated to DPSH.

Benchmark Datasets
The performance of the proposed algorithm is validated on the Reference Energy Disaggregation Data Set (REDD) and the Plug-Level Appliance Identification Dataset (PLAID).

REDD Dataset
REDD is a freely available data set containing detailed power usage information from several homes. It is aimed at furthering research on energy disaggregation [40]. The data contains power consumption from real homes over several months' time. They are the power consumption of the whole house as well as for each individual circuit in the house. All data in REDD is recorded with UTC time stamps. For each monitored house, REDD record the AC waveform itself in order to compute both real and reactive powers easily. This dataset includes low-frequency power data in each house, high-frequency voltage, and current data in house 3 and house 5. The list of appliances in the REDD dataset has been introduced in Table 2.

PLAID Dataset
The Plug-Level Appliance Identification Dataset is a public and crowd-sourced dataset for load identification research. PLAID dataset includes short voltage and current measurements for different residential appliances. The measurement equipment collects data in the order of a few seconds. The goal of PLAID is to provide a public library for high-resolution appliance measurements. It can be integrated into existing or novel appliance identification algorithms [41]. PLAID currently includes current and voltage measurements sampled at 30 kHz from 11 different appliance types present in more than 60 households in Pittsburgh, Pennsylvania, USA. Data collection took place during the summer of 2013 and winter of 2014. Measurements with significant noise in the voltage due to measurement errors were removed [42]. The list of appliances in the PLAID dataset has been introduced in Table 3.  Our paper adopts an event detection method and a trajectory extraction method, which are proposed in Reference [28]. An event is defined as the state-switching process of an appliance within a certain period of time. In this paper, the event is detected by comparing the variation in power during that durations with two predetermined thresholds.
Event detection is summarized by (12). Stride is represented by R in (12), and R is set to 1 s. The aggregated apparent power at t s is P t . The difference between two adjacent aggregated apparent powers is denoted by ∆P t (∆P t = P t+1 − P t ). The event begins when |∆P t | ≥ P on1 , and continues to calculate |∆P t+1 |, |∆P t+2 |, ..., until |∆P t+TR | < P on1 and |∆P t+TR+1 | < P on1 . If |P t+TR − P t | ≥ P on2 , the appliance has a state transition at t ∼ t + TR s. In other words, the integral event begins at t s and finishes at t + TR s. T is the number of strides that represents the duration of the event. P on1 and P on2 are set as 30W and 100W respectively in this paper. We believe that power fluctuations below P on1 are considered to be caused by noise, and it is considered a complete state switching process that the power difference before and after the event is greater than P on2 , and event detection is an essential step for V-I trajectory extraction. The voltage and current data must be processed before plotting the trajectory.
We extract the same number of voltage and current waveforms before and after the event. Four kinds of waveforms (the voltage waveforms before the event, the current waveforms before the event, the voltage waveforms after the event, and the current waveforms after the event) are interpolated and averaged separately. Then, the voltage waveforms before and after the event are averaged. The current waveform before and after the event takes the difference. Therefore, we can plot the V-I trajectory as a delta-form signature, which makes use of the difference between two consecutive snapshots and meets the feature-additive criterion [12]. We extracted 10 types of appliances' V-I trajectories from the REDD dataset to verify the effectiveness of our proposed method. According to the above method, the V-I trajectories for different appliances from REDD database are shown in Figure 3. Then, we directly use the raw images as input in DPSH model. We extract 4400 V-I trajectory images to train the proposed model. Each image belongs to one of the 10 classes. For these classes, the number of images of each class is at least 100, and we randomly select 1100 V-I trajectory images for the test set.
PLAID is a data set composed directly of the voltage and the current data of many appliances. It requires the same data processing to extract the trajectory. However, PLAID dataset is different from REDD dataset. PLAID currently includes current and voltage measurements sampled from different appliance types present in more than 60 households. We need a method to reduce the fluctuation of the image shape, so we convert every V-I trajectory image into a binary V-I image (n × n matrix) by meshing the V-I trajectory. Each cell of the mesh is assigned a binary value that denotes whether or not it is traversed by the trajectory. Binary V-I image can reduce the volatility of data generated by different appliance types present in different households, and we choose 6 binary V-I trajectory images to test DPSH algorithm, as shown in Figure 4. We select 757 V-I trajectory images to train the proposed model. Each image belongs to one of the 6 classes. For these classes, the number of images of each class is different. The category with the fewest numbers contains 26 images, and we randomly select 226 V-I trajectory images for the test set.   All of the V-I trajectory images are resized as 224 × 224. Then we directly use the raw image pixels as input in DPSH model. Please note that there are two Alexnets (top Alexnet and bottom Alexnet). These two Alexnets have the same structure and share the same weights. That is to say, both the input and the loss function are based on pairs of images. The extracted V-I trajectories are input into the Alexnets in pairs. The features of V-I trajectories, which are extracted after 5 convolutional layers, 3 max-pooling layers, and 3 fully connected layers, form many feature vectors of 1000 elements. DPSH model learns a hashing function can convert 1000-dimensional feature vectors to trajectory codes in the training period. The code length is a hyper-parameter. We can set it according to our own needs. The trajectory codes must ensure that the Hamming distance between two similar V-I trajectory images is as small as possible, while the Hamming distance between two dissimilar V-I trajectory images is as large as possible. In other words, the trajectory codes can reflect pairwise similarity. Therefore, we can identify the category information of V-I trajectories according to these codes in the testing period. Two V-I trajectories will belong to the same category if the Hamming distance between two trajectory codes is close to 0, and the proposed model can produce a lot of coding results that have never been seen in the training period when there are unidentified appliances in the consumer's environment. At this point, we can retrain the model quickly to accommodate the addition of a new appliance. The retrained model can also continue to detect other newer appliances.

Performance Metrics
In this paper, the problem of NILM is considered as an approximate nearest neighbor search task, which should assign each V-I trajectory image into one of predefined classes. We utilize a meaningful performance metric, mean average precision (MAP), which can illustrate the performance in many multi-class classification tasks [43]. We estimate the ranks of data samples in the calculation of average precision(AP). The discrete form of AP for class C k is where we use x i to represent the ith query V-I image. |C k | denotes the cardinality of set C k , and rank(s; S) is the rank of s in set S. A smoothed pair-wise rank function can be used to estimate ranking relation between two samples x i and x j . This function is defined in [42] as follows. rank(x i ; x j ) = 1 For M-class classification, we will generally use the mean of all APs of different classes to evaluate the overall performance.
We adopt three classification metrics to evaluate the effect of classification. They are shown in the following equations based on True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). These metrics analyze how well the algorithm can identify changes in the appliance's status.
Precision + Recall F 1 -score is a measure of the test's accuracy and is obtained by calculating the weighted average of the Precision and Recall. It calculates the percentage of energy correctly assigned to each appliance in the dataset. A higher F 1 -score value indicates a better identification of the appliance. To obtain the final test result, the average F 1 -score is taken: where N is the amount of different appliances, and F 1,i is the F 1 -score when appliance i is used as hold out appliance.

Selection of Code Length
We adopt the REDD dataset and PLAID dataset to verify the performance of DPSH. In the experiment on each dataset, both training and testing are done on all of the appliances. The number of V-I trajectory images produced by each kind of appliance has a different proportion in the total. The reason is that the frequency of opening and closing is different in the operation of various appliances. We adopt a pre-trained Alexnet model to reduce training times. Besides, we set the mini-batch size to be 64 and tune the learning rate among [10 −6 , 10 −2 ]. For DPSH method, the hyper-parameter η is set to be 50 by using a validation strategy. The most important parameter is the length of the output hash code.
The optimal code length determines the recognition accuracy and space complexity of DPSH algorithm. Therefore, the code length is set to be 12, 24, 32, 48 bits [35][36][37] so that we can verify the effect of this parameter on experimental results. As shown in Figure 5, the DPSH method based on the Alexnet model shows high accuracy in terms of MAP when the code length is set to be 12, 24, 32, 48 bits.  It can be seen that the performance of DPSH on REDD is better than that on PLAID. This is because we believe that there is only one V-I trajectory for each appliance when processing the REDD dataset, and the V-I trajectory shape of each appliance is different in the PLAID dataset. In other words, we assume that all 10 appliances in the REDD dataset are single-state appliances (Each appliance corresponds to a V-I trajectory). In the PLAID dataset, there are many types of V-I trajectories for each appliance. This may be a normal multi-state appliance, or it may be caused by measurement errors, and the part of trajectories of various appliances may be repeated. This is the reason why the results on REDD and PLAID are different. However, DPSH can maintain high accuracy in both two datasets, as shown in Figure 5. The minimum value of MAP in the PLAID dataset reaches 0.9169, when the code length is set to be 12 bits.
According to the above analysis, the MAP of DPSH on the two datasets is high and the performance is stable. Through calculation, we get the average MAP of each code length as 0.9581, 0.9616, 0.9627, and 0.9627. It can be seen that the MAP value of 0.9627 is the best result, but the average MAP of 32 bit and 48 bit are equivalent. At the same time, a too-long code length will increase memory cost, so we chose 32 bit as the code length for subsequent result analysis.

Results and Discussion
To define how well the DPSH model can identify identified appliances and unidentified appliances, we designed the following two scenarios. Scenario 1 simulates a normal household electricity environment where all of appliances are identified. Scenario 2 simulates a household environment where unidentified appliance exists.

Recognition of Identified Appliances
We train and test on two datasets to verify whether the proposed algorithm can distinguish identified appliances well. The number of appliances used for training and testing is the same, and Table 4 details the performance indicators achieved by running DPSH with 32-bit code length on different datasets. These are the best results of our proposed method on two datasets. As shown in Table 4, the Precision, Recall, and F 1 -score of every appliance are introduced in detail. This shows that our proposed algorithm is highly effective in appliance identification. It can accurately identify all appliances in REDD. DPSH also has a good ability to identify all appliances in the PLAID dataset. In conclusion, this method has a strong adaptability to different datasets. The 32-bit encoding results of DPSH method on the PLAID dataset are shown as Figure 6. It ensures that the Hamming distance between codes of different labels is as large as possible, but a kind of appliance label does not necessarily correspond to only one encoding result. This phenomenon illustrates the complexity of the PLAID dataset, and multiple encoding results of DPSH increase the stability of the method. It ensures that the algorithm can still maintain a high accuracy of appliance identification when the V-I trajectory images of the appliance are deviated due to fluctuations or the appliance has multiple states.
Ap4 Figure 6. The 32-bit encoding results of DPSH method on PLAID dataset.
The benchmark method in Reference [44] is based on siamese neural network and DB-SCAN clustering method (SN-DBSCAN). Figure 7 is the Precision and Recall for proposed DPSH and SN-DBSCAN [44] algorithm on REDD and PLAID datasets. The appliances in Figure 7 correspond to Tables 2 and 3. Figure 7a,b reflects the results generated by testing on REDD. It is obvious that the Precision and Recall for DPSH-32-bit are stable. The Precision for the SN-DBSCAN algorithm has a slight deviation. However, the Recall for the SN-DBSCAN algorithm has great fluctuation. It has the lowest Recall value for the third appliance, only 0.556. Meanwhile, as shown in Figure 7c,d, the results on the PLAID dataset are similar to the above. The Precision and Recall for DPSH-32-bit only fluctuate a little bit and are basically stable at very high values. Due to the complexity of PLAID, the Precision and Recall for DPSH-32-bit have some fluctuations on the second and third appliances. The Recall value of the second appliance is the lowest, but it has reached 0.882. The Precision and Recall of SN-DBSCAN not only have lower values on the second and third appliances but also have larger deviations on other appliances. This shows that DPSH is more stable on a complex PLAID dataset than SN-DBSCAN. In other words, this demonstrates that DPSH leads to performance improvements with respect to the SN-DBSCAN even in the presence of noise or measurement error. After the above analysis, it specifically reflects that DPSH is always more accurate than the SN-DBSCAN algorithm. We can conclude from Figure 7 that our proposed algorithm maintains high Precision and Recall on REDD and PLAID to ensure high accuracy of appliance identification. The radar chart in Figure 8 shows the F 1 -score for each appliance in the experiment including two datasets, and the area of each colored line is proportional to the F average of the related algorithm. It shows that the DPSH method gives better results for every appliance on F 1 -score, compared with the SN-DBSCAN algorithm. The F average of DPSH and SN-DBSCAN are 0.984 and 0.932 on REDD, respectively, and the indicators on the PLAID dataset are 0.969 and 0.815. The more complex the test environment, the greater difference in F average between DPSH and SN-DBSCAN. Compared with the SN-DBSCAN algorithm, the experimental results indicate that the proposed method significantly improves the accuracy and can be efficiently generalized when they are tested on the same database.

Recognition of Unidentified Appliance
To define how well the method can identify unidentified appliances, we complete the experiment on two datasets. We choose the Ap10 to be an unidentified appliance and call it Un10 in the REDD dataset. All other appliances are identified appliances, so training is done on 9 appliances and testing on 10 appliances. Similarly, we choose the Ap5 to be an unidentified appliance and call it Un5 in the PLAID dataset, and training is done on 5 appliances and testing on 6 appliances. The goal of our experiments is to detect unidentified appliances in user environments. Therefore, we choose Un10 and Un5, respectively, in two datasets as fixed unidentified appliances. This is called leave-one-appliance-out validation. In the REDD, each appliance corresponds to one V-I trajectory, but there is more than one V-I trajectory for each of these appliances in the PLAID, and they may be similar to each other, so we designed two experiments on two different datasets. It is validated whether (1) the selected appliances are properly separated by different codes with high Hamming distance, and (2) the unidentified appliance has its trajectory images classified as "unidentified".
The 32-bit encoding results of DPSH method on the PLAID are shown as Figure 9 when the last appliance is selected to be unidentified. Compared with Figure 6, we can see that the encoding results of the unidentified appliance are different from the encoding results of the identified appliances and include several forms. Therefore, our proposed method can achieve the purpose of detecting the unidentified appliance. Note that the encoding results of the same labels in Figures 6 and 9 are different. Because the samples used to learn the hash function are different (in Figure 9, the samples for the fourth appliance are not used).
Un5 Figure 9. The 32-bit encoding results of DPSH method with the unidentified appliance.
To know which appliances are mixed up, a confusion matrix is created: Figure 10a,b reports for each appliance type (row index) the number of labels that were correctly predicted or confused with other appliances (column index). The values in the matrix are percentages, and the colors represent the recall value per row (thus per appliance). It can be seen that the recognition accuracy of the appliances used for training is still very high. Due to measurement errors or similar shapes to other appliances, the unidentified appliance is encoded into a variety of discrete hash codes, so the unidentified appliance may be confused with other appliances. On the REDD dataset, only a small number of identified appliances are distinguished as unidentified. All the unidentified appliances are distinguished and marked correctly. Due to the complexity of the environment, the identified and unidentified appliances are confused on the PLAID, during the process of differentiation, but the number is small. In conclusion, we can discover that the DPSH algorithm also has high accuracy for the identification of unidentified appliances by analyzing the experimental results on the two datasets. According to Reference [44], the training appliances are used for learning form clusters. The samples belonging to the holdout appliance do not belong to any cluster and neither form a cluster as the siamese neural network is not trained on them. They have spread around and get the label "unidentified". We train and test SN-DBSCAN on the same datasets. The experimental results of the two methods on REDD and PLAID with the unidentified appliance are shown in Tables 5 and 6. For all appliances, the recognition accuracy of DPSH is basically higher than that of SN-DBSCAN. At the same time, we find that DPSH can maintain a high recognition capability for identified appliances. F 1 -score is an evaluation that comprehensively considers Precision and Recall and can more fully reflect the appliance identification. Therefore, we next focus on this indicator. The radar chart in Figure 11 shows the F 1 -score for each appliance in the experiment, including all the appliances in two datasets. In the REDD experiment, the F 1 -score of DPSH on Ap1 and Ap2 are slightly lower than SN-DBSCAN. On Ap1 and Ap2, especially on the unidentified appliance, DPSH's F 1 -score are significantly higher than SN-DBSCAN, and the F average values of DPSH and SN-DBSCAN are, respectively, 0.942 and 0.860. In the PLAID experiment, DPSH can better identify the unidentified appliance, but the F 1 -score of DPSH on Ap2 and Ap6 are slightly higher than SN-DBSCAN. The F average values of two algorithms are, respectively, 0.883 and 0.726. When unidentified appliances appear in the environment, although the SN-DBSCAN algorithm can identify them, the recognition accuracy of the identified and unidentified appliances decreases significantly. Concerning SN-DBSCAN algorithm, the proposed approach shows a higher improvement. The relative differences of F average are +0.082 and +0.157. In conclusion, the recognition ability of DPSH is very prominent for both identified appliances and the unidentified appliance. To further illustrate the effectiveness and practicality of DPSH, this paper conducts retraining experiments. We label the V-I trajectory images of unidentified appliances that are identified by two algorithms in the above experiments, and we add these V-I trajectory images to the respective training set. This new training set is used to retrain DPSH and SN-DBSCAN model. The indicators for evaluating the effectiveness of the two models are shown in Tables 7 and 8. By observing F 1 -score, it can also be concluded that the ability of DPSH to recognize the appliances is very strong. We calculate that the F average values of the two methods on REDD are 0.984 and 0.918, and the F average values of the two methods on PLAID are 0.967 and 0.763. We can see that the accuracy of DPSH can be restored to a very high level through the retraining process, although the number of unidentified appliances used for retraining is relatively small. In contrast, the accuracy of SN-DBSCAN is greatly affected by the amount of training data, and the accuracy after retraining is not satisfactory. When other unidentified appliances appear, the number of retraining increases, the accuracy of SN-DBSCAN will become lower and lower. However, DPSH will still accurately identify all appliances. Therefore, we can conclude that DPSH has good practicality. It is able to detect other unidentified appliances after retraining. The sustainability of this method performs well.

Conclusions
This paper has proposed a voltage-current trajectory enabled DPSH for NILM. Our major purpose is that different appliance loads including unidentified appliances can be distinguished by encoding their V-I trajectory images. DPSH model has an end-to-end deep learning architecture containing feature learning part and objective function part. Specifically, the feature learning part aims to learn a deep neural network which can extract multiple features from the original images, and then, features are encoded into compact binary hash codes. The purpose of the objective function part is to learn how to encode the features, which can reflect the similarity between query images and database images well. Experiments on real datasets have shown that DPSH model improves the accuracy and can outperform the benchmark method to achieve state-of-the-art performance in NILM. The F average of DPSH on REDD and PLAID dataset are 0.984 and 0.969, separately. Meanwhile, DPSH has solved a difficult problem that the accuracy of the recognition algorithm will drop a lot when the unidentified appliance is added to the environment. In other words, this method can identify the unidentified appliance, under the condition of ensuring the accuracy of identified appliances. The F average of two experiments to distinguish unidentified appliance are respectively 0.942 and 0.883, so we can add images of the unidentified appliance to the training set and retrain the model so that it can recognize all appliances well. The F average of two experiments increased to 0.984 and 0.967, respectively.