Building a Graph Signal Processing Model Using Dynamic Time Warping for Load Disaggregation

Building on recent unsupervised Non-intrusive load monitoring (NILM) algorithms that use graph Laplacian regularization (GLR) and achieve state-of-the-art performance, in this paper, we propose a novel unsupervised approach to design an underlying graph to model the correlation within time-series smart meter measurements. We propose a variable-length data segmentation approach to extract potential events, assign all measurements associated with an identified event to each graph node, employ dynamic time warping to define the adjacency matrix of the graph, and propose a robust cluster labeling approach. Our simulation results on four different datasets show up to 10% improvement in classification performance over competing approaches.


Introduction
The ongoing smart metering deployments worldwide [1,2] aim to maximize benefits of the smart grid by providing household aggregate energy consumption in real time, facilitating accurate billing, demand response [3], improved network maintenance, fault detection [4,5], and improved energy efficiency feedback.
Ambitious energy efficiency goals and the availability of smart meter data has re-ignited research on Non-Intrusive Load Monitoring (NILM) [6], i.e., identifying individual loads from the aggregated measurements (e.g., voltage, current, and active and reactive power) of smart meters [7][8][9]. However, despite the fact that the NILM problem has been researched for four decades [6], it remains a challenge, especially at sampling rates in the order of seconds and minutes, typical for wide-spread smart meters. Indeed, sensor noise from measurements, multiple appliances working at the same time, unknown appliances, multi-state appliances, base-load, and load fluctuation all act as "disaggregation noise" [10].
An appliance operation can be modeled as a finite state machine and then disaggregation can be performed based on a state transition model learnt during training [6]. The most popular examples of this are state-based NILM methods, including various Hidden Markov Model (HMM)-based NILM methods and their extensions [6,8,[11][12][13]. Event-based NILM methods, on the other hand, are based on detecting the event of an appliance being switched on or off, or transiting from one operation state to another (e.g., in the case of washing machine). Event-based NILM methods usually share the following three steps: (1) Event detection: Detect time instance of the state change in the time-series aggregate data.
(2) Feature construction and extraction: Compute and extract the electrical features, for example, the active power, from the detected event duration. (3) Classification and pattern matching: Classify the events into pre-defined labels (i.e., appliances) based on the features extracted. approach showed high computation complexity (since each sample from aggregated measurements is related to one graph node) and poor performance when multiple appliances are operating at the same time.
To address the above limitations of current designs, in this paper, we propose a novel unsupervised method for NILM disaggregation using Dynamic Time Warping (DTW). The proposed approach segments aggregated power data into windows and associates them to graph nodes. Then, DTW distance, instead of Euclidean distance as used in [7,9,25,[34][35][36][37], learns the underlying graph for clustering different data segments. Simulation results demonstrate significant performance improvement over the prior methods. Specifically, the key contributions are: • a data segmentation approach to identify variable-length segments of potential events; • DTW-based graph learning based on segmented data; • a robust cluster labeling approach for efficient appliance labeling and post-processing; and • an overall unsupervised NILM method comprising graphical signal pre-processing, data segmentation, graph design, and labeling that outperforms competing approaches.
The rest of this paper is organized as follows. Section 2 provides a background on disaggregation methods and GSP in NILM. Section 3 describes the proposed unsupervised approach with four steps listed as subsections. Section 4 introduces five benchmarks applied in this paper for the results comparison, which is presented in Section 5. The last section concludes the paper and highlights future work.

Background and Prior Work
In this section, we first introduce the notation and review relevant concepts of GSP. Then, we review prior work on GSP tools for NILM.

Notation and Background
All matrices are denoted by uppercase bold letters, such as X. Element in the ith row and jth column of the matrix X is denoted by x i,j . Vectors are denoted by lowercase bold letter, such as x with the ith element denoted as x i .
A vector of measurements x is represented by a graph, defined as G = (V, A), where V is the set of nodes and A is a weighted adjacency matrix. Each node v i ∈ V corresponds to an element x i ∈ x. Each entry a i,j in A represents the weight of the edge between graph nodes v i and v j , carrying information on similarity, i.e., correlation, between the elements x i and x j . a i,j is usually defined by Gaussian Kernel weighting function: where d(i, j) is the distance between x i and x j , e.g., the Euclidean distance as in [7,9,25,34,35]. A combinatorial graph Laplacian matrix is defined as: where D is a diagonal degree matrix and d k,k = ∑ N j=1 a j,k , where N is the length of x. Note that, since G is an undirected graph, A, as defined in (1), is symmetric (a i,j = a j,i ) and real and L is a positive semi-definite matrix of dimension N × N.
Next, we define a graph signal y as a mapping from V to a set of complex numbers. For example, in the classification task, y is a set of classification labels representing classes to which the samples in x belong to; that is, y i ∈ y is the class of the sample x i ∈ x.
GSP-based representation methods, including graph Laplacian regularization (GLR) and graph total variation (GTV) minimization, have been used in the past to classify time-series signals [32,[39][40][41]. The main idea is to represent classification labels as a piece-wise smooth signal on graphs and then apply a graph signal smoothness prior. In particular, the classification labels, y, form a discrete graph signal that resides on G, where a classification label y i ∈ y, corresponding to the observed feature vector x i , indexes Vertex v i .
Let n be the number of samples in the N-length vector x for which the class membership is known; that is, these n samples form a training set, and all classification labels y i , i ≤ n are known and fixed. All unknown labels to be determined, i > n, are initialized to zero. If the graph is designed such that vertices that correspond to observations with similar characteristics (and are consequently in the same class) are connected with high-value weighted edges, then the graph signal y is expected to change slowly from vertex to vertex, that is, it is piece-wise smooth.
Then, one can restore unclassified labels (y i , i > n) by finding the smoothest graph signal that minimizes the graph global smoothness, by minimizing GTV, defined as: where λ max is the largest-magnitude eigenvalue of A, or the graph global smoothness, often referred to GLR, whose quadratic form is given by: Since, for undirected graphs, Laplacian representation is often desirable, due to many useful properties of graph Laplacian matrix [42,43], in this paper, we use GLR minimization. Note that GLR minimization leads to real-valued solutions that are then, for binary classification, 'quantized' so that y ∈ {0, 1}, where y i = 1 means that x i belongs to Class 1. Multi-class classification can then be performed as one-against-all approach for each class, one at a time. The method described above provides a powerful, scalable, and flexible data mining and signal processing approach, particularly suited for classification tasks when training periods are short and insufficient to build comprehensive class models [40,41].

Related Work
GSP-based methods have recently been proposed for tackling the event-based load disaggregation problem [7,9,20,25,[34][35][36][37]. In [7,25], after the events (i.e., load state changes) are detected by thresholding, the feature of each event is represented using a power edge, that is, the change of aggregated power value when the detected event started or ended. This way, each event is associated with one graph node, and classified by minimizing Equation (4) [7]. Binary classifiers are run multiple times, once for each appliance. Zhao et al. [25] first applied GLR minimization to identify all events that belong to the cluster of a particular event type/class. Then, clustered events are removed and the procedure is repeated until all events are assigned to a cluster. Zhao et al. [37] provided a GSP-filter based pre-processing and graph-based edge matching post-processing to further improve the performance of event-based NILM methods. In [34,35], unlike classification/clustering labels defined as graph signal in [7,25], the authors directly used active power or power edges as graph signals.
The main issue of the above approaches [7,25,35,36] is the fact that they heavily depend on successful edge detection to isolate the power change events, and thus assign a single graph node to each detected event. However, due to the noisy nature of true aggregated data where many unknown loads are present, edge detection often fails. For example, if a true event is missed (i.e., a true negative) or false event is detected (i.e., false positive), the GSP classifier will fail. This situation is, to a certain extent, mitigated in [9], where a semi-supervised approach of high complexity is proposed, relying on time embedding, graph sparsification with loopy belief propagation and finally GLR or manifold regularization for classification.
To mitigate the effect of event detection failure, while keeping the complexity low, in this paper, we propose a novel graph generation approach for NILM, where each graph node is associated with a continuous time-series signal of power measurements, constituting the same steady-state, instead of an event sample. Each graph node is associated with a feature vector comprising power values and time information. Instead of Euclidean distance, used in the prior work [7,9,25,[34][35][36], and due to the variable length of event segments, distance based on dynamic time warping (DTW) is used to estimate the level of correlation between segments, and then construct the graph. DTW is used for NILM in [19,26] to compare the distance between appliance "templates" and extracted segments. However, the approach of Liao et al. [19] requires a database that contains good quality templates.
In a related paper [44], GSP-based principal components analysis (PCA) is applied to microseismic data analysis. Then, the graph weight is calculated based on the sum of Euclidean distances. This scheme is included as benchmark.

Methodology
The main limitation of the prior GSP-based NILM approaches reviewed in the previous section, is the fact that they either ignore fluctuation of power values within the identified event, that is, rely only one value when the appliance is switching on/off or changing state [7,25], or suffer from high complexity [9,[34][35][36]. This is in contrast to state-based approaches that track the changes of measured power, but are difficult to train [8,11,12]. To address these limitations, we propose an improved GSP-based NILM design, described next.
The proposed NILM-based system comprises a pre-processing block, data segmentation, clustering, and finally cluster labeling. Each of these blocks is described in the following subsections. We use P i to denote the active power measurement at time instant i, and ∆P i = P i − P i−1 , for i > 1.

Pre-Processing
The smart meter measurements are accompanied by unavoidable measurement noise. Furthermore, transient power values act as noise in the steady-state analysis. Pre-processing has been commonly used to reduce these effects (see [25] and references therein). It is especially important in the proposed scheme, as it relies on all collected measurements, instead of just an "edge"-i.e., the value of significant power change corresponding to appliance state changes, We propose a two-stage pre-processing approach: firstly, median filtering for sharpening edges and improving data segmentation (discussed in the next subsection), and, secondly, bilateral filtering for denoising the signal and removing signal fluctuations to improve the clustering step (Section 3.3).
We sharpen signal edges via median filtering for ∆P i to stand out more clearly. That is, if |∆P i | > Thr, then the ith sample potentially belongs to the start/end of an event, and therefore an appliance state change. We update such samples using: where k is an averaging window. Next, graph-based bilateral (GB) filtering, as in [37], smooths signal fluctuations. For a vector signal x, with x i = ∆P i , we design the underlying graph by setting the adjacency matrix A using Equation (1), based on Gaussian kernel function with σ being the scaling factor set heuristically: Then, the diagonal degree matrix D is d k,k = ∑ N j=1 a j,k . From the bilateral filter operator D −1 A, we obtain an optimization problem to restore the filtered, clean signal as: where α is a parameter that trades-off the two terms. Equation (7) can be solved by calculating the first derivative of the cost function with respect to the filter input, and the closed form solution is [32]: wherex * is the output signal and I is the unit matrix. GB filtering is applied as in Equation (8) where x is a vector of consecutive ∆P i , given by Equation (5). As in [37], we set α to 1, which was experimentally determined to provide best results for the NILM problem at low sampling rates. The performance gain due to pre-processing is experimentally evaluated in Section 5.3.

Signal Segmentation
In the following, we refer to p i and ∆p i as filtered aggregate power and filtered power change value, respectively. Once the measured power signal is denoised and sharpened, the next step is to segment the signal and associate each segment to one graph node.
The idea is to group aggregate power samples if they belong to the same steady state. Data segmentation is achieved by comparing the value of the aggregate power change |∆p i | = p i+1 − p i with a preset threshold Thrs. If |∆p i | > Thrs, p i+1 will be the first sample of a new segment. Otherwise, p i belongs to the same segment as the previous sample.
We use S to represent the set of all segments and S i is the ith segment, consisting of continuous samples of aggregated power corresponding to the ith estimated steady state. The preset threshold Thrs should be larger than the envisaged maximum variation during the same steady state but smaller than the minimum value of the state changes.
The proposed segmentation approach is given in Algorithm 1. Starting from i = 1 and j = 1, the algorithm assigns p i to S j and compares |∆p i | with Thrs. p i+1 is assigned to S j if |∆p i | < Thrs or it is assigned to S j+1 otherwise. If p i+1 is assigned to S j+1 , we increase j by one. Then, we increment i and repeat the above steps until all samples from the aggregate power measurements p are assigned to one segment.

GSP-Based Clustering
In this section, we introduce three unsupervised GSP (UGSP) clustering methods that differ in the way segments are defined and consequently the way the underlying graph is designed. The first approach segments the data and then resorts to DTW to build a graph. The other two methods do not use the segmentation approach described in Section 3.2, but instead rely on comparing fixed-length segments. Following the notation introduced in Section 2, a graph G = (V, A) is defined by graph nodes V and adjacency matrix A.

UGSP-DTW Method
Each node v i ∈ V is assigned to a segment S i ∈ S obtained as described in Section 3.2. Since S i and S j are vectors of different lengths, we use DTW distance to measure their similarity. DTW is commonly used for measuring similarity between two sequences [19]. Thus, we design the graph adjacency matrix as: where σ is the scaling factor, iteratively tuned as in [45] based on graph Fourier transform, and DTWdist(a, b) is the distance between sequences a and b. DTW is a commonly used metric for measuring similarity between two sequences of possibly different lengths. It is very popular in speech recognition and lately data mining. Given two signals of possibly different lengths, DTW performs a non-linear mapping of one signal to another by minimizing the distance between the two signals by finding an optimal mapping path via dynamic programming. In particular, the DTW distance between two sequences p = [p 1 , . . . , p n ] and q = [q 1 , . . . , q m ] is calculated as follows.
As part of initialization, for all i, j > 0, D(i, 0) = D(0, j) = ∞ and D(0, 0) = 0. As a boundary constraints, the mapping path starts at (1,1) and ends at (n, m). That is, let be a distance between points p i and q j , where 1 ≤ i ≤ n and 1 ≤ j ≤ m. Then, D(i, j) is accumulated DTW distance between vectors [p 1 . . . p i ] and [q 1 . . . q j ] given by and D(n, m) is the final DTW distance between p and q. More details can be found in [46]. Similar to He et al. [7], we define the graph signal s as the classification label of each data segment S i . Starting from the first sample of the graph signal s, we adopt binary GSP-based classification to find all samples s i , where i > 1, that belong to the same cluster as s 1 . That is, first we set s 1 = 1, and all remaining samples of graph signals are set to zero. Then, GLR is adopted to estimate a graph signal with minimum variation with respect to the underlying graph. The optimization problem is: where L is a combinatorial Laplacian matrix of the graph as defined in Equation (2). The closed-form solution to the optimization problem in (10) is [47,48]: where L # 2:N+1,2:N+1 is the pseudo-inverse of L 2:N+1,2:N+1 . Note that, for a matrix L, the subscript x 1 : x 2 , y 1 : y 2 is a sub-matrix that contains columns x 1 to x 2 and rows y 1 to y 2 of L. If s * i > T s , we set s i = 1, i.e., we assign it to the same cluster as the starting sample; otherwise, s i is set to zero. Similar to He et al. [7], we set T s = 0.5. Then, we remove all clustered segments, including the first one, and repeat the above steps until all segments are assigned to a cluster.

UGSP_PCA-FIX and UGSP_GPCA-FIX
An alternative approach is to reduce dimensionality of the segments leading to shorter fixed-length segments and estimate correlation of the newly generated segments using Euclidean distance. Principal component analysis (PCA) is a commonly used statistical procedure that uses an orthogonal transformation to convert a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components [49] and maximize the sample variance. After data segmentation (Section 3.2), PCA is performed on each segment of samples to obtain the principle components. Since components of all segments are of the same length, the graph adjacency matrix is defined using the sum of Euclidean distances, akin to Equation (6), by replacing DTW distance with Euclidean distance in Equation (9). The remaining clustering and labeling steps are the same as in the proposed scheme.
We also provide results for Graph Principal Component Analysis (GPCA) inspired by Taylor et al. [44]. In GPCA, for each data segment, we build a graph G = (V, A), where each graph node v i ∈ V is associated with a sample inside the data segment. The adjacency matrix A and the graph Laplacian matrix L are defined using Equation (1) with Euclidean distance measure and Equation (2), respectively. As with PCA, GPCA finds the new coordinates (i.e., principal components) of each data segment. The solution to the generalized eigenvalue decomposition problem shown in (12) provides the new parameterized data with the same dimension [44,50].
where ψ k and λ k are the kth eigenvector and eigenvalue of the graph Laplacian matrix, respectively. This is similar to the traditional PCA, but has an advantage that provides a more general model of the raw data. The GPCA method shares the same remaining steps as the PCA method.

UGSP-FIX and UGSP_OPT-FIX Methods
The segmentation approach proposed in Algorithm 1 relies on a manually set threshold Thrs. To avoid that, UGSP-FIX and UGSP_OPT-FIX methods use fixed-length segments of duration T samples. Then, we define the adjacency matrix A: where l = T+1 2 , T, the length of the segment, is assumed to be an odd number, and σ is the scaling factor, tuned as in [45]. ω t , with ∑ T t=1 ω t = 1, is the weight of the tth sample, enabling the ability to assign different levels of importance to different segments. Note that a i,j is the averaged weighted sample-by-sample Euclidean distance between Segments i and j. In the low-complexity UGSP-FIX method, we set ω t = 1 for t = T+1 2 , and zero otherwise. In UGSP_OPT-FIX method, we optimize ω t 's via a full-search algorithm for t = 1, . . . , T [51], which is possible since T is kept small.
The remaining GSP-based classification steps are the same as in Method UGSP-DTW.

Cluster Labeling
After all data segments are grouped into clusters, the clusters need to be assigned to correct labels, that is, appliances or disaggregated loads. For each cluster C m , we first calculate the average length H m of all segments within the cluster. If H m is less than Thr H , we set the corresponding cluster C m as noise, belonging to non-classified segments, L 1 . Thr H is a heuristically set threshold to identify noisy clusters with short data segments.
Next, we compare P C m , the mean of active power values of all measurements within cluster C m , with Thr l and Thr h , which are the lower and upper bound thresholds. Clusters with P C m larger than Thr h or P C m less than Thr l are also labeled as L 1 . For the remaining clusters, P C m is compared with P l n for n ∈ [1, K], where K is the total number of labels, and for each preset label L n , P l n is the expected power value of the appliance labeled as L n , which can be estimated using manufacturer information or expert knowledge. Label L n , n = 1, . . . , K denotes the case when individual appliances is in the ON-state with no other appliances in the ON-state (P l n is the estimated expected wattage of an appliance), or a group of appliances is in the ON-state simultaneously (P l n is the sum of the estimated expected wattage of a group of appliances). For multi-state appliances, each state is treated as an individual appliance in the ON-state.
Cluster m is assigned to the label L n if |P C m − P l n | is the minimum over all n ∈ [1, K]. Clusters with |P C m − P l n | larger than the preset threshold Thr d will be assigned to an unknown appliance which is also labeled as L 1 . The pseudocode of the cluster labeling method is shown in Algorithm 2, where C is the set of all clusters.
Algorithm 2: Cluster labeling. After all clusters are assigned a label, the coherence of multi-state appliances is considered to further refine the labeling results. Any cluster with only a single observed state that was assigned to a multi-state appliance label, which cannot appear alone, is assigned to the second best single-state label, only if in Step 11 of Algorithm 2 the following condition is satisfied: |P C m − P r l | is the second smallest and is within threshold Thr d . The average duration H m of cluster C m is also used as additional information to help improve the labeling accuracy. If H m is much longer (beyond a pre-defined time threshold) than the expected working duration of the corresponding appliance, the second minimal difference in Step 11 is used to correct the label.
The parameters of the proposed method are set as following: Thr H = 5 is a heuristically set threshold to avoid picking up too-short clusters due to noise. Thr d = 200 W is the maximum difference allowed between the average power of a cluster and the expectation of labels. Thr l is slightly larger (by 10 W) than the base-load. Thr h is set to the double of the maximum expected power of any appliance. The parameters are set to trade off performance and complexity. Changing these parameters by 10-20% will not influence the accuracy of the results.

Benchmarks
This section briefly describes three benchmarks. The first scheme is natural extension of the proposed UGSP-DTW scheme to the semi-supervised learning setting. The second scheme uses an alternative clustering approach. The final scheme [37] shares the same GSP-based clustering method but a different segmentation approach, and is shown in [37] to be superior to state-of-the-art state-based and event-based unsupervised approaches.

Semi-Supervised GSP
A semi-supervised GSP approach, SGSP-DTW, shares the same pre-processing and data segmentation steps as the proposed UGSP-DTW method. After data segmentation (Section 3.2), graph-based binary classification, similar to He et al. [7], is performed, instead of clustering (see Section 3.3) with the adjacency matrix given by (9).

DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is a commonly used data clustering algorithm which does not require the number of clusters to be specified. DBSCAN is usually used for spatial point set clustering. As in [45], DBSCAN is performed instead of GSP-based clustering as a benchmark. Graph nodes are regarded as points in the point set data, and DTW distance is used to represent the distance between the nodes. Note that DBSCAN relies heavily on two parameters that need to be pre-set, the minimum number of neighboring samples needed, and the minimum number of samples in a cluster. As confirmed by the simulation results presented in the next section, DBSCAN is ineffective for appliances with low frequency of usage.
We also include a comparison with the best unsupervised methods reported in [52], where DBSCAN is evaluated with a Euclidean distance measure. This helps us further understand the importance of the DTW distance metric. Furthermore, in [52], several unsupervised NILM pre-processing methods are evaluated with DBSCAN clustering, including median filtering, edge sharpening, graph bilateral filtering and their various combinations. We denote this method as DBSCAN-Euclidean method [52] and report the results only for the appliances and datasets considered in [52].

Event-Based GSP
We also compare the proposed method with the event-based unsupervised GSP-based NILM of Zhao et al. [25], but apply the pre-processing steps of Section 3.1 on the raw data.

Results and Discussion
In this section, we present our experimental results. First, in Section 5.2, we introduce performance measures and datasets. Section 5.3 presents the performance of pre-processing. The following subsection compares the five segmentation and clustering methods introduced in Section 3.3. The last subsection compares the proposed DTW-based unsupervised GSP method with the benchmarks described in Section 4.

Datasets
We used two open-source datasets: REDD dataset that contains measurements at 1 Hz sampling rate [53] and REFIT dataset at 1/8 Hz sampling rate [54]. The REDD dataset contains data from US houses and is widely used for evaluation of various NILM approaches, containing few unknown appliances and relatively low "disaggregation noise" [10]. The REFIT dataset is more realistic, containing continuous recording of aggregate and submetered appliances over two years from UK houses as well as many unknown appliances and therefore a high level of "disaggregation noise" [10].
Both datasets contain a single time-series aggregate power signal (measured in Watts [W]). We used two houses from each dataset, selected to include typical appliances and various noise levels [10]. Appliance-level plug readings were used to determine ground truth.
In the results tables, we use BGFI to label bathroom GFI, DW dishwasher, R refrigerator, KO kitchen outlet, MW microwave, WD wash dryer, S stove, F-F fridge freezer, WM washing machine, T toaster, and K kettle. We note that, besides these appliances, each house contains a number of unknown appliances that are considered to be noise. Generally, REFIT houses are "noisier", in the sense of containing many unknown appliances. All results are provided for one month worth of data.

Performance Metrics
We used F-measure (F M ) over all samples to evaluate NILM classification performance defined as: PR = TP/(TP + FP), RE = TP/(TP + FN), F M = 2 * (PR * RE)/(PR + RE), where TP (true positive) is the number of samples which the NILM method correctly decided that the appliance was on, FP (false positive) is when the NILM estimates the appliance is on while the appliance is actually off, and FN (false negative) considers samples that are detected as off by the NILM algorithm while the appliance is in fact on. PR (precision) captures the correctness of detection-the higher is the PR, the fewer are the FPs. On the other hand, high RE (recall) means a low number of FNs, which implies that a higher percentage of appliance on states are detected correctly. F M ∈ [0, 1] balances PR and RE.

Pre-Processing Gain Evaluation
The proposed pre-processing (Section 3.1) introduces additional complexity and delay. Hence, in this subsection, we compare the proposed schemes with and without pre-processing to evaluate the gain obtained via the proposed pre-fltering. We denote R-SGSP-DTW and R-UGSP-DTW (where R = raw measurements) to represent DTW-based semi-supervised and unsupervised GSP methods, respectively, without pre-processing applied on the raw measurements. Table 1 shows the results for REFIT House 2. For both semi-supervised and the proposed unsupervised load disaggregation methods, pre-processing provides significant improvement of over 0.3, for all appliances. The gain is smallest for KO due to the relatively low noise levels for this appliance. Due to obvious performance gains of pre-processing, in the following section, we present the results of the proposed scheme and benchmarks, all with the pre-processing method of Section 3.1 (see [37] for a detailed study of various pre-processing methods for NILM, including graph bilateral filtering). Table 1. F-measure for two GSP-based methods with and without pre-processing applied to REFIT House 2.

Comparison between Segmentation Methods
To quantify the gains from the proposed data segmentation, in this section, we compare the three proposed methods, UGSP-DTW, UGSP-FIX, UGSP_OPT-FIX, UGSP_PCA-FIX, and UGSP_GPCA-FIX, introduced in Section 3.3.
The parameters of the proposed method are set as follows: Thr H = 5 W is the threshold to isolate noise clusters with short data segments, since we observe that 5 W and lower generally represents appliances on stand-by, and not in active use, i.e., ON state; Thr d = 200 W is the maximum distance from the average power of a cluster to the state value to be associated with that cluster label; Thr l is set slightly over the estimated base-load; and Thr h is set to twice the maximum power of any appliance in the house. Table 2 shows the F-measure results of the five segmentation approaches for House 1 in the REDD dataset. For the fixed-segment methods, T = 7 to include six neighboring samples. First, when comparing UGSP_OPT-FIX and UGSP-FIX, we can see that optimizing weights provides negligible performance gain across all six appliances. Secondly, exploiting correlation and reducing dimensionality via PCA-and GPCA-based methods provides an additional performance gain over both fixed-segmentation methods. However, PCA-or GPCA-based methods often do not capture well the structure of raw data leading to a large difference between two data segments with the same label. Finally, there is a noticeable and significant performance gain of the proposed variable-length segmentation UGSP-DTW over both fixed-segmentation methods and PCA-and GPCA-based methods, indicating that DTW preserves some key information in the segments which PCA and GCPA removes during the transformation.

Comparison against NILM Benchmarks
In this section, we evaluate the robustness of the proposed UGSP-DTW NILM method against the benchmark algorithms described in Section 4, namely SGSP-DTW ( DTW-based semi-supervised GSP method (Section 4.1)), UGSP_PCA-FIX (PCA-based unsupervised GSP method (Section 3.3.2)), UGSP_GPCA-FIX (GPCA-based unsupervised GSP method (Section 3.3.2)), DBSCAN-DTW (DTW-based DBSCAN method (Section 4.2)), and UGSP by Zhao et al. [25]. The results are given in Tables 3 and 4,  for two REDD houses, and Tables 5 and 6 for two houses from the REFIT dataset. Tables 3-6 consistently show that the proposed UGSP-DTW method provides the best disaggregation accuracy for most appliances in all four houses. For the two REDD houses, which generally have lower disaggregation noise [10], the proposed method is superior to other methods for all appliances except R in House 2, when DBSCAN shows the best results. Note that, since refrigerator is always on, as expected, DBSCAN shows high performance. DBSCAN is a density-based clustering which is highly dependent on the size of each cluster and the minimum number of neighbor samples. Thus, for appliances, such as kettle, dishwasher, and microwave, with low frequency of use, DBSCAN is not effective, whereas, for the appliances such as refrigerator that are always on, DBSCAN shows high performance. We can also see that the best performing DBSCAN-Euclidean method of Khazaei et al. [52] performs generally worse than the DBSCAN-DTW-based approach, demonstrating the improvement due to the DTW distance measure.   Table 6. Comparison of F-measure for REFIT House 6. While it may be expected that SGSP-DTW outperforms all unsupervised methods, the results in the tables show that this is generally the case for all unsupervised methods except for the proposed UGSP-DTW and DBSCAN-DTW methods. This is due to the post-processing cluster labeling step of the proposed Algorithm 2 in UGSP-DTW and DBSCAN-DTW. The unsupervised state-of-the-art methods of Zhao et al. [25] and Khazaei et al. [52], which rely on a fixed Euclidean distance, perform worse that the proposed method except for a slight improvement for DW in REFIT House 2.
Overall, DTW-based approaches outperform other methods, demonstrating the value of the proposed variable-length segmentation and DTW distance measure.

Conclusions and Future Work
In this paper, we propose a DTW-based unsupervised GSP method for NILM. The proposed method is based on pre-processing, segmenting smart meter data, and clustering the segments using GSP-based clustering. We demonstrate the value of variable-length segmentation methods over fixed-length and transformation-based segmentation, and the value of the proposed cluster labeling method through rigorous experimentation on four REDD and REFIT datasets and four state-of-the-art benchmarks.
Future work includes providing an efficient real-time implementation integrating the approach into smart home decision support systems for demand response as well as designing advanced energy feedback mechanisms. Another interesting area of research is assessing the quality of the NILM output without relying on ground-truth. Future work also includes enlarging the set of features used for power disaggregation. These features could be reactive power, voltage/current, and other measurements, such as date, time, and weather.