Electricity Pattern Analysis by Clustering Domestic Load Proﬁles Using Discrete Wavelet Transform

: Energy demand has grown explosively in recent years, leading to increased attention of energy efﬁciency (EE) research. Demand response (DR) programs were designed to help power management entities meet energy balance and change end-user electricity usage. Advanced real-time meters (RTM) collect a large amount of ﬁne-granular electric consumption data, which contain valuable information. Understanding the energy consumption patterns for different end users can support demand side management (DSM). This study proposed clustering algorithms to segment consumers and obtain the representative load patterns based on diurnal load proﬁles. First, the proposed method uses discrete wavelet transform (DWT) to extract features from daily electricity consumption data. Second, the extracted features are reconstructed using a statistical method, combined with Pearson’s correlation coefﬁcient and principal component analysis (PCA) for dimensionality reduction. Lastly, three clustering algorithms are employed to segment daily load curves and select the most appropriate algorithm. We experimented our method on the Manhattan dataset and the results indicated that clustering algorithms, combined with discrete wavelet transform, improve the clustering performance. Additionally, we discussed the clustering result and load pattern analysis of the dataset with respect to the electricity pattern.


Introduction
Smart grid technologies and applications capable of adaptive, resilient, and sustainable self-healing, with foresight for prediction under different uncertainties, improve the reliability of the power system [1]. Furthermore, the smart grid allows bidirectional communication that supports the demand response (DR) programs [2]. Demand response technologies are widely applied and are constantly improving. The most common DR programs can be categorized into the following two classes: price-based programs and incentive-based programs. Price-based programs contain time of use (ToU), real time pricing (RTP) and critical peak pricing (CPP), which aim to motivate the end-user to change their consumption behavior [3]. On the other hand, incentive-based programs reach a consensus with consumers to reduce electricity consumption. Examples of these schemes are direct-load control (DLC), interruptible/curtailable service (I/C), demand bidding/buy (DB), etc. [4]. Considering various end-user consumption behaviors, it required the utility companies to design reasonable strategies. Therefore, it is necessary to analyze end-users' consumption data to acquire the load patterns.
Advanced metering infrastructure (AMI) and smart meters have been adopted to automatically collect energy consumption data at a fine granular interval, which is usually in intervals of 1 h, 30 min, or even 30 s [5]. Most countries have vigorously deployed smart meters because of the potential value of consumption data [6]. The massive amount of data sampled by smart meters could be used for research, typically load forecasting, customer segmentation, pricing/incentive mechanism, scheduling and control [7]. However, the extracted load consumption data lack labels, hence, the need of clustering techniques to segment the electricity consumption data. In addition, with the high time resolution advanced smart meter implemented in the household, the massive data will increase the complexity of the clustering method, called the "curse of dimensionality" [8]. This is a problem for implementing clustering algorithms because most clustering algorithms become intractable to process high-dimensional data input. To deal with the issue of the curse of dimensionality, the load consumption data needs preprocessing i.e., dimensionality reduction.
This study proposed clustering for segment residential customer daily power data, using discrete wavelet transform to extract features and reduce dimension by statistical methods and principal component analysis (PCA). The dataset, named Multifamily Residential Electricity Dataset (MFRED), contains 10-s resolution daily power data for 26 apartment groups, collected over 365 days in Manhattan, New York, 2019 [9]. First, data cleansing and multi-level one-dimensional (1D) discrete wavelet transform were applied on 8640-value daily load curves. Second, we reduced extracted feature dimensions. Finally, clustering algorithms were implemented, and the evaluation of the methods was carried out. Our main contributions of this work include the following: (1) a proposed method to vastly reduce the daily load profile dimensionality, to accelerate the clustering, and (2) the three cluster validity indices (CVI) imply that our proposed method to extract features outperforms the clustering original data, especially on hierarchical clustering.
The paper is structured as follows: Section 2 briefly discusses the related works. Section 3 describes the MFRED data. Section 4 explains the methodology in the study. Analysis and results are presented in Section 5, with conclusions in Section 6.

Related Works
Clustering is unsupervised learning, which could group similar data with no label attached to them [10]. Clustering algorithms can be classified into partitioning algorithms, hierarchical algorithms, density-based algorithms, and grid-based algorithms [11]. The authors of [12] implemented an improved K-means clustering method on load curves and verified that it performed better than the original K-means algorithm. The authors in [13] used modified fuzzy c-means (FCM) to extract representative load profiles of the customers. Ordering points to identify the clustering structure (OPTICS) is one of the density-based clustering models used to analyze consumer bid-offers in [14]. Gaussian mixture model (GMM) clustering is widely used to segment households' load profiles for demand response [15].
Additionally, most clustering algorithms cannot properly process high dimensionality data [16]. Most of the aforementioned works extracted consumption load patterns in terms of hourly, 30-min, 15-min load data. However, the advanced high-frequent smart meter could extract load data in intervals of 1-min, 30-s, and even 1-s, leading to large-scale consumption data that increases computational complexity. Most clustering algorithms evaluating the belonged cluster are calculated by distance. High dimensionality data would consume more computational complexity in each iteration, resulting in more time consumption. Hence, there are numerous studies about dimensionality reduction on load curve clustering, using feature extraction, feature construction and feature selection. In [17], the authors developed electricity price schemes based on demand patterns, using k-means combined with PCA. In [18], the authors proposed singular value decomposition to extract features before k-means clustering and evaluate the error sum of squares (SSE) index to compare with direct clustering. In [19], they used a fused load curve k-means algorithm, based on "Haar" discrete wavelet transform for reduce dimension, to obtain the load patterns of consumers from China and the United States and evaluate clustering performance by four CVI [20]. Xiao et al. [21] proposed a fusion clustering algorithm to obtain the consumption characteristics, using load curve clustering, based on discrete wavelet transform (CC-DWT). In this study, we implemented clustering to segment 10-s interval daily electricity consumption data, using multi-level discrete wavelet transform, Pearson correlation coefficient, and PCA techniques to preprocess the daily load profiles. The clustering evaluation result shows our proposed method outperformed the conventional methods, without reducing dimension.

Data
In this study, we used the Multifamily Residential Electricity Dataset (MFRED) [9], which consisted of 390 apartments, from 1 January to 31 December 2019. This dataset was collected by real-time metering and contained 246 million data from residential buildings in Manhattan, New York, USA. The resolution of data was one sample per 10-s, providing 8640 data points in each daily profile. During the one-year period, some advanced meters were offline due to various reasons (e.g., smart meters offline). Therefore, some electricity data were not recorded in MFRED.
In the MFRED, the percentages of building stock prior to 1940, between 1940-1980, post-1980 were 79%, 7%, and 14%, respectively. The ratios of the entire Manhattan building stock prior to 1940, between 1940-1980, post-1980 were 86%, 6%, 8%, respectively, which means the residential structure in our research is very similar to that of the whole of Manhattan. In addition, considering the privacy leakage, the 390 apartments' data were reconstructed into 26 groups, called apartment groups (AG), which means each AG is made up of 15 apartments that are more representative. Hence, the dataset recorded the average real power (kW), reactive power (kVAR) and consumption (kWh), over 15 apartments, from 26 apartment groups, every 10 s for 365 days. Here, we used one channel real power data for our research. Figure 1 shows the distribution of daily energy consumption, and the black dashed line represents the mean electricity consumption (8.21 kWh). obtain the consumption characteristics, using load curve clustering, based on discrete wavelet transform (CC-DWT).
In this study, we implemented clustering to segment 10-s interval daily electricity consumption data, using multi-level discrete wavelet transform, Pearson correlation coefficient, and PCA techniques to preprocess the daily load profiles. The clustering evaluation result shows our proposed method outperformed the conventional methods, without reducing dimension.

Data
In this study, we used the Multifamily Residential Electricity Dataset (MFRED) [9], which consisted of 390 apartments, from 1 January to 31 December 2019. This dataset was collected by real-time metering and contained 246 million data from residential buildings in Manhattan, New York, USA. The resolution of data was one sample per 10-s, providing 8640 data points in each daily profile. During the one-year period, some advanced meters were offline due to various reasons (e.g., smart meters offline). Therefore, some electricity data were not recorded in MFRED.
In the MFRED, the percentages of building stock prior to 1940, between 1940-1980, post-1980 were 79%, 7%, and 14%, respectively. The ratios of the entire Manhattan building stock prior to 1940, between 1940-1980, post-1980 were 86%, 6%, 8%, respectively, which means the residential structure in our research is very similar to that of the whole of Manhattan. In addition, considering the privacy leakage, the 390 apartments' data were reconstructed into 26 groups, called apartment groups (AG), which means each AG is made up of 15 apartments that are more representative. Hence, the dataset recorded the average real power (kW), reactive power (kVAR) and consumption (kWh), over 15 apartments, from 26 apartment groups, every 10 s for 365 days. Here, we used one channel real power data for our research. Figure 1 shows the distribution of daily energy consumption, and the black dashed line represents the mean electricity consumption (8.21 kWh).

Methodology
Our proposed method consists of the following four major stages: data cleansing, feature extraction, dimensionality reduction and clustering. Daily real power data are obtained from MFRED, and the data are cleansed for the missing value. Multi-level discrete wavelet transform is then applied to extract the features. In the dimensionality reduction stage, we implement the following two methods to decrease the dimension: statistical method combined with Pearson correlation and PCA. Finally, clustering algorithms were applied to segment daily load curves by using selected features. The proposed method is as shown in Figure 2.

Methodology
Our proposed method consists of the following four major stages: data cleansing, feature extraction, dimensionality reduction and clustering. Daily real power data are obtained from MFRED, and the data are cleansed for the missing value. Multi-level discrete wavelet transform is then applied to extract the features. In the dimensionality reduction stage, we implement the following two methods to decrease the dimension: statistical method combined with Pearson correlation and PCA. Finally, clustering algorithms were applied to segment daily load curves by using selected features. The proposed method is as shown in Figure 2.

Data Cleansing
Real and reactive power data were recorded in MFRED, where the real power data is reserved for the purpose of clustering. The primary issue with real power data is the missing values and anomalous values. Missing values are filled by averaging the previous and post 10-s values. However, tens of thousands of continuous data were missed because of the long-time breakdown of all meters on 09 July 2019, from 14:30 to 21:30 UTC. Therefore, this day is not taken into consideration due to the large amount of missing data. Anomalous values may be caused by the real-time meters (RTM) data collection accuracy, detected by the following five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles. The single outlier was replaced by the average, the maximum and itself. After data cleansing, the reconstructed subset consisted of 8640 ten-second interval real power data (kW) in 364 days and 26 AGs. Thus resulting input data matrix dimension is 9464 × 8640. Figure 3 illustrates the 8640-value diurnal load curves from different AGs.

Data Cleansing
Real and reactive power data were recorded in MFRED, where the real power data is reserved for the purpose of clustering. The primary issue with real power data is the missing values and anomalous values. Missing values are filled by averaging the previous and post 10-s values. However, tens of thousands of continuous data were missed because of the long-time breakdown of all meters on 09 July 2019, from 14:30 to 21:30 UTC. Therefore, this day is not taken into consideration due to the large amount of missing data. Anomalous values may be caused by the real-time meters (RTM) data collection accuracy, detected by the following five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles. The single outlier was replaced by the average, the maximum and itself. After data cleansing, the reconstructed subset consisted of 8640 ten-second interval real power data (kW) in 364 days and 26 AGs. Thus resulting input data matrix dimension is 9464 × 8640. Figure 3 illustrates the 8640-value diurnal load curves from different AGs.

Discrete Wavelet Transform
Wavelet transform contains continuous wavelet transform (CWT) and discrete wavelet transform (DWT). Discrete wavelet transform is widely used in waveform processing, including feature extraction in electroencephalography (EEG) [22], electromyography (EMG) [23], time-series load curves [24,25], etc. DWT decomposes the signal into various sets by passing through the low-pass filter and high-pass filter. The DWT and DWT coefficients are given by Equations (1) and (2), respectively, as follows: where k is a signal index and j is the scale index. The detailed coefficients are obtained from a high-pass filter, while approximation coefficients are extracted from a low-pass filter, which could continue to decompose into a high-pass filter and low-pass filter. Figure 4 shows the decomposition of the 3-level 1-D discrete wavelet transform that we used in our research.

Discrete Wavelet Transform
Wavelet transform contains continuous wavelet transform (CWT) and discrete wavelet transform (DWT). Discrete wavelet transform is widely used in waveform processing, including feature extraction in electroencephalography (EEG) [22], electromyography (EMG) [23], time-series load curves [24,25], etc. DWT decomposes the signal into various sets by passing through the low-pass filter and high-pass filter. The DWT and DWT coefficients are given by Equations (1) and (2), respectively, as follows: where k is a signal index and j is the scale index. The detailed coefficients are obtained from a high-pass filter, while approximation coefficients are extracted from a low-pass filter, which could continue to decompose into a high-pass filter and low-pass filter. Figure 4 shows the decomposition of the 3-level 1-D discrete wavelet transform that we used in our research. To extract features from daily load curves, we implemented the three-level 1-D Daubechies 4 (db4) discrete wavelet transform. Three-level means it will repeat one-level 1-D discrete wavelet transform three times based on the previous approximation coefficients. The Daubechies wavelet is preferred for feature extraction compared with Haar wavelet which is the special case of Daubechies noted as db1. Haar wavelet is the simplest and first wavelet transform which decomposes the discrete data using the two-length filter. Eight of filter length in db4 wavelet contains more details but it involves slightly higher computational processes [19]. Thus, we employed db4 to compute the detailed coefficients and approximation coefficients. Three detailed coefficient sets and one approximation coefficient set are denoted as cD1, cD2, cD3, cA3, respectively. Figure 5 shows the To extract features from daily load curves, we implemented the three-level 1-D Daubechies 4 (db4) discrete wavelet transform. Three-level means it will repeat onelevel 1-D discrete wavelet transform three times based on the previous approximation coefficients. The Daubechies wavelet is preferred for feature extraction compared with Haar wavelet which is the special case of Daubechies noted as db1. Haar wavelet is the simplest and first wavelet transform which decomposes the discrete data using the two-length filter. Eight of filter length in db4 wavelet contains more details but it involves slightly higher computational processes [19]. Thus, we employed db4 to compute the detailed coefficients and approximation coefficients. Three detailed coefficient sets and one approximation coefficient set are denoted as cD1, cD2, cD3, cA3, respectively. Figure 5 shows the four components of the daily load curve while using a three-level 1-D db4 discrete wavelet transform. The cA3 coefficients curve reflects a similar variation with the original load curve, while the value of cD3, cD2, cD1 components is very close to 0, which contains detailed information of daily load curve. For each daily power curve, the number of detailed coefficients (cD1, cD2, cD3) and approximation coefficients (cA3) were 4323, 2165, 1086 and 1086, respectively.

Dimensionality Reduction
This phase aims to reduce the dimensions from extracted features (cA3, cD3, cD2, cD1). First, we used the statistical method to get the statistical variables (mean, std, min, 25%, 50%, 75%, max) from each daily coefficient (cA3_mean, cA3_std, cA3_min, cA3_25%, cA3_50%, cA3_75%, cA3_max, cD3_mean, etc.). There were 28 features extracted from the approximation and detailed coefficients. Second, we calculated Pearson's correlation coefficient, which measures the correlation of each two features. The correlation coefficient values are between -1 and 1, the value close to 1 represents a high positive correlation while the value close to -1 represents a high negative correlation [26]. High correlation features can be replaced by other features with similar characteristics. The correlation coefficient value is calculated from Equation (3), as follows: where n is the number of samples, , is the value of data, is the mean value of X, and is the mean value of Y. The correlation heatmap that represents the coefficient matrix is shown in Figure 6. According to the correlation heatmap, coefficients close to 1 or −1 imply redundant features. For the purpose of reducing the dimension, we removed one of the features in which the absolute values of correlation coefficients are bigger than 0.95. Figure 7 shows the correlation heatmap after eliminating the high correlation features.

Dimensionality Reduction
This phase aims to reduce the dimensions from extracted features (cA3, cD3, cD2, cD1). First, we used the statistical method to get the statistical variables (mean, std, min, 25%, 50%, 75%, max) from each daily coefficient (cA3_mean, cA3_std, cA3_min, cA3_25%, cA3_50%, cA3_75%, cA3_max, cD3_mean, etc.). There were 28 features extracted from the approximation and detailed coefficients. Second, we calculated Pearson's correlation coefficient, which measures the correlation of each two features. The correlation coefficient values are between −1 and 1, the value close to 1 represents a high positive correlation while the value close to −1 represents a high negative correlation [26]. High correlation features can be replaced by other features with similar characteristics. The correlation coefficient value is calculated from Equation (3), as follows: where n is the number of samples, X i , Y i is the value of data, X is the mean value of X, and Y is the mean value of Y.
The correlation heatmap that represents the coefficient matrix is shown in Figure 6. According to the correlation heatmap, coefficients close to 1 or −1 imply redundant features. For the purpose of reducing the dimension, we removed one of the features in which  PCA is one of the most appealing techniques that is widely used for dimensionality reduction of large data sets [27]. Given the original high dimensional data, PCA can map the data into k dimensions (k < original dimension) with principal components that are not related to each other and still preserve the original information. The data are normalized by z in the PCA process to obtain the feature vector composed of principal components by finding the covariance, eigenvector, and eigenvalue. In our study, we applied the PCA method to reduce the dimensionality to 3 components and still preserve 99 percent variability. Thus input dimension is reduced to 28 from 8640 using DWT combined with the statistical method and to 16 after correlation analysis. Finally, we have 3 component features applying PCA transform.  PCA is one of the most appealing techniques that is widely used for dimensionality reduction of large data sets [27]. Given the original high dimensional data, PCA can map the data into k dimensions (k < original dimension) with principal components that are not related to each other and still preserve the original information. The data are normalized by z in the PCA process to obtain the feature vector composed of principal components by finding the covariance, eigenvector, and eigenvalue. In our study, we applied the PCA method to reduce the dimensionality to 3 components and still preserve 99 percent variability. Thus input dimension is reduced to 28 from 8640 using DWT combined with the statistical method and to 16 after correlation analysis. Finally, we have 3 component features applying PCA transform. PCA is one of the most appealing techniques that is widely used for dimensionality reduction of large data sets [27]. Given the original high dimensional data, PCA can map the data into k dimensions (k < original dimension) with principal components that are not related to each other and still preserve the original information. The data are normalized by z in the PCA process to obtain the feature vector composed of principal components by finding the covariance, eigenvector, and eigenvalue. In our study, we applied the PCA method to reduce the dimensionality to 3 components and still preserve 99 percent variability. Thus input dimension is reduced to 28 from 8640 using DWT combined with the statistical method and to 16 after correlation analysis. Finally, we have 3 component features applying PCA transform.

Clustering Method
Clustering load curves into groups is essential to identify load patterns [28]. For comparison purposes, we implemented the following three different clustering methods: k-means, hierarchical, and fuzzy c-means clustering.

K-Means Clustering
K-means is the most popular hard clustering algorithm goal to partition n data into k clusters, with the grouped data close to its centroid [29]. The K-means method is implemented as follows: First, determine the cluster number K then initial K centroids. Second, allocate each sample to the nearest centroids according to the distance. Third, determine the new K centroids that were generated by calculating the mean of the cluster points. Then, repeat the second and third steps until the centroids are completely unchanged.
Determining the number of clusters is one of the major challenges in clustering. The elbow method aims at finding the appropriate number of clusters by calculating the score for a range of values of K [30]. In our study, we determined this parameter by analyzing the following two metrics: distortion and Calinski-Harabasz score. Generally, distortion scoring computes the within cluster sum of squared (WCSS) to select cluster K [31]. Distortion score decreases with K increase. It is computed using Equation (4), as follows: where K is the number of clusters, c h is the cluster h, µ h is the hth cluster center, x i − µ h 2 is the Euclidean distance between data point x i and its belonged centroid µ h .
We applied Yellowbrick package to visualize the elbow method [32]. Figure 8 illustrates the WCSS value in different K. By applying the elbow method for 1 ≤ K ≤ 10, the distortion score reduces rapidly with increase in K until K = 3 and then reduces gradually. We also employed Calinski-Harabasz analysis method in our study. It calculates the ratio of the sum of between-clusters dispersion and inter-cluster dispersion for all clusters, as follows: where N is the total number of data points, K is the number of clusters, n h and c h are the number of points and centroids of the hth cluster, respectively, c is the centroid of data points. The higher value of K ∑ h=1 n h c h − c 2 means different cluster centroids are well separated, while the lower value of indicates that the points of cluster are well centered. Therefore, the larger the value of the CH index, the more distinct the clusters. Figure 9 shows the scores according to the change in the value of K, and it has a maximum value when K = 3. Even looking at the graph combined with the distortion and Calinski-Harabasz scores, it proves that it is the optimal solution for k = 3.

Hierarchical Clustering
Hierarchical clustering algorithms are formed by iteratively dividing the groups using bottom-up or top-down methods called agglomerative and divisive hierarchical clustering [33]. In this study, we employed agglomerative hierarchical clustering to segment load curves based on preprocessed features. The agglomerative builds up clusters starting with a single object as a single cluster and then using distance metric to merge the two most similar clusters [34]. Repeat until all of the objects are finally merged into a single cluster. We use "Ward" linkage to compute the distance between the new cluster and the rest of the clusters, minimizing the variance of the merged clusters [35]. Ward linkage criterion can be expressed as follows: where c(X i ) is the centroid of cluster i, n i denotes the number of points in cluster i.  Figure 9 shows the scores according to the change in the value of K, and it has a maximum value when K = 3. Even looking at the graph combined with the distortion and Calinski-Harabasz scores, it proves that it is the optimal solution for k = 3.

Hierarchical Clustering
Hierarchical clustering algorithms are formed by iteratively dividing the groups using bottom-up or top-down methods called agglomerative and divisive hierarchical clustering [33]. In this study, we employed agglomerative hierarchical clustering to segment load curves based on preprocessed features. The agglomerative builds up clusters starting with a single object as a single cluster and then using distance metric to merge the two  Figure 9 shows the scores according to the change in the value of K, and it has a maximum value when K = 3. Even looking at the graph combined with the distortion and Calinski-Harabasz scores, it proves that it is the optimal solution for k = 3.

Hierarchical Clustering
Hierarchical clustering algorithms are formed by iteratively dividing the groups using bottom-up or top-down methods called agglomerative and divisive hierarchical clustering [33]. In this study, we employed agglomerative hierarchical clustering to segment load curves based on preprocessed features. The agglomerative builds up clusters starting with a single object as a single cluster and then using distance metric to merge the two most similar clusters [34]. Repeat until all of the objects are finally merged into a single  Figure 10 depicts the Ward linkage truncated dendrogram which present a tree structure to vasualize the clusters and the number belonged each cluster. Ward's method dendrogram displays the clustering structure of the data. The numerical data in Figure 10 means the distance between different cluster centers which is calculated by Equation (6). The black dashed line represents the distance threshold which is 50. In addition, we combined the Calinski-Harabasz index with dendrogram to determine the optimal number of clusters (Table 1). According to the result, it can be confirmed that when K changes from 2 to 3, the Calinski-Harabasz index increases rapidly and then gradually increases thereafter. The Calinski-Harabasz index and dendrogram indicate that three is the optimal number for the value of K.
cluster. We use "Ward" linkage to compute the distance between the new cluster and the rest of the clusters, minimizing the variance of the merged clusters [35]. Ward linkage criterion can be expressed as follows: where is the centroid of cluster i, denotes the number of points in cluster i. Figure 10 depicts the Ward linkage truncated dendrogram which present a tree structure to vasualize the clusters and the number belonged each cluster. Ward's method dendrogram displays the clustering structure of the data. The numerical data in Figure 10 means the distance between different cluster centers which is calculated by Equation (6). The black dashed line represents the distance threshold which is 50. In addition, we combined the Calinski-Harabasz index with dendrogram to determine the optimal number of clusters (Table 1). According to the result, it can be confirmed that when K changes from 2 to 3, the Calinski-Harabasz index increases rapidly and then gradually increases thereafter. The Calinski-Harabasz index and dendrogram indicate that three is the optimal number for the value of K.  19368.50

Fuzzy c-means Clustering
The fuzzy c-means (FCM) algorithm is one of the soft clustering algorithms, also known as "soft K-means," where each data object can belong to multiple clusters. The

Fuzzy c-Means Clustering
The fuzzy c-means (FCM) algorithm is one of the soft clustering algorithms, also known as "soft K-means," where each data object can belong to multiple clusters. The fuzzy c-means algorithm has been widely used in many applications, such as consumer behavior and market segmentation [36]. FCM aims to minimize the objective function, as follows: Energies 2022, 15, 1350 11 of 18 where m is the fuzziness parameter in the range of [1,+∞), u ij is the degree of membership of x i in cluster j, c j is the centroid of cluster j. The membership degree and cluster center will be updated iteratively until the objective function value is smaller than the error. The cluster center c j and membership degree u ij and can be obtained as follows: The algorithm comprises the following steps: Step 1: Determine the number of clusters, fuzziness parameter m and the error ε.
Step 2: Initialize the membership matrix Step 3: At k step, compute the centroid c k with equation (8).
Step 5: If U [k+1] − U [k] < ε, stop, else, return to step 3. The main advantage of FCM is its suitability for overlapped data, its scalability and simplicity, and accuracy. However, the time complexity of fuzzy c-means is more than k-means. In our study, we selected the fuzziness index and error ε by grid search. The optimal fuzziness index was determined as m = 1.25, and the error as ε = 1 × 10 −5 . Figure 11 shows the clustering result based on three principal components. The points from Cluster 1 and Cluster 2 are relatively compact, but Cluster 3 is more dispersed.
follows: (7) where m is the fuzziness parameter in the range of [1, ∞), is the degree of membership of in cluster j, is the centroid of cluster j. The membership degree and cluster center will be updated iteratively until the objective function value is smaller than the error. The cluster center and membership degree and can be obtained as follows: The algorithm comprises the following steps: Step 1: Determine the number of clusters, fuzziness parameter m and the error ε.
Step 3: At k step, compute the centroid with equation (8).
Step 4: Update the new membership matrix , with Equation (9).
The main advantage of FCM is its suitability for overlapped data, its scalability and simplicity, and accuracy. However, the time complexity of fuzzy c-means is more than kmeans. In our study, we selected the fuzziness index and error ε by grid search. The optimal fuzziness index was determined as m = 1.25, and the error as ε = 1 × 10 −5 . Figure 11 shows the clustering result based on three principal components. The points from Cluster 1 and Cluster 2 are relatively compact, but Cluster 3 is more dispersed.

Experiment Results and Analysis
In the clustering phase, we employed three different clustering algorithms to segment different daily load curves. Considering the diversity of clustering performance evaluations, we selected three indices for validation, namely the silhouette coefficient, Calinski-Harabasz index, and Davies-Bouldin index (DBI) [37], which are internal clustering criteria. The Calinski-Harabasz index has been described in Section 4. The silhouette coefficient combines cohesion and separation. Cohesion indicates the similarity of points in the same cluster. On the contrary, separation indicates the object compared to other clusters. Specifically, the silhouette coefficient is calculated as follows: where a(i) indicates that cohesion is the mean distance between a sample and all other points in the same cluster, and b(i) is the minimum value of the mean distance between an object and all other objects in the nearest cluster, then, the equations of a(i) and b(i) are as follows: The value of silhouette is in the range of [−1,1]. If the silhouette coefficient is close to 1, it means that the model is suitable; a negative value indicates incorrect clustering. Higher values of the silhouette coefficient imply that the model clustered well. Davies-Bouldin index measures the average similarity between clusters, where the similarity compares the distance between clusters with the size of the clusters themselves. For a given set of clusters C = {c 1 , c 2 , . . . , c k }, c i is the most similar with c j . Davies-Bouldin index is defined as follows: where k is the cluster number, s i is the average distance between all objects in cluster i and cluster i centroid, d ij is the distance between ith and jth cluster centroids. The smaller value of the Davies-Bouldin index implies that the clusters are separated properly. We compared our proposed method with the original clustering algorithm without reducing the dimension. Table 2 compares the three clustering results, presented by calculating cluster validity indexes. The name of clustering methods that include 'Original' denotes the daily load data without reducing dimensionality. N denotes that the daily load data were normalized by min-max normalization to rescale the data to fit in the range 0 to 1. Generally, normalizing the data before clustering could ignore the distance difference between different variables. Equation (14) presents the formula for min-max normalization, as follows: According to the evaluation index, our wavelet-based preprocessing method slightly improves clustering performance compared to the original method. However, the performance of wavelet-based hierarchical clustering is better than hierarchical clustering without dimensionality reduction. Compared with the three wavelet-based clustering algorithms, the performance of k-means and FCM were similar, the silhouette coefficient and Davies-Bouldin index of FCM were better than k-means. For hierarchical clustering, the silhouette coefficient is the best, but the other two indices are worse than those of k-means and FCM. In addition, the proposed method significantly saves the computation time by dimensionality reduction. Based on our comparison, we adopt the wavelet-based fuzzy c-means method. In three clusters, the first, second, and third clusters represent 66.54%, 26.84%, and 6.62% of the daily load curves, respectively. Figure 12 shows the load patterns of the three clusters and daily load curves. Cluster 1 and 3 represent the lowest and highest power consumption, respectively. In each cluster figure, the bold red line represents the representative load pattern, while the other curves represent the daily power usage in the cluster. Cluster 1 contains 6297 daily load curves, with stable power consumption; the average power usage and average peak power were 0.187 kW and 0.438 kW, respectively. Cluster 2 contains 2540 daily load curves; the average power usage and average peak power were 0.517 kW and 1.056 kW, respectively. Cluster 3 is composed of 627 daily load curves, which is the highest power usage group and has the highest variability. For Cluster 3, the average power usage and average peak power were 1.212 kW and 2.209 kW, respectively. Figure 13 illustrates the average daily power usage box and whisker plot of three clusters. Boxplot could present data distribution based on a five-number summary, including minimum, first quartile, median, third quartile and maximum. There are some outliers (data point in Figure 13) in cluster 2, while in cluster 3, many outliers fall beyond the maximum value. As the power usage increases from cluster 1 to 3, the variation in power also increases. The standard deviation of cluster 1 is 0.0829 kW, cluster 2 is 0.1329 kW, and cluster 3 is 0.3193 kW. Figure 14 shows the average power load pattern in four seasons of three clusters. It appears that the three clusters have similar power usage characteristics in the four seasons, i.e., the average power usage valley and peak at the same time every season, around 4 am and 8 pm, respectively. Moreover, the household generally needs to use air conditioners to control the indoor temperature during the summer; therefore, electricity usage is higher. The winter consumption in the three clusters is less than that of the summer, insinuating that most apartments have installed a heating system that is not taken into account in the electricity data. Looking at all four seasons, electricity demand is stably required between  Figure 13 illustrates the average daily power usage box and whisker plot of three clusters. Boxplot could present data distribution based on a five-number summary, including minimum, first quartile, median, third quartile and maximum. There are some outliers (data point in Figure 13) in cluster 2, while in cluster 3, many outliers fall beyond the maximum value. As the power usage increases from cluster 1 to 3, the variation in power also increases. The standard deviation of cluster 1 is 0.0829 kW, cluster 2 is 0.1329 kW, and cluster 3 is 0.3193 kW.  Figure 14 shows the average power load pattern in four seasons of three clusters. It appears that the three clusters have similar power usage characteristics in the four seasons, i.e., the average power usage valley and peak at the same time every season, around 4 am and 8 pm, respectively. Moreover, the household generally needs to use air conditioners to control the indoor temperature during the summer; therefore, electricity usage

Conclusions
High-frequency smart meters have been broadly deployed for collecting electricity data. Our proposed method implements discrete wavelet transform to convert time-domain data to frequency domain. We extracted detailed and approximate signals using a statistical approach, then used Pearson's correlation coefficients to filter the high correlation features. To further reduce dimensionality, we applied PCA to preserve three features. The rest of the three features were used to achieve the clustering algorithm. Our study aimed at obtaining the representative load patterns from high time resolution daily load curves in Manhattan. Our method reduces the large dimensions to increase efficiency in clustering. In addition, it improves the clustering result slightly by estimating the silhouette coefficient, Calinski-Harabasz index, and Davies-Bouldin index, then comparing the clustering without discrete wavelet transform. From representative load patterns, the utility policymaker could design a reasonable demand response scheme to maintain the power system stability and help the utility maximize the profit and even reduce consumers' electricity fees. Based on Figure 14, policymakers could design three different advanced time of use tariffs, according to electricity consumption volume and representitive load curves from the three clusters. To each cluster, the electricity demand increases apparently from 4 pm to 8 pm, which could influence the power system stability. It means the appropriate DR scheme is significant during this period, such as load shifting/shedding. For future work, we suggest exploring the sub-cluster from the previous clusters to get more detailed load patterns based on our method.