Electric Load Data Compression and Classification Based on Deep Stacked Auto-Encoders

With the development of advanced metering infrastructure (AMI), electrical data are collected frequently by smart meters. Consequently, the load data volume and length increase dramatically, which aggravates the data storage and transmission burdens in smart grids. On the other hand, for event detection or market-based demand response applications, load service entities (LSEs) want smart meter readings to be classified in specific and meaningful types. Considering these challenges, a stacked auto-encoder (SAE)-based load data mining approach is proposed. First, an innovative framework for smart meter data flow is established. On the user side, the SAEs are utilized to compress load data in a distributed way. Then, centralized classification is adopted at remote data center by softmax classifier. Through the layer-wise feature extracting of SAE, the sparse and lengthy raw data are expressed in compact forms and then classified based on features. A global fine-tuning strategy based on a well-defined labeled subset is embedded to improve the extracted features and the classification accuracy. Case studies in China and Ireland demonstrate that the proposed method is more capable to achieve the minimum of error and satisfactory compression ratios (CR) than benchmark compressors. It also significantly improves the classification accuracy on both appliance and house level datasets.


Introduction
With the wide integration of distributed generations, a two-way power flow occurs frequently in modern power systems [1].A similar bi-directional flow also appears in the information transmission of power systems due to advanced metering infrastructure (AMI) and some control systems [2,3].Smart meters, the basic terminal equipment of AMI, have gained increasing popularity worldwide.By the end of 2016, smart meters installed in the UK and the US reached 2.9 million [4] and 70 million [5], respectively.There will be an unusual volume of collected readings from smart meters that should be transmitted simultaneously through limited-bandwidth environment.With the rapidly increasing volume of smart meter data, to transmit and store load data in a large service area is a realistic challenge for load service entities (LSEs).For instance, the volume of load profile data at a granularity of 1s and double-precision floating point for the 40 million households in Germany amounts to 25 TB per day [6].It is practically unnecessary to keep the complete raw load data, considering an obvious redundancy.This can be explained in two aspects: (i) Load data is of sparsity itself, i.e., for the majority of users, With the development of smart grid and electricity market, new challenges and requirements arise for load data mining.
(1) High quality smart meters collect data frequently and the historical measurements accumulate constantly, the data volume increases dramatically [5].Besides, for some special issues, wider profiles are utilized [9,30], e.g., the weekly, monthly, or even annual load profiles.These developments require the load data to be compressed in higher ratios.(2) With the development of DR, user segmentation has received unprecedented attention recently [11].Users should be classified into more specific and meaningful types.In non-intrusive load identification, the readings are investigated to match the household appliances [24,31], e.g., refrigerator and air conditioner.
In order to address the above challenges, this paper aims to achieve higher CRs for load data compression and better accuracies for load classification.Specifically, the concept and model of deep learning in Artificial Intelligence (AI) is introduced.Load data are compressed in a layer-wise way by Deep Stacked Auto-Encoders (SAEs) and the probability characteristics of load classification are obtained with a softmax regression method.The main contributions of the proposed method can be presented as follows: (A) An innovative framework for smart grid data flow.Specifically, encoders are installed or integrated at the user side to compress the collected smart meter readings to smaller volumes.
The compressed data are transmitted to the remote data center for storage.The corresponding decoders are preconfigured in the data center, which can recover the compressed data if needed.
In this way, both the transmission bandwidth and the storage space are reduced without affecting the further application and value-added services.(B) An efficient and adaptive Artificial Neural Network (ANN)-based compressor.Stacked Auto-Encodes (SAEs) are introduced to compress smart meter data.The SAE-based nonlinear compressor achieves satisfactory minimum of information loss when load data are compressed to extremely small sizes.By adjusting the number of neurons, the output data of SAE can be compressed to any configurable size and the model can adapt to various input dimensions.
Compared with the state-of-the-art linear methods (PCA, DWT, and SVD), the proposed SAE model achieves higher CRs with a satisfactory minimum of information loss meanwhile.(C) A labeled data driven improvement for classification.Specifically, massive unlabeled data are classified by softmax classifier in data center based on features extracted by distributed SAEs.A following-up training is implemented to fine-tune all the weights based on well-defined labeled subset after pre-training, which greatly modifies the high order features and improves the classification accuracy.Numerical results demonstrate the proposed ANN-based method has a significantly better fit for both building level user segmentation and appliance level abnormal event detection.
The organization of the rest of this paper is as follows.Section 2 gives the framework of the proposed method.Section 3 provides some preprocessing techniques to make the raw data fit for ANN.Section 4 introduces the basic Auto-Encoder (AE) to encode and decode smart meter data.Section 5 elaborates the softmax regression for load profile classification.Based on AE and softmax classifier, Section 6 provides a deep ANN stacked by AEs for load data mining.Real smart meter datasets in China and Ireland are utilized for case studies in Section 7 to demonstrate the feasibility of the proposed approach.Finally, the conclusions are presented in Section 8.

Framework of the Proposed ANN-Based Load Data Compression and Classification Model
In this section, framework of the ANN-based load data compression and classification model is elaborated.With a distributed hierarchical structure, the proposed model consists of three parts, an encoding network, a decoding network, and a classification network.The deployment of the three parts is shown in Figure 1.Encoding networks are installed or integrated at the user side to compress the collected smart meter readings to smaller volumes.The compressed data are transmitted to the remote data center for storage.In this way, both the transmission bandwidth and the storage space are greatly reduced.The corresponding decoding network is preconfigured in the data center and it can reconstruct the compressed data to its original length if needed.The compressed or reconstructed data can also be input into the classification network for identifying the labels or categories of the corresponding users or profiles.Some further applications are designed based on the estimated labels, etc., power billing, fault detection, or demand response.Messages about these applications are sent to the corresponding users, after that.In this paper, for a higher accuracy, loads are classified based on the compressed data.The proposed framework is easy to achieve in reality as the needed hardware are programmable chips such as the Field Programmable Gate Array (FPGA) [32], the Very Large Scale Integration (VLSI) [33], and Digital Signal Processing(DSP) [34].
Energies 2019, 12 4 of 17 network is preconfigured in the data center and it can reconstruct the compressed data to its original length if needed.The compressed or reconstructed data can also be input into the classification network for identifying the labels or categories of the corresponding users or profiles.Some further applications are designed based on the estimated labels, etc., power billing, fault detection, or demand response.Messages about these applications are sent to the corresponding users, after that.In this paper, for a higher accuracy, loads are classified based on the compressed data.The proposed framework is easy to achieve in reality as the needed hardware are programmable chips such as the Field Programmable Gate Array (FPGA) [32], the Very Large Scale Integration (VLSI) [33], and Digital Signal Processing(DSP) [34].Specifically, AE is utilized as the basic functional unit to encode and decode massive load data.Softmax regression is the recommended method for load classification.Inspired by the revolutionary progresses in deep learning, multiple AEs are stacked to constitute the compressor in order to provide better compression capability [35,36].A fine-tuning strategy based on labeled load data is utilized to train the stacked AEs for a layer-wise compression of the redundant smart meter data [37].

Preprocessing of Load Data
In this section, some necessary preprocessing techniques utilized to make the raw data fit for ANN input are illustrated.
Suppose a dataset with K n-dimension load curves is obtained.Each curve can be denoted as Xi = (X1i, X2i,…, Xni) T .First, a min-max data normalization is implemented [38].
, max min Here, xi,j is the normalized i-th dimension value of profile j.
After min-max normalization, a scale profile within [0, 1] is obtained for each sample, which can be denoted as xi (i=1, 2,…, K).Considering the reality that many different measurements may belong to the same user, an enumerated search strategy is utilized to select the representative Specifically, AE is utilized as the basic functional unit to encode and decode massive load data.Softmax regression is the recommended method for load classification.Inspired by the revolutionary progresses in deep learning, multiple AEs are stacked to constitute the compressor in order to provide better compression capability [35,36].A fine-tuning strategy based on labeled load data is utilized to train the stacked AEs for a layer-wise compression of the redundant smart meter data [37].

Preprocessing of Load Data
In this section, some necessary preprocessing techniques utilized to make the raw data fit for ANN input are illustrated.
Suppose a dataset with K n-dimension load curves is obtained.Each curve can be denoted as X i = (X 1i , X 2i , . . ., X ni ) T .First, a min-max data normalization is implemented [38].
Here, x i,j is the normalized i-th dimension value of profile j.
Energies 2019, 12, 653 5 of 17 After min-max normalization, a scale profile within [0, 1] is obtained for each sample, which can be denoted as x i (i = 1, 2, . . ., K). Considering the reality that many different measurements may belong to the same user, an enumerated search strategy is utilized to select the representative profile for each user.Specifically, for each profile, distance to all the other profiles belong to the same user is evaluated as follows.
Based on the results, the profile with the minimum distance to others is selected as the representative load profile for this user.
Based on the selected representative profiles, abnormal values can be dealt with in the following way.The raw representative profile is selected based on the dataset excluding profiles with missing values.Data points, which are outside a certain threshold distance to the representative profile are considered as outliers and are excluded from the dataset.The missing values are assigned with the corresponding values of the representative profile.The final representative profile is selected based on the updated dataset.
In general, we obtain smart meter measurements from different users.Several load curves are collected from each user.First, these obtained curves are normalized to profiles without amplitude differences.For each user, the representative profile is selected and the sleek representative profiles constitute the final dataset.

The Basic Auto-Encoder for Load Data Compression
In this section, AE is illustrated as the basic functional unit to compress load data, including its topology, cost function, sparsity constraint, and training algorithm.
The principle of data compression by ANN can be described as follows.For an ANN-based compressor, the unit number of the first layer means the dimension scale of input data, the last layer outputs the processed data.To ensure the recoverability, for a feasible compressor, the unit number of the last layer should be the same as the input layer, then such symmetrical ANNs are able to learn a function F(x) ≈ x, in other words, to output a value, vector, or matrix approximating the input.Once the networks are successfully trained, we can obtain the compressed data on hidden layers.Hypothesis is made that the units on hidden layers are less than the input and output layers, then the input data are expressed with reduced dimensions and data compression is achieved.
As mentioned above, ANN with symmetrical structure can be utilized for data compression and such special ANN is named as AE [35]. Figure 2 gives an AE with 4 input/output units and 2 hidden layers, where a l i is the activation (meaning output value) of unit i on layer l, z l i is the weighted sum of inputs to unit i on layer l including the bias term.W l ij is the weight associated with the connection between unit j on layer l and unit i on layer l+1.b l i is the bias associated with unit i on layer l.W l and b l are vector format of W l ij and b l i .Like other ANNs, to train an AE is to determine the weights of units.The optimization problem can be formulated as [36]: where, m is the total number of load profiles, n is the number of data points in each profile.L is the number of layers.n l is the total number of units on layer l except for the bias unit.λ and γ are weights of corresponding terms.The first term in Equation ( 3) quantifies the error between the input and output load data.The second is a weight decay term used to prevent over-fitting.As mentioned above, the number of units on hidden layers determines the dimension of the compressed data, however, the regulating of hidden unit configuration is relatively not convenient.A sparsity constraint is introduced to compress the data volume adaptively, which is presented as the third term in Equation ( 3).For an ANN using sigmoid activation function, the unit is active when output approaches 1; oppositely, inactive when output approaches 0. The average activation level of hidden layers is imposed to approximate a small value, denoted as δ.The constraint of δi l approaching δ can be achieved by the Kullback-Leibler divergence (KL-divergence).
Due to non-convexity, the minimum of the cost function cannot be solved by known closed-form ways.Iterative back propagation (BP) gradient algorithms are utilized to search the solutions [39], e.g., the Newton method, the gradient descent method, and the Limited-memory Broyden Fletcher Goldfarb Shanno (L-BFGS) algorithm [40].The iterative ANN training is time-consuming compared with traditional methods such as PCA, DWT and SVD, which may be considered as a main limit for ANN-based load data compression.
The expression ability of any regression model is determined by the number of integrated variables.Therefore, for an ANN-based compressor, which is a non-linear regression essentially, its data compression effect is determined by the number of connecting weights.In order to compress the obtained load data to the desired sizes, the number of units on hidden layers must be decreased.If a very high CR is expected, the number of hidden units must be extremely small, in order to maintain the expression ability of the network (data recovery effect), the number of layers should be increased accordingly.Such networks with multiple hidden layers are called deep ANNs [35].In other words, Deep ANNs actually use more longitudinal variables (layers) to reduce the horizontal variables (units).Due to the vanishing gradient [37], the training of deep networks is challenging for ANN-based data compression.In this paper, the utilized deep ANN is considered as a special network stacked by simple AEs, and a promising deep learning algorithm called wake-sleep [35] is introduced to solve the training problem, which will be illustrated in Section 6.The first term in Equation ( 3) quantifies the error between the input and output load data.The second is a weight decay term used to prevent over-fitting.As mentioned above, the number of units on hidden layers determines the dimension of the compressed data, however, the regulating of hidden unit configuration is relatively not convenient.A sparsity constraint is introduced to compress the data volume adaptively, which is presented as the third term in Equation (3).For an ANN using sigmoid activation function, the unit is active when output approaches 1; oppositely, inactive when output approaches 0. The average activation level of hidden layers is imposed to approximate a small value, denoted as δ.The constraint of δ i l approaching δ can be achieved by the Kullback-Leibler divergence (KL-divergence).

The Softmax Load Data Classifier
KL-divergence is a standard function used to quantify the difference between two distributions [35].When δ i l approaches δ, KL(δ||δ i l ) approximates 0; when δ i l diverges from δ, KL(δ||δ i l ) increases rapidly.Thus, minimizing this KL-divergence has the effect of forcing δ i l approaching δ.Due to non-convexity, the minimum of the cost function cannot be solved by known closed-form ways.Iterative back propagation (BP) gradient algorithms are utilized to search the solutions [39], e.g., the Newton method, the gradient descent method, and the Limited-memory Broyden Fletcher Goldfarb Shanno (L-BFGS) algorithm [40].The iterative ANN training is time-consuming compared with traditional methods such as PCA, DWT and SVD, which may be considered as a main limit for ANN-based load data compression.
The expression ability of any regression model is determined by the number of integrated variables.Therefore, for an ANN-based compressor, which is a non-linear regression essentially, its data compression effect is determined by the number of connecting weights.In order to compress the obtained load data to the desired sizes, the number of units on hidden layers must be decreased.If a very high CR is expected, the number of hidden units must be extremely small, in order to maintain the expression ability of the network (data recovery effect), the number of layers should be increased accordingly.Such networks with multiple hidden layers are called deep ANNs [35].In other words, Deep ANNs actually use more longitudinal variables (layers) to reduce the horizontal variables (units).Due to the vanishing gradient [37], the training of deep networks is challenging for ANN-based data compression.In this paper, the utilized deep ANN is considered as a special network stacked by simple AEs, and a promising deep learning algorithm called wake-sleep [35] is introduced to solve the training problem, which will be illustrated in Section 6.

The Softmax Load Data Classifier
This section introduces a softmax load classifier.Shown in Figure 3, the softmax classifier is a simple ANN with only two layers, the input load data layer and the output label layer.
Energies 2019, 12 7 of 17 This section introduces a softmax load classifier.Shown in Figure 3, the softmax classifier is a simple ANN with only two layers, the input load data layer and the output label layer.In this paper, the softmax regression is a supervised method used for multi-class load classification.To describe this in a probabilistic way, we want to estimate the probabilities of classifying the input load profile to each possible class and find the class with the maximum probability.Given an input xi=(x1i, x2i, …, xni) T , the label yi can take k different values.Shown as Equation ( 4), the softmax classifier outputs a k dimensional vector (whose elements sum to 1) giving us k estimated probabilities [41]. ( ; Where θ1, θ2, … , θk ∈ R k×1 are the classifier parameters to be trained.The probability of classifying xi into category j is:  (7) It worth noting that for supervised learning based on labeled data, the softmax output of the specific label dimension is imposed to be 1 and other dimensions are 0.
An indicator function ind{•} is adopted to formulate the cost function of softmax regression.
Considering the factor of weight decay, the cost function and optimization problem of softmax regression can be expressed as [42] ( ) { } Where, θij is the weight of softmax classifier connecting unit j on input layer and unit i on output layer.In a similar way as AE, the softmax classifier can be trained by iterative BP algorithms.In this paper, the softmax regression is a supervised method used for multi-class load classification.To describe this in a probabilistic way, we want to estimate the probabilities of classifying the input load profile to each possible class and find the class with the maximum probability.Given an input x i = (x 1i , x 2i , . . ., x ni ) T , the label y i can take k different values.Shown as Equation ( 4), the softmax classifier outputs a k dimensional vector (whose elements sum to 1) giving us k estimated probabilities [41].

Load Data Mining Based on Stacked Auto-Encoders
where θ 1 , θ 2 , . . ., θ k ∈R k×1 are the classifier parameters to be trained.The probability of classifying x i into category j is: It worth noting that for supervised learning based on labeled data, the softmax output of the specific label dimension is imposed to be 1 and other dimensions are 0.
An indicator function ind{•} is adopted to formulate the cost function of softmax regression.
Considering the factor of weight decay, the cost function and optimization problem of softmax regression can be expressed as [42] where, θ ij is the weight of softmax classifier connecting unit j on input layer and unit i on output layer.In a similar way as AE, the softmax classifier can be trained by iterative BP algorithms.

Load Data Mining Based on Stacked Auto-Encoders
In this section, a deep ANN stacked by AEs is proposed for load data mining.As mentioned in the previous section, the proposed deep SAE is significantly more applicable to extract features and express data than basic AE.
The proposed network consists of L layers in which the outputs of below layers are connected to the inputs of the above layer.The lower part (layer 1 to L-1) is stacked by AEs, and the top layer (layer L) is a softmax classifier.As a typical deep ANN, the proposed network is difficult to train by the iterative BP algorithms due to gradient diffusion and a possible trap in local optima [37].Therefore, a wake-sleep training principle of deep learning is introduced to train the network in the following two steps.

Wake Procedure: A Greedy Layer-Wise Pre-Training
In this procedure, a greedy layer-wise training is performed, which consists of a layer-wise AE unsupervised self-learning and a supervised softmax regression [43,44].
Figure 4 illustrates the layer-wise training procedure.Specifically, in the stacked AE network, the decoding layers of AEs are implicit.As shown in Figure 4, if we supplement a decoding layer behind layer 2 to constitute a complete symmetric AE, we can train it through a self-learning algorithm.Expanding the same idea to layer l and (l + 1), the SAE can be trained in layer-wise way.Finally, a supervised softmax regression is performed on layer L. The detailed pseudo-codes of the wake step are given in Table 1.
Table 1.Procedure of the greedy layer-wise training and fine-tuning.

Start the training process
Step1: Obtain a labeled dataset Ω(x, y).Randomize SAE and initialize the iterative counter l = 1, training set I = Ω(x).
Step2: Connect a virtual decoding layer to layer (l+1) to form a symmetric AE, denoted as AE-l.
Step3: Train AE-l with its input and output imposed equal to I using BP.Denote the weights of layer l as {W l , b l }.
Step5: If l = L, then jump to Step 6, otherwise run back to Step2.

(II) Supervised softmax regression
Step6: Randomize the soft-max classifier weights θ.Input the activations of layer (L−1) to the classifier: z L = a L−1 .Impose the label vector on the classifier output: a L = y.Step7: Train the classifier using BP to obtain the optimal weights, denoted as θ .

Sleep (III) Global fine-tuning on the whole ANN
Step1: Randomize the whole ANN with the weights got in the wake step: Input the original dataset to the whole ANN: z 1 = Ω(x).Impose the label vector on output: a L = y.
Step2: Train the SAE and softmax classifier as a whole ANN using BP.Obtain the final optimal weights of all the layers, denoted as {W*,b * ,θ*}.

Terminate the training process
Energies Figure 4 illustrates the layer-wise training procedure.Specifically, in the stacked AE network, the decoding layers of AEs are implicit.As shown in Figure 4, if we supplement a decoding layer behind layer 2 to constitute a complete symmetric AE, we can train it through a self-learning algorithm.Expanding the same idea to layer l and (l+1), the SAE can be trained in layer-wise way.Finally, a supervised softmax regression is performed on layer L. The detailed pseudo-codes of the wake step are given in Table 1.

Start the training process
Step1: Obtain a labeled dataset Ω(x, y).Randomize SAE and initialize the iterative counter l = 1, training set I = Ω(x).
Step2: Connect a virtual decoding layer to layer (l+1) to form a symmetric AE, denoted as AE-l.
Step3: Train AE-l with its input and output imposed equal to I using BP.Denote the weights of layer l as {W l´, b l´} .
Step5: If l = L, then jump to Step 6, otherwise run back to Step2.

Sleep Procedure: A Global Fine-Tuning
During the wake step, an initial setting of the weights is obtained by greedy layer-wise training.The next step is to further modify all the weights to minimize the error of the network as a whole.This is called the fine-tuning.
Figure 5 shows the fine-tuning procedure.In this step, we treat all layers, including AEs and the classifier, as a single integrated model.A supervised learning algorithm is utilized to fine-tune all the weights based on some labeled load data simultaneously.As the data is propagated from input to output, the error is transmitted down from the top.Pseudo-codes of the sleep algorithm can also be found in Table 1.
Energies 2019, 12 9 of 17 Step6: Randomize the soft-max classifier weights θ.Input the activations of layer (L−1) to the classifier: z L = a L−1 .Impose the label vector on the classifier output: a L = y.
Step7: Train the classifier using BP to obtain the optimal weights, denoted as θ´.

Sleep (III) Global fine-tuning on the whole ANN
Step1: Randomize the whole ANN with the weights got in the wake step: Input the original dataset to the whole ANN: z 1 = Ω(x).Impose the label vector on output: Step2: Train the SAE and softmax classifier as a whole ANN using BP.Obtain the final optimal weights of all the layers, denoted as {W * ,b * ,θ * }.

Sleep Procedure: a Global Fine-Tuning
During the wake step, an initial setting of the weights is obtained by greedy layer-wise training.The next step is to further modify all the weights to minimize the error of the network as a whole.This is called the fine-tuning.
Figure 5 shows the fine-tuning procedure.In this step, we treat all layers, including AEs and the classifier, as a single integrated model.A supervised learning algorithm is utilized to fine-tune all the weights based on some labeled load data simultaneously.As the data is propagated from input to output, the error is transmitted down from the top.Pseudo-codes of the sleep algorithm can also be found in Table 1.The ordinary BP ANNs are generally initialized randomly.For SAE, the initialization is done by a greedy layer-wise training, which makes the initial setting closer to the overall optima and helps to jump out of the local optima trap.Actually, the greedy layer-wise training is a pre-training, whose results are modified by fine-tuning in the next step [37].The fine-tuning based on labeled load data is the most significant step for load data deep learning.The advantages of SAE for load data mining and the necessity of fine-tuning are demonstrated in the following case studies.

Case Studies
In this section, experiments on realistic datasets are used to demonstrate the feasibility of the proposed ANN based approach.Shown in Table 2, three cases are utilized.Data of the case 1 are obtained through the Electric Data Acquire System (EDAS) of State Grid Zhejiang Elec.Power Corp. (SGZEPC) in China.Case 2 is provided by the Sustainable Energy Authority of Ireland (SEAI).Case 3 contains accumulated measurements in our lab.The main parameters of are given in Table 3 [35].Due to the page limitation, the sensitivity analysis of these parameters is omitted here.The ordinary BP ANNs are generally initialized randomly.For SAE, the initialization is done by a greedy layer-wise training, which makes the initial setting closer to the overall optima and helps to jump out of the local optima trap.Actually, the greedy layer-wise training is a pre-training, whose results are modified by fine-tuning in the next step [37].The fine-tuning based on labeled load data is the most significant step for load data deep learning.The advantages of SAE for load data mining and the necessity of fine-tuning are demonstrated in the following case studies.

Case Studies
In this section, experiments on realistic datasets are used to demonstrate the feasibility of the proposed ANN based approach.Shown in   With the increasing of hidden units, the minimum cost decreases but the decreasing speed slows down.As can be seen, the gap between AE(96/75) and AE(96/65) is obviously smaller than the that between AE(96/15) and AE(96/25).As hidden units correspond to features, this indicates that an evident redundancy exists in load data.For instance, encoded in 45 features, the load data can be decoded well with a small cost below 1.4.Figure 7 also indicates a proper maximum iteration, over which, more BP iterations produce little effects.In Figure 7, if 15 features are used, 150 iterations are enough; when 45 features are used, 500 iterations are appropriate.Based on the different reconstruction results of AE(96/20,50), AE(96/5,900), and AE(96/20,900), we can preliminarily conclude that the load data compression performance of AE is determined by the number of iterations and features (number of hidden units).

Performance of SAE
Figure 7 further illustrates the relationship of the cost, features, and iterations of BP.The indicated results are summarized as follows:  With the increasing of hidden units, the minimum cost decreases but the decreasing speed slows down.As can be seen, the gap between AE(96/75) and AE(96/65) is obviously smaller than the that between AE(96/15) and AE(96/25).As hidden units correspond to features, this indicates that an evident redundancy exists in load data.For instance, encoded in 45 features, the load data can be decoded well with a small cost below 1.4.Figure 7 also indicates a proper maximum iteration, over which, more BP iterations produce little effects.In Figure 7, if 15 features are used, 150 iterations are enough; when 45 features are used, 500 iterations are appropriate.With the increasing of hidden units, the minimum cost decreases but the decreasing speed slows down.As can be seen, the gap between AE(96/75) and AE(96/65) is obviously smaller than the that between AE(96/15) and AE(96/25).As hidden units correspond to features, this indicates that an evident redundancy exists in load data.For instance, encoded in 45 features, the load data can be decoded well with a small cost below 1.4.Figure 7

also indicates a proper maximum iteration, over
Energies 2019, 12, 653 12 of 17 which, more BP iterations produce little effects.In Figure 7, if 15 features are used, 150 iterations are enough; when 45 features are used, 500 iterations are appropriate.

Performance of SAE
Figure 8 displays the features of three representative profiles (L1, L2, and L3) obtained on different hidden layers of SAE (96/48/24/12,500).The dataset is case1.Visually, the 96 load points recovered from 6 features are close to the original data.The recovering performance is much better that that of basic AE in Figure 6.
Energies 2019, 12 12 of 17 Figure 8 displays the features of three representative profiles (L1, L2, and L3) obtained on different hidden layers of SAE (96/48/24/12,500).The dataset is case1.Visually, the 96 load points recovered from 6 features are close to the original data.The recovering performance is much better that that of basic AE in Figure 6.It can be seen from Figure 8 that the hidden layers express the data based on pulse signals with fewer dimensions.Specifically, different pulse shapes (triangle or trapezoid), pulse numbers, pulse permutations can distinguish different types of input load.These are high-order descriptions of load data based on extracted features.
A message from Figure 8 is that some tiny differences can also be observed with high order features.For example, on hidden layer 4 where loads are expressed in 6 features very concisely, the different electricity consumption characteristics of the three profiles are displayed intuitively.Specifically, profile 1 and 2 are exactly the same in 4 features (feature 1, 3, 4, and 6), while L1 and L3 are only consistent in 2 features (feature 2 and 5).From the perspective of original data, L1 and L2 are both in single-peak shape, while profile 3 presents dual peaks.
If L1, L2, and L3 are representative profiles of three users and this is a user segmentation problem, then user 1 and 2 are more likely to be put into the same class while user 3 should be isolated.If the three profiles are metered from the same user for abnormal detecting, then profile 3 should be determined as an event profile of abnormal behavior.Obviously, the high order features grasp the essential characteristic of input and are of great significance for the understanding of data.It can be seen from Figure 8 that the hidden layers express the data based on pulse signals with fewer dimensions.Specifically, different pulse shapes (triangle or trapezoid), pulse numbers, pulse permutations can distinguish different types of input load.These are high-order descriptions of load data based on extracted features.
A message from Figure 8 is that some tiny differences can also be observed with high order features.For example, on hidden layer 4 where loads are expressed in 6 features very concisely, the different electricity consumption characteristics of the three profiles are displayed intuitively.Specifically, profile 1 and 2 are exactly the same in 4 features (feature 1, 3, 4, and 6), while L1 and L3 are only consistent in 2 features (feature 2 and 5).From the perspective of original data, L1 and L2 are both in single-peak shape, while profile 3 presents dual peaks.
If L1, L2, and L3 are representative profiles of three users and this is a user segmentation problem, then user 1 and 2 are more likely to be put into the same class while user 3 should be isolated.
If the three profiles are metered from the same user for abnormal detecting, then profile 3 should be determined as an event profile of abnormal behavior.Obviously, the high order features grasp the essential characteristic of input and are of great significance for the understanding of data.
It is worth noting that an obvious information distortion exists in the shaded circular area.Through, the objective of SAE is to restore the input as much as possible, error generates in the layer-wise self-learning due to local optima trap or other factors, which gradually enlarges with the layer-wise greedy training from the lower to the top AE.Considering the multiple layers, the error of SAE is usually large than single AE.It is not hard to imagine that some dual-peak profiles may be smoothed to be single-peak and misclassified on top layers (refer to the red profile in Figure 6a,d).As mentioned above, well-defined labeled data can be utilized to fine tune the network and help decrease the error.In case 1, the labels are determined on profile shape, 1 for single-peak and 2 for dual-peak.Figure 9 displays the compression costs of different SAEs with and without fine-tuning performed.
Energies 2019, 12 13 of 17 It is worth noting that an obvious information distortion exists in the shaded circular area.Through, the objective of SAE is to restore the input as much as possible, error generates in the layer-wise self-learning due to local optima trap or other factors, which gradually enlarges with the layer-wise greedy training from the lower to the top AE.Considering the multiple layers, the error of SAE is usually large than single AE.It is not hard to imagine that some dual-peak profiles may be smoothed to be single-peak and misclassified on top layers (refer to the red profile in Figure 6a  and 6d).As mentioned above, well-defined labeled data can be utilized to fine tune the network and help decrease the error.In case 1, the labels are determined on profile shape, 1 for single-peak and 2 for dual-peak.Figure 9 displays the compression costs of different SAEs with and without fine-tuning performed As can be seen from Figure 9, error of the three networks both decrease after fine-tuning based on well-defined labeled data.It is also demonstrated that a relatively appropriate depth (number of layers) exists for a given dataset.In case 1, the SAE including two hidden layers with fine-tuning achieves the minimum cost, which is 2.32.The increase of depth from 2 to 3 produces no improvement and brings some negative effect.
Lossy compression is essentially a tradeoff between data volume and error.From this perspective, two indexes are introduced including CR and the Root Mean Square Error (RMSE) [7].CR is the ratio between the sizes of the original and compressed data, RMSE is a widely used measure of the reconstruction error.For ANN based load data compression, CR is determined by the number of output units, e.g., AE (96/48) gives a CR of 2.
In terms of CR and RMSE, the compression performances of SAE (with fine-tuning) and several other methods are compared on case 1, including PCA [24], DWT [25,26] and k-SVD [7,21].Figure 10 displays the results.As can be seen, the RMSE of SAE grows very slowly with the increasing CR.SAE achieves relatively lower RMSE values than the other linear methods at given CRs.If the load data is compressed to 1/16 of its original size, the corresponding RMSE of SAE is only 0.092, which is greatly lower than k-SVD (0.16), DWT (0.24) and PCA (0.27).As can be seen from Figure 9, error of the three networks both decrease after fine-tuning based on well-defined labeled data.It is also demonstrated that a relatively appropriate depth (number of layers) exists for a given dataset.In case 1, the SAE including two hidden layers with fine-tuning achieves the minimum cost, which is 2.32.The increase of depth from 2 to 3 produces no improvement and brings some negative effect.
Lossy compression is essentially a tradeoff between data volume and error.From this perspective, two indexes are introduced including CR and the Root Mean Square Error (RMSE) [7].CR is the ratio between the sizes of the original and compressed data, RMSE is a widely used measure of the reconstruction error.For ANN based load data compression, CR is determined by the number of output units, e.g., AE (96/48) gives a CR of 2.
In terms of CR and RMSE, the compression performances of SAE (with fine-tuning) and several other methods are compared on case 1, including PCA [24], DWT [25,26] and k-SVD [7,21].Figure 10 displays the results.As can be seen, the RMSE of SAE grows very slowly with the increasing CR.SAE achieves relatively lower RMSE values than the other linear methods at given CRs.If the load data is compressed to 1/16 of its original size, the corresponding RMSE of SAE is only 0.092, which is greatly lower than k-SVD (0.16), DWT (0.24) and PCA (0.27).
To obtain a higher CR, fewer units should be deployed on the output layer, but we can still control the RMSE by increasing the hidden layers and fine-tune the whole network based on high quality labeled data.This is a significant advantage of ANN-based compressor.
In terms of CR and RMSE, the compression performances of SAE (with fine-tuning) and several other methods are compared on case 1, including PCA [24], DWT [25,26] and k-SVD [7,21].Figure 10 displays the results.As can be seen, the RMSE of SAE grows very slowly with the increasing CR.SAE achieves relatively lower RMSE values than the other linear methods at given CRs.If the load data is compressed to 1/16 of its original size, the corresponding RMSE of SAE is only 0.092, which is greatly lower than k-SVD (0.16), DWT (0.24) and PCA (0.27).   2 are utilized to show the effects of using SAE and softmax regression for load data classification.
Confusion matrixes [9] of the case 2 are illustrated in Table 4. Variable ss and sr denote the number of SMEs correctly predicted as SMEs, incorrectly predicted as residents respectively.Similarly, all the other variables in Table 5 can be defined.Then, the classification accuracy index a c is introduced to evaluate the performance of classifier.a c is formulated as the proportion of the samples that are correctly classified in the whole set.In the similar way, a c index of case 3 can also be defined.Table 5 illustrates the numerical comparison of the proposed method and some state-of-the-art load data classification models, the following results can be summarized: way helps improve the classification.The premise is the adopted fine-tuning.(d) With fine-tuned SAE, more than 97% fault events are correctly detected and over 90% SMEs and residents are correctly classified.Such a high accuracy may not be achieved by common "linear compressor + classifier" models.

Conclusions
In this paper, an ANN based smart meter data mining approach is exploited.First, to satisfy the bandwidth constraint for signal transmission and space requirement for data storage, an innovative framework for smart meter data flow is established.Specifically, SAEs are installed at the user side to compress the meter readings.The corresponding decoding network is preconfigured in the data center, which can recover the compressed data if needed.Compared to the existing linear models such as PCA, DWT, and SVD, the proposed SAE compressor is of higher CRs with satisfactory errors.For advanced demand side applications, a labeled data driven strategy for accuracy improvement of unlabeled data classification is invented.Specifically, massive unlabeled data are classified by softmax classifier in data center based on features extracted by remote SAEs.A following-up training is implemented to fine-tune the weights after pre-training.It is demonstrated by the numerical results that the fine-tuning based on well-defined labeled dataset greatly modifies the extracted high order features and improves the accuracy.
Moreover, the iterative gradient descent calculation is the most time-consuming step, and the massive sample data and their intermediate calculation quantities are the main memory overhead for the proposed method, which both occur in the training step and can be completed in the manufacturing or debugging stage of smart meters.Thus, by integrating pre-trained networks into smart meters, the proposed method is feasible to handle the curse of CPU time and memory.
The proposed method has a significantly fit for both building level user segmentation and appliance level event detection.It is a feasible tool to deal with big data on demand side, which has a convincing application potential in power billing, demand response, and load forecasting.

Figure 1 .
Figure 1.Framework of the proposed Artificial Neural Network (ANN)-based model.

Figure 1 .
Figure 1.Framework of the proposed Artificial Neural Network (ANN)-based model.

Figure 7
Figure 7 further illustrates the relationship of the cost, features, and iterations of BP.The indicated results are summarized as follows:With the increasing of hidden units, the minimum cost decreases but the decreasing speed slows down.As can be seen, the gap between AE(96/75) and AE(96/65) is obviously smaller than the that between AE(96/15) and AE(96/25).As hidden units correspond to features, this indicates that an evident redundancy exists in load data.For instance, encoded in 45 features, the load data can be decoded well with a small cost below 1.4.Figure7also indicates a proper maximum iteration, over which, more BP iterations produce little effects.In Figure7, if 15 features are used, 150 iterations are enough; when 45 features are used, 500 iterations are appropriate.

Figure 7 .
Figure 7. Relationship of the cost, features, and iterations.

Figure 7
Figure 7 further illustrates the relationship of the cost, features, and iterations of BP.The indicated results are summarized as follows:With the increasing of hidden units, the minimum cost decreases but the decreasing speed slows down.As can be seen, the gap between AE(96/75) and AE(96/65) is obviously smaller than the that between AE(96/15) and AE(96/25).As hidden units correspond to features, this indicates that an evident redundancy exists in load data.For instance, encoded in 45 features, the load data can be decoded well with a small cost below 1.4.Figure7also indicates a proper maximum iteration, over which, more BP iterations produce little effects.In Figure7, if 15 features are used, 150 iterations are enough; when 45 features are used, 500 iterations are appropriate.

Figure 7 .
Figure 7. Relationship of the cost, features, and iterations.

Figure 7 .
Figure 7. Relationship of the cost, features, and iterations.

Figure 9 .
Figure 9.The compression costs of different SAEs on case 1.

Figure 9 .
Figure 9.The compression costs of different SAEs on case 1.

Figure 10 .
Figure 10.Comparison of SAE and linear algorithms.

7. 3 .
Accuracy of the Proposed ClassifierCase 2 and 3 in Table a c = ss + rr sr + rs + ss + rr × 100%(10) Indicated by the different a c values of softmax and "AE+softmax," extracted features help improve the classification accuracy.(b) Fine-tuning is necessary for SAE training, which greatly improves the accuracy by over 10 percentages.(c) Compared ac values of AE(x/25,500) and SAE(x/40/15, 500), features extracted in layer-wise

Table 1 .
Procedure of the greedy layer-wise training and fine-tuning.

Table 2 ,
three cases are utilized.Data of the case 1 are obtained through the Electric Data Acquire System (EDAS) of State Grid Zhejiang Elec.Power Corp. (SGZEPC) in China.Case 2 is provided by the Sustainable Energy Authority of Ireland (SEAI).Case 3

Table 4 .
Confusion matrix of case 2.

Table 5 .
The classification accuracy of different algorithms.