1. Introduction
The electrical loss includes non-technical loss and technical loss. Technical loss is an unavoidable loss in the process of power transmission, which is determined by power loads and parameters of power supply equipment. Non-technical loss is caused by wrong measurement, electricity theft, and non-payment by consumers [
1]. In recent years, the U.S. has lost USD 6 billion every year due to electricity theft, according to a report by Forbes magazine [
2]. Therefore, the detection of electricity theft is of great significance to reduce non-technical loss.
The existing methods for electricity theft detection can be divided into supervised classification and unsupervised regression. The unsupervised regression method is to determine the electricity theft by comparing the deviation between the actual value and the predicted value of the power load [
3]. This kind of method does not need a labeled data set to train the model, but it is difficult to set the threshold and the detection accuracy is low [
4,
5]. Supervised classification methods mainly include traditional data mining models such as support vector machine (SVM), multi-layer perceptron (MLP), Bayesian network, extreme gradient boosting tree (XGBoost) [
6,
7,
8,
9,
10], and new deep learning technologies such as the deep belief network and convolutional neural network (CNN) [
11,
12,
13]. Specifically, SVM is very suitable for binary classification. For n types of stealing power curves, it needs to train n SVM, which consumes a lot of computing time for the data sets with a large number of samples [
14,
15]. The Bayesian network is sensitive to the form of input data, and it needs to assume a prior distribution for samples, which may lead to poor accuracy for detection due to the inaccurate prior model [
16]. MLP has a powerful non-linear mapping ability. In theory, it can fit arbitrary continuous functions theoretically. However, it has the problem of over-fitting [
17]. The XGBoost improves the performance by using multiple classifiers, but it has too many parameters, which makes it difficult to adjust parameters [
18,
19]. In general, these traditional data mining methods are easy to implement, and are suitable for electricity theft detection with small samples. However, they have problems of low feature extraction ability and limited detection accuracy. Relatively, deep neural networks not only have strong ability of feature extraction, but also can map complex nonlinear relationships, which gives them a higher detection accuracy than traditional methods [
20,
21].
A sufficient number of stealing power curves in the data set is the basis to ensure that the deep neural networks have strong generalization ability. However, it is difficult to detect electricity theft due to the strong concealment of thieves and limited audit resources. In practical engineering, the number of stealing power curves found are limited, which is not enough to train deep neural networks. Therefore, it is necessary to use the limited stealing power curves for data augmentation, so as to improve the accuracy of detection. In reference [
22], the random oversampling (ROS) is proposed to reproduce the samples. Although the number of samples is increased, the classifier is prone to over fitting, since new sample lacks diversity. To solve this problem, the synthetic minority over-sampling technique (SMOTE) is proposed in reference [
23,
24,
25]. However, the SMOTE does not take into account the probability distribution characteristics of electricity stealing curves, so the improvement of accuracy is limited. Reference [
26] uses the conditional generative adversarial network (CGAN) to model the stealing power curves, which has higher accuracy than the traditional oversampling methods, but it is difficult to adjust parameters, and the training process is unstable.
The conditional variational auto-encoder (CVAE) is a novel deep generative network which uses output vectors to reconstruct input features. At present, the CVAE has been widely used in different fields such as image augmentation, dimensionality reduction, and data generation [
27,
28,
29], and has shown good performance, but its application in data augmentation for stealing power curves is still in its infancy. In theory, the CVAE effectively extracts the potential features of stealing power curves by using the encoder with strong learning ability, and reconstructs the stealing power curves by the decoder, which can provide enough data for the deep neural network. Specifically, it is necessary to redesign the structure of the CVAE to make it suitable for generating stealing power curves, since the existing structures are only suitable for processing 2-dimensional data, such as images and videos.
To improve the accuracy of electricity theft detection, a data augmentation method for stealing power curves based on conditional variational auto-encoder is proposed in this paper. The key contributions of this paper can be summarized as follows:
The CVAE proposed has strong generalization ability and can generate many stealing power curves similar to that from the test set through unsupervised learning. As long as Gaussian noises are input to the decoder of CVAE, any number of samples of stealing power curves can be generated to train the deep neural network.
Compared with ROS and SMOTE, the samples generated by CVAE not only have diversity, but also capture the probability distribution characteristics of stealing power curves. In addition, the training process of CVAE is more stable than that of CGAN and can generate new samples with higher quality.
After data augmentation for the training set by CVAE, the detection accuracy of deep neural network can be significantly improved, and it is suitable for different classifiers.
The rest of this paper is organized as follows:
Section 2 proposes the conditional variational auto-encoder for data augmentation.
Section 3 introduces the process of electricity theft detection base on CNN. The simulation and results are shown in
Section 4. The conclusions are described in
Section 5.
2. Conditional Variational Auto-Encoder for Data Augmentation
2.1. Conditional Variational Auto-Encoder
Formally, the variational auto-encoder is to learn the data distribution
of stealing power curves according to the historical data
. Typically, this data distribution of stealing power curves can be decomposed as follows [
30]:
where П is the capital pi that is a product of all values in range of series.
In order to solve numerical problems, the log function is applied to obtain the following results:
Each data point of steal power curves includes the latent variable
z that explains the generative process. The Equation (1) can be rewritten for a single point as:
where ∫ denotes the sign for definite integrals.
The generation procedure for stealing power curves includes various steps. First, the prior probability is sampled to obtain the latent variable z. Then, the stealing power curve x is generated accordingly to the posterior probability . Unfortunately, the prior probability and the posterior probability are not available. In order to estimate them, the posterior probability needs to be known. Hence, the inference is very difficult. Since the posterior probability is often very complex, a simple distribution and parameter are needed to approximate it.
The distribution
needs to be estimated, because it is impossible to directly sample the distribution of the stealing power curves. Therefore, the Kullback–Leibler divergence can be combined with the variational lower bound:
where
is the Kullback–Leibler divergence between the distribution
and the distribution
.
is the probability distribution to be learned and
is the prior distribution of latent variables. Obviously, this Kullback–Leibler divergence is greater than 0. The term acts as a lower bound of the log-likelihood:
In this case, the term
could be written as:
where the first term
constrains the function
to the shape of the
. The second term
reconstructs the input data with the given latent variable z that follows
. With this optimization goal
L, the model can be parameterized as follows:
where
f and
g are deep neural networks with a set of parameters, respectively. A more detailed derivation about VAE can be found in [
27].
For the stealing power curve, it may have different shapes due to different attack methods, such as physical attack, data attack, and communication attack. In order to make the variational auto-encoder generate the stealing power curves of the specified attack method, the labels should be added in the training stage of the variational auto-encoder. Normally, the conditional distribution
can be used to replace the original distribution
. The term
of CVAE could be written as [
31]:
2.2. Data Augmentation for Stealing Power Curves
The main advantage of CVAE is that it does not need to assume the probability distribution of the stealing power curves, and only a few samples are needed to train the model, which can generate samples similar to the original stealing power curves. A summary of this process for generating stealing power curves is represented in
Figure 1.
Step 1: The input data of CVAE are the stealing power curves and labels. Before inputting the data into the CVAE, it is necessary to normalize the data of power stealing curves, otherwise CVAE may not converge. In this paper, the min-max normalization method is used to transform the input data into values between 0 and 1.
Step 2: The deep convolutional network with a strong ability of feature extraction is used to construct encoder that maps input data to the low dimensional latent variables. Then, the mean and variance of the output data of the encoder are calculated, which are used to generate corresponding Gaussian noises as the input data of the decoder.
Step 3: Gaussian noises are fed to the decoder composed of the deep transposed convolutional network to generate new stealing power curves. Then, the output data of the decoder and the actual data are utilized to calculate the loss function, which is used to update the weight of the encoder and decoder by the back-propagation method.
Step 4: After training the CVAE, the Gaussian noises are fed to the decoder to generate the stealing power curves under the specified attack model. Furthermore, the generated stealing power curves and the original samples from the training set will be used to train a classifier (e.g., CNN), which is used to distinguish whether the unknown sample is a stealing power curve or a normal power curve.
3. Electricity Theft Detection Based on Data Augmentation
3.1. Attack Models for Generation of Stealing Power Curves
In previous works, most of the stealing power curves are obtained by simulation, because it is difficult to detect electricity theft due to the strong concealment of thieves and limited audit resources. In this paper, the different attack models (e.g., physical attack, communication attack, and data attack) are utilized to obtain the samples with labels [
2,
13].
Table 1 shows the stealing power curves under different attack models. In the
Table 1, some types of attack models will cause denial of service. In this case, the meter will stop reporting consumer information. This is the case of attack models such as alter routing table, drop packets, and disconnect meter [
32]. The more problematic attack models are those that allow generating fake consumption records that imitate a user with a legitimate low power profile. This is the case of session hijacking, other types of attack models that permit privileged access to the firmware misconfiguration and power meter [
33,
34].
In type 1, the normal power curve, is multiplied by a random number in the range of 0.1 to 0.8 to get the stealing power curve. In type 2, the recorders of consumption are replaced by zeroes during a random period of every day. In type 3, every point of the normal power curve is multiplied by a random number in the range of 0.1 to 0.8. In type 4, the recorder of consumption is the product between the mean of the normal power curve and a random noise in the range of 0.1 to 0.8. In type 5, the recorders of consumption between low electrovalence period and high electrovalence period are exchanged.
3.2. Electricity Theft Detection Based on CNN
The input variables of the classifier for electricity theft detection are the power curves, and the output variables are the types of power curves shown in
Table 2.
As one of the representative algorithms of deep learning technologies, CNN has been widely used in image classification, fault diagnosis, and time series prediction due to its powerful feature extraction ability and has achieved remarkable results [
35,
36]. Compared with the traditional classification methods, CNN can not only map more complex nonlinear relationships, but also has good generalization ability. Therefore, this paper selects CNN as the classifier for electricity theft detection.
CNN is composed of convolutional layers, pooling layers, flatten layers, dropout layers, and fully connected layers. Specifically, convolutional layers and pooling layers are responsible for extracting the features of stealing power curves. Their mathematical formula is as follows:
where
denotes input data of
i-th convolutional layer and
denotes output data of
i-th convolutional layer.
denotes output data of
i-th max-pooling layer.
is the activation function.
and
denote the offset vector and the weights of the
i-th convolutional layer, respectively.
To alleviate the over-fitting, the dropout layer can make some neurons lose efficacy with a certain probability. The flatten layer is used as the bridge between the pooling layer and fully connected layer, which plays the role of format transformation for data. The mathematical formula of a fully connected layer is as follows:
where
is the activation function.
and
denote the offset vector and the weights of the
i-th fully connected layer, respectively.
and
denote output and input data of
i-th fully connected layer, respectively.
According to the characteristics of the power curves, the optimal structure and parameters of CNN are obtained after many experiments, as shown in
Figure 2.
In order to process the data conveniently, a 0 element is added at the end of the power curve, and the 1 × 49 vector is reconstituted into a 3-dimensional tensor of 5 × 5 scales as the input data of the convolutional layer. Then, two convolutional layers and max-pooling layers are used to extract the key features of the power curves. The number of filters in two convolutional layers is 16 and 32, respectively. The convolutional size and pooling size are 2 × 2. There is a dropout layer behind the pooling layer, which makes neurons lose efficacy with a probability of 0.25. After the flatten layer, there are two fully connected layers with 15 and 6 neurons, respectively. Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction. Common activation functions include the Sigmoid function, Tanh function, Softmax, and ReLU. Specifically, the Sigmoid function is usually used to normalize the output of the last layer of the neural network for forecasting tasks. The Softmax function is usually used as a classifier of the neural network for multi-classification. For the Tanh function, previous works show that it has the problems of vanishing gradient and is computationally expensive [
11]. Therefore, except for the last layer, it uses softmax function as activation function, and the remaining layers use the ReLU function as the activation function. The loss function is categorical cross-entropy, and the optimizer is the Adadelta algorithm.
3.3. The Process of the Proposed Methods
Summarizing the above analysis, the process of electricity theft detection based on data augmentation is shown in
Figure 3. The specific steps are as follows:
Step 1: After importing the dataset, the dataset is divided into the training set, validation set, and test set. The one-hot codes method is used to represent seven types of power curves, and the min-max normalization method is used to normalize the raw data.
Step 2: In the coding stage, the stealing power curves are mapped into latent variables by encoder. In the decoding stage, the new stealing power curves are obtained by feeding Gaussian noises to decoder. Then, the loss function is calculated to update the weights of the network. After the training of CVAE, a large amount of Gaussian noises are fed to the decoder of CVAE to generate new samples for training CNN.
Step 3: The samples generated by CVAE and the original samples from training set are used to train CNN. In the training process, the features of input variables are extracted by convolutional layers and pooling layers, and the labels output by a fully connected layer are used to calculate the loss function. Finally, the back-propagation algorithm is used to update the weights of CNN. After training CNN, it will be used to distinguish whether the unknown sample is a stealing power curve or a normal power curve.
Step 4: For the multi classification problem, it is too simple to evaluate the performance of the model only by accuracy. In this paper, Macro F1 and G-mean are used to evaluate the performance of CNN for the test set [
37,
38].
6. Conclusions
Due to the strong concealment of electricity theft and the limitation of inspection resources, the number of power theft samples mastered by the power department is insufficient, which limits the accuracy of power theft detection. Therefore, a data augmentation method for electricity theft detection based on a conditional variational auto-encoder is proposed. The following conclusions are drawn through simulation:
- (1).
The training process of CVAE is very stable, and the convergence speed is fast. The generated stealing power curves have a similar shape and distribution characteristics with the original stealing power curves.
- (2).
After data augmentation by CVAE, the accuracy, Macro F1, and G-mean of CNN are improved by 7.00%, 6.65%, and 6.01%, respectively compared with the original training set. Compared with existing data augmentation methods (e.g., ROS, SMOTE and GAN), the accuracy, Macro F1, and G-mean values of CNN are the largest, which indicates that the new samples generated by CVAE have the strongest improvement on detection performance.
- (3).
Compared with the original training set, the training set augmented by CVAE improves the comprehensive detection performance of classifiers such as CNN, MLP, SVM, and XGBoost, which indicates that CVAE is suitable for different classifiers.
For future work, we can try other generative networks (e.g., a flow-based network) to model the stealing power curve. In addition, the capsule network can be used to distinguish the stealing curves from the normal curves.