Data Augmentation for Electricity Theft Detection Using Conditional Variational Auto-Encoder

: Due to the strong concealment of electricity theft and the limitation of inspection resources, the number of power theft samples mastered by the power department is insu ﬃ cient, which limits the accuracy of power theft detection. Therefore, a data augmentation method for electricity theft detection based on the conditional variational auto-encoder (CVAE) is proposed. Firstly, the stealing power curves are mapped into low dimensional latent variables by using the encoder composed of convolutional layers, and the new stealing power curves are reconstructed by the decoder composed of deconvolutional layers. Then, ﬁve typical attack models are proposed, and the convolutional neural network is constructed as a classiﬁer according to the data characteristics of stealing power curves. Finally, the e ﬀ ectiveness and adaptability of the proposed method is veriﬁed by a smart meters’ data set from London. The simulation results show that the CVAE can take into account the shapes and distribution characteristics of samples at the same time, and the generated stealing power curves have the best e ﬀ ect on the performance improvement of the classiﬁer than the traditional augmentation methods such as the random oversampling method, synthetic minority over-sampling technique, and conditional generative adversarial network. Moreover, it is suitable for di ﬀ erent classiﬁers.


Introduction
The electrical loss includes non-technical loss and technical loss. Technical loss is an unavoidable loss in the process of power transmission, which is determined by power loads and parameters of power supply equipment. Non-technical loss is caused by wrong measurement, electricity theft, and non-payment by consumers [1]. In recent years, the U.S. has lost USD 6 billion every year due to electricity theft, according to a report by Forbes magazine [2]. Therefore, the detection of electricity theft is of great significance to reduce non-technical loss.
The existing methods for electricity theft detection can be divided into supervised classification and unsupervised regression. The unsupervised regression method is to determine the electricity theft by comparing the deviation between the actual value and the predicted value of the power load [3]. This kind of method does not need a labeled data set to train the model, but it is difficult to set the threshold and the detection accuracy is low [4,5]. Supervised classification methods mainly include traditional data mining models such as support vector machine (SVM), multi-layer perceptron (MLP), Bayesian network, extreme gradient boosting tree (XGBoost) [6][7][8][9][10], and new deep learning technologies such as the deep belief network and convolutional neural network (CNN) [11][12][13]. Specifically, SVM is very suitable for binary classification. For n types of stealing power curves, it needs 1.
The CVAE proposed has strong generalization ability and can generate many stealing power curves similar to that from the test set through unsupervised learning. As long as Gaussian noises are input to the decoder of CVAE, any number of samples of stealing power curves can be generated to train the deep neural network.

2.
Compared with ROS and SMOTE, the samples generated by CVAE not only have diversity, but also capture the probability distribution characteristics of stealing power curves. In addition, the training process of CVAE is more stable than that of CGAN and can generate new samples with higher quality.

3.
After data augmentation for the training set by CVAE, the detection accuracy of deep neural network can be significantly improved, and it is suitable for different classifiers.
The rest of this paper is organized as follows: Section 2 proposes the conditional variational auto-encoder for data augmentation. Section 3 introduces the process of electricity theft detection base on CNN. The simulation and results are shown in Section 4. The conclusions are described in Section 5.

Conditional Variational Auto-Encoder
Formally, the variational auto-encoder is to learn the data distribution p θ (X) of stealing power curves according to the historical data X = x 1 , x 2 , · · · x n . Typically, this data distribution of stealing power curves can be decomposed as follows [30]: where Π is the capital pi that is a product of all values in range of series.
In order to solve numerical problems, the log function is applied to obtain the following results: Each data point of steal power curves includes the latent variable z that explains the generative process. The Equation (1) can be rewritten for a single point as: where denotes the sign for definite integrals. The generation procedure for stealing power curves includes various steps. First, the prior probability p θ * (z) is sampled to obtain the latent variable z. Then, the stealing power curve x is generated accordingly to the posterior probability p θ * (x|z ). Unfortunately, the prior probability p θ * (z) and the posterior probability p θ * (x|z ) are not available. In order to estimate them, the posterior probability p θ (z|x ) = needs to be known. Hence, the inference is very difficult. Since the posterior probability is often very complex, a simple distribution q φ (z|x ) and parameter φ are needed to approximate it. The distribution log p θ (x i ) needs to be estimated, because it is impossible to directly sample the distribution of the stealing power curves. Therefore, the Kullback-Leibler divergence can be combined with the variational lower bound: where D KL q φ (z|x ) p θ (z|x ) is the Kullback-Leibler divergence between the distribution q φ (z|x ) and the distribution p θ (z|x ). q φ is the probability distribution to be learned and p φ is the prior distribution of latent variables. Obviously, this Kullback-Leibler divergence is greater than 0. The term acts as a lower bound of the log-likelihood: In this case, the term L(θ, φ; x) could be written as: Energies 2020, 13,4291 4 of 14 where the first term −D KL q φ (z|x ) p θ (z) constrains the function q φ (z|x ) to the shape of the p θ (z). The second term E q φ (z|x) [log p θ (x|z )] reconstructs the input data with the given latent variable z that follows p θ (x|z ). With this optimization goal L, the model can be parameterized as follows: where f and g are deep neural networks with a set of parameters, respectively. A more detailed derivation about VAE can be found in [27]. For the stealing power curve, it may have different shapes due to different attack methods, such as physical attack, data attack, and communication attack. In order to make the variational auto-encoder generate the stealing power curves of the specified attack method, the labels should be added in the training stage of the variational auto-encoder. Normally, the conditional distribution p θ (x y ) can be used to replace the original distribution p θ (x). The term L(θ, φ; x) of CVAE could be written as [31]:

Data Augmentation for Stealing Power Curves
The main advantage of CVAE is that it does not need to assume the probability distribution of the stealing power curves, and only a few samples are needed to train the model, which can generate samples similar to the original stealing power curves. A summary of this process for generating stealing power curves is represented in Figure 1.

Attack Models for Generation of Stealing Power Curves
In previous works, most of the stealing power curves are obtained by simulation, because it is difficult to detect electricity theft due to the strong concealment of thieves and limited audit resources. In this paper, the different attack models (e.g., physical attack, communication attack, and data attack) are utilized to obtain the samples with labels [2,13]. Table 1 shows the stealing power curves under different attack models. In the Table 1, some types of attack models will cause denial of service. In this case, the meter will stop reporting consumer information. This is the case of attack models such as alter routing table, drop packets, and disconnect meter [32]. The more problematic attack models are those that allow generating fake consumption records that imitate a user with a legitimate low power profile. This is the case of session hijacking, other types of attack models that permit privileged access to the firmware misconfiguration and power meter [33,34].  Step 1: The input data of CVAE are the stealing power curves and labels. Before inputting the data into the CVAE, it is necessary to normalize the data of power stealing curves, otherwise CVAE may not converge. In this paper, the min-max normalization method is used to transform the input data into values between 0 and 1.
Step 2: The deep convolutional network with a strong ability of feature extraction is used to construct encoder that maps input data to the low dimensional latent variables. Then, the mean and variance of the output data of the encoder are calculated, which are used to generate corresponding Gaussian noises as the input data of the decoder.
Step 3: Gaussian noises are fed to the decoder composed of the deep transposed convolutional network to generate new stealing power curves. Then, the output data of the decoder and the actual data are utilized to calculate the loss function, which is used to update the weight of the encoder and decoder by the back-propagation method.
Step 4: After training the CVAE, the Gaussian noises are fed to the decoder to generate the stealing power curves under the specified attack model. Furthermore, the generated stealing power curves and Energies 2020, 13, 4291 5 of 14 the original samples from the training set will be used to train a classifier (e.g., CNN), which is used to distinguish whether the unknown sample is a stealing power curve or a normal power curve.

Attack Models for Generation of Stealing Power Curves
In previous works, most of the stealing power curves are obtained by simulation, because it is difficult to detect electricity theft due to the strong concealment of thieves and limited audit resources. In this paper, the different attack models (e.g., physical attack, communication attack, and data attack) are utilized to obtain the samples with labels [2,13]. Table 1 shows the stealing power curves under different attack models. In the Table 1, some types of attack models will cause denial of service. In this case, the meter will stop reporting consumer information. This is the case of attack models such as alter routing table, drop packets, and disconnect meter [32]. The more problematic attack models are those that allow generating fake consumption records that imitate a user with a legitimate low power profile. This is the case of session hijacking, other types of attack models that permit privileged access to the firmware misconfiguration and power meter [33,34].

Attack Models Mathematical Model Attack Models Mathematical Model
Type 1 In type 1, the normal power curve, is multiplied by a random number in the range of 0.1 to 0.8 to get the stealing power curve. In type 2, the recorders of consumption are replaced by zeroes during a random period of every day. In type 3, every point of the normal power curve is multiplied by a random number in the range of 0.1 to 0.8. In type 4, the recorder of consumption is the product between the mean of the normal power curve and a random noise in the range of 0.1 to 0.8. In type 5, the recorders of consumption between low electrovalence period and high electrovalence period are exchanged.

Electricity Theft Detection Based on CNN
The input variables of the classifier for electricity theft detection are the power curves, and the output variables are the types of power curves shown in Table 2. As one of the representative algorithms of deep learning technologies, CNN has been widely used in image classification, fault diagnosis, and time series prediction due to its powerful feature extraction ability and has achieved remarkable results [35,36]. Compared with the traditional classification methods, CNN can not only map more complex nonlinear relationships, but also has good generalization ability. Therefore, this paper selects CNN as the classifier for electricity theft detection.
Energies 2020, 13, 4291 6 of 14 CNN is composed of convolutional layers, pooling layers, flatten layers, dropout layers, and fully connected layers. Specifically, convolutional layers and pooling layers are responsible for extracting the features of stealing power curves. Their mathematical formula is as follows: where x i denotes input data of i-th convolutional layer and y i denotes output data of i-th convolutional layer. y denotes output data of i-th max-pooling layer. f i is the activation function. b i and w i denote the offset vector and the weights of the i-th convolutional layer, respectively.
To alleviate the over-fitting, the dropout layer can make some neurons lose efficacy with a certain probability. The flatten layer is used as the bridge between the pooling layer and fully connected layer, which plays the role of format transformation for data. The mathematical formula of a fully connected layer is as follows: where g i is the activation function. b i and w i denote the offset vector and the weights of the i-th fully connected layer, respectively. y i and y i denote output and input data of i-th fully connected layer, respectively. According to the characteristics of the power curves, the optimal structure and parameters of CNN are obtained after many experiments, as shown in Figure 2.
used in image classification, fault diagnosis, and time series prediction due to its powerful feature extraction ability and has achieved remarkable results [35,36]. Compared with the traditional classification methods, CNN can not only map more complex nonlinear relationships, but also has good generalization ability. Therefore, this paper selects CNN as the classifier for electricity theft detection.
CNN is composed of convolutional layers, pooling layers, flatten layers, dropout layers, and fully connected layers. Specifically, convolutional layers and pooling layers are responsible for extracting the features of stealing power curves. Their mathematical formula is as follows: where i x denotes input data of i-th convolutional layer and i y denotes output data of i-th convolutional layer. ' y denotes output data of i-th max-pooling layer. i f is the activation function.
i b and i w denote the offset vector and the weights of the i-th convolutional layer, respectively.
To alleviate the over-fitting, the dropout layer can make some neurons lose efficacy with a certain probability. The flatten layer is used as the bridge between the pooling layer and fully connected layer, which plays the role of format transformation for data. The mathematical formula of a fully connected layer is as follows: where i g is the activation function. According to the characteristics of the power curves, the optimal structure and parameters of CNN are obtained after many experiments, as shown in Figure 2. In order to process the data conveniently, a 0 element is added at the end of the power curve, and the 1 × 49 vector is reconstituted into a 3-dimensional tensor of 5 × 5 scales as the input data of the convolutional layer. Then, two convolutional layers and max-pooling layers are used to extract the key features of the power curves. The number of filters in two convolutional layers is 16 and 32, respectively. The convolutional size and pooling size are 2 × 2. There is a dropout layer behind the pooling layer, which makes neurons lose efficacy with a probability of 0.25. After the flatten layer, there are two fully connected layers with 15 and 6 neurons, respectively. Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron's input is relevant for the model's prediction. Common activation functions include the Sigmoid function, Tanh function, Softmax, and ReLU. Specifically, the Sigmoid function is usually used to normalize the output of the last layer of the neural network for forecasting tasks. The Softmax function is usually used as a classifier of the neural network for multi-classification. For the Tanh function, previous works show that it has the problems of vanishing gradient and is In order to process the data conveniently, a 0 element is added at the end of the power curve, and the 1 × 49 vector is reconstituted into a 3-dimensional tensor of 5 × 5 scales as the input data of the convolutional layer. Then, two convolutional layers and max-pooling layers are used to extract the key features of the power curves. The number of filters in two convolutional layers is 16 and 32, respectively. The convolutional size and pooling size are 2 × 2. There is a dropout layer behind the pooling layer, which makes neurons lose efficacy with a probability of 0.25. After the flatten layer, there are two fully connected layers with 15 and 6 neurons, respectively. Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron's input is relevant for the model's prediction. Common activation functions include the Sigmoid function, Tanh function, Softmax, and ReLU. Specifically, the Sigmoid function is usually used to normalize the output of the last layer of the neural network for forecasting tasks. The Softmax function is usually used as a classifier of the neural network for multi-classification. For the Tanh function, previous works show that it has the problems of vanishing gradient and is computationally expensive [11]. Therefore, except for the last layer, it uses softmax function as activation function, and the remaining layers use the ReLU function as the activation function. The loss function is categorical cross-entropy, and the optimizer is the Adadelta algorithm.

The Process of the Proposed Methods
Summarizing the above analysis, the process of electricity theft detection based on data augmentation is shown in Figure 3. The specific steps are as follows: Step 1: After importing the dataset, the dataset is divided into the training set, validation set, and test set. The one-hot codes method is used to represent seven types of power curves, and the min-max normalization method is used to normalize the raw data.
Step 2: In the coding stage, the stealing power curves are mapped into latent variables by encoder. In the decoding stage, the new stealing power curves are obtained by feeding Gaussian noises to decoder. Then, the loss function is calculated to update the weights of the network. After the training of CVAE, a large amount of Gaussian noises are fed to the decoder of CVAE to generate new samples for training CNN.
Step 3: The samples generated by CVAE and the original samples from training set are used to train CNN. In the training process, the features of input variables are extracted by convolutional layers and pooling layers, and the labels output by a fully connected layer are used to calculate the loss function. Finally, the back-propagation algorithm is used to update the weights of CNN. After training CNN, it will be used to distinguish whether the unknown sample is a stealing power curve or a normal power curve.
Step 4: For the multi classification problem, it is too simple to evaluate the performance of the model only by accuracy. In this paper, Macro F1 and G-mean are used to evaluate the performance of CNN for the test set [37,38].
Energies 2020, 13, x FOR PEER REVIEW 7 of 14 computationally expensive [11]. Therefore, except for the last layer, it uses softmax function as activation function, and the remaining layers use the ReLU function as the activation function. The loss function is categorical cross-entropy, and the optimizer is the Adadelta algorithm.

The Process of the Proposed Methods
Summarizing the above analysis, the process of electricity theft detection based on data augmentation is shown in Figure 3. The specific steps are as follows: Step 1: After importing the dataset, the dataset is divided into the training set, validation set, and test set. The one-hot codes method is used to represent seven types of power curves, and the min-max normalization method is used to normalize the raw data.
Step 2: In the coding stage, the stealing power curves are mapped into latent variables by encoder. In the decoding stage, the new stealing power curves are obtained by feeding Gaussian noises to decoder. Then, the loss function is calculated to update the weights of the network. After the training of CVAE, a large amount of Gaussian noises are fed to the decoder of CVAE to generate new samples for training CNN.
Step 3: The samples generated by CVAE and the original samples from training set are used to train CNN. In the training process, the features of input variables are extracted by convolutional layers and pooling layers, and the labels output by a fully connected layer are used to calculate the loss function. Finally, the back-propagation algorithm is used to update the weights of CNN. After training CNN, it will be used to distinguish whether the unknown sample is a stealing power curve or a normal power curve.
Step 4: For the multi classification problem, it is too simple to evaluate the performance of the model only by accuracy. In this paper, Macro F1 and G-mean are used to evaluate the performance of CNN for the test set [37,38].

Data Description
To illustrate the effectiveness of the proposed methods, the data set of smart meters from London is used for simulation and analysis [39]. In this dataset, the time resolution of the power curve is 30 min, which means that each power curve has 48 points. Some samples are randomly selected to generate the stealing power curves based on the attack models proposed in Section 3.1. For example, in order to generate the stealing power curves in type 1, a normal curve is randomly selected from the data set. Then, this normal power curve is multiplied by a random number in the range of 0.1 to 0.8 to get the stealing power curve in type 1. In this case, the power curve with a label can be obtained through the attack models. Furthermore, the CVAE model is used to expand the number of training samples to twice the original number as shown in Table 3. Specifically, the samples in the validation set and test set do not change after data augmentation.
Except for SVM, the other algorithms (e.g., CVAE, CNN, MLP, XGBoost, ROS, SMOTE, and GAN) in this paper are running in Spyder (Python 3.7) with keras 2.2.4 and tensorflow 1.12.0. The parameters of the computer are as follows: 16 GB of memory, the processor is 3.8 GHz and Intel Core (TM) i5-7400 CPU.

Data Description
To illustrate the effectiveness of the proposed methods, the data set of smart meters from London is used for simulation and analysis [39]. In this dataset, the time resolution of the power curve is 30 min, which means that each power curve has 48 points. Some samples are randomly selected to generate the stealing power curves based on the attack models proposed in Section 3.1. For example, in order to generate the stealing power curves in type 1, a normal curve is randomly selected from the data set. Then, this normal power curve is multiplied by a random number in the range of 0.1 to 0.8 to get the stealing power curve in type 1. In this case, the power curve with a label can be obtained through the attack models. Furthermore, the CVAE model is used to expand the number of training samples to twice the original number as shown in Table 3. Specifically, the samples in the validation set and test set do not change after data augmentation.
Except for SVM, the other algorithms (e.g., CVAE, CNN, MLP, XGBoost, ROS, SMOTE, and GAN) in this paper are running in Spyder (Python 3.7) with keras 2.2.4 and tensorflow 1.12.0. The parameters of the computer are as follows: 16 GB of memory, the processor is 3.8 GHz and Intel Core (TM) i5-7400 CPU.  Figure 4 shows the structure and parameters of CVAE. The input data of CVAE are vectors of 1 × 48 scales. A 0 is added at the end of these vectors and then they will become vectors of 1 × 49 scales. Furthermore, the reshape function is used to transform these vectors into matrixes of 7 × 7 × 1 scales. For the encoder, it includes four convolutional layers, one flattened layer, and three fully connected layers. Specifically, the first convolutional layer includes one filter, and the remaining three convolutional layers have 16 filters. The kernel size of the first two layers is 2, and that of the last two layers is 3. The activation function of all convolution layers is the ReLU function. Every convolutional layer is followed by a batch normalization layer. The flatten layer is used as the bridge between the pooling layer and the fully connected layer, which plays the role of format transformation for data. To calculate the KL divergence loss and sample latent variable, the encoder adds two fully connected layers with 32 neurons for variance and mean to its end. For the decoder, its input data are Gaussian noises of 1 × 32 scales. Two fully connected layers, three deconvolutional layers, and one convolutional layer constitute the decoder. Specifically, the numbers of neurons in the fully connected layers are 64 and 256, respectively. The numbers of filters in the deconvolutional layers are all 16 and the kernel size is all 3. The number of filters in the convolution layer is 1 and the kernel size is 2. The activation functions are all ReLU functions. In addition, the optimizer is the Rmsprop algorithm.   Figure 4 shows the structure and parameters of CVAE. The input data of CVAE are vectors of 1 × 48 scales. A 0 is added at the end of these vectors and then they will become vectors of 1 × 49 scales. Furthermore, the reshape function is used to transform these vectors into matrixes of 7 × 7 × 1 scales. For the encoder, it includes four convolutional layers, one flattened layer, and three fully connected layers. Specifically, the first convolutional layer includes one filter, and the remaining three convolutional layers have 16 filters. The kernel size of the first two layers is 2, and that of the last two layers is 3. The activation function of all convolution layers is the ReLU function. Every convolutional layer is followed by a batch normalization layer. The flatten layer is used as the bridge between the pooling layer and the fully connected layer, which plays the role of format transformation for data. To calculate the KL divergence loss and sample latent variable, the encoder adds two fully connected layers with 32 neurons for variance and mean to its end. For the decoder, its input data are Gaussian noises of 1 × 32 scales. Two fully connected layers, three deconvolutional layers, and one convolutional layer constitute the decoder. Specifically, the numbers of neurons in the fully connected layers are 64 and 256, respectively. The numbers of filters in the deconvolutional layers are all 16 and the kernel size is all 3. The number of filters in the convolution layer is 1 and the kernel size is 2. The activation functions are all ReLU functions. In addition, the optimizer is the Rmsprop algorithm. In order to observe the training stability of CVAE, Figure 5 visualizes the evolution process of CVAE. Obviously, the loss function decreases rapidly with the increase of iteration times. When the iteration times are more than 40, the value of the loss function tends to be stable, which indicates In order to observe the training stability of CVAE, Figure 5 visualizes the evolution process of CVAE. Obviously, the loss function decreases rapidly with the increase of iteration times. When the iteration times are more than 40, the value of the loss function tends to be stable, which indicates that CVAE has entered the convergence state. The training process of CVAE is very stable, not like the loss function of CGAN which fluctuates violently and is difficult to converge.

Performance of CVAE
Energies 2020, 13, x FOR PEER REVIEW 9 of 14 that CVAE has entered the convergence state. The training process of CVAE is very stable, not like the loss function of CGAN which fluctuates violently and is difficult to converge. After training CVAE, the Gaussian noises of 1 × 32 scales are used as input variables of the decoder, and a large number of new stealing power curves are obtained. Then, some new stealing power curves are selected to verify the effectiveness of the power curves generated by CVAE. Next, the Euclidean distance of each power curve in the test set and the new power curve generated by CVAE is calculated, and the power curve in the test set with the minimum Euclidean distance is selected. Finally, Figure 6 visualizes the shapes of the generated power curves and the real power curves. It can be seen from Figure 6 that the stealing power curves generated by CVAE are very close to those from the test set. The stealing power curves in the test set do not participate in the training process of the CVAE, which indicates that CVAE has a strong generalization ability, and the stealing power curves generated by CVAE are very in line with the actual scene.
In addition to comparing the shape similarity of stealing electricity curves, the validity of CVAE can be further verified by the probability density function (PDF). It can be seen from the Figure 7 that the probability distribution functions of the stealing power curves generated by CVAE are very close to those from the test set, which indicates that CVAE can not only learn the shape of the stealing power curves, but also capture the distribution characteristics of historical data to generate high-quality samples. After training CVAE, the Gaussian noises of 1 × 32 scales are used as input variables of the decoder, and a large number of new stealing power curves are obtained. Then, some new stealing power curves are selected to verify the effectiveness of the power curves generated by CVAE. Next, the Euclidean distance of each power curve in the test set and the new power curve generated by CVAE is calculated, and the power curve in the test set with the minimum Euclidean distance is selected. Finally, Figure 6 visualizes the shapes of the generated power curves and the real power curves.

Loss function
Energies 2020, 13, x FOR PEER REVIEW 9 of 14 that CVAE has entered the convergence state. The training process of CVAE is very stable, not like the loss function of CGAN which fluctuates violently and is difficult to converge. After training CVAE, the Gaussian noises of 1 × 32 scales are used as input variables of the decoder, and a large number of new stealing power curves are obtained. Then, some new stealing power curves are selected to verify the effectiveness of the power curves generated by CVAE. Next, the Euclidean distance of each power curve in the test set and the new power curve generated by CVAE is calculated, and the power curve in the test set with the minimum Euclidean distance is selected. Finally, Figure 6 visualizes the shapes of the generated power curves and the real power curves. It can be seen from Figure 6 that the stealing power curves generated by CVAE are very close to those from the test set. The stealing power curves in the test set do not participate in the training process of the CVAE, which indicates that CVAE has a strong generalization ability, and the stealing power curves generated by CVAE are very in line with the actual scene.
In addition to comparing the shape similarity of stealing electricity curves, the validity of CVAE can be further verified by the probability density function (PDF). It can be seen from the Figure 7 that the probability distribution functions of the stealing power curves generated by CVAE are very close to those from the test set, which indicates that CVAE can not only learn the shape of the stealing power curves, but also capture the distribution characteristics of historical data to generate high-quality samples. It can be seen from Figure 6 that the stealing power curves generated by CVAE are very close to those from the test set. The stealing power curves in the test set do not participate in the training process of the CVAE, which indicates that CVAE has a strong generalization ability, and the stealing power curves generated by CVAE are very in line with the actual scene.

Loss function
In addition to comparing the shape similarity of stealing electricity curves, the validity of CVAE can be further verified by the probability density function (PDF). It can be seen from the Figure 7 that the probability distribution functions of the stealing power curves generated by CVAE are very close to those from the test set, which indicates that CVAE can not only learn the shape of the stealing power curves, but also capture the distribution characteristics of historical data to generate high-quality samples.

Performance Comparison of Different Methods for Data Augmentation
In order to illustrate the effectiveness of generating stealing power curves by CVAE, the ROS, SMOTE, and CGAN are used as the baseline. These methods are used to expand the samples of the training set to train a classifier (e.g., CNN), and the results of the classifier for test set are shown in Table 4.
As can be seen from Table 4, the detection performance of CNN has been significantly improved after data augmentation by various methods. Specifically, after data augmentation by ROS, the accuracy, Macro F1, and G-mean of CNN are improved by 2.00%, 1.64%, and 1.69%, respectively, compared with the original training set. After data augmentation by SOMTE, the accuracy, Macro F1, and G-mean of CNN are improved by 3.50%, 2.90%, and 3.33%, respectively, compared with the original training set. After data augmentation by CGAN, the accuracy, Macro F1, and G-mean of CNN are improved by 4.46%, 4.46%, and 4.68%, respectively compared with the original training set. After data augmentation by CVAE, the accuracy, Macro F1, and G-mean, of CNN are improved by 7.00%, 6.65%, and 6.01%, respectively, compared with the original training set. Therefore, compared with the existing methods for data augmentation, the proposed CVAE can expand the training set according to the actual shape and distribution characteristics of stealing power curves, and has the strongest improvement on CNN performance.

Adaptability Analysis of CVAE
In order to verify the adaptability of CVAE to different classifiers, CVAE is used to expand the samples from the training set, and then the performance of different classifiers (e.g., CNN, MLP, SVM, and XGBoost) after data augmentation is tested. After many experiments, their optimal parameters are found as follows: For MLP, the number of neurons in the input layer is 48, and the number of neurons in the middle layer is 24 and 12, respectively. The number of neurons in the output layer is equal to the number of categories. The optimizer is the root mean square prop (RMSprop) and the loss function is cross-entropy. Besides, a dropout layer with a rate of 0.25 is inserted between each fully

Performance Comparison of Different Methods for Data Augmentation
In order to illustrate the effectiveness of generating stealing power curves by CVAE, the ROS, SMOTE, and CGAN are used as the baseline. These methods are used to expand the samples of the training set to train a classifier (e.g., CNN), and the results of the classifier for test set are shown in Table 4. As can be seen from Table 4, the detection performance of CNN has been significantly improved after data augmentation by various methods. Specifically, after data augmentation by ROS, the accuracy, Macro F1, and G-mean of CNN are improved by 2.00%, 1.64%, and 1.69%, respectively, compared with the original training set. After data augmentation by SOMTE, the accuracy, Macro F1, and G-mean of CNN are improved by 3.50%, 2.90%, and 3.33%, respectively, compared with the original training set. After data augmentation by CGAN, the accuracy, Macro F1, and G-mean of CNN are improved by 4.46%, 4.46%, and 4.68%, respectively compared with the original training set. After data augmentation by CVAE, the accuracy, Macro F1, and G-mean, of CNN are improved by 7.00%, 6.65%, and 6.01%, respectively, compared with the original training set. Therefore, compared with the existing methods for data augmentation, the proposed CVAE can expand the training set according to the actual shape and distribution characteristics of stealing power curves, and has the strongest improvement on CNN performance.

Adaptability Analysis of CVAE
In order to verify the adaptability of CVAE to different classifiers, CVAE is used to expand the samples from the training set, and then the performance of different classifiers (e.g., CNN, MLP, SVM, and XGBoost) after data augmentation is tested. After many experiments, their optimal parameters are found as follows: For MLP, the number of neurons in the input layer is 48, and the number of neurons in the middle layer is 24 and 12, respectively. The number of neurons in the output layer is equal to the number of categories. The optimizer is the root mean square prop (RMSprop) and the loss function is cross-entropy. Besides, a dropout layer with a rate of 0.25 is inserted between each fully connected layer to alleviate over-fitting. For SVM, the fitcecoc function from MATLAB2018a is used to classify stealing power curves. For XGBoost, the min child weight is 2 and the subsample is 0.8. The max depth is 6 and eta is 0.1. The gamma is 0.2. The results of the test set using different classifiers as shown in Table 5. As can be seen from Table 5, the performance of each classifier has been greatly improved after data augmentation by CVAE. Specifically, the accuracy, Macro F1, and G-mean of CNN are improved by 7.00%, 6.65%, and 7.01%, respectively after data augmentation. The accuracy, Macro F1, and G-mean of MLP are improved by 3.75%, 4.35%, and 4.65%, respectively after data augmentation. The accuracy, Macro F1, and G-mean of SVM are improved by 4.50%, 4.40%, and 4.26%, respectively after data augmentation. The accuracy, Macro F1, and G-mean of XGBoost are improved by 3.00%, 2.73%, and 1.96%, respectively, after data augmentation. In general, CVAE can effectively improve the accuracy of electricity theft detection through the unsupervised generation of new samples, which is suitable for different classifiers.

Discussion
The objective of this paper is to propose a new method based on CVAE to improve the accuracy for electricity theft detection. In this paper, the effectiveness of the proposed CVAE has been tested on the smart meter dataset from the low carbon London project. The simulation results show that the accuracy of electricity theft detection can be significantly enhanced after data augmentation by CVAE. For the CVAE model, its training process requires some labeled power curves. However, the labels of the power curves of stealing electricity are difficult to obtain in some cases, which make it impossible to train the CVAE model. At this time, we can try to use the traditional VAE to model different types of stealing power curves. If the data set contains n kinds of different electricity stealing power curves, we have to train n VAE model. Relatively, if each stealing power curve has a label, we only need to train one CVAE model.

Conclusions
Due to the strong concealment of electricity theft and the limitation of inspection resources, the number of power theft samples mastered by the power department is insufficient, which limits the accuracy of power theft detection. Therefore, a data augmentation method for electricity theft detection based on a conditional variational auto-encoder is proposed. The following conclusions are drawn through simulation: (1). The training process of CVAE is very stable, and the convergence speed is fast. The generated stealing power curves have a similar shape and distribution characteristics with the original stealing power curves.