Noise Reduction Power Stealing Detection Model Based on Self-Balanced Data Set

: In recent years, various types of power theft incidents have occurred frequently, and the training of the power-stealing detection model is susceptible to the inﬂuence of the imbalanced data set and the data noise, which leads to errors in power-stealing detection. Therefore, a power-stealing detection model is proposed, which is based on Improved Conditional Generation Adversarial Network (CWGAN), Stacked Convolution Noise Reduction Autoencoder (SCDAE) and Lightweight Gradient Boosting Decision Machine (LightGBM). The model performs Generation- Adversarial operations on the original unbalanced power consumption data to achieve the balance of electricity data, and avoids the interference of the imbalanced data set on classiﬁer training. In addition, the convolution method is used to stack the noise reduction auto-encoder to achieve dimension reduction of power consumption data, extract data features and reduce the impact of random noise. Finally, LightGBM is used for power theft detection. The experiments show that CWGAN can e ﬀ ectively balance the distribution of power consumption data. Comparing the detection indicators of the power-stealing model with various advanced power-stealing models on the same data set, it is ﬁnally proved that the proposed model is superior to other models in the detection of power


Introduction
Power safety is of great significance to social production and citizen daily life. In recent years, various types of power theft incidents have occurred frequently, causing huge economic losses to the state and power supply companies, and disrupting the power order of legal power consumers. In addition, illegal cross-connecting cables by power theft will keep the transformer at the end of the power grid overloaded for a long time, which directly affects the stability of normal power supply and reasonable power allocation by power supply companies, and also leads to great security risks. With the continuous emergence of new methods of stealing electricity [1], the methods of measuring equipment being privately modified have become more professional [2]. Along with the introduction and implementation of the national smart grid, while bringing convenience to power system control, it has also caused the amount of consumer electricity data to grow exponentially, and the annual data volume of large cities has already exceeded 10 billion. The explosion of professional power theft and power consumption data has increased the difficulty of power theft investigation and put forward higher requirements for current automatic power theft detection methods.
In order to solve the above problems, many scholars have used machine learning algorithms [3] to analyze the daily power consumption patterns of users to build classification models, including decision trees [4], random forest (RF) [5], support vector machines (SVM) [6], neural network (NN) [7], etc. R. Punmiya et al. [8] proposes a gradient boost theft detector (GBTD) based on the latest three gradient  Due to the large amount and high dimensionality of power theft data, it is necessary to reduce the dimensions and feature extraction of the original data and optimize the neuron connection method to speed up model training with a limited amount of data. To this end, a convolutional stacked power feature extraction noise reduction autoencoder (SCDAE) is proposed. It takes the 147 × 7-dimensional enhanced data set as a training input, after normalized whitening operation and multi-layer noise reduction encoding, and finally outputs typical features that can effectively reconstruct the original data. Compared with the general SDAE, the SCDAE model has more stable and abstract feature extraction capabilities [20,26] and provides better data input for subsequent classifiers.
Finally, LightGBM is used to classify users who steal electricity. The algorithm does not need to perform one-hot feature coding [27]. Instead, it divides the power consumption data features extracted by SCDAE directly into the discrete domain to form a histogram. This not only reduces memory usage, but also improves classification efficiency. At the same time, it produces subtrees in a node expansion mode with a depth limit. When the number of splits of the decision tree is the same, it can reduce the occurrence of overfitting while obtaining higher model accuracy.

Stealing Data Set Balance Processing
In the theft detection, to achieve a good classification training effect, a certain amount of data is required. However, due to the particularity of electricity theft, analyzing a certain electricity consumption data set [28] found that the sample of users who steal electricity is much smaller than the sample of normal users, and there are more than 3000 users who steal electricity from more than 40,000 users. Among them, the theft data with less sample types is called positive type, and the normal data with more sample types is called negative type samples. In addition, the randomness of normal users' electricity consumption behavior will cause data noise. As a result, a large number of negative samples are located in the category boundaries and overlapping areas of the data set, making it difficult to distinguish between power-stealing samples and normal samples.
For imbalanced data sets with noise, if a general classifier is used for training and learning, a "bias" phenomenon often occurs, that is, the classification is biased toward negative samples but it is difficult to achieve effective discrimination for positive samples. Therefore, this paper not only Theft detection is essentially two types of detection for user categories. To achieve two types of classification, you must first analyze the data and extract data features. However, due to the particularity of the power-stealing behavior, the number of users who steal power is usually much smaller than that of ordinary users, resulting in an extremely imbalanced data set. The extracted features will be biased towards normal user data [25], and the accuracy will be insufficient, so the imbalance of the original data must be solved first. The model's data preprocessor divides the daily power consumption vector into a weekly data matrix and uses the conditions of the Wasserstein criterion to generate an anti-network to balance power theft data. The training data preprocessor extracts the original training set data and trains the adversarial network according to a certain sampling rate. Through the generation and confrontation process of CWGAN, effective stealing data will be generated and mixed with the original training set to form an enhanced data set D. Subsequent models will be trained on the enhanced data set.
Due to the large amount and high dimensionality of power theft data, it is necessary to reduce the dimensions and feature extraction of the original data and optimize the neuron connection method to speed up model training with a limited amount of data. To this end, a convolutional stacked power feature extraction noise reduction autoencoder (SCDAE) is proposed. It takes the 147 × 7-dimensional enhanced data set as a training input, after normalized whitening operation and multi-layer noise reduction encoding, and finally outputs typical features that can effectively reconstruct the original data. Compared with the general SDAE, the SCDAE model has more stable and abstract feature extraction capabilities [20,26] and provides better data input for subsequent classifiers.
Finally, LightGBM is used to classify users who steal electricity. The algorithm does not need to perform one-hot feature coding [27]. Instead, it divides the power consumption data features extracted by SCDAE directly into the discrete domain to form a histogram. This not only reduces memory usage, but also improves classification efficiency. At the same time, it produces subtrees in a node expansion mode with a depth limit. When the number of splits of the decision tree is the same, it can reduce the occurrence of overfitting while obtaining higher model accuracy.

Stealing Data Set Balance Processing
In the theft detection, to achieve a good classification training effect, a certain amount of data is required. However, due to the particularity of electricity theft, analyzing a certain electricity Energies 2020, 13, 1763 4 of 16 consumption data set [28] found that the sample of users who steal electricity is much smaller than the sample of normal users, and there are more than 3000 users who steal electricity from more than 40,000 users. Among them, the theft data with less sample types is called positive type, and the normal data with more sample types is called negative type samples. In addition, the randomness of normal users' electricity consumption behavior will cause data noise. As a result, a large number of negative samples are located in the category boundaries and overlapping areas of the data set, making it difficult to distinguish between power-stealing samples and normal samples.
For imbalanced data sets with noise, if a general classifier is used for training and learning, a "bias" phenomenon often occurs, that is, the classification is biased toward negative samples but it is difficult to achieve effective discrimination for positive samples. Therefore, this paper not only improves the data feature extraction ability from the algorithm level, but also optimizes the data of the classifier training process from the root source, balances the two types of data distribution and improves the detection rate of the test set.
In the data set of imbalanced electricity consumption by users, the positive type of power-stealing samples are relatively small, but to achieve effective power-stealing detection, the amount of information contained in the positive power-stealing samples is often more critical. Resampling the power-stealing sample can fully extract the data characteristics of the power-stealing user. In addition, the demand for training classifiers for data is often very large, and it is generally difficult to obtain enough positive samples for stealing electricity to constitute a data set and a test set. Therefore, it is necessary to generate positive samples in the classifier training stage to balance the data set. GAN can learn the original data distribution through the internal generation confrontation mechanism to generate samples. In this paper, the conditional generation countermeasure network proposed by Mirza [23] is used as the basis to introduce the conditional tagging quantity. The Wasserstein distance [24] is used instead of the KL (Kullback-Leibler) divergence to evaluate the conditions of the generated data distribution and the original data distribution, and an objective function of the training network that matches the characteristics of the power-stealing data is set. Finally, the two types of samples are processed differently, that is, the positive samples of power stealing are generated by the network, and the negative samples are undersampled to realize the balanced processing of the data set.

Design of CW Generation Counterattack Network for Stealing Data
The classic generative adversarial network consists of a generative model and a discriminant model. The generative model is denoted as G, and the generative model G is inputted with random data x to generate G(x). Through training, the distribution function Φ g of G(x) approximately obeys the sample true distribution Φ r , and the discriminant model D evaluates the degree of difference between the two distributions. During the model training process, the generated model G and the discriminant model D are updated alternately, making it difficult for the final discriminant model to distinguish the real data from the data generated by the generated model. The overall objective function is the following formula (Equation (1)).
In the formula, E is used to represent the expectation function. The initial parameters of the discriminant function D (x) can be arbitrary values. The target discriminant model can be learned through data samples. The optimal discriminant model satisfies the formula (Equation (2)).
When the optimal discriminant model exists, the objective of the optimal generative model is Energies 2020, 13, 1763

of 16
In order to generate electricity-stealing data using the generation counter network architecture, the following marks are introduced: The original data set is recorded as D t , the original power consumption data is recorded as P k , and the power consumption data generated by the generation model is recorded as P k . The loss function in the training process of the generated model is expressed in a matrix form: Due to the discontinuous distribution of power consumption by power-stealing users, there is an optimal discriminator D * between Φ g and Φ r , which can achieve target classification with a 100% probability, and the gradient is 0 [29] in the sampling data set, which causes the gradient to disappear during the neural network learning gradient approximation D * , making it difficult to continue learning. At this time, the KL divergence used to evaluate the approximation of the two distributions tends to be infinite, and the JS divergence is a constant. Therefore, the Wasserstein distance is used instead of the KL divergence to measure Φ g (x) and Φ r (x) in the generation of power theft data.
where C is the minimum radius of the neighborhood containing Φ g and Φ r support sets, and ε represents noise. Arjovsky proved that when φ θ represents the probability distribution of the function g θ = (x; θ), where g θ is a generator function, and W (Φ r , Φ g ) is also continuous when g θ is continuous with respect to θ. K-R duality shows that W (Φ r , Φ g ) satisfies formula The neural network parameter w is continuously updated through the back propagation principle, and the discriminant model objective function formula (Equation (7)) is obtained.
Use ϕ(x, w) approximation to approximate f (x) in the objective function of the discriminant model. At the same time, in order to make Equation (7) meet Lipschitz's continuous assumption, it is necessary to make the weight of each update of the neural network generated in theft data within a certain range, generally −0.01 to 0.01. In addition, due to the high randomness and distribution uncertainty of the power-stealing data, it is often difficult to converge during the training process. To this end, user classification labels (normal users or power-stealing users) are introduced to form a condition generation adversarial network during training. The traditional classic adversarial generation network is transformed from free unsupervised learning to supervised learning that is relatively easy to converge. At this time, the objective function of the classic generative adversarial neural network is Make the original data obey the distribution Φ O , and generate the data obey the distribution Φ G . Based on the above formulas X and Y, the objective function of the generation model of the anti-theft neural network is: Based on the above-mentioned theory, the classic GAN network is optimized and designed. The generative network and discriminative network objective function that are compatible with the Energies 2020, 13, 1763 6 of 16 power-stealing data are constructed, and the CWGAN network power-stealing balance set algorithm can be designed by using the characteristics of user power consumption data.

Feature Extraction of Electricity Data
Because the collected user electricity consumption data is often measured by time, different time divisions constitute different data statistics. However, the time is too short, often the data features are vague, and it is difficult to determine the type of user. If the time is too long, it will occupy a lot of resources in the data storage and calculation process. The classifier efficiently provides the basis. Feature extraction methods represented by autoencoders [30] are more and more widely used in the field of power theft detection based on their strong generalization capabilities. In view of the large difference in power consumption between different users and the high degree of randomness, there is a lot of noise in the data of normal users and users who steal electricity. Here, convolution and denoising operations are introduced to design the stacked convolution noise reduction autoencoder (SCDAE) feature extraction of electrical data.

Electricity Data SCDAE Design
The ordinary autoencoder (AE) is a three-layer neural network structure, and the input data is reconstructed by the hidden layer h. Because of its ability to reconstruct the original data, the hidden layer h can be considered to have identifiable information. The stacked autoencoder (SAE) extracts the intrinsic data features at multiple levels while reconstructing the input data by setting multiple hidden layers h i (i = 1, 2, . . . n), and the hidden layer h n contains all the data. The final feature, the encoder, is shown in formula (Equation (10)): where σ e is the activation function of the encode, x i = h i−1 , when i is 1; x i is the daily electricity consumption vector of a single user divided by three weeks; and w i and b i are encoders neural network weights and biases. Due to the randomness of electricity consumption behavior of users, the noise existing in the electricity consumption data of stealing users and normal users will have a bad impact on feature extraction. To this end, a stacked noise reduction autoencoder (SDAE) is constructed, adding noise to the original power consumption data, reconstructing the original power consumption data from the noised power consumption data, and improving the generalization of the power consumption data extraction by SAE, as shown in formula (Equation (11)), where is the distribution obeyed by adding noise to the original data, which is determined by the original data x i and the parameter η.
Considering that the input is 147 × 7-dimensional power consumption, if a fully connected network is used, the training time will be too long and the training data demand will be too large. To this end, the convolution operation is introduced into SAE to form a stacked convolutional self-encoder (SCDAE), whose encoder is as shown in formula (Equation (12)): where ⊗ is the convolution operator. In order to preserve the internal information of the power consumption data as much as possible, the pooling layer in the classic CNN is omitted in SCAE. At the same time, in order to prevent overfitting, a random neuron hiding operation is introduced in SCAE. This is to improve the network performance by blocking the neuron's joint action, specifically introducing a Bernoulli function with probability p before the neuron to disable some neurons.
It should be noted that while training the encoder, the structure of a typical symmetric encoder decoder is no longer symmetrical due to the introduction of convolution operations, and it is necessary to continue undersampling in the decoder. The final feature decoder can be expressed as: where ⊗ is the undersampling operator, σ d is the decoder activation function, w i and b i are the decoder convolution network weights and offsets, and χ i is the original power data reconstructed by SCDAE. So the training SCDAE loss function can be defined as: where Ω is the regularization term that prevents the model from overfitting. Each layer of CDCD in SCDAE propagates features forward and gradients backward in a convolutional manner. It is difficult to converge by training the model directly in the form of SCDAE. To this end, the SCDAE is split into n-layer CDAE for stepwise training, and the feature vector obtained by the convolution of the upper CDAE will be used as the input of the lower CDAE. The training process is shown in Figure 2: Using the above formula (Equation (15)) for the power consumption data, the SCDAE of the power data feature extraction can be obtained after the layered training and convergence according to the flow in Figure 2, which can be used directly for power feature extraction.

Theft Detection Based on LightGBM Classification
LightGBM training is performed on the feature data extracted by the SCDAE pair of enhanced power datasets along with the original user labels. LightGBM is a classification tool optimized based on GBDT. It uses feature contribution to train weak classifiers (decision trees) for selection. By constructing a histogram of width k to traverse the input data, the variance information gain is estimated according to Equation (16) to find the optimal segmentation point. [31] ̃( ) = 1 (  Using the above formula (Equation (15)) for the power consumption data, the SCDAE of the power data feature extraction can be obtained after the layered training and convergence according to the flow in Figure 2, which can be used directly for power feature extraction.

Theft Detection Based on LightGBM Classification
LightGBM training is performed on the feature data extracted by the SCDAE pair of enhanced power datasets along with the original user labels. LightGBM is a classification tool optimized based on GBDT. It uses feature contribution to train weak classifiers (decision trees) for selection. By constructing a histogram of width k to traverse the input data, the variance information gain is estimated according to Equation (16) to find the optimal segmentation point. [31] v j (d) Energies 2020, 13, 1763 8 of 16 In the formula, A and B are the feature data sets sampled according to a certain percentage according to the gradient contribution size, O is the feature data set on the fixed node of the decision tree, and a and b are constants. Use leaf-wise growth strategies with depth limitation for acceleration. By setting we can prove the maximum value of the approximation error ε (d) of the classification model: Among them, n is the dimension of the data set, and σ is the probability constant. For the public third-party LightGBM library, it is necessary to set core parameters such as the learning rate of 0.1 and the number of leaves of a single decision tree of 31. Control parameters such as the minimum data amount of a single leaf of 15, and (GOSS) large and small gradient retention ratios of 0.2 and 0.1 are input and output parameters. The maximum number of features in a single cabinet is 255, and the minimum amount of data is 5, etc., to achieve the normal classification of power consumption feature data and power theft [32].
The training and detection process of the theft detection model is shown in Figure 3 below. Fifty percent, 20%, and 30% of the original data are randomly selected to form the model training set, validation set, and test set. The training process consists of three parts: the original data training set and the validation set. After training CWGAN, an enhanced data set is generated for training and increasing SCDAE and LightGBM. The test process mainly consists of two parts, the feature extractor and LightGBM classifier generated by SCDAE cropping. The training and detection process of the theft detection model is shown in Figure 3 below. Fifty percent, 20%, and 30% of the original data are randomly selected to form the model training set, validation set, and test set. The training process consists of three parts: the original data training set and the validation set. After training CWGAN, an enhanced data set is generated for training and increasing SCDAE and LightGBM. The test process mainly consists of two parts, the feature extractor and LightGBM classifier generated by SCDAE cropping .

Experimental Results and Analysis
The CWGAN-SCDAE-LightGBM model (CSL model) proposed in this paper will use the public electricity consumption data set published by State Grid Corporation of China for experiments. This data set [28] contains the electricity consumption data of 42,372 electricity customers in 1035 days (1 January 2014 to 31 October 2016). This experiment was performed under python 3.6 using the public LightGBM framework.

Evaluation Index
In classification detection, it is not possible to evaluate the performance of the classifier on an

Experimental Results and Analysis
The CWGAN-SCDAE-LightGBM model (CSL model) proposed in this paper will use the public electricity consumption data set published by State Grid Corporation of China for experiments. This data set [28] contains the electricity consumption data of 42,372 electricity customers in 1035 days (1 January 2014 to 31 October 2016). This experiment was performed under python 3.6 using the public LightGBM framework.

Evaluation Index
In classification detection, it is not possible to evaluate the performance of the classifier on an imbalanced data set based on accuracy alone, and the imbalanced data set used has certain requirements for the sensitivity and specificity of the detection, and multiple indicators need to be used. Typical evaluation indicators include recall, specificity, accuracy, F-number, and Accuracy.

Data Set Balance Verification
Based on the data analysis and the second section of the adversarial generation network design, the power-stealing balance data algorithm is as follows: • Step 1: Count the number of stolen users n 1 and the number of normal users n 2 in the original data set D 0 , and set the undersampling rate α (0 < α <1). Randomly undersampling normal users and mixing them with the original data set stealing users constitutes a new data set D 1 for training of the CWGAN network; • Step 2: Train the CWGAN network; • Step 3: Use the trained CWGAN network to generate new stealing data to form a new stealing sample set D 2 . Finally, it is merged with the original data set D 0 to generate the final balanced power consumption data set. The Algorithm 1 pseudo code is as follows:

Algorithm 1 CWGAN algorithm for generating steal data
Input: unbalanced data set D0, sampling rate is α(α < 1), number of iterations epoch1, epoch2 / * Construct training data set * / Step 1: Calculate n1 and n2 according to D0 Step 2: Undersampling of negative samples constitutes D, the number is n 1 + αn 2 / * construct training data set * / Step 3: Train CWGAN model based on data set D End for / * Generate positive samples based on generative adversarial network * / Step 4: The random noise with a vector capacity of n = n 2 − n 1 extracted from U (0,1) is amplified into data {{ p 1 , y 0 }, { p 2 , y 0 }, { p 3 , y 0 } . . . { p n , y 0 }}. As an input to generate the network G, n positive sample sets D 2 can be generated.
Step 5: Mix the generated positive sample set D 2 with the original data set D 0 to obtain a balanced data set D 3 Step 6: Output data set D 3 Output: balanced data set The Adam optimizer is used in the power-stealing CW confrontation generation network. One-Hot Encoding is used to encode the user type and add it to the condition variant. The dimension is two. The generation model G uses a classic convolutional neural network structure, and the discrimination model D uses a single hidden layer neural network whose activation function is Relu. Since the original data is the daily power consumption of a single user for 3 years, the input of the power-stealing generation model is a dimension of 1 × 1035. By weekly partitioning into 147 × 7-dimensional, (0,1) uniformly distributed random noise and label variable y, the final output is the same 147 × 7-dimensional power-stealing data sample as the real stealing user. The discrimination model D also inputs 147 × 7-dimensional power consumption data, outputs the probability of real data with a dimension of 1, and the output layer activation function is a sigmod function.
The results of unbalanced data set processing for the same data set using the above-mentioned power-stealing CWGAN, CGAN, and SMOTE methods are shown in Figures 4 and 5. The abscissa of Figure 4 is a sequence of data numbers, of which Figure 4a is the original power consumption data, and Figure 4b is the new power-stealing data generated by the K-nearest neighbor algorithm in the local area by the SMOTE method. It can be clearly seen that the generation and aggregation of power-stealing data are quite different from the original data distribution. Figure 4c uses classic CGAN to generate data. Because the original data is an unbalanced data set, the generated data further exacerbates the imbalance of the original data. Figure 4d is the data generated by CWGAN. The distribution of the generated data is similar to the original data distribution, and the amount of the two types of data is balanced to a certain degree. It is superior to other methods in terms of quantity and quality and guarantees the model of power theft well-trained. Figure 5 is the average daily power consumption of the original data and generated data over 365 days. It can be seen that the generated model approximates the original data well, and the generated data also effectively filters out the abnormalities caused by the default values in the original data noise.
Energies 2020, 13, x FOR PEER REVIEW 10 of 17 exacerbates the imbalance of the original data. Figure 4d is the data generated by CWGAN. The distribution of the generated data is similar to the original data distribution, and the amount of the two types of data is balanced to a certain degree. It is superior to other methods in terms of quantity and quality and guarantees the model of power theft well-trained. Figure 5 is the average daily power consumption of the original data and generated data over 365 days. It can be seen that the generated model approximates the original data well, and the generated data also effectively filters out the abnormalities caused by the default values in the original data. noise.     The loss value (Loss) and Matthews correlation coefficient value (MCC) of the CWGAN preprocessed data set and the original data set for direct SCDAE + LightGBM training are shown in Figure 6. The MCC value is a typical indicator used to evaluate an imbalanced data set. When it is 1, it means a completely accurate prediction. As shown in Figure 6a, when the CWGAN balanced data set is not used, as the number of trainings increases, the inflection point appears in the test loss, that is, the model appears to overfit. After using CWGAN to balance the data set, as shown in Figure 6b, the occurrence of overfitting is effectively alleviated. Similarly, the MCC values are shown in Figure 6c,d. After using the CWGAN balanced data set, there is a significant improvement. The loss value (Loss) and Matthews correlation coefficient value (MCC) of the CWGAN preprocessed data set and the original data set for direct SCDAE + LightGBM training are shown in Figure 6. The MCC value is a typical indicator used to evaluate an imbalanced data set. When it is 1, it means a completely accurate prediction. As shown in Figure 6a, when the CWGAN balanced data set is not used, as the number of trainings increases, the inflection point appears in the test loss, that is, the model appears to overfit. After using CWGAN to balance the data set, as shown in Figure 6b, the occurrence of overfitting is effectively alleviated. Similarly, the MCC values are shown in Figure  6d and Figure 6c. After using the CWGAN balanced data set, there is a significant improvement.

SCDAE Feature Extraction Verification
Feature extraction uses multiple noise reduction autoencoders to be stacked in a convolutional manner. Features are extracted from user electricity data, and the stacked structure can be determined layer by layer from bottom to top. The goal of training SCDAE is to minimize the overall reconstruction loss function , and the reconstruction loss function of the i-th CDAE is , then the overall reconstruction loss function of SCDAE can be defined as = 1 − ∏ (1 − ), and can be a smaller value (25%). The number of stacked layers of SCDAE is determined by adjusting the dimension of the output features of each hidden layer. For the of each CDAE in the validation set, the size of the output feature dimension of the hidden layer is shown in Figure 7. In Figure 7a, the turning point of the slope of the loss value is (116, 0.097), that is, when the dimension of the output feature value of the first hidden layer is 116, the loss value will hardly increase with the increase of the feature value dimension. Similarly, for two to four auto-encoders, take 73, 36 and 17, respectively. Through the joint test fine-tuning, the can be close to 25%, and the SCDAE can be determined to be a four-layer stacked structure.

SCDAE Feature Extraction Verification
Feature extraction uses multiple noise reduction autoencoders to be stacked in a convolutional manner. Features are extracted from user electricity data, and the stacked structure can be determined layer by layer from bottom to top. The goal of training SCDAE is to minimize the overall reconstruction loss function E SCDAE , and the reconstruction loss function of the i-th CDAE is L i , then the overall reconstruction loss function of SCDAE can be defined as E SCDAE = 1 − i (1 − L i ), and E SCDAE can be a smaller value (25%). The number of stacked layers of SCDAE is determined by adjusting the dimension of the output features of each hidden layer. For the L i of each CDAE in the validation set, the size of the output feature dimension of the hidden layer is shown in Figure 7. In Figure 7a, the turning point of the slope of the loss value is (116, 0.097), that is, when the dimension of the output feature value of the first hidden layer is 116, the loss value will hardly increase with the increase of the feature value dimension. Similarly, for two to four auto-encoders, take 73, 36 and 17, respectively. Through the joint test fine-tuning, the E SCDAE can be close to 25%, and the SCDAE can be determined to be a four-layer stacked structure. To determine the epoch value for feature extraction training, all labeled samples need to be trained. Too small or too large an epoch values can cause underfitting or overfitting. Figure 8: after 70 epochs, both the AUC score and F1 score decreased slightly, and SCDAE overfitting occurred. The AUC score and F1 score of the 50th epoch reached 0.9738 and 0.8773, respectively, so the epoch value of the experiment was set to 50.  To determine the epoch value for feature extraction training, all labeled samples need to be trained. Too small or too large an epoch values can cause underfitting or overfitting. Figure 8: after 70 epochs, both the AUC score and F1 score decreased slightly, and SCDAE overfitting occurred. The AUC score and F1 score of the 50th epoch reached 0.9738 and 0.8773, respectively, so the epoch value of the experiment was set to 50.   Figures 9 and 10 show the comparison of the test law of the theft detection model CSL (CWGAN-SCDAE-LightGBM) designed in this paper with several existing advanced steal-the-power models in the ROC curve and during a certain number of iterations. From the comparison of the ROC curves of several methods, we can see that the ROC curve of CSL is closest to the upper left corner, the area under the curve is the largest, and the detection effect is the best.
Energies 2020, 13, x FOR PEER REVIEW 13 of 17 Figures 9 and 10 show the comparison of the test law of the theft detection model CSL (CWGAN-SCDAE-LightGBM) designed in this paper with several existing advanced steal-the-power models in the ROC curve and during a certain number of iterations. From the comparison of the ROC curves of several methods, we can see that the ROC curve of CSL is closest to the upper left corner, the area under the curve is the largest, and the detection effect is the best. It can be seen in Figure 10 that with the increase of training iterations, the detection algorithm of the CWGAN enhanced data set is added at an early stage, and the accuracy rate is higher than that without the enhanced data set. In addition, when the SCDAE training tends to converge, the CSL model has a clear advantage over other detection algorithms in terms of test accuracy. Most of the other methods have the same number of repetitions at about 85%, and the CSL model can finally reach a stability rate of more than 90% on the test set, up to 97.6%. To sum up, this paper proposes two improvements in the detection of power theft: CWGAN handles unbalanced data sets and SCDAE power feature extraction. After training convergence, it can effectively improve the performance of ROC and accuracy on the test set, and increase the accuracy of power theft detection to more than 90%.   Figures 9 and 10 show the comparison of the test law of the theft detection model CSL (CWGAN-SCDAE-LightGBM) designed in this paper with several existing advanced steal-the-power models in the ROC curve and during a certain number of iterations. From the comparison of the ROC curves of several methods, we can see that the ROC curve of CSL is closest to the upper left corner, the area under the curve is the largest, and the detection effect is the best. It can be seen in Figure 10 that with the increase of training iterations, the detection algorithm of the CWGAN enhanced data set is added at an early stage, and the accuracy rate is higher than that without the enhanced data set. In addition, when the SCDAE training tends to converge, the CSL model has a clear advantage over other detection algorithms in terms of test accuracy. Most of the other methods have the same number of repetitions at about 85%, and the CSL model can finally reach a stability rate of more than 90% on the test set, up to 97.6%. To sum up, this paper proposes two improvements in the detection of power theft: CWGAN handles unbalanced data sets and SCDAE power feature extraction. After training convergence, it can effectively improve the performance of ROC and accuracy on the test set, and increase the accuracy of power theft detection to more than 90%. It can be seen in Figure 10 that with the increase of training iterations, the detection algorithm of the CWGAN enhanced data set is added at an early stage, and the accuracy rate is higher than that without the enhanced data set. In addition, when the SCDAE training tends to converge, the CSL model has a clear advantage over other detection algorithms in terms of test accuracy. Most of the other methods have the same number of repetitions at about 85%, and the CSL model can finally reach a stability rate of more than 90% on the test set, up to 97.6%.
To sum up, this paper proposes two improvements in the detection of power theft: CWGAN handles unbalanced data sets and SCDAE power feature extraction. After training convergence, it can effectively improve the performance of ROC and accuracy on the test set, and increase the accuracy of power theft detection to more than 90%.

Conclusions
The new CSL power-stealing detection model proposed in this paper deals with unbalanced data sets through CWGAN. The generated power-stealing data is mixed with the original data to form an enhanced data set for subsequent feature extractor training. Experiments show that the model not only makes the model converge quickly, but the MCC value is higher under the same epoch and the final MCC value of the model is increased by 0.1 to 0.8 compared to the case without data balancing operation.
In addition, in view of the interference noise phenomenon in the user's electricity data set, a comprehensive convolution and encoder idea is proposed to extract the power-stealing feature extractor SCDAE. On the one hand, the noise in the data set is filtered by the noise reduction auto-encoder to avoid the adverse impact of the noise data. On the other hand, the noise reduction autoencoders are stacked by convolution to extract more typical features in the theft of electricity, laying a good foundation for subsequent classification detection. Finally, through experiments comparing the training and test results obtained on different data-stealing detection models on the same data set, it is concluded that the CSL power-stealing detection model has improved the typical indicator accuracy from about 85% to more than 90% compared to the common power-stealing detection model, which has obvious advantages. The current work of this paper still has certain limitations, which involve the need to adjust a large number of parameters when using the LightGBM library. The LightGBM model parameters play an important role in the final effect of the power-stealing model. In this paper, only manual debugging is used to implement some parameters, and LightGBM's advantages in the classification of power-stealing features have not been fully utilized. Parameter adaptive adjustment methods can be added in the future to achieve the optimal approximation of model parameters.