Detection of Electricity Theft Behavior Based on Improved Synthetic Minority Oversampling Technique and Random Forest Classiﬁer

: E ﬀ ective detection of electricity theft is essential to maintain power system reliability. With the development of smart grids, traditional electricity theft detection technologies have become ine ﬀ ective to deal with the increasingly complex data on the users’ side. To improve the auditing e ﬃ ciency of grid enterprises, a new electricity theft detection method based on improved synthetic minority oversampling technique (SMOTE) and improve random forest (RF) method is proposed in this paper. The data of normal and electricity theft users were classiﬁed as positive data (PD) and negative data (ND), respectively. In practice, the number of ND was far less than PD, which made the dataset composed of these two types of data become unbalanced. An improved SOMTE based on K-means clustering algorithm (K-SMOTE) was ﬁrstly presented to balance the dataset. The cluster center of ND was determined by K-means method. Then, the ND were interpolated by SMOTE on the basis of the cluster center to balance the entire data. Finally, the RF classiﬁer was trained with the balanced dataset, and the optimal number of decision trees in RF was decided according to the convergence of out-of-bag data error (OOB error). Electricity theft behaviors on the user side were detected by the trained RF classiﬁer.


Introduction
Power losses are usually divided into technical losses (TLs) and nontechnical losses (NTLs) [1]. NTLs refer to the power loss during the transformation, transmission, and distribution process and are mainly caused by electricity theft on the user side [2]. In most countries, electricity theft losses (ETLs) account for the predominant part of the overall electricity losses [3], and are mainly taking place in the medium-and low-voltage power grids. ETLs can cause serious problems, such as loss of revenue of power suppliers, reducing the stability, security, and reliability of power grids and increasing unnecessary resources consumption. In India, ETLs were valued at about US$4.5 billion [4], which is still rising year by year. ETLs are reported to reach up to 40% of the total electricity losses in countries such as Brazil, Malaysia, and Lebanon [5]. ETLs of some provinces in China reached about 200 million kWh, with an overall cost of 100 million yuan. As reported in [6], the losses due to electricity theft reached about 100 million Canadian dollars every year with a power loss that could have supplied 77,000 families for one year. The annual income loss caused by electricity theft in the United States accounted for 0.5% to 3.5% of the total income [7,8]. Therefore, the research on advancing electricity theft detection techniques has become essential due to its significance for energy saving and consumption reductions [9].
In the past an electric meter packaging, professional electric meters, and bidirectional metering conventional methods were adopted to deal with electricity theft [10,11]. Today, electricity theft detection methods rely on classifying the data collected by smart meter measurement system. Classification of the electricity theft and normal behaviors is conducted through data analysis [12].
The modern methods for electricity theft detection mainly include state-based analysis, game theory, and classification [13].
State-based detection schemes employ specific devices to provide high detection accuracy. A novel hybrid intrusion detection system framework that incorporates power information and sensor placement has been developed in [14] to detect malicious activities such as consumer attacks. In [15], an integrated intrusion detection solution (AMIDS), was presented to identify malicious energy theft attempts in advanced metering infrastructures. AMIDS makes use of different information sources to gather a sufficient amount of evidence about an ongoing attack before marking an activity as a malicious energy theft. In [16], state estimation was used to determine electricity theft users. When there was a difference between the voltage of the state estimation and the voltage of the measured node, the breadth-first search was conducted from the root node of the distribution network, and the magnitude of the difference at the same depth was compared to locate electricity theft users. In [17], in order to detect and localize the occurrence of theft in grid-tied microgrids, A Stochastic Petri Net (SPN) with a low sampling rate was used to first detect the random occurrence of theft and then localize it. The detection was based on determining the accurate line losses through (Singular Value Decomposition) SVD, which led to the recognition of theft in grid-tied MGs. State-based detection schemes will bring additional investment required for monitoring systems, including equipment costs, system implementation costs, software costs, and operation/training costs. In [18], it investigated energy theft detection in microgrids, considering a realistic model for the microgrid's power system and the protection of users' privacy. It proposed two energy theft detection algorithms capable of successfully identifying energy thieves. One algorithm, called centralized state estimation algorithm based on the Kalman filter (SEK), employed a centralized Kalman filter. However, it could not protect users' privacy and did not have very good numerical stability in large systems with high measurement errors. The other one, called privacy-preserving bias estimation algorithm (PPBE), was based on two loosely coupled filters, and could preserve users' privacy by hiding their energy measurements from the system operator, other users, and eavesdroppers. However, state-based detection schemes employ specific devices to provide high detection accuracy, which, however, come with the price of extra investment required for the monitoring system including device cost, system implementation cost, software cost, and operating/training cost.
Another approach for theft detection is based on game theory. Reference [19] formulated the problem of theft detection as a game between an illegitimate user and a distributor. The distributor wants to maximize the probability of theft detection while illegitimate users or thieves want to minimize the likelihood of being caught by changing their Probability Density Functions (PDFs) of electricity usage.
Classification-based methods include expert systems and machine learning. Expert systems are based on computer models trained by human experts to deal with complex problems and draw the same conclusions as experts [20]. The expert system of electricity theft detection based on specific decision rules was initially used. With the rapid development of artificial intelligence technology, machine learning enables computers to learn decision rules from training. Therefore, in recent years, machine learning has become the main research direction of electricity theft detection [21]. In [22], it explored the possibilities that exist in the implementation of Machine-Learning techniques for the detection of nontechnical losses in customers. The analysis was based on the work done in collaboration with an international energy distribution company. It reported on how the success in detecting nontechnical losses can help the company to better control the energy provided to their customers, avoiding a misuse, and, hence, improving the sustainability of the service that the company provides. Reference [23] provides a novel knowledge-embedded sample model and deep semi-supervised learning algorithm to detect NTL by using the data in smart meter. It first analyzed the characteristic of realistic NTL, and designed a knowledge-embedded sample model referring to the principle of electricity measurement. Next, it proposed an autoencoder based semi-supervised learning model.
In [24], fuzzy logic and expert system were combined to integrate human expert knowledge into the decision-making process to identify electricity theft behavior. A grid-based local outliers algorithm was proposed in [25] to achieve unsupervised learning of abnormal behavior of power users. This method mapped variables features into two-dimensional plane by factor analysis (FA) and principal component analysis (PCA). The dimensionality of data and the operation cost of outlier factor algorithm were reduced by grid technology. In [26], electricity theft detection method based on probabilistic neural network was employed to detect two types of illegal consumption.
In [27], clustering analysis was carried out firstly to reduce the number of data to be analyzed, then the suspected users were found through neural network. In [28], the extreme learning machine (ELM) was used to identify the weight between the hidden and output layer, and electricity theft was detected through the measured data of the meter. In [29], a five-joint neural network was trained with power data comprising 20,000 customers and achieved considerable accuracy. SVM-FIS method was proposed in [30], which could reduce the calculation complexity and improve the detection accuracy by combining the fuzzy inference system with the SVM. In [31], a data-based method was proposed to detect sources of electricity theft and other commercial losses. Prototypes of typical consumption behavior were extracted through clustering the data collected from smart meters.
For an unbalanced dataset, intelligent algorithms tend to favor positive data (PD) in the training process and ignore the important information contained in a few negative data (ND), which may reduce the detection accuracy [32]. Therefore, optimizing the unbalance of the dataset plays an important role for improving the efficiency and accuracy of the algorithm. Data-oriented methods mostly rely on existing and validated cases of fraud either for training or validation. However, since frauds are scarce, it is difficult to obtain these samples, unless another Fraud detection methods such as unsupervised detection, or a manual inspection campaign are used [33].
The theory of unbalanced data processing has been widely used in the fields of network fraud identification, network intrusion detection, medical diagnosis, and text classification. However, it is still rarely used in electricity theft detection. Reference [34] introduced consumption pattern-based energy theft detector (CPBETD), a new algorithm for detecting energy theft in advanced metering infrastructure (AMI). CPBETD relies on the predictability of customers' normal and malicious usage patterns, and it addresses the problem of imbalanced data and zero-day attacks by generating a synthetic attack dataset, benefiting from the fact that theft patterns are predictable. In [35], a methodology was proposed to improve the performance and evaluation of supervised classification algorithms in the context of NTL detection with imbalanced data. The main contributions of our work lie in two aspects: (1) The strategies considered to counteract the effects of imbalanced classes, and (2) an extensive list of performance metrics detailed and tested in the experiments.
A comprehensive detection method for NTLs of unbalanced power data was proposed in [36], which contained three detection models (Boolean rule, fuzzy logic, and support vector machine). Reference [37] proposed two undersampling methods for the classification of unbalanced data, easy ensemble (EE) algorithm and balance cascade (BC) algorithm. The above two methods exhibited high computation and implementation complexity. In [38], a one-sided selection (OSS) method was proposed for dealing with unbalanced data. In [39], a KNN-near miss method based on the K-nearest neighbor (KNN) undersampling method was proposed. In [40], an oversampling method, called synthetic minority oversampling technique (SMOTE), was adopted, which achieved excellent results in the processing of unbalanced data and effectively solved the problem of excessive random sampling. However, the algorithm had certain blindness in the selection of neighbors, did not consider the distribution of data when generating new data, and had strong marginality.
Reference [41] reported that, compared with single strong decision tree, weak decision tree had high computational efficiency. In addition, considering the weight sparsity of weak classifier, the recognition rate of the cluster could be further improved [42]. In [43], decision trees were used for NTL detecting and the algorithms were tested with real a database of Endesa Company. In addition, random forest classifier (RF) can save resources and computational time because the multiple decision trees run in parallel. Moreover, each decision tree can achieve random selection of data and attributes without overfitting [44].
In summary, considering the shortcomings of existing electricity theft detection methods and the unbalance of user data, a method for electricity theft behaviors detection was proposed based on improved SMOTE and random forest classifier in this paper. The main contributions of this paper can be listed as follows.
(1) Considering the high unbalance of power user-side dataset and the shortcomings of existing methods, a new K-SMOTE method was proposed to deal with the unbalanced initial datasets. The proposed method can reduce the impact of detection accuracy caused by unbalanced data. (2) Combined with the unbalanced data, considering the limitation of setting decision tree in RF algorithm, the improved random forest classifier was applied to detect electricity theft behaviors. The efficiency of power theft detection could be greatly improved because multiple decision trees were running in parallel. Then, the improved RF algorithm and K-SMOTE oversampling algorithm were combined to establish the electricity theft detection system, which considered the unbalance of the users' electricity dataset. (3) The detection method of this paper had higher detection accuracy and better stability compared with the existing methods.
This paper is organized as follows: Section 2 proposes the K-SMOTE. Section 3 describes the proposed detection method for electricity theft behaviors. In Section 4, simulation results are presented to verify the feasibility and superiority of the proposed method. Section 5 summarizes the main conclusion of this study. In addition, the nomenclature table is shown in Appendix A.

Proposed Algorithm
The detection of electricity theft behaviors is a binary classification problem which calls for distinguishing of normal and electricity theft users. If the electricity data of the user side are directly used by a classifier, unbalanced data may make the classifier more prone to PD and ignore the important information contained in ND, which may degrade the performance of the classifier substantially.
As shown in Figure 1, the triangle and circle represent two kinds of datasets. Respectively, the solid box represents the actual decision boundary of the two kinds of datasets, while the dotted box represents the possible learning decision boundary of the classification algorithm. The number of triangle data in Figure 1a is less than the circular data, so they represent an unbalanced dataset. From Figure 1a,b that shows the normal dataset, it can be seen that the decision boundary of the classification algorithm may be quite different from the real decision boundary if the dataset is unbalanced. In the actual power consumption environment, the number of users stealing electricity is far less than normal users, so the users' electricity dataset is an unbalanced dataset. Unbalanced user data will make the classification algorithm more prone to normal user samples, thereby ignoring the important information contained in a small number of electricity theft user samples, making the decision boundary of the classifier and the actual decision boundary inconsistent, resulting in serious performance degradation of the classifier. Therefore, it was necessary to use an appropriate method to balance the dataset. The traditional SMOTE method was easy to cause data marginalization problems. If there are more PD between some ND, the artificial data generated around these ND will cause the problem of blurred boundaries of PD and ND.
In the field of the detection of electricity theft, the problem about the low detection accuracy due to the unbalance of the power consumption dataset on the user side needs to be solved. Based on a kind of unbalanced data processing method based on K-means clustering and SMOTE, named K-SMOTE, the problem of low electricity theft detection accuracy caused by unbalance electricity data is solved in this paper.

SMOTE
SMOTE is a classic oversampling algorithm normally used for solving data unbalance problems [45]. Compared to the random oversampling approach, SMOTE performance is better in preventing overfitting [40] by adding ND to achieve balancing distribution with PD. The basic idea is to perform linear interpolation between the existing ND and their neighbors. Specific steps of SMOTE are as follows: (1) For a sample xi in minority class samples set X, calculate the Euclidean distance from this sample to all other samples in the set, and get its k nearest neighbor, denoted as yj(j = 1, 2, … , k). (2) Sampling rate is set according to the data unbalance ratio to determine the sampling magnification. For data xi, n numbers are randomly selected from their K-nearest neighbors, and new data can be constructed as follows: where xj = 1, 2, … , n, rand(0,1) represents a random number between 0 and 1. New data synthesized by SMOTE is shown in Figure 2. In the actual power consumption environment, the number of users stealing electricity is far less than normal users, so the users' electricity dataset is an unbalanced dataset. Unbalanced user data will make the classification algorithm more prone to normal user samples, thereby ignoring the important information contained in a small number of electricity theft user samples, making the decision boundary of the classifier and the actual decision boundary inconsistent, resulting in serious performance degradation of the classifier. Therefore, it was necessary to use an appropriate method to balance the dataset. The traditional SMOTE method was easy to cause data marginalization problems. If there are more PD between some ND, the artificial data generated around these ND will cause the problem of blurred boundaries of PD and ND.
In the field of the detection of electricity theft, the problem about the low detection accuracy due to the unbalance of the power consumption dataset on the user side needs to be solved. Based on a kind of unbalanced data processing method based on K-means clustering and SMOTE, named K-SMOTE, the problem of low electricity theft detection accuracy caused by unbalance electricity data is solved in this paper.

SMOTE
SMOTE is a classic oversampling algorithm normally used for solving data unbalance problems [45]. Compared to the random oversampling approach, SMOTE performance is better in preventing overfitting [40] by adding ND to achieve balancing distribution with PD. The basic idea is to perform linear interpolation between the existing ND and their neighbors. Specific steps of SMOTE are as follows: (1) For a sample x i in minority class samples set X, calculate the Euclidean distance from this sample to all other samples in the set, and get its k nearest neighbor, denoted as y j (j = 1, 2, . . . , k). (2) Sampling rate is set according to the data unbalance ratio to determine the sampling magnification.
For data x i , n numbers are randomly selected from their K-nearest neighbors, and new data can be constructed as follows: where x j = 1, 2, . . . , n, rand(0,1) represents a random number between 0 and 1.
New data synthesized by SMOTE is shown in Figure 2. In Figure 2 x is the core data currently used to construct the new data: x1, x2, x3, x4 are the four nearest neighbor data of x; r1 , and r2, r3, r4 are synthetic new data.

K-Means Clustering Algorithm
K-means clustering is a widely used algorithm that takes the distance between data points and cluster center as the optimization objective [46]. The algorithm would maximize the similarity of elements in clusters while minimizing the similarity between clusters. The K-means selects the desired cluster center, K, minimizes the variance of the whole cluster through continuous iteration and recalculation of the cluster center, and takes the relatively compact and mutually independent clusters as the ultimate goal.
The basic idea of the K-means is to determine the number of initial clusters centers, K, and randomly select K data as the center of the initial cluster in the given dataset D. Then, for each remaining data in D, calculate the Euclidean distance to each cluster center, divide it into the cluster class belonging to the nearest cluster center, and repeat the calculation to generate new cluster centers. The clustering process converges when cluster centers encountered no longer change or the number of iterations reaches the preset threshold limit.
(3) After all data have been calculated, the new clustering center of each class can be recalculated by Equation (3): where Nj represents the number of data in class j.
(4) If the cluster to which each data belongs does not change with the increase of iteration process, go to Step 2. Otherwise, go to Step 5. (5) Output clustering results.

K-SMOTE
This paper combined the K-means algorithm and SMOTE to balance the electricity data on the user side. The specific steps are as follows: (1) Let M, O, P, and N represent the unbalanced electricity dataset on the user side, the PD, and the ND. T is the training set in P, S is the majority training set, and T and S constitute the total training In Figure 2 x is the core data currently used to construct the new data: x 1 , x 2 , x 3 , x 4 are the four nearest neighbor data of x; r 1 , and r 2 , r 3 , r 4 are synthetic new data.

K-Means Clustering Algorithm
K-means clustering is a widely used algorithm that takes the distance between data points and cluster center as the optimization objective [46]. The algorithm would maximize the similarity of elements in clusters while minimizing the similarity between clusters. The K-means selects the desired cluster center, K, minimizes the variance of the whole cluster through continuous iteration and recalculation of the cluster center, and takes the relatively compact and mutually independent clusters as the ultimate goal.
The basic idea of the K-means is to determine the number of initial clusters centers, K, and randomly select K data as the center of the initial cluster in the given dataset D. Then, for each remaining data in D, calculate the Euclidean distance to each cluster center, divide it into the cluster class belonging to the nearest cluster center, and repeat the calculation to generate new cluster centers. The clustering process converges when cluster centers encountered no longer change or the number of iterations reaches the preset threshold limit.
Specific steps are as follows: (1) Dataset D, denoted as {x 1 , x 2 , x 3 , . . . , x n }, randomly select k initial cluster center as µ 1 , µ 2 , . . . , µ k ∈ D n . (2) Calculate the Euclidean distance using Equation (2), that is, calculate the distance d(x i , µ j ) between x i to each cluster center, find the minimum d, and divide x i into the same cluster as µ i : After all data have been calculated, the new clustering center of each class can be recalculated by Equation (3): where N j represents the number of data in class j. (4) If the cluster to which each data belongs does not change with the increase of iteration process, go to Step 2. Otherwise, go to Step 5. (5) Output clustering results.

K-SMOTE
This paper combined the K-means algorithm and SMOTE to balance the electricity data on the user side. The specific steps are as follows: (1) Let M, O, P, and N represent the unbalanced electricity dataset on the user side, the PD, and the ND. T is the training set in P, S is the majority training set, and T and S constitute the total training set O. K is the number of the initial clusters, u i is the cluster center, and X new is the corresponding new interpolated data point set.
(2) Determine the number of initial clusters K.
(3) For T, K-means algorithm was used to perform clustering and record cluster centers. T was divided into K clusters, and the cluster center were {µ 1 , µ 2 , µ 3 , . . . , µ k }. (4) SMOTE was used in T to achieve data interpolation based on cluster centers {µ 1 , µ 2 , µ 3 , . . . , µ k }, then the interpolated dataset X new was obtained (5) T, S, and X new were combined to form new training set O'.

Random Forest Classification Based on K-SMOTE
RF, a statistical learning algorithm proposed by Breiman in 2001 [47], is essentially a combinatorial classifier containing multiple decision trees [48]. It mainly uses bagging method to generate bootstrap training datasets and classification and regression tree (CART) to generate pruning-free decision trees. As a new machine learning classification and prediction algorithm, random forest features the following advantages.
(1) Compared with existing classification algorithms, its average accuracy is at a high level [49].
(2) It can process input data with high dimensional characteristics without dimensionality reduction [50]. (3) An unbiased estimate of the internal generation error can be obtained during the generation process. (4) It is robust to default value problems. (5) Each decision tree in the random forest operates independently, realizing parallel operation of multiple decision trees and saving resources and computational time. (6) Randomness is reflected in the random selection of data and attributes, even if each tree is not pruned, there will be no overfitting.
The electricity data of the grid user side includes various types, such as voltage, current, power consumption, user classification, etc. So the electricity theft users need to be detected quickly and accurately, so as to promptly notify the power department or relevant stakeholders to take timely and proper action.
On the other hand, the RF classifier has poor processing ability for unbalanced datasets, so in this paper it was combined with the K-SMOTE to detect electric power theft.

Decision Tree
Random forest is a single classifier composed of several decision trees. Decision trees can be regarded as a tree model including three kinds of nodes: Root, intermediate, and leaf nodes. Each node represents the attribute of the object, the bifurcation path from each node represents a possible attribute value, and each leaf node corresponds to the value of the object represented by the path from the root to the leaf node. The path, which starts from the root to the leaf node, represents a rule, and the whole tree represents a set of rules determined by the training dataset. The decision tree has only a single output, which starts from the root node, and only the unique leaf nodes can be reached. In other words, the rule is unique essentially. The classification idea of decision tree is a data mining process which is achieved by analyzing data with a series of generated rules.
Concept learning system (CLS), iterative dichotomiser 3 (ID3), classification 4.5 (C4.5), CART and other node-splitting algorithms can be used to generate the decision trees [51]. This paper selected the CART node-splitting algorithm because it can handle both continuous variables and discrete variables.
The principle of CART node-splitting algorithm is as follows. Information entropy (IE) is the most commonly used indicator to measure the purity of a sample set. Assume that the proportion of the k-th samples in the set D is p k (k = 1, 2, . . . , r), then the information entropy of D (Ent(D)) is defined as: The smaller the value of Ent(D), the higher is the purity of D.
CART decision tree uses the Gini-index to select the partitioning attributes. Using the same sign as in Equation (9), the purity of the dataset D can be measured using the Gini value, calculated as below: Intuitively, Gini(D) reflects the probability that two samples are randomly selected from the dataset D, and their class labels are inconsistent. Therefore, the smaller the Gini(D), the higher is the purity of the dataset D.
Assume that the discrete attribute a has V possible values {a 1 , a 1 , . . . , a V }. If property a is used to partition the dataset D, there will be V branch nodes, in which the v node contains all the data with a v value on property a in D, and is denoted as D v . We can calculate the IE of D v according to Equation (4). Considering that the number of samples contained in different branch nodes is different, then give each branch node a weight |D v |/|D|, that is, the more samples of branch nodes, the greater the influence of branch nodes. Then, the Gini-index of the attribute a is defined as: In the candidate attribute set A, select the attribute that minimizes the Gini-index after division as the optimal division attribute, and define the optimization attribute as a * ; then:

Discretization of Continuous Variable
The continuous attribute in the decision tree needs to be discretized. The dichotomy method is used for node splitting of decision trees. The main idea of the method is to find the maximum and minimum values of a continuous variable, and set multiple equal breakpoints between them. These equal breakpoints divide the dataset into two small sets and calculate the information gain rate generated by each breakpoint. In CART decision tree, the steps of discretization of continuous variables are as follows.

Bootstrap Random Sampling
Bootstrap random sampling algorithm is used to obtain different training datasets for training base classifiers.
The mathematical model of bootstrap is as follows: Assuming that there are n different data {x 1 , x 2 , x 3 , . . . , x n } in the dataset D, if any data is extracted from D and put back for n times to form a new set D * , then the probability that D * does not contain the x i (i = 1, 2, . . . , n) is (1-1/n) n . When n→∞, it can be launched: Equation (8) indicates that approximately 36.8% of the original data are not extracted in each sampling. This part of the data are called out-of-bag (OOB) data.

OOB Error Estimate
OOB data are not fitted to the training set. However, OOB data can be used to test the generalization capabilities of the model. It has been proven that the error calculated by OOB, called OBB error, is an unbiased estimate of the true error of the random forest [52]. Therefore, OOB error can be used to evaluate the accuracy of the random forest algorithm.
The performance of the generated random forest can be tested with OOB data. The principle of OOB is shown in the Table 1, in the first column, where x i represents the input sample and y i represents the classification label corresponding to x i . In the first row, T i represents the decision tree constructed by RF. Yes "Y" indicates that the current sample participates in the classification of the corresponding decision tree, and No "N" indicates that the current sample does not participate in the classification of the corresponding decision tree. Therefore, it can be seen from Table 1 that (x 1 , y 1 ) was not used for the construction of T 1 , T 2 , and T 3 , so (x 1 , y 1 ) was the OOB data of the decision trees T 1 , T 2 , and T 3 . After RF model is trained, its performance can be tested by OOB dataset, and the test result is the OOB error. In addition, there is also a relationship between the number of decision trees and the OOB error; therefore, for a certain dataset, this relationship can be used to solve the optimal number of decision trees in RF.

Decision
Tree Suppose the random forest consists of k decision trees. The OOB dataset is O and the OOB data of each decision tree are O i (i = 1, 2, . . . , k), bringing the OOB data into the corresponding decision tree for classification. The numbers of misclassifications of each decision tree are set to X i (i = 1, 2, . . . , k), and the error size of the OOB is calculated from: Energies 2020, 13, 2039 10 of 20

Random Forest
Random forest is a set of tree classifiers {h(x, θ k ), k = 1, 2, . . . , n)} and h(x, θ k ) is the meta-classifier, which is a classification regression tree composed of CART algorithm. As an independent random vector, h(x, θ k ) determines the growth of each decision tree and x is the input vector of the classifier.
A schematic diagram of the random forest algorithm is shown in Figure 3.
algorithm to generate the unpruned decision trees. n K = (10) (5) Input test set Te into each trained decision trees, and the classification result is determined according to the voting result of each decision tree. The voting classification formula is as follows: where Tei (i = 1, 2, …, k) represents each element in the test set, MV represents the majority vote, and ht(Tei) represents the classification result of element Tei in decision tree T.
(6) The current OOB error is calculated according to Equation (9). If the OOB error converges, then go to Step 7. If the error does not converge, update the decision tree number nTree according to Equation (11) and return to Step 4. (7) Output classification result.   Combined with the proposed oversampling method, the specific electricity theft detection steps are as follows: (1) The unbalanced user-side dataset M is oversampled by K-SMOTE to obtain dataset M'.
(2) Divide the training set Tr and test set Te of random forest.
(3) Set the number of initial decision tree nTree. (4) Use the bootstrap method to select training data for each decision tree. The total features in M are K. Selecting n features randomly, n is calculated using Equation (15). Then, use the CART algorithm to generate the unpruned decision trees.
(5) Input test set Te into each trained decision trees, and the classification result is determined according to the voting result of each decision tree. The voting classification formula is as follows: where T ei (i = 1, 2, . . . , k) represents each element in the test set, MV represents the majority vote, and h t (T ei ) represents the classification result of element T ei in decision tree T. (6) The current OOB error is calculated according to Equation (9). If the OOB error converges, then go to Step 7. If the error does not converge, update the decision tree number nTree according to Equation (11) and return to Step 4. (7) Output classification result.
Based on the above theory and steps, the proposed electricity theft detection process was as shown in Figure 4.

Simulation Results
In order to verify the accuracy and effectiveness of the proposed detection method, three models, back-propagation neural network (BPN), support vector machine (SVM), and RF were established. The parameters of models are as follows.
(1) The neurons' number of input layer, hidden layer, and output layer in BPN were 20, 40 and 1, respectively. Learning rate was 0.3, momentum term was 0.3, batch volume was 100, and the maximum number of iterations was 50 [53]. (2) The kernel function of SVM qA radial basis function (RBF), the parameter coefficient of kernel function penalty g WAS 0.07, and the penalty factor coefficient c of PD and ND were 1 and 0.01, respectively [30].
This paper selected the short-term load data of 50 urban electricity users from 15 March 2018 to 16 May 2018 in Hebei province, China. The dataset included six data types (peak, flat, and valley active power; power factor; voltage; and current) and three user types (industrial, commercial, and residential). Data were sampled at intervals of 30 minutes through smart meters. Among them, the unbalance ratio was 16.47%.

Simulation Results
In order to verify the accuracy and effectiveness of the proposed detection method, three models, back-propagation neural network (BPN), support vector machine (SVM), and RF were established. The parameters of models are as follows.
(1) The neurons' number of input layer, hidden layer, and output layer in BPN were 20, 40 and 1, respectively. Learning rate was 0.3, momentum term was 0.3, batch volume was 100, and the maximum number of iterations was 50 [53]. (2) The kernel function of SVM qA radial basis function (RBF), the parameter coefficient of kernel function penalty g WAS 0.07, and the penalty factor coefficient c of PD and ND were 1 and 0.01, respectively [30].
This paper selected the short-term load data of 50 urban electricity users from 15 March 2018 to 16 May 2018 in Hebei province, China. The dataset included six data types (peak, flat, and valley active Energies 2020, 13, 2039 12 of 20 power; power factor; voltage; and current) and three user types (industrial, commercial, and residential). Data were sampled at intervals of 30 min through smart meters. Among them, the unbalance ratio was 16.47%.

Evaluation Indexes
After the classification of unbalanced data, all test sets were divided into four cases: TN (true negative), TP (true positive), FP (false positive), and FN (false negative). These indicators constituted a confusion matrix, as shown in Table 2. Confusion matrix is a way to evaluate the model performance, where the row corresponds to the category to which the object actually belongs and the column represents the category predicted by the model. FP is the first type of error and FN is the second type of error. Through confusion matrix, multiple evaluation indexes can be extended.
(1) Accuracy (ACC): ACC is the ratio of the number of correct classifications to the total number of samples. The higher the value of ACC, the better is the performance of the detection algorithm.
Mathematically ACC is defined as: (2) True Positive Rate (TPR): TPR describes the sensitivity of the detection model to PD. The higher the value of TPR, the better is the performance of the detection algorithm. TPR is defined as: (3) False Positive Rate (FPR): FPR refers to the proportion of data in ND, which actually belongs to ND, and is wrongly judged as PD by the detection algorithm. FPR is defined as: (4) True Negative Rate (TNR): TNR describes the sensitivity of the detection model to ND, which is defined as: (5) G-mean index: G-mean index is used for the evaluation of classifier performance [54]. Large G-mean index reveals better classification performance. The value of G-mean depends on the square root of the product of the accuracy of PD and ND. G-mean can reasonably evaluate the overall classification performance of unbalanced dataset, and it can be expressed as: (6) Receiver operating characteristic (ROC) and area under the ROC curve (AUC): Receiver operator characteristic chive (ROC) was originally created to test the performance of a radar [55]. ROC curve Energies 2020, 13,2039 13 of 20 describes the relationship between the relative growth of FPR and TPR in the confusion matrix. For values output by the binary classification model, the closer the ROC curve is to the point (0, 1), the better the classification performance. Area under the ROC curve (AUC), is an index to evaluate the performance of the detection algorithm in the ROC curve. The AUC value of 1 corresponds to an ideal detection algorithm.

Unbalanced Processing of User-Side Data
Random oversampling, SMOTE, and K-SMOTE were used to oversample the datasets, and the results are shown in Figure 5, of which the black circle represents the normal users, the red asterisk represents the electricity theft users, and the blue box represents the data generated after oversampling.

Unbalanced Processing of User-Side Data
Random oversampling, SMOTE, and K-SMOTE were used to oversample the datasets, and the results are shown in Figure 5, of which the black circle represents the normal users, the red asterisk represents the electricity theft users, and the blue box represents the data generated after oversampling.
In addition, Table 3 shows the repetition rate of artificial data and original data generated by several oversampling algorithms.  It can be observed from Figure 5 that a large amount of duplicated data were included in the result of random oversampling algorithm, and some data were never selected. From Table 3, the data repetition rate of random oversampling was 95.02%, which indicates that the oversampling effect was not ideal. The data repetition rate of SMOTE was 30.5%, as can be seen from Figure 5. Data In addition, Table 3 shows the repetition rate of artificial data and original data generated by several oversampling algorithms. It can be observed from Figure 5 that a large amount of duplicated data were included in the result of random oversampling algorithm, and some data were never selected. From Table 3, the data repetition rate of random oversampling was 95.02%, which indicates that the oversampling effect was not ideal. The data repetition rate of SMOTE was 30.5%, as can be seen from Figure 5. Data generated by SMOTE were scattered with other data and introduced noise points. The problem of data overlap still existed and could not be ignored. K-SMOTE can generate data near the center, and use representative points to limit the boundaries of the generated data to avoid introducing noise. Data generated by K-SMOTE generally follows the original distribution. Further, as shown in Table 3, the data repetition rate was only 15.84%.

Determination of the Number of Decision Trees
The number of decision trees is relevant to the accuracy of the algorithm. In this paper, 80% of the user data were set to form a training set and 20% to form a test set. The optimal number of decision trees can be determined by minimizing the OOB error. The relationship between the OOB error and the number of decision trees, nTree, is shown in Figure 6.
Energies 2020, 13,2039 14 of 21 generated by SMOTE were scattered with other data and introduced noise points. The problem of data overlap still existed and could not be ignored. K-SMOTE can generate data near the center, and use representative points to limit the boundaries of the generated data to avoid introducing noise. Data generated by K-SMOTE generally follows the original distribution. Further, as shown in Table  3, the data repetition rate was only 15.84%.

Determination of the Number of Decision Trees
The number of decision trees is relevant to the accuracy of the algorithm. In this paper, 80% of the user data were set to form a training set and 20% to form a test set. The optimal number of decision trees can be determined by minimizing the OOB error. The relationship between the OOB error and the number of decision trees, nTree, is shown in Figure 6. It can be observed that when the decision tree number was larger than 368, the OBB error almost converged to a minimum level. If the number of decision trees was too small, the accuracy was low. On the other hand, too many decision trees did not improve the accuracy further and the algorithm burden was increasing. Therefore, the decision tree number was set to 368.

Detection Results of RF
The above-mentioned electricity users' dataset processed by K-SMOTE and not processed by K-SMOTE oversampling were, respectively, detected by RF. In order to make the simulation results more convincing and avoid randomness, three independent tests were carried out for each detection. ACC values of test data are listed in Table 4, and ROC curves are shown in Figures 7 and 8. In Figures  7 and 8, the three differently colored curves of red, green and blue represent the ROC curve of three independent tests. According to these results, when K-SMOTE was not used for unbalanced data processing, the mean value of ACC in RF was 85.53%, while the average value of ACC in RF after K-SMOTE was 94.53%. In addition, it can be concluded that ROC curve detected by RF with K-SMOTE was obviously closer to (0.1) than ROC curve detected RF algorithm without K-SMOTE. That is, the area under the former ROC curve was larger.
Moreover, the AUC index of the former was obviously better than that of the latter, which shows that it is necessary to use K-SMOTE to deal with unbalanced data before the detection of electricity theft behaviors. In addition, the detection performance of RF was also ideal. It can be observed that when the decision tree number was larger than 368, the OBB error almost converged to a minimum level. If the number of decision trees was too small, the accuracy was low. On the other hand, too many decision trees did not improve the accuracy further and the algorithm burden was increasing. Therefore, the decision tree number was set to 368.

Detection Results of RF
The above-mentioned electricity users' dataset processed by K-SMOTE and not processed by K-SMOTE oversampling were, respectively, detected by RF. In order to make the simulation results more convincing and avoid randomness, three independent tests were carried out for each detection. ACC values of test data are listed in Table 4, and ROC curves are shown in Figures 7 and 8. In Figures 7  and 8, the three differently colored curves of red, green and blue represent the ROC curve of three independent tests. According to these results, when K-SMOTE was not used for unbalanced data processing, the mean value of ACC in RF was 85.53%, while the average value of ACC in RF after K-SMOTE was 94.53%. In addition, it can be concluded that ROC curve detected by RF with K-SMOTE was obviously closer to (0.1) than ROC curve detected RF algorithm without K-SMOTE. That is, the area under the former ROC curve was larger.
Moreover, the AUC index of the former was obviously better than that of the latter, which shows that it is necessary to use K-SMOTE to deal with unbalanced data before the detection of electricity theft behaviors. In addition, the detection performance of RF was also ideal.

Comparison of Detection Performance of Different Algorithms
The electricity users' data processed by K-SMOTE were tested by BPN and SVM. Again, in order to make the simulation results more convincing, the same dataset was used and three independent tests were performed. The testing results are shown in Table 5 and Figures 9 and 10. In Figures 9  and 10, the three differently colored curves of red, green and blue represent the ROC curve of three independent tests.    It can be concluded from the above test results that: (1) Without K-SMOTE, the ACC value and AUC of RF detection method were relatively low.
However, with K-SMOTE, the ACC and AUC value of three detection methods were obviously improved, which was increased about 10%. This indicates that unbalanced datasets would affect the accuracy of detection algorithm, and K-SMOTE plays an effective role in improving machine learning accuracy. (2) The electricity user data processed by K-SMOTE were tested by BPN and SVM. The ACC mean values of SVM and BPN were 71.26% and 84.87%, respectively, and the mean values of AUC in SVM and BPN were 0.7236 and 0.8716, respectively. These indexes were lower than the ACC and AUC of RF, which were 94.53% and 0.9513, respectively. Thus, the performance of RF was It can be concluded from the above test results that: (1) Without K-SMOTE, the ACC value and AUC of RF detection method were relatively low. However, with K-SMOTE, the ACC and AUC value of three detection methods were obviously improved, which was increased about 10%. This indicates that unbalanced datasets would affect the accuracy of detection algorithm, and K-SMOTE plays an effective role in improving machine learning accuracy.
(2) The electricity user data processed by K-SMOTE were tested by BPN and SVM. The ACC mean values of SVM and BPN were 71.26% and 84.87%, respectively, and the mean values of AUC in SVM and BPN were 0.7236 and 0.8716, respectively. These indexes were lower than the ACC and AUC of RF, which were 94.53% and 0.9513, respectively. Thus, the performance of RF was superior to SVM and BPN.

Conclusions
In order to better adapt to the rapid development of the power grid, aiming at the unbalanced dataset on the user side and improving the efficiency and accuracy of electricity theft detection algorithms, this paper proposed a method based on K-SMOTE and RF classifier for detecting electricity theft. The main conclusions can be summarized as below: (1) K-SMOTE was proposed to avoid the influence of unbalanced data on the accuracy of the classifier.
(2) The RF classifier, which was suitable for the nature of the user-side dataset, was used to detect electricity theft. The decision trees in RF classifier could work in parallel, which improved the detection efficiency and reduced the computational time.
(3) Compared with the conventional detection methods, the proposed method featured higher accuracy and stronger stability.
The method proposed in this paper can provide reliable targets for manual inspection, thereby reducing nontechnical losses in power systems and, hence, improving system reliability and security.

Conflicts of Interest:
The authors declare no conflict of interest.