BiGRU-CNN Neural Network Applied to Electric Energy Theft Detection

: This paper presents an assessment of the potential behind the BiGRU-CNN artiﬁcial neural network to be used as an electric power theft detection tool. The network is based on different architecture layers of the bidirectional gated recurrent unit and convolutional neural network. The use of such a tool with this classiﬁcation model can help energy sector companies to make decisions regarding theft detection. The BiGRU-CNN artiﬁcial neural network singles out consumer units suspected of fraud for later manual inspections. The proposed artiﬁcial neural network was programmed in python, using the keras package. The best detection model was that of the BiGRU-CNN artiﬁcial neural network when compared to multilayer perceptron, recurrent neural network, gated recurrent unit, and long short-term memory networks. Several tests were carried out using data of an actual electricity supplier, showing the effectiveness of the proposed approach. The metric values assigned to their classiﬁcations were 0.929 for accuracy, 0.885 for precision, 0.801 for recall, 0.841 for F1-Score , and 0.966 for area under the receiver operating characteristic curve.


Introduction
The economic progress of developing countries directly relates to the use of electricity by manufacturing industries. Therefore, the lack of this essential resource significantly impacts the economy at large [1][2][3]. There might be numerous reasons behind the shortage of electricity availability; the causes are classified as technical and non-technical losses [4]. Technical losses naturally occur due to irradiation and to electrical energy dissipation during its transmission and distribution, which entails losses in dielectrics and especially in electrical conductors by the Joule effect [5]. Non-technical losses, on the other hand, are defined as any energy consumed or any unbilled service due to the failure of measuring equipment or its fraudulent manipulation. These losses are caused by breakdown or illegal handling at the consumer's premises. Non-technical losses are very difficult to predict [6].
For electricity suppliers, the main cause of non-technical losses is the illegal use of electricity by fraudulent customers [7]. This problem has long been one of the main concerns in the energy system management sector, for it can imbalance demand and supply, causing energy network regulation problems and, consequently, drastic profit losses [8]. Theft detection in electricity networks is thus essential to avoid economic loss and mitigate safety risks. However, conventional methods primarily rely on human verification or specific measuring equipment which are extremely slow, expensive, and inefficient [9]. A considerable number of fraud detection modeling techniques in electrical energy consumption help overcome these obstacles [10]. There are several fraud detection techniques in electricity networks, where classification-based detection is one of the most used approaches. This type of technique mainly distinguishes abnormal energy use patterns from all normal consumption patterns in a test sample containing both normal class and fraudulent class examples [11]. Some algorithms that perform this technique are: knearest neighbors [12,13], Support Vector Machine [14,15], Random Forest [16,17], Gradient Boosting [18,19], and Ensemble Learning [20,21].
Classification algorithms such as K-nearest neighbors, support vector machine, decision tree, and logistic regression have already been established in several applications based on electricity-related problems as well as in other research areas [22][23][24]. However, most of these are based on artificial resources extraction that requires manual intervention and has low electricity theft detection accuracy [25]. It is important to emphasize that all of the above algorithms disregard the data's sequential nature, assuming that they are time-independent [26]. In the real world, however, the opposite happens, given the electricity consumption dynamic behavior [9]. To address these limitations, [27] proposed the use of a widespread deep recurrent neural network based on the detection of electricity theft that can effectively pinpoint cyber-attacks in smart grids. Applied to energy problems, this model explores the nature of the time series of customers' electricity consumption to implement a recurrent neural network of gated recurrent unit (GRU) architecture, thus improving the detection performance and, consequently, better performance simulations results than those of other classic methods.
Reference [28] added non-dominated sorting genetic algorithm to tune the hyperparameters of the GRU network, which explores the nature of the time series of power consumption readings, thereby improving detection performance above classic algorithms.
Recurring neural networks of GRU architecture can be used with other architectures to form hybrid models of electric power fraud detection. Authors in [29] proposed a deep hybrid neural network model based on the combination of GRU and Convolutional Neural Network (CNN) networks and the Particle Swarm Optimization (PSO) algorithm, where the data used was users' real-time electricity consumption. The selection and extraction of resources are performed using the CNN network, which reduce the dimensionality and redundancy present in the time series. The classification of consumption patterns as normal and fraudulent is done using the GRU network with the PSO algorithm. The simulation results show that the proposed model outperforms existing techniques in terms of energy theft detection. Additionally, the proposed model is more robust and accurate than existing classification methods. Reference [30] first used the bidirectional gated recurrent unit (BiGRU) to classify a consumer as honest or fraudulent, using real-time historical series. The experiments showed that this proposed model surpassed traditional classification techniques.
In view of the possibility of using classification algorithms to detect electricity consumption fraud, the present work aims to improve electricity theft detection with a model based on different layers of artificial neural networks called BiGRU-CNN. Most of the times, the process of fraud detection is carried out manually, and it is necessary for the energy companies' employees to collect information on energy consumption for each user. In many cases, this procedure is not efficient. For this reason, the Artificial Intelligence methods proposed in this work become an important alternative solution to the problem of fraud detection, since they allow an efficient exploration of the large amount of information available in the database of the electric power companies. This constitutes the main contribution of our paper.
More accurate classification models can cut costs and add revenue to energy sector companies. The proposed classification of distinct layers BiGRU-CNN was thus compared with the classic artificial neural networks multilayer perceptron (MLP), recurrent neural network (RNN), long short-term memory (LSTM), and GRU to check whether their energy theft classifications are more accurate or not. In this case, the historical series of electrical energy demand of several consumers of the respective company were used as feed in the fraud detection neural models.
This paper is structured as follows: Section 2 includes the theoretical framework used in the paper. Section 3 is the proposed methodology that includes the data used in this research work, the data pre-processing, the neural networks used for transforming the time series into a supervised machine learning problem, and the comparison of metrics. Section 4 corresponds to the experiments performed during this research work. Section 5 concludes and highlights the most important aspects of the paper.

Theoretical Framework
The recurrent neural network (RNN) is an artificial neural network that uses the connection edge of adjacent temporal nodes and introduces the concept of time in the predictive model, making it suitable for processing time series [31]. However, the conventional RNN architecture is susceptible to interference from adjacent time periods, giving rise to the problem of error flow disappearance [32]. One of the alternatives to overcome this is the use of the GRU architecture neural network, which is basically an improved version of LSTM [33]. Generally, both GRU and LSTM networks are suitable for solving the problem of vanishing gradient through their multiplicative ports. However, GRU networks are more efficient in achieving convergence and updating the internal weights during training, in addition to its internal port structure being more succinct than the LSTM network [32].
A typical unit or cell of the GRU architecture network can be constructed from two ports called the reset gate and update gate [34]. The first port (reset gate) filters previously irrelevant information on hidden layers [31]; the lower its value, the greater the amount of information ignored [33]. On the other hand, the second port (update gate) determines the amount of information to be transferred to the output layer [31]; the higher its value, the more information contained in the previous state is used [33]. Figure 1, adapted from [35], shows the structural diagram of a GRU neural network cell governed by Equations (1)-(4), where z t is the update gate, ρ is the activation sigmoid function, w represent the weights for each input, r t is the reset gate,h t is the candidate hidden state of the current hidden node, h t is the hidden current state, x t is the current input of the artificial neural network, h t−1 is the hidden state of the previous time instant, and u represent the weights for hidden state of the previous time instant [35].
For many sequential modeling tasks, it is interesting to access both past and future information. However, the standard GRU neural network processes temporal sequences chronologically and, therefore, is unable to obtain future context information [36]. The bidirectional GRU (Bi-GRU), on the other hand, can perform this operation, since it consists of two standard GRU that process the input sequence in two divergent directions (chronological and anti-chronological) which are subsequently merged into a single variable [37]. This enables the model to explore past and future information. The latter, in turn, can provide more efficient predictive results [36].   Unlike recurring networks of GRU and Bi-GRU architecture, CNN is a type of feedforward network that is not formed by cyclic connections and has no memory as input [34]. Compared to traditional classification methods, CNN can not only map more complex nonlinear relationships, but it also has good generalizability [38]. Aside from classifying, CNN networks are widely used for resource extraction through the kernel, which is, in a nutshell, a filter or matrix that slides over the input to perform the convolution operation and produce a resource map, where different kernels generate different resource maps and all these are merged, thus producing the convolution layer output [39]. CNN architecture neural networks are composed of convolutional layers, pooling layers, and fully connected layers, where convolutional layers and pooling layers are responsible for extracting the electrical energy theft curves characteristics [38]. Figure 3, adapted from [40], presents these layers organized in a generic way to compose the CNN network, and Equations (10) and (11) define their behavior; where x i is the input of the i-th layer of convolution, y i is the output of thei-th layer of convolution, y is the output of the i-th max-pooling layer, f i is the activation function and, finally, variables b i and w i are, respectively, the offset vector and the weights of the i-th convolution layer [38].

Proposed Methodology
The present study proposes a BiGRU-CNN electric energy theft detection modelconstructed using a Bi-GRU layer followed by a CNN layer. The input data set for the consumer energy demand historical series was manipulated to feed the Bi-GRU layer that then process it to extract long-term time dependencies. These time-dependent characteristics, which are represented by two hidden state vectors that have past and future information, were introduced into the CNN layer so that significant local relationships are captured through the convolution and pooling layers.
After this procedure, the dataset was structured in several dimensions which were filtered by the flatten layer to become one-dimensional again before being introduced in the fully connected layer which labels electricity consumers as fraudulent or honest. The structure of the proposed model is shown in Figure 4.

Data
The database used in this work, also used in the work of [41], was provided by a Colombian electricity supplier which cannot be disclosed due to confidentiality reasons. The data encompass the actual electricity consumption of 462.433 users, where consumption was measured, in kWh, monthly. To complement consumer information, the company also provided a database with manual reviews carried out on all registered customers, as well as anomalies found at the time of such reviews. During the manual inspection, several abnormal consumption patterns were detected, the main causes being clandestine spliced wires, bore meters, a previous connection to the measuring box, and measuring boxes without a security seal. It is worth mentioning that other anomalies were found during checks, although most of them are electric power theft related. Because users' electrical energy consumption pattern is already labeled as fraudulent or normal, neural networks will be trained through supervised learning. The rating provided by the models will be compared with the actual consumer class, making it possible to ascertain greater accuracy and reliability on whether the proposed model is able to correctly label a customer as fraudulent or honest. Evidently, most works on this topic found in the literature lack actual consumer classification given the complex, laborious, expensive, and time-consuming manual fraud checks, as [41] shows. To overcome this obstacle, all consumers are considered honest with fictitious fraud data created to conduct training in a supervised manner. Both ways seemingly create unreliable classification models due to lack of vital data.

Data Pre-Processing
Initially, the database was cleaned to eliminate incomplete data records and remove irrelevant theft information from the users' consumption curve. This pre-processing step reduced the data from 462.433 to 314.023 users. It is worth mentioning that most of the incomplete data were from "new" customers, people relocating (new rents), or new homes, so it will have been necessary to develop and implement very specific algorithms to fill the missing values. Those customers would present a very atypical load growth during their first years before reaching a steady state. One way to fill the missing data would have been to use existing "similar" information from other customers; however, this will probably lead to populating our data with suppositions. Fortunately, the data set was big enough that even having removing that chunk of data, the statistical impact was very low, so it was decided to simply discard the missing information.
The new slashed database was inserted into a python programming language code on Google Colab. After obtaining the users' electrical energy consumption data by the program, the MinMaxScaler function of the sklearn pre-processing package was used to normalize the data before feeding it into neural networks. Normalization was needed because energy consumption data varies considerably, potentiality affecting the algorithm's performance during training and thus providing misleading ratings. Equation (12) shows how data normalization was performed by the MinMaxScaler function. In this case, x is the observation of the electrical demand in a time instant.
The pre-processed dataset contains 314.023 consumers. Of these, 240.774 (76.674%) do not steal electrical energy, they are classified "label 0". The remaining 73.249 consumers (23,326%) do, and they are named "label 1". The pre-processed dataset was fragmented to create the training and test samples. The training sample comprises 80% of all consumers and helps artificial neural networks adjust their respective internal parameters during training. These parameters are evaluated in the test sample, represented by the remaining 20% of total consumers, to check whether they are effective in classifying theft occurrence (label 1) or normal electricity consumption (label 0). The 20-80% ratio for training and test samples is common in similar studies, as seen in [42]. To avoid biased neural networks results, training and test samples have the same proportion of normal and fraudulent consumers as the pre-processed Dataset, i.e., 76,674% and 23,326%. Figure 5 illustrates these sample graphs, as well as their compositions.

Neural Networks
After pre-processing the users' historical electrical energy consumption data, the time series were transformed into a supervised machine learning problem. In other words, a sequence of input and output pairs was created so that a decision could be made and then compared to the correct output. The internal parameters of artificial neural networks are modified during training by the Adam (Adaptive Moment Estimation) algorithm so that the rate of correct network classifications is as high as possible. This algorithm was selected since it has proven to be effective when performing prediction of fraudulent electricity consumption [9]. The ten neurons in each of the two intermediate layers of all predictive models have the rectified linear unit (ReLU) [43], Equation (13), as the activation function, while the only neuron in the output layer has the binary cross-entropy activation function that is responsible for classifying a consumer as fraudulent or honest [44]. Regarding the CNN architecture layers of neural networks, the kernel size quantity was set to 6 and the number of filters was set to 8. The training of all artificial neural networks was performed in 150 rounds with a batch size of 32.

Comparison of Metrics
After network training, neural models were fed by consumer information contained in the test sample to ascertain whether they can provide accurate answers. These predictions were organized in a confusion matrix ( Figure 6) to improve the ability to understand each neural network's individual performance regarding accurate classification of fraud or normal electricity consumption. The confusion matrix illustrated in Figure 6 is composed of four classes, where the ordinate axis represents the desired correct response, and the abscissa axis indicates the neural network's forecast. The true positive (TP) class encompasses the correct response from the network to the event of interest. In this case, the network is right that a power consumption fraud occurred. On the other hand, the false positive (FP) class corresponds to the total of erroneous responses from the network to the event of interest, i.e., the network erroneously predicted a fraud occurrence which was, in fact, normal consumption. The true negative (TN) class, on the other hand, comprises the exact classification performed by the neural network regarding the event of no interest, in this case, the network correctly classifies an honest consumer. Finally, the false negative (FN) class presents cases in which the network indicated no consumer fraud when, in fact, electricity was stolen.
The confusion matrix comprehensively represents the individual performance of each of the prediction models with regards to fraudulent consumer classification. However, comparing the predictive performances of different models from these matrices is insufficient. Thus, the following metrics are extracted: Accuracy, Precision, Recall, F1-score, and ROC AUC. Accuracy indicates the number of hits in the neural network, correctly classifying fraudulent and honest consumers. Precision is the reason predictions are indeed true when it comes to fraudulent consumers, and all projections cast customers as fraudulent, even when they were not. Recall, also known as sensitivity, is the ratio between the assertive forecasts of fraudulent consumers and all consumers who stole electricity.
The weighted average of the precision and recall metric is defined as an F1-score. Finally, the ROC AUC is represented by the area under the curve formed by the false positive fraction on the horizontal axis, with the true positive ratio forming the vertical axis. Other than calculating the AUC metric, the ROC curve is also used to define an optimal threshold that can balance the ratio of true positive and false positive. Normally a default threshold is set at 0.5.
All metrics indicate satisfactory results when close to 1 and low predictive results when approaching 0-, which corresponds to the correct classification of fraudulent consumers. Equations (14)-(17) define Accuracy, Precision, Recall, and F1-score metrics, respectively. In this case TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

Tests and Results
After training the neural networks through training samples, the internal parameters were tested to verify their ability to generalize the same results for unprecedented data, which are contained in the test sample. Figures 7-9 depict the confusion matrices for individual results of several artificial neural networks. Additionally, to ease different models' comparison, Table 1 lists metrics based on each matrix's values. Table 1 also indicates the required simulation time used by each network to obtain those metrics.    Table 1 shows that the MLP model performed the worst in correctly classifying a consumer as fraudulent or honest when accounting for Accuracy, Recall, F1-Score, and ROC AUC metrics. This is explained by the fact that this network is unable to extract the temporal dynamic behavior of the energy demand data from different consumers, since its structure does not have information feedback devices. Since it does not process the temporality of the users' energy consumption data, the network has fewer parameters to modify during training and, consequently, lower simulation time than the recurring networks. The recurring GRU architecture network, on the other hand, had the worst precision metric performance and the best recall metric performance. When the standard GRU was used to create the proposed BiGRU-CNN model, the comparison metrics underwent significant changes. This new network, formed from layers of different architectures with the bidirectional engine, performed best in 4 of the 5 metrics, namely: Accuracy, Precision, F1-Score, and ROC AUC.
These results demonstrate that the procedures performed in the standard GRU have considerably increased its ability to correctly classify a user's consumption as fraudulent or honest. Furthermore, they show their superiority in relation to neural network models. Their simulation time was longer than others, given the larger number of parameters modified during training. Figure 10 shows the ROC curves of all neural models used to calculate the AUC metric in Table 1, as well as the sweet spot that increases true positive while decreasing the false positive ratio. With these coordinates, it is possible to determine the G-mean that is later used to find the optimal threshold, responsible for improving the networks' classification power. The G-mean is given by the square root of the product between the true positive rate (TPR) and the true negative rate (TNR)-the higher its value, the better its predictive ability to classify [17]. Once the optimal threshold is obtained from G-mean, the network can be retrained by setting this new value in the activation function that is responsible for defining whether an energy consumption is fraudulent or not, improving the prediction performance of the network. Table 2 shows the G-mean associated with its ideal threshold. Analyzing the ROC curves constructed from the classifications of the type of electricity consumption users made by the neural networks, it is apparent that the MLP network curve is the closest to the curve that indicates an inefficient classification model. Regardless of the chosen threshold, its classification performance will always be the worst when compared to other models capable of processing temporal autocorrelations of electric energy consumption. When observing the recurrent networks, it is easy to see that their performances are similar, and the BiGRU-CNN model superiority is proven by the greater distance to the model without classification ability curve. Big data applications will be coming to the power system, bringing large benefits especially on the distribution level; however, the smart metering and communications infrastructure necessary to implement those kinds of algorithms is still far away on the horizon for most utilities across the world. In the meantime, utilities facing the need for automated theft detection today need to rely on their current data, which is basically limited to monthly energy billing information and manual inspections by sample. The tool presented in this paper will help narrow the sample inspections, reducing the overall cost of manual labor and increasing the return of investment on theft detection programs; nonetheless, if there were more data available, for example private information from smart metering infrastructure, and public data such as technical and commercial data from other utilities, local socio-economic data, credit scores, among others, future iterations of this kind of algorithms will benefit the power utility industry overall, and thus the service offered to our customers.

Conclusions
The present work proposed the Bigru-CNN model to classify electricity users as fraudulent or honest based on their consumption patterns. This classification is intended to avail energy sector companies making decisions on whether to carry out manual inspections of electricity consuming units.
The experimental results showed that feeding a Bi-GRU layer with the historical series to extract its long-term temporal correlations, and then introducing these time characteristics into a CNN layer so that local trends can be captured, proved to be efficient when comparing Accuracy, Precision, F1-Score, and ROC AUC metrics with MLP, RNN, GRU, and LSTM networks. To ensure that the proposed BiGRU-CNN model is effectively superior to other consumer electricity theft classification models, future work should be carried out altering the hyperparameters of neural networks, as well as the time series of consumers who feed them.