Detection of Non-Technical Losses using SOSTLink and Bidirectional Gated Recurrent Unit to Secure Smart Meters

: Energy consumption is increasing exponentially with the increase in electronic gadgets. Losses occur during generation, transmission, and distribution. The energy demand leads to increase in electricity theft (ET) in distribution side. Data analysis is the process of assessing the data using different analytical and statistical tools to extract useful information. Fluctuation in energy consumption patterns indicates electricity theft. Utilities bear losses of millions of dollar every year. Hardware-based solutions are considered to be the best; however, the deployment cost of these solutions is high. Software-based solutions are data-driven and cost-effective. We need big data for analysis and artiﬁcial intelligence and machine learning techniques. Several solutions have been proposed in existing studies; however, low detection performance and high false positive rate are the major issues. In this paper, we ﬁrst time employ bidirectional Gated Recurrent Unit for ET detection for classiﬁcation using real time-series data. We also propose a new scheme, which is a combination of oversampling technique Synthetic Minority Oversampling TEchnique (SMOTE) and undersampling technique Tomek Link: “Smote Over Sampling Tomik Link (SOSTLink) sampling technique”. The Kernel Principal Component Analysis is used for feature extraction. In order to evaluate the proposed model’s performance, ﬁve performance metrics are used, including precision, recall, F1-score, Root Mean Square Error (RMSE), and receiver operating characteristic curve. Experiments show that our proposed model outperforms the state-of-the-art techniques: logistic regression


Introduction
In the modern world, electricity utilization is increasing day by day. It is broadly categorized into six main areas. These areas are residential, industrial, commercial, traction, agriculture, and other activities [1]. More than 65% of energy is consumed by residential regions [2]. Traditional grid is replaced with smart grid because it has some limitations, i.e., one-way communication, manual monitoring and restoration, central distribution with few sensors, etc. [3], whereas in smart grid, the information flows in two ways between the utility and the consumer [4,5]. It also helps utilities to produce electricity according to the customer's needs [6][7][8].
Electricity losses take place during generation, transmission, and distribution. There are two types of losses: technical and non-technical [9]. Technical losses occur in the electrical system by internal actions, for example, problem in the transformer or issue in the transmission lines [10]. Non-technical Losses (NTL) occur in the electrical system by external actions, for example unknown and incorrect flow of electricity, inaccurate meter readings, non-payment of bills by customers, and errors in maintaining database records [11]. Electricity theft is one of the major causes of NTL. There are different types of electricity theft including meter tampering or by passing, smart meter hacking, etc. [12]. Different types of issues take place due to electricity theft like revenue loss, electricity surging, and heavy load on electrical systems [13]. According to Zheng et al. [14], one-hundred million Canadian dollars are wasted every year as a result of electricity theft. This loss of power supply can be enough for 77,000 users for a year.
We have mapped problems with proposed solutions as shown in Table 1. Recently, many authors proposed different approaches to solve these issues, which are broadly classified into three main categories: Artificial Intelligence (AI) and Machine Learning (ML)-based, State-based, and game theory-based systems. State-based approaches observe the structure in which information is collected from different resources. However, the additional cost of hardware devices is required to implement the model. Moreover, on site inspection is used to detect electricity theft. However, it is not possible to inspect each user within a short period of time. In game theory-based approaches, there is a game between utility and electricity theft [15]. However, these approaches have high False Positive Rate (FPR) and low detection rate. The most challenging problem in game theory based solutions is to defined the rules and interaction between players . On the other hand, the main focus of machine learning and artificial intelligence based systems is to analyze the patterns of real time series data. These systems extract useful information from a dataset by analyzing electricity consumption patterns [16]. Any deviation or changes in the consumption patterns may lead to electricity fraud case or illegal action [17,18]. Additional hardware devices are required to detect theft, and these devices have high maintenance cost. Domain experts are required for data analysis and final decision-making. Therefore, there is a need to develop automated electricity theft detection method to overcome these issues [14]. Figure 1 shows the normal and abnormal consumption of energy in two months (i.e., August and September 2016).
The main contributions of this research are summarized as follows. •

Literature Review
A detailed survey of existing studies are presented in this section. Zheng et al. [14], have proposed wide and deep Convolutional Neural Network (CNN) to capture the periodicity from State Grid Corporation of China (SGCC) dataset. However, the time complexity of wide and deep CNN model is very high, being a hybrid model. The number of non-fraudulent customers is greater than the number of fraudulent customers, which causes the class imbalance problem. The authors have addressed the issue of class imbalance using large scale data in [19]. For this purpose, the authors proposed Random undersampling Boosting (RusBoost) and Maximal Overlap Discrete Wavelet Packet Transform (MODWPT) for classification using real smart meter data from commercial and industrial zones [19]. However, the limitation of random undersampling is the underfitting problem, biased selection of samples, and removal of useful information from the majority class.
A metaheuristic technique, namely, the Binary Black Hole Algorithm (BBHA), is used to select the most representative features using real time-series data collected from a Brazilian agency in [20]. However, challenges in BBHA include being stuck in local minima and class imbalance problem [21]. The authors have proposed a Clustering technique by Fast Search and Find of Density Peaks (CFSFDP) combined with Maximum Information Coefficient (MIC) based on the method in [22]. They use an Irish dataset of real smart meter project [23]. The dataset consists of residential, small and medium size enterprises with 5000 customers within 500 days. However, observer meters are installed for smart meter security. The installation and maintenance costs of hardware resources are very high.
Authors have combined K-means clustering and Deep Neural Network (DNN) to secure the smart meter [17]. This combined approach is used to detect the anomalies in normal electricity usage of Irish data. Razavi et al. [18] have proposed finite mixture clustering model and genetic programming to discover new characteristics for theft detection. It is applied on the customer behavior trails from 2009-2010 in Ireland. The main concern of the study is feature engineering, rather than accurate classification. However, it has high FPR, which leads to high on-site inspection cost.
A detailed literature review is presented in Table 2. The authors have deployed a deep learning methodology in which data is represented as an image [11]. This methodology is specifically designed to accommodate large scale data. Many machine learning techniques have been applied for Electricity Theft Detection (ETD) including the Long Short-Term Memory (LSTM) method proposed in [24]. LSTM is not only used for single entity data, but it is also used to learn long-term dependency sequences. The data is collected from electricity load diagrams duration of 2011-2014. However, the delay time occurs during the detection of anomaly.
Spiri'c et al. [25] have proposed a fuzzy logic method to minimize the total loss. This method determines the loss between electricity usage and supply. Fuzzy logic has some limitations such as it requires large data for training, and expert team for creating fuzzy rules. This method is time-consuming and complex, and it is not considered as an optimal solution. The authors have proposed gradient boosting based method for ETD, which is composed of Extreme Gradient Boosting (XGBoost), categorical boosting, and light boosting and uses Irish dataset [7]. These methods consume more memory, time, and are unable to handle categorical data. Li et al. [26] have proposed Smart Energy Theft System (SETS) in smart homes, which is an IoT-based solution for ETD. A peer-to-peer computing-based method named multiple linear regression is used [27].
Coma-Puig et al. [28] have proposed NTL detection method for energy utility to observe loss between generation, and distribution. Data is collected from leading electricity provider in Spain. Viegas et al. [29] have proposed fuzzy Gustafson-Kessel (GK) clustering to detect NTL using a real dataset. Saeed et al [30] have proposed ensemble bagged tree-based algorithm to detect NTL. The data is collected from Multan Electric Power Company (MEPCO) in Pakistan. The bagging algorithm did not not perform well because it causes an overfitting problem.
Ghasemi et al. [31] have proposed Probabilistic Neural Network (PNN), and Levenberg-Marquardt method to detect two types of electricity thefts: first, where an individual consumes a portion of the required energy illegally, and second, where an individual consumes all the required energy illegally. The authors have proposed Extreme Gradient Boosting (EGB) trees to rank the customers according to their anomalous consumption behavior in [32]. Data is collected from commercial, and industrial users of Endesa. A hybrid model based on CNN and RF has been proposed by the Li et al. [33] to detect NTL in smart grid. Real time-series data is collected from energy utility of Ireland and London. Hasan et al. [34] have proposed a hybrid model of neural networks named as CNN and LSTM, using the SGCC dataset, which is publicly available. Singh et al. [35] have proposed a Principal Component Analysis (PCA) approximation to find the electricity theft. Data is collected from an Irish leading center for qualitative data.

Problem Statement
Electricity theft is a serious issue for utilities due to billions of dollars lost annually. Many machine learning approaches have been proposed to detect NTL. However, further research is needed to encounter some important problems.
Zheng et al. [14] have proposed wide and deep CNN for ETD. However, the execution time is high because it is a hybrid model. FPR is not calculated. Moreover, the imbalanced nature of the data is not considered.
Avila et al. [19] have proposed RusBoost for NTL detection. However, important information is lost due to random undersampling. Moreover, it requires high engagement of experts and execution time.
Buzau et al. [36] have proposed a hybrid model which consists of LSTM and MLP to secure smart grid from electricity theft. However, this requires high memory to capture anomalies in consumption data. Furthermore, the execution time and FPR of LSTM is very high, which leads to high inspection cost in ETD.

Proposed Model
To solve the aforementioned problems, we propose a model which is a combination of KPCA and BGRU. The proposed system model for ETD is shown in Figure 2, which is based on five steps: data preprocessing using imputation and normalization, the problem of imbalanced data is encountered using SOSTLink method, feature extraction using KPCA, bidirectional GRU for classification, and performance metrics. Flow chart of proposed model is shown in Figure 3.

Electricity Consumption Data
The dataset released by SGCC is used in this research, which is publicly available [14].

Data Preprocessing
A data preprocessing step is important because the performance of a model not only depends on algorithms, but also on the quality of data. Generally, real time-series data is noisy, inconsistent, and incomplete (missing values), which increases the difficulty of mining useful information. The SGCC dataset contains missing and incorrect values due to various factors like breakdown of smart meter, unscheduled maintenance of data storage, and unreliable transmission measurement [14]. Consequently, to attain high performance in NTL detection, many preprocessing techniques have been used in the literature. For that reason, we perform preprocessing using imputation and normalization. To remove missing values, we use a simple imputation method. In this method, empty values are considered as Not a Number (NaN) and their forward and backward values are checked. If these values exist, NaN is replaced by the mean of these two numbers; otherwise, zero is replaced in all empty fields. We normalize the data by applying MinMax normalization, which normalize the data between 0 and 1. The formula of MinMax normalization is follows, where n represents newly generated values, x is the selected cell on which operations are performed, tmin is the minimum value of the column, and tmax is the maximum value of the column.

Handling Imbalance Data
SGCC dataset causes model bias towards the majority class. Figure 4 shows that normal users are greater in number as compared to abnormals, due to which the model misclassifies the minority class. To address the imbalanced data problem, two methods are used in the literature: cost function and sampling techniques. In this paper, we use the sampling technique. There are two types of sampling techniques: random undersampling and random oversampling. In random undersampling, some data points from the majority class are discarded and the majority class is made equal to the minority class. This sampling technique requires less execution time but leads to the loss of important information. In random oversampling, data points from the minority class are replicated randomly, so no information is lost and both majority and minority classes are balanced.
In this paper, we propose SOSTLink, which is a combination of oversampling technique Synthetic Minority Oversampling TEchnique (SMOTE) and undersampling technique Tomek Link. The SMOTE algorithm is used for oversampling and Tomek Link is used for undersampling. SMOTE generates data points by taking the means of two numbers from the minority class. For example, take an instance of the minority class as (y 1 , y 2 ), if its nearest neighbor is chosen as (y 1 , y 2 ), then the generated data points shown in Equation (2)   (y 1 , y 2 ) = (y 1 , y 2 ) + random(0, 1) where, Random function provides a random number between 0 and 1. By applying SMOTE, minority class data points are generated and balanced with the majority class. Figure 5 shows the data set after applying SMOTE. However, it does not consider neighboring examples from other class. This can increase overlapping of same class data, introduce additional noise, and keep its prediction away from actual residential customers. Moreover, it cannot be applied to high-dimensional data. Tomek Link is used for undersampling in case the observations near to the borderline of minority class are removed [37]. The undersampling steps are as follows.
1. Read input from the dataset.
2. Minority samples are generated from input dataset. 3. Majority samples, which are nearest neighbors of minority observation, are also generated. 4. Combine both observation samples from majority and minority. 5. Delete all majority samples that are the nearest neighbor of minority samples. 6. Now dataset is undersampled as observations from majority class are removed.
In our proposed SOSTLink, samples are generated in the minority class and samples are removed from the majority class as shown in Figure 6. It corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class.

Feature Extraction using KPCA
Feature extraction is a process in which the most relevant features are extracted from the original dataset. In the literature, many methods have been proposed for feature extraction. In this research, we use KPCA for feature extraction. It extracts useful information from the entire dataset as much as possible, without losing important information. It is a variant of Principal Component Analysis (PCA) with kernel function. It uses a kernel function to project the dataset into a higher-dimensional feature space, where it is linearly separable. To implement the KPCA algorithm, the following steps are involved.

•
The first step is the choice of kernel mapping k(x m , x n ). • Based on training data {x n , (n = 1, · · · , N)}, we get K.

•
To get λ i and a i , solve eigenvalue problem of K. • For each given data point x, obtain its principal components in the feature space:

Bidirectional Gated Recurrent Unit for Classification
In traditional neural network and CNN models, weights are updated during backpropagation, due to which the problem of vanishing gradient and exploding gradient occurs. To resolve these issues, LSTM and GRU are used as advanced versions of the Recurrent Neural Network (RNN). However, LSTM has some limitations over GRU. LSTM has more parameters, besides being time-consuming and less efficient. It needs more data for generalization. Therefore, GRU is considered a better classification model as compared to the traditional neural networks, CNN, and LSTM models. Figure 7 shows the overall architecture of GRU. GRU is proposed by Cho et al., which is the advanced version of RNN [38]. It easily learns long-term dependencies and resolves the problem of vanishing gradient [39]. The structure of GRU is slightly different from LSTM. LSTM consists of three gates, whereas GRU consists of two gates. Update and reset are the two gates of GRU. The update gate is the combination of input and forget gate of LSTM, and the reset gate is applied directly to the previous hidden state. Thus, GRU has fewer parameters, faster training process, and requires less data for generalization. For short-term dependencies, the reset gate is activated, and for long-term dependencies, the update gate is used. GRU uses a combination of both gates, so input sequences are passed through the deep network and all the gradients are kept. The relationship between the input and output gates is described in Equations (4)- (7).
where r(t) is the reset gate, u(t) is update gate, and W is parameter. σ is the sigmoid function and tanh is hyperbolic tangent function. BGRU is used in [40] for natural language processing.We get our motivation from work in [40] and used BGRU in our proposed model as shown in Figure 8. Bidirectional GRU is the latest version of bidirectional RNN. To make a prediction of a current observation, it uses information from the previous time step and forward time step. In GRU, information flows from left to right by computing each value. The output of GRU is passed to BGRU as input. In the final prediction, information flows from right to left starting from final time step and moving to the initial time step. In our proposed model, we use five layers: GRU, BGRU, flatten, dropout, and dense. We use 100 neurons in GRU layer and 50 neurons in the bidirectional layer. Moreover, the flatten layer is used to convert multidimensional data into one-dimensional data.

Study of Hyperparameters used for Experiments
The performance of proposed model depends on its hyperparameters. We have achieved the desired performance by selecting the optimal number of hidden layers. We perform a number of experiments by changing the value of alpha. We get the optimal value of alpha on a hit and trial basis. As shown in Figure 4, data is transformed using Random Oversample Technique. Table 3 shows the values of the hyperparameters.

Experimental Results
A variable which is used to control the training process of BGRU is known as an epoch. In our model, we run the program for 25 epochs. After the 17th epoch, the accuracy remains constant. On the training data, accuracy gradually starts increasing and reaches 95%. While at testing, the accuracy slightly fluctuates. The dataset has some zero values. At the 4th epoch, the BGRU trains over the batch containing zero values, which causes overfitting. As shown in Figure 9, the accuracy of the proposed BGRU model is 94%. We also calculate the loss of the proposed model. For loss, we conduct 16 iterations. At each step, loss decreases and reaches a 0.1 minimum at the training phase. At the testing phase, the minimum loss is less then 0.2. Loss of training and testing data is the same. The selected batch at the 2nd to 4th epoch consists of zero values; therefore, overfitting occurs at 4th iteration. Figure 10 indicates that the proposed model performs well on training and testing data.

Decision Tree
DT is also used for classification of energy theft. It has the power of decision-making in order to perform NTL detection. It works for complex problems because of its high adaptability.

Support Vector Machine
SVM is a classification method that is used in literature for binary classification. In the literature, many studies have compared their model with SVM.

Logistic Regression
Logistic Regression (LR) is a binary classification method which is equivalent to the single hidden layer of neural network. Sigmoid activation function is used in LR, with values ranging from 0 to 1.

Random Forest
The building block of Random Forest (RF) is multiple DT. In recent studies, it is used to identify thefts in power distribution. It has achieved better accuracy along with reducing overfitting issue.

Convolutional Neural Network
CNN is used to perform NTL detection. CNN is a multilayered deep learning model suitable for complex problems.

Long Short Term Memory
LSTM is a classifier used for theft detection, which learns temporal correlations from time series. In LSTM, the input is given in a sequence to train the model. In existing studies, many authors compare their classifiers with LSTM.

Multilayer Perceptron-Convolutional Neural Network
The Multilayer Perceptron Convolutional Neural Network (MLP-CNN) is a hybrid model, in which MLP maps input data with output data and CNN is a deep learning model. We compare our results with this model.

Performance Metrics
The objective of NTL detection is to minimize the inspection cost and maximize the electricity theft detection. Performance metrics are computed from confusion matrix which is shown in Figure 11. It is used to evaluate the performance of a classifier on test data. This matrix is appropriate when we have a verified number of thefts in a dataset [41]. From this matrix, four possible outcomes are generated. These outcomes are True Positive (T+), False Positive (F+), True Negative (T-), and False Negative (F-). In T+, classifier correctly detects thefts as Fraudsters. In F+, normal users are detected as theft by the classifier. In T-, normal users are correctly identified by classifier. Whereas in F-, thefts are detected as normal users by the classifier. These outcomes are then used to calculate precision, recall, and F1-score. In this research, we have used precision, F1-score, recall, and ROC curve [19].  FPR is considered as an important performance metric. In FPR, normal users are classified as theft, which raises the model's misclassification rate. If FPR is high, the inspection cost is also high. The objective of NTL detection is to minimize the inspection cost. We have calculated the FPR of benchmark models as shown in Figure 13. The proposed model has the lowest FPR of 0.06, whereas SVM has the highest FPR. The ROC curve is another performance metric for classification problems. It tells us how confident our model is to differentiate between the normal and theft users. Figure 13 shows the ROC curve of the proposed BGRU model. The value of ROC curve is 0.86 of BGRU model. The score of ROC curve improved greatly by applying the SOSTLink method.
We also calculate the RMSE of our proposed model and also for comparison techniques. RMSE gives relatively high weights to large errors and it is very useful when large errors are undesirable. It takes the average of root square of errors. The formula of RMSE is as follows, RMSE calculates the distance between acutal sample and predictied sample. The RMSE of BGRU is 0.044, which is lowest as compared to other existing deep learning models as shown in Figure 14; however, the RMSE of RF is highest at 0.400.

Conclusions
Electronic gadgets are growing rapidly, and because of this demand for electricity is increasing day by day. Losses occurs during generation, transmission, and distribution. In the literature, many studies have been proposed to deal with non-technical losses. However, still there is need to improve FPR and a better balancing technique to achieve good results. In this paper, first we remove missing values by imputation method and normalized the data by applying MinMax normalization. Second, we propose SOSTLink sampling technique, which is hybrid of two sampling techniques SMOTE and Tomik Link for balancing the imbalance data. Finally, we used bidirectional GRU for classification of NTL detection by analyzing the electricity consumption patterns of consumers. In order to evaluate the model performance, we use five performance evaluation metrics using real electricity consumption dataset of SGCC. Dataset consists of customer identification number, flag, and features. There are 1035 features that are the daily consumption of electricity. We compare the proposed system model with other existing techniques like, SVM, RF, LR, LSTM, CNN, and MLPCNN and show that our BGRU outperforms these techniques.
In future work, we will integrate other methods with BGRU to yield better results. Moreover, we will apply the BGRU model in other areas such as bank fraud and other theft detection problems.
Author Contributions: H.G. and N.J. proposed and implemented the main idea. I.U. and A.M.Q. wrote the simulation section. M.K.A. and G.P.J. organized and refined the manuscript. All authors worked together and responded to the honorable reviewers' comments. All authors have read and agreed to the final version of the manuscript.