Electricity Theft Detection in Smart Grid Systems: A CNN-LSTM Based Approach

: Among an electricity provider’s non-technical losses, electricity theft has the most severe and dangerous e ﬀ ects. Fraudulent electricity consumption decreases the supply quality, increases generation load, causes legitimate consumers to pay excessive electricity bills, and a ﬀ ects the overall economy. The adaptation of smart grids can signiﬁcantly reduce this loss through data analysis techniques. The smart grid infrastructure generates a massive amount of data, including the power consumption of individual users. Utilizing this data, machine learning and deep learning techniques can accurately identify electricity theft users. In this paper, an electricity theft detection system is proposed based on a combination of a convolutional neural network (CNN) and a long short-term memory (LSTM) architecture. CNN is a widely used technique that automates feature extraction and the classiﬁcation process. Since the power consumption signature is time-series data, we were led to build a CNN-based LSTM (CNN-LSTM) model for smart grid data classiﬁcation. In this work, a novel data pre-processing algorithm was also implemented to compute the missing instances in the dataset, based on the local values relative to the missing data point. Furthermore, in this dataset, the count of electricity theft users was relatively low, which could have made the model ine ﬃ cient at identifying theft users. This class imbalance scenario was addressed through synthetic data generation. Finally, the results obtained indicate the proposed scheme can classify both the majority class (normal users) and the minority class (electricity theft users) with good accuracy.


Introduction
Because of the high cost of acquiring energy, as well as the limited amount of energy resources, efficient and operative use of energy resources is a very important aspect of social and economic development for any country.The smart grid has become a key solution for making the greatest use of future energy monitoring.The smart grid system can be described as an entire electricity network consisting of the power system infrastructure and computers to manage and monitor the energy usage, along with an intelligent monitoring system that tracks the usage pattern and mode of action of all consumers connected with the system [1].The smart grid provides the utilities' and customers' facility to monitor, control, and predict energy use by integrating modern digital equipment with the existing electrical system.In this system, the collector device delivers usage readings to the operational center using the internet, and the power transmission company performs the billing process depending on these readings.At the same time, the operation center collects user readings from neighborhood customers' periodic updates through a wireless network.The main target is to reduce losses due to energy wastage and provide viable, cost-effective, and secure electricity supplies [2].The device that Energies 2019, 12, 3310 2 of 18 performs usage reporting is known as a smart meter; it is a computerized version of an ancient meter.The processor, nonvolatile storage, and communication facilities, along with the ability to maintain widespread customer energy generation, make smart meters an important part of smart grid systems.
Today, electric power loss has become one of the most conspicuous issues affecting both conventional power grids and smart grids.From the statistics, it has been shown that transmission and distribution losses increased from 11% to 16% between the years 1980 to 2000 [3].The electricity losses vary from country to country.The losses in the USA, Russia, Brazil, and India were 6%, 10%, 16%, and 18%, respectively, of their total energy production [4].The difference between the energy produced in one system and the metered energy delivered to the users is known as the power loss.To determine the amount of electricity loss, smart meters in smart grids play a prominent role.Advanced energy meters obtain information from the consumers' load devices and measure the consumption of energy in intervals of an hour.The energy meter provides additional information to the utility company and the system operator for better monitoring and billing, and provides two-way communications between the utility companies and consumers [5].However, it is also possible to limit the maximum amount of electricity consumption, which can terminate as well as re-connect the supply of electricity from any remote place [6].
Electricity loss is mainly classified into two categories, namely technical loss (TL) and non-technical loss (NTL).TL occurs because of the joules effect on power lines and transformer loss during the transportation of electricity [7].The calculation of TL is quite complex, making it difficult to locate the point of loss and estimate the amount of energy destroyed.The TL cannot be stopped completely, but it can be reduced by applying some modification techniques throughout the system.The NTL can be defined simply as the difference between a total loss and TL [8].The main causes of NTL are billing delay and irregularities, energy theft, faulty energy meters, fraud, and unpaid bills [9].Many researchers mentioned a different class of loss, named gratis, which occurs when electricity is provided free of charge [10].In recent times, cyber tampering has become a way to malevolently change the consumption data patterns of the smart meter, which decreases the bill for the users [11].Though such electricity loss is caused by a small number of consumers, it decreases the profitability and energy efficiency of the utility companies, resulting in increased costs for all users and creating problems such as load-shedding, industrial routine hampering, and inflation.NTL costs a huge amount of money for both developed and developing countries such as USA, UK, Brazil, Malaysia, and India [12,13].For example, these losses in the power sectors of India and Brazil cost approximately US $44.5 and US $3 billion per year respectively [14].So, for these countries, the preclusion of electricity energy theft is a major challenge [15] when working to strengthen the economic state.Real-time fraud detection is the only way to get rid of this problem.
Among the various types of NTL, electricity theft viciously damages the revenue of the power sector company, which causes a significant loss of energy resources and damages the economy of any country.To mitigate this problem, several techniques have been implemented through prolific research to identify and abate the theft issue.Researchers categorize energy theft detection (ETD) systems into three basic groups, state-based, game theory-based and classification-based detection [16].The use of upgraded devices and sensors in state-based detection results in higher accuracy in ETD, as proposed in earlier studies [9,17,18].The main limitations of applying this detection system are vulnerability, the higher cost of hardware devices, and maintenance of the devices.Cardenas et al. presented a game theory-based detection to find an optimal solution, which was based on formulating various potential strategies [19].Computing the utility function among distributors, regulators, and thieves is the biggest challenge in this process.The daily electricity consumption patterns of users can be analyzed by using machine learning algorithms to establish classification models, which include decision trees, random forest (RF), support vector machines (SVM) [20], neural networks (NN) [21], and so on.
The periodic description of power consumption plays a major role because the electricity consumption patterns of each user are very significant in the field of ETD.Depending on evaluation of smart meter data, many researchers have proposed some techniques to identify fraud data.For example, Gue et al. [11] formed a three-level framework, Giani et al. [22] presented a phasor measurement units (PMU)-based security system to counterfeit cyber-attacks, and Najmeddine et al. [23] developed a matrix pencil approach for theft detection solutions.Along with that, the rough set and rule-based models were also applied to detect NTLs [20,24,25].The conventional electricity theft detection method has been addressed with statistical techniques [26,27], comparing the abnormal and normal meter readings, fuzzy networks, and rough sets [28,29].In a very recent paper [30], Bernat Coma-Puig and Josep Carmona utilized XGBoost, LightGBM, and CatBoost learning algorithms to detect NLT.In recent times, the new idea of smart grids brings a new era in unraveling electricity theft.In most cases, data from smart meters is being used for further implementation.
Though the amount of data held by power utility companies is too large in most cases, machine learning-based classification has achieved significant attention in recent times.The daily consumption data is applied to find theft patterns, which also preserves the privacy of consumers [16].In [31][32][33], SVM was applied to detect anomalies and irregularities in the collected data by clustering and classification.In many algorithms, clustering is applied as a primary and secondary step, which makes this technique the most appropriate for modeling and identification of any energy consumption profile.Not only SVM, but other techniques such as one-class SVM, optimum path forest, and C4.5 decision tree were combined in [34], by combining the decision tree and the SVM [20], fuzzy logic with SVM [13], generic algorithm-support vector machine (GA-SVM) [35,36], extreme learning machine (ELM), and online sequential ELM (OS-ELM) [37][38][39] to achieve higher accuracy in theft detection.
The extension and availability of the internet increase the concern of cyber-attacks in the smart grid system.With the help of the Markle Hass concept, Wei et al. [35] proposed a security framework.In [40], the smart grid metering infrastructure could be linked with the security threads which make data more confidential and secure.Unsupervised methods such as fuzzy classification were performed with the computation of Euclidean distance to the cluster center in [41].In [42], a wavelet-based technique was applied to analyze features to identify fraudulent consumers using an artificial neural network (ANN)-based approach.Another technique in [21] has shown better fraud detection efficiency in smart grid systems where an ANN is incorporated with an SVM to construct a hybrid scheme.Load profiles have become an alternative and cost-effective approach to identify fraudulent consumers [37].Researchers have worked with different pattern recognition methods, which have been used as load profiling tools based on the local recorded pattern [43][44][45].
In addition to the classical machine learning approaches, deep learning approaches have achieved huge success in areas like image classification and computer vision [46], and speech recognition [47].Due to its ability to handle and control huge amounts of data, automate feature extraction, and its classification process, deep learning techniques are utilized to build models to work with smart meter data originating from the smart grids.A wide and deep CNN structure was proposed in [7] to detect electricity theft in smart grid environments.Hybrid deep learning techniques have been utilized in recent times for load forecasting.A combination of CNN and LSTM structures was proposed in [48], where the model was used for short-term load forecasting.The proposed model performed quite well in comparison to other approaches.Another similar model was suggested in [49] for similar purposes.This CNN-LSTM hybrid structure is also used in predicting electricity price [50] and household power consumption [51].In all of these applications, the CNN-LSTM model exhibited very good performance.In contrast, in [50,51], an electricity price was predicted using raw data without any preprocessing.The quality of data can be affected by various facets of the sensors, such as sensor power scale, noise level, and so on.However, it is inevitable to apply an appropriate preprocessing technique to achieve a generalized performance.These studies used hybrid structures like a CNN-LSTM model on electricity consumption datasets to develop a regression model.In this study, we combined the hybrid model, CNN-LSTM, with a preprocessing technique on the electricity consumption signature dataset to solve the classification problem.This encouraged us to exploit this hybrid structure to detect electricity theft by analyzing the irregular and abnormal consumption patterns of consumers.
The remainder of this paper is organized as follows.In Section 2, the overall methodology is presented, including the data preprocessing technique and the proposed CNN-LSTM model.The experimental results, with robustness and reliability analysis using numerous evaluation matrices, are also presented in Section 3. Finally, we conclude this paper in Section 4.

Materials and Methods
This work was intended to identify electricity theft from the power consumption pattern of users, utilizing CNN-LSTM-based deep learning techniques.This classifier model was trained utilizing a dataset consisting of daily power consumption data of both normal and fraudulent users in a supervised manner.First, the data was prepared by a data preprocessing algorithm to train the model.The preprocessing step also involved synthetic data generation for better performance.At the next stage, the proposed model was hypertuned and finally, the optimized model was evaluated via the test data.The overall methodology is depicted in Figure 1.experimental results, with robustness and reliability analysis using numerous evaluation matrices, are also presented in Section 3. Finally, we conclude this paper in Section 4.

Materials and Methods
This work was intended to identify electricity theft from the power consumption pattern of users, utilizing CNN-LSTM-based deep learning techniques.This classifier model was trained utilizing a dataset consisting of daily power consumption data of both normal and fraudulent users in a supervised manner.First, the data was prepared by a data preprocessing algorithm to train the model.The preprocessing step also involved synthetic data generation for better performance.At the next stage, the proposed model was hypertuned and finally, the optimized model was evaluated via the test data.The overall methodology is depicted in Figure 1.

Electricity Theft Data
The study was carried out on a collection of real electricity consumption data of consumers, which was made available by the State Grid Corporation of China (http://www.sgcc.com.cn/).Table 1 presents the metadata information about the dataset.The dataset is composed of electricity consumption signatures of 9655 consumers over 1 year.Our primary observation revealed that a normal user and an abnormal user generate different patterns of electricity consumption.Figure 2 displays the fortnightly consumption of two consumers; one is a normal user and the other is an electricity thief.The consumption trend indicates that an abnormal or electricity theft user has a pattern that fluctuates more than a normal user.Electricity consumption data is generally acquired through smart meters or various sensors located at the user end.The data is then aggregated to any central location through a data communication network.In this scenario, there is a possibility of smart meter failure, sensor malfunctioning, or faults in data transmission and the storage server.It is inherent that missing or erroneous data will be present in the electricity consumption datasets.In this dataset, we found numerous missing values.If those missing instances are just discarded, the size of the dataset shrinks considerably, and thus reliable analysis becomes difficult.To avoid downsizing the dataset, we proposed a data preprocessing algorithm.With the help of this algorithm, we have filled in the missing values in the dataset.

Electricity Theft Data
The study was carried out on a collection of real electricity consumption data of consumers, which was made available by the State Grid Corporation of China (http://www.sgcc.com.cn/).Table 1 presents the metadata information about the dataset.The dataset is composed of electricity consumption signatures of 9655 consumers over 1 year.Our primary observation revealed that a normal user and an abnormal user generate different patterns of electricity consumption.Figure 2 displays the fortnightly consumption of two consumers; one is a normal user and the other is an electricity thief.The consumption trend indicates that an abnormal or electricity theft user has a pattern that fluctuates more than a normal user.Electricity consumption data is generally acquired through smart meters or various sensors located at the user end.The data is then aggregated to any central location through a data communication network.In this scenario, there is a possibility of smart meter failure, sensor malfunctioning, or faults in data transmission and the storage server.It is inherent that missing or erroneous data will be present in the electricity consumption datasets.In this dataset, we found numerous missing values.If those missing instances are just discarded, the size of the dataset shrinks considerably, and thus reliable analysis becomes difficult.To avoid downsizing the dataset, we proposed a data preprocessing algorithm.With the help of this algorithm, we have filled in the missing values in the dataset.The electricity theft dataset is a typical example of an imbalanced dataset, a dataset where the instances of one class are significantly lower than the other class.The distribution of the two classes, the normal user and the theft user, is presented in Figure 3.The distribution shows that the number of theft users is very small compared to that of the normal users.This is called a class imbalanced problem.The consequences of using such an imbalanced dataset to train a model are described in the results section.The resulting model can successfully classify only the majority class.This class imbalance problem was addressed by generating synthetic data to increase the minority class.Then, the classification model was trained with the balanced dataset.

Data Preprocessing
The data preprocessing task was divided into two sections.Initially, the missing data was computed using a preprocessing algorithm, and finally, the generation of synthetic data points was used to solve the class imbalance problem.

Computing the Missing Values
Our proposed preprocessing algorithm utilized the local average value of the consumed power to calculate the missing values.We employed the preprocessing method used in [7].If there was a missing value or NaN value at the position  , the value was computed as: The electricity theft dataset is a typical example of an imbalanced dataset, a dataset where the instances of one class are significantly lower than the other class.The distribution of the two classes, the normal user and the theft user, is presented in Figure 3.The electricity theft dataset is a typical example of an imbalanced dataset, a dataset where the instances of one class are significantly lower than the other class.The distribution of the two classes, the normal user and the theft user, is presented in Figure 3.The distribution shows that the number of theft users is very small compared to that of the normal users.This is called a class imbalanced problem.The consequences of using such an imbalanced dataset to train a model are described in the results section.The resulting model can successfully classify only the majority class.This class imbalance problem was addressed by generating synthetic data to increase the minority class.Then, the classification model was trained with the balanced dataset.

Data Preprocessing
The data preprocessing task was divided into two sections.Initially, the missing data was computed using a preprocessing algorithm, and finally, the generation of synthetic data points was used to solve the class imbalance problem.

Computing the Missing Values
Our proposed preprocessing algorithm utilized the local average value of the consumed power to calculate the missing values.We employed the preprocessing method used in [7].If there was a missing value or NaN value at the position  , the value was computed as: The distribution shows that the number of theft users is very small compared to that of the normal users.This is called a class imbalanced problem.The consequences of using such an imbalanced dataset to train a model are described in the results section.The resulting model can successfully classify only the majority class.This class imbalance problem was addressed by generating synthetic data to increase the minority class.Then, the classification model was trained with the balanced dataset.

Data Preprocessing
The data preprocessing task was divided into two sections.Initially, the missing data was computed using a preprocessing algorithm, and finally, the generation of synthetic data points was used to solve the class imbalance problem.

Computing the Missing Values
Our proposed preprocessing algorithm utilized the local average value of the consumed power to calculate the missing values.We employed the preprocessing method used in [7].If there was a missing value or NaN value at the position x i , the value was computed as: The local average was given as, The parameter U k holds a binary value based on thresholding of the value at entry k.The thresholding was performed as We experimentally found that the local values have an equal probability of occurrence, and the value of P k was chosen as 0.10.One special case needed to be tackled, in which there was a continuous entry of NaN.Such cases were handled by inserting the row average in those entries before preprocessing was performed.

Generation of Synthetic Data Points
In this dataset, the number of fraudulent observations is significantly lower than non-fraudulent observations, as depicted in Figure 2. A classification model created from such datasets could have a bias towards the majority class.Though the model showed good accuracy, it was prone to misclassify the minority class.This class imbalance problem was counteracted using one of two major approaches: the cost function-based approach and the sampling-based approach.In this paper, we utilized the sampling-based approach.Sampling-based approaches perform under-sampling or oversampling on the imbalanced dataset to reduce the disparity in the amount between the two classes of data.Under-sampling randomly discards the majority class entries to reduce the majority class instances.This technique shrinks the size of the dataset, which is advantageous from a computation perspective, but the random removal may discard potentially significant information, and the remaining data may not be a proper representation of the population.The model developed may produce a less accurate result with the test data.
On the other hand, the over-sampling technique replicated the minority class to increase the number of minority instances.Though no information was lost in this approach, due to the replication of data points, the model developed is likely to suffer from overfitting.By generating synthetic data rather than just replicating the minority class for balancing the dataset, the overfitting problem can be avoided.In this paper, we used the synthetic minority over-sampling technique (SMOTE) to generate synthetic data using minority instances [52].SMOTE introduces synthetic data points along with the line segments adjoining any or all of the k nearest neighbors of the minority class in the feature space.If (x 1 , x 2 ) is an instance of minority class and if its nearest neighbor is chosen as x 1 , x 2 then the data point is synthesized as follows: where ∆ = x 1 − x 1 , (x 2 − x 2 ) and random (0, 1) provides a random number between 0 and 1.
Using the SMOTE algorithm, the minority class instance was increased to balance the theft dataset.The distribution of two classes after utilizing SMOTE is shown in Figure 4, where the count of normal users and theft users was made equal.Compute local average: Set threshold value: Calculate the missing value:

Proposed CNN-LSTM Architecture for Smart Grid Data Classification
In this work, the integration of a convolutional neural network (CNN) and long short-term memory (LSTM) was utilized to solve a classification task.The CNN has an automatic feature extraction ability from the given dataset and LSTM performs better in the case of sequential data.The combination of both has been investigated in different applications, such as text from image or video, sentiment analysis, and natural language processing.In this paper, a CNN-LSTM model was used to solve a binary classification problem.In this study, we utilized 7 hidden layers, where the first 4 hidden layers performed a convolutional operation.Each of the convolutional layers consisted of twenty feature sets.The rest of the hidden layers performed the LSTM operation.The first, second and third LSTM layers consisted of 10, 5 and 100 neurons, respectively.

CNN Model
A CNN is a subclass of neural network proposed by Y. Le Cun et al [53], which is inspired by the working principle of using the human visual cortex for object recognition.CNNs were designed for identifying objects, as well as their classes, in an image.CNN differs from conventional machine learning algorithms in the context of feature extraction, where CNN extracts features globally through a number of stacked layers.
Generally, CNN architecture consists of several convolution layers and pooling layers.These layers are followed by one or more fully connected (FC) layers.The convolutional layer is the principal building block of a CNN.Convolution is a mathematical operation that acts upon two sets of information.The operation can be addition, multiplication, or a derivative such as

Proposed CNN-LSTM Architecture for Smart Grid Data Classification
In this work, the integration of a convolutional neural network (CNN) and long short-term memory (LSTM) was utilized to solve a classification task.The CNN has an automatic feature extraction ability from the given dataset and LSTM performs better in the case of sequential data.The combination of both has been investigated in different applications, such as text from image or video, sentiment analysis, and natural language processing.In this paper, a CNN-LSTM model was used to solve a binary classification problem.In this study, we utilized 7 hidden layers, where the first 4 hidden layers performed a convolutional operation.Each of the convolutional layers consisted of twenty feature sets.The rest of the hidden layers performed the LSTM operation.The first, second and third LSTM layers consisted of 10, 5 and 100 neurons, respectively.

CNN Model
A CNN is a subclass of neural network proposed by Y. Le Cun et al [53], which is inspired by the working principle of using the human visual cortex for object recognition.CNNs were designed for identifying objects, as well as their classes, in an image.CNN differs from conventional machine learning algorithms in the context of feature extraction, where CNN extracts features globally through a number of stacked layers.
Generally, CNN architecture consists of several convolution layers and pooling layers.These layers are followed by one or more fully connected (FC) layers.The convolutional layer is the Energies 2019, 12, 3310 8 of 18 principal building block of a CNN.Convolution is a mathematical operation that acts upon two sets of information.The operation can be addition, multiplication, or a derivative such as In the case of CNNs, the two sets of information are the input data and a convolution filter, which is also called the kernel.The convolutional operation is performed by sliding the kernel over the entire input, which produces a feature map.In practice, different filters are utilized to perform multiple convolutions to produce distinct feature maps.These feature maps are finally integrated to formulate the final output from the convolution layer.
Activation functions are used after the convolution operation to introduce non-linearity to the model.Different activation functions such as linear function, sigmoid, and tanh are used, but the rectified linear unit (ReLU) was used in the proposed CNN since it can train the model faster and ensure near-global weight optimization.The ReLU activation function is defined as follows: The pooling layer appears next to the convolution layer.This layer down-samples each feature map to reduce their dimensions, which in turn reduces overfitting and training time.The max pooling is widely used in CNNs which just selects the maximum value in the pooling window.
The FC layer is essentially a fully connected artificial neural network.In a nutshell, in a CNN, the convolution and pooling layers extract low-level features such as edges, lines, ears, eyes, and legs, and the fully connected layer performs the classification task based on these low-level features.The activation function used in this final classification layer is typically a SoftMax function, which assigns a probability value to each class which adds up to 1.The SoftMax function is defined as If the weight matrix is denoted as W and the feature matrix by X, then ϕ in the above equation is generalized as

LSTM
Long short-term memory (LSTM) networks are a special class of recurrent neural networks (RNN) designed to avoid the short-term memory problem of RNNs.LSTMs are capable of remembering and propagating significant information from the initial stages of the network towards the final stage.In this work, we used the fundamental structure of an LSTM, as shown in Figure 5.An LSTM has a similar repetitive structure to that of an RNN, but the modules have different internal components, as seen in Figure 5.The important part of an LSTM is the cell state, which conveys information along the chain.The information in the cell state is dropped or modified by several units called gates.An LSTM module consists of three gates, the forget gate, the input gate, and the output gate.The forget gate consists of a sigmoid layer that takes the previous hidden state (ℎ ) and the current input ( ) to produce an output between 0 and 1.This layer actually decides what information should be kept or discarded.A zero value means to forget the previous information and one means to keep the previous information.The forget gate output is written as Later, the forget gate utilizes a sigmoid function and a tanh function to decide what information is going to be added in the cell state.Both the functions take ℎ and  as the input.The output of the sigmoid determines whether the current information is important or not, whereas the tanh function regulates the network by squashing the value between −1 and +1.Finally, both the outputs are multiplied.

𝑖 = 𝜎(𝑊 • ℎ , 𝑥 + 𝑏 )
(10) With the output from the forget gate and input gate, the information in the cell state is updated.It is done through pointwise multiplication of the current cell state and the forget gate output.If  is 0, the multiplication will also result in zero, which means total dropout of the previous value.Otherwise, if  is 1, it is retained.Later, the pointwise addition updates the cell state as In the final stage, the output gate determines the final output.This output also acts as the next hidden state, ℎ .In this gate, a sigmoid function takes ℎ and  as the input and the current cell state  is passed through a tanh function.Then, the sigmoid output and the tanh output are multiplied to determine what information the hidden layer is going to carry.
Therefore, our proposed wide and deep CNN-LSTM model, depicted in Figure 6 in which the CNN layers precede the LSTM layers, is highly efficient and robust when used for smart grid data classification.The forget gate consists of a sigmoid layer that takes the previous hidden state (h t−1 ) and the current input (x t ) to produce an output between 0 and 1.This layer actually decides what information should be kept or discarded.A zero value means to forget the previous information and one means to keep the previous information.The forget gate output is written as Later, the forget gate utilizes a sigmoid function and a tanh function to decide what information is going to be added in the cell state.Both the functions take h t−1 and x t as the input.The output of the sigmoid determines whether the current information is important or not, whereas the tanh function regulates the network by squashing the value between −1 and +1.Finally, both the outputs are multiplied.
With the output from the forget gate and input gate, the information in the cell state is updated.It is done through pointwise multiplication of the current cell state and the forget gate output.If f t is 0, the multiplication will also result in zero, which means total dropout of the previous value.Otherwise, if f t is 1, it is retained.Later, the pointwise addition updates the cell state as In the final stage, the output gate determines the final output.This output also acts as the next hidden state, h t .In this gate, a sigmoid function takes h t−1 and x t as the input and the current cell state C t is passed through a tanh function.Then, the sigmoid output and the tanh output are multiplied to determine what information the hidden layer is going to carry.
Therefore, our proposed wide and deep CNN-LSTM model, depicted in Figure 6 in which the CNN layers precede the LSTM layers, is highly efficient and robust when used for smart grid data classification.

Results and Discussion
In this section, we verify the efficacy of the proposed CNN-LSTM, with novel missing data treatment for identifying deception users in the benchmark smart grid dataset.To ensure reliability and robustness of the experimental analysis, we used underneath performance matrices.

Performance Matrices
In this work, the model performance is evaluated using several performance metrics.They are briefly described below.

Binary Cross-Entropy
Cross-entropy is a loss function widely used for classification problems.When Pi indicates the true probability distribution and Qi is the probability for the classes predicted by a machine learning model, then the cross-entropy is given by The cross-entropy decreases toward zero as the prediction becomes more and more accurate.When the classification model is used to classify only two classes, the loss calculation involves only two probabilities.This is called binary cross entropy.Mathematically it is defined as In this work, the BCE was utilized as the cost function to determine how well the model is doing for the training and test datasets.

Matthews Correlation Coefficient (MCC)
This metric is primarily designed for examining the performance of binary classification problems.MCC is a single number that is extracted from the parameters of a confusion matrix.A confusion matrix, presented in Figure 7, is a technique for computing the accuracy of a classification model.
For binary classification, the Matthews correlation coefficient is given by

Results and Discussion
In this section, we verify the efficacy of the proposed CNN-LSTM, with novel missing data treatment for identifying deception users in the benchmark smart grid dataset.To ensure reliability and robustness of the experimental analysis, we used underneath performance matrices.

Performance Matrices
In this work, the model performance is evaluated using several performance metrics.They are briefly described below.

Binary Cross-Entropy
Cross-entropy is a loss function widely used for classification problems.When P i indicates the true probability distribution and Q i is the probability for the classes predicted by a machine learning model, then the cross-entropy is given by The cross-entropy decreases toward zero as the prediction becomes more and more accurate.When the classification model is used to classify only two classes, the loss calculation involves only two probabilities.This is called binary cross entropy.Mathematically it is defined as In this work, the BCE was utilized as the cost function to determine how well the model is doing for the training and test datasets.

Matthews Correlation Coefficient (MCC)
This metric is primarily designed for examining the performance of binary classification problems.MCC is a single number that is extracted from the parameters of a confusion matrix.A confusion matrix, presented in Figure 7, is a technique for computing the accuracy of a classification model.
The value of the MCC ranges between −1 and 1, with 1 being a perfect prediction and −1 a completely incorrect prediction.3.1.3.F-measure F-measure, or F1 score, is a useful measure, as compared to accuracy, in cases where the dataset has an imbalanced class distribution.In such a scenario, high accuracy does not imply a robust model.The F1 score is defined as:

Actual Values
where precision and recall are expressed as follows: It can be seen that the F1 score encompasses both false positives and false negatives.This implies that the F1 score is a more useful measure than accuracy for any imbalanced dataset.

Experimental Results
Initially, we trained the model with the weekly, fortnightly, and monthly consumption patterns.The results are presented in Figures 8a-d.The plots in Figure 8 show that the change in the parameter's value remained almost constant but had a slight upward trend.This trend encouraged the use of the whole dataset for training the model.
Here, we present the performance comparison of the proposed model based on the raw dataset and on the transformed dataset after preprocessing and synthetic data augmentation.Table 2 summarizes the performance parameters in these two cases.In both cases, we set the training ratio to 80%.Before adding the synthetic data into the dataset, the number of fraud users was comparatively small with respect to the true users.The overall accuracy is almost the same in both cases.However, due to class imbalance, the model could not classify the fraud users very successfully, which is evident from the values of precision, recall, and F1 score.On the other hand, the inclusion of synthetic data in the dataset improved the model's ability to classify the fraud users.For binary classification, the Matthews correlation coefficient is given by The value of the MCC ranges between −1 and 1, with 1 being a perfect prediction and −1 a completely incorrect prediction.

F-measure
F-measure, or F1 score, is a useful measure, as compared to accuracy, in cases where the dataset has an imbalanced class distribution.In such a scenario, high accuracy does not imply a robust model.The F1 score is defined as: where precision and recall are expressed as follows: It can be seen that the F1 score encompasses both false positives and false negatives.This implies that the F1 score is a more useful measure than accuracy for any imbalanced dataset.

Experimental Results
Initially, we trained the model with the weekly, fortnightly, and monthly consumption patterns.The results are presented in Figure 8a-d.The plots in Figure 8 show that the change in the parameter's value remained almost constant but had a slight upward trend.This trend encouraged the use of the whole dataset for training the model.
Here, we present the performance comparison of the proposed model based on the raw dataset and on the transformed dataset after preprocessing and synthetic data augmentation.Table 2 summarizes the performance parameters in these two cases.In both cases, we set the training ratio to 80%.Before adding the synthetic data into the dataset, the number of fraud users was comparatively small with respect to the true users.The overall accuracy is almost the same in both cases.However, due to class imbalance, the model could not classify the fraud users very successfully, which is evident from the values of precision, recall, and F1 score.On the other hand, the inclusion of synthetic data in the dataset improved the model's ability to classify the fraud users.Figure 9 shows three other performance parameters plotted against the number of epochs.As the number of epochs was increased, the train loss and accuracy also increased, but the validation loss increased, and accuracy had a downward pattern.This is a case of overfitting of the model, as it was not generalized for the unseen data.Another metric widely accepted for binary classification is the Matthews correlation coefficient (MCC).The MCC is typically used in cases of imbalanced datasets.A value of +1 indicates perfect prediction, but as with the patterns of loss and accuracy, the MCC also decreases for the test set.As shown in Table 2, although the overall accuracies were very similar for both before and after SMOTE, the graphs presented provide contradictory results regarding the accuracy values for both normal users and theft users without SMOTE.This is because even though the accuracy is good, the model does not classify the test data well because of the imbalanced dataset.After implementing the SMOTE to generate synthetic data, the model is trained using the new dataset.The training performance for the balanced dataset can be shown in Figure 10.
The similar parameters, in this case, do not have divergent patterns for the train and test sets.The test accuracy has an unchanged pattern after 400 epochs, which achieved an accuracy rate of 89% at 500 epochs.Further, the test loss seemed to remain constant after the same number of epochs.The MCC for the test set also attained a value of 0.8, which is indicative of the good prediction capability of the model.Figure 9 shows three other performance parameters plotted against the number of epochs.As the number of epochs was increased, the train loss and accuracy also increased, but the validation loss increased, and accuracy had a downward pattern.This is a case of overfitting of the model, as it was not generalized for the unseen data.Another metric widely accepted for binary classification is the Matthews correlation coefficient (MCC).The MCC is typically used in cases of imbalanced datasets.A value of +1 indicates perfect prediction, but as with the patterns of loss and accuracy, the MCC also decreases for the test set.As shown in Table 2, although the overall accuracies were very similar for both before and after SMOTE, the graphs presented provide contradictory results regarding the accuracy values for both normal users and theft users without SMOTE.This is because even though the accuracy is good, the model does not classify the test data well because of the imbalanced dataset.After implementing the SMOTE to generate synthetic data, the model is trained using the new dataset.The training performance for the balanced dataset can be shown in Figure 10.
The similar parameters, in this case, do not have divergent patterns for the train and test sets.The test accuracy has an unchanged pattern after 400 epochs, which achieved an accuracy rate of 89% at 500 epochs.Further, the test loss seemed to remain constant after the same number of epochs.The MCC for the test set also attained a value of 0.8, which is indicative of the good prediction capability of the model.The confusion matrices are compared in Figure 11 for both cases, where the impact of increasing the number of instances in the data set is clearly reflected.In the first scenario, the model had a severe performance loss when classifying fraud users.This is because in the initial dataset were for electricity thieves.Later, after processing, those instances were increased, and the model was able to classify between the minority class and the theft users considerably well.In every instance, 80% of the data were used for training and the remaining 20% were used as validation data.The confusion matrices for the corresponding methods are given in Figure 11.

Performance Comparison
We trained a logistic regression model, the basic model for the binary classification problem, with our dataset.The support vector machine (SVM) algorithm, another popular classification algorithm used in several studies for electricity theft detection, was also used to build the classification model with the same dataset.The results are presented in Table 3.The confusion matrices are compared in Figure 11 for both cases, where the impact of increasing the number of instances in the data set is clearly reflected.In the first scenario, the model had a severe performance loss when classifying fraud users.This is because in the initial dataset were for electricity thieves.Later, after processing, those instances were increased, and the model was able to classify between the minority class and the theft users considerably well.In every instance, 80% of the data were used for training and the remaining 20% were used as validation data.The confusion matrices for the corresponding methods are given in Figure 11.

Performance Comparison
We trained a logistic regression model, the basic model for the binary classification problem, with our dataset.The support vector machine (SVM) algorithm, another popular classification algorithm used in several studies for electricity theft detection, was also used to build the classification model with the same dataset.The results are presented in Table 3.   Different machine learning and deep learning algorithms have been investigated for electricity theft detection on various datasets.However, this dataset was not studied using other classification algorithms.Therefore, we present a comparison of several other works, based on different datasets, which were used to perform electricity theft detection.The comparison is presented in Table 4.   Different machine learning and deep learning algorithms have been investigated for electricity theft detection on various datasets.However, this dataset was not studied using other classification algorithms.Therefore, we present a comparison of several other works, based on different datasets, which were used to perform electricity theft detection.The comparison is presented in Table 4. Different machine learning and deep learning algorithms have been investigated for electricity theft detection on various datasets.However, this dataset was not studied using other classification algorithms.Therefore, we present a comparison of several other works, based on different datasets, which were used to perform electricity theft detection.The comparison is presented in Table 4.

Figure 1 .
Figure 1.Block diagram of the proposed electricity theft detection system.

Figure 1 .
Figure 1.Block diagram of the proposed electricity theft detection system.

Figure 3 .
Figure 3. Distribution of normal and theft users in the dataset.

Figure 3 .
Figure 3. Distribution of normal and theft users in the dataset.

Figure 3 .
Figure 3. Distribution of normal and theft users in the dataset.

Figure 4 .
Figure 4. Distribution of normal and theft users after applying synthetic minority over-sampling technique (SMOTE).

Figure 4 .
Figure 4. Distribution of normal and theft users after applying synthetic minority over-sampling technique (SMOTE).

Figure 11 .
Figure 11.Confusion matrices for the imbalanced and balanced dataset.

Figure 11 .
Figure 11.Confusion matrices for the imbalanced and balanced dataset.

Figure 11 .
Figure 11.Confusion matrices for the imbalanced and balanced dataset.

Table 1 .
Metadata information of the electricity theft dataset.

Table 1 .
Metadata information of the electricity theft dataset.

Table 2 .
Model performance on the class-imbalanced and balanced datasets.

Table 2 .
Model performance on the class-imbalanced and balanced datasets.