Data-Driven Approaches for Energy Theft Detection: A Comprehensive Review

: The transition to smart grids has served to transform traditional power systems into data-driven power systems. The purpose of this transition is to enable eﬀective energy management and system reliability through an analysis that is centered on energy information. However, energy theft caused by vulnerabilities in the data collected from smart meters is emerging as a primary threat to the stability and proﬁtability of power systems. Therefore, various methodologies have been proposed for energy theft detection (ETD)


Introduction
The development of the smart grid has revolutionized the operation and management of energy systems [1].The smart grid has been recognized as a future solution for smart energy monitoring as it is aimed at stabilizing and optimizing complex energy networks.Among these digital transformations in the energy management system, the capability to analyze comprehensive energy information is essential.The collection and analysis of energy information plays an important role in improving energy management by enabling the understanding of energy usage pa erns, optimizing energy distribution, and minimizing energy loss.Furthermore, it can enhance the reliability and stability of energy systems by detecting potential problems and implementing appropriate measures to prevent system failures.
The transformation of energy utilities into smart grids is being made possible by the introduction of metering systems due to the widespread adoption of smart meters.The large amount of energy information collected from smart meters serves to provide various data-driven services to individuals, enterprises, and utilities.These services include demand forecasting, flexibility forecasting, impedance estimation, phase grouping, remote switching, and hosting capacity [2].However, several concerns regarding data vulnerability have surfaced as energy management systems increasingly evolve into datadriven systems [3].Energy information is susceptible to hacking through unauthorized access, thus posing significant threats to consumer security and privacy [4].One primary threat from data vulnerability is energy theft, which can lead to substantial financial losses for distributed system operators (DSOs) and consumers [3,4].The methods used in energy theft typically aim to reduce electricity bills or illicitly take financial benefits from the power grid, which damages the infrastructure of the energy system and results in significant revenue losses for utilities.According to the Northeast Group [5], energy theft incurs a massive global cost of USD 96 billion annually.In the USA, Progress Energy Incorporated reported a 5% increase in such incidents [6].
In order to build data-driven systems, big data techniques need to be considered.Big data characteristics can be summarized by the 5 Vs: velocity, volume, value, variety, and veracity [7].These aspects are crucial in the analysis and processing of large datasets, such as those used in data-driven approaches.Therefore, integrating artificial intelligence (AI) with big data enables the extraction of valuable insights and the enhancement of predictive models, thereby improving applications such as data-driven energy theft detection (ETD).
In order to prevent energy theft from causing losses in the smart grid, various methodologies for detecting energy theft have been researched.ETD has mainly been classified into hardware-based and data-driven methods [8].Hardware-based ETD has been proposed to detect the energy theft that occurs through the physical manipulation of mechanical meters in traditional power grids.Data-driven ETD has been proposed to detect, based on energy usage and power flow pa erns, suspected energy theft.With the development of smart grids, the focus has shifted to data-based ETD because energy theft primarily occurs through data manipulation rather than the physical manipulation of meters [9].
In Figure 1, the technical elements in designing ETD using a data-driven approach are shown.The design of data-driven ETD involves several steps, such as data collection, data processing, malicious behavior modeling, and developing an intelligent algorithm that can detect energy theft.Most of the prior research has relied on various methodologies, including machine learning and deep learning, to achieve the guaranteed performance of data-driven ETD.A deep learning-based ETD model may perform be er on datasets with a balanced distribution, but its performance may degrade on datasets with an imbalanced data distribution [10].ETD with an imbalanced dataset can be biased, suitable only for specific cases, and pose challenges to model scalability.
Generative AI models are being introduced in ETD using augmentation with training datasets to handle the limitations of energy theft datasets [3,[11][12][13].Unlike traditional AI models, generative AI can be utilized to detect anomalies in energy consumption pa erns by analyzing vast volumes of data.Moreover, generative AI can help to analyze data distribution and extract features across large spaces and high dimensions related to energy theft [14,15].Generative AI-based ETD can help reconstruct an energy theft dataset in which a practical energy theft occurs.Even though generative AI can be expected to bring various improvements to ETD, there is still a lack of research.
In Table 1, a comparison of previous survey papers on data-driven ETD is presented.While existing review papers focus on specific aspects of data-driven ETD, such as methodologies [16][17][18][19][20], smart grid components [21], consumer privacy [22], and the scale of energy usage [23], there has not been an in-depth analysis of data-driven ETD that emphasizes the limitations of energy theft datasets.This survey addresses these limitations and categorizes AI modeling based on the types of energy theft datasets.Additionally, this paper provides insights into designing data-driven ETD using state-ofthe-art generative AI, offering a comprehensive perspective for future research.In this paper, a comprehensive analysis of more than 80 research papers was carried out.The distribution of these publications is summarized in Figure 2a. Figure 2b shows the yearly distribution of the analyzed papers, highlighting the increasing interest in datadriven approaches for ETD and various AI algorithms in recent years.The investigated papers outlined definitions of energy theft type, limitations of datasets, and methodological perspectives for designing data-driven ETD.In Figure 3, a comprehensive categorization of the proposed data-driven approaches for ETD is shown to address the challenges and to refine the methodologies.The diverse aspects of ETD are represented, including types of energy theft such as meter tampering, meter malfunctioning, cyber-a acks, feeder tapping, and billing irregularities.Additionally, various dataset issues are highlighted, such as high-dimensional data, imbalances, inaccurate readings, and missing labels.To address these issues, data-driven methodologies are categorized into supervised learning, semi-supervised learning, and generative AI.
The structure of this paper is as follows.Section 2 offers a review of energy theft methods used in energy management systems.Section 3 discusses the limitations of the datasets used in the design of data-driven ETD and technical methods.Section 4 reviews conventional AI and generative AI methods for detecting energy theft.Section 5 discusses the challenges and opportunities of ETD using generative AI, and the conclusions are presented in Section 6.

Energy Theft in Energy Management System
In an energy management system, losses are categorized into technical and nontechnical losses [23].Technical losses are caused by the energy dissipated in the conductors used in transmission, sub-transmission, and distribution lines.These losses are inherent to transmi ing electricity over long distances and through various network components.Reducing these losses is crucial for improving grid efficiency and sustainability, thus requiring investments in efficient network equipment, advancements in the design and planning of transmission and distribution networks, and deploying intelligent technologies to monitor and manage electricity flow effectively.Non-technical losses are caused by external factors in the energy system, including administrative inefficiencies, theft, and billing errors, which are rampant with malicious and illegal activities.One of the significant contributors to non-technical losses, energy theft, is characterized by the unauthorized and illegal consumption of electricity.Energy theft leads to economic losses that are incurred by distributors and providers along with potential hazards, deterioration of service, and safety issues [24,25].
Energy theft is extensively passed off as the focus is on the various malicious practices employed by individuals to consume or reduce their billed energy consumption illegally.These methods are broadly categorized based on whether they involve tampering with metering devices or bypassing metering systems [23].For users connected to a mediumvoltage network, meter tampering methods are carried out by shunting the secondary winding of measuring current transformers and equipping electronic energy meters with external boards to alter measurements.For low-voltage network users, energy thefts are performed by bypassing the energy meter through illicit connections, directly tapping energy from the distribution network, and tampering with energy meters using external magnets to interfere with meter operations.

Types of Energy Theft
The key types of energy theft may encompass meter tampering and malfunctioning, feeder tapping, billing irregularities, and cyber-a acks [16,[24][25][26].The approaches to designing ETD by energy theft types are shown in Figure 4.

Meter Tampering and Malfunctioning to Achieve Energy Theft
Meter tampering and malfunctioning are characterized by deliberate alteration or damage being inflicted upon electricity meters to reduce recorded energy consumption [24].The efficiency of meters is compromised by the insertion of strong magnets, leading to incomplete consumption recording.Various methods are employed to manipulate meter readings, including physical interference, such as a aching magnets to meters, slowing mechanical rotation, injecting substances to block the meter mechanism, and altering the internal wiring and components.The flow of all the current through the meter is prevented by shorting the ends, thereby failing to record the full energy consumption.Moreover, the amount of current measured is modified by altering the voltage wires of the electricity meter or the insulation on the secondary side wires.

Feeder Tapping for Energy Theft
Feeder tapping is commonly observed in areas where physical access to distribution lines is relatively uncomplicated and where monitoring is sparse [24].Illegal or unauthorized connections are made by directly connecting to the transmission line.The illegal connections to utility feeder lines are characterized by individuals bypassing the energy meter.Single-phase usage from a three-phase supply is employed to record zero voltage consumption and null energy consumption.In terms of safety issues, severe risks are posed by feeder tapping not only to the perpetrators themselves, but also to neighbors and utility workers.These risks are exacerbated by the unauthorized and unprotected wiring involved in such connections.

Billing Irregularities for Energy Theft
Billing alterations are made through illicit payments to utility officials, leading to the recording of incorrect meter readings.Billing irregularities are characterized by various malicious activities associated with the manipulation of billing records [27].These activities often involve collusion with utility employees, where meter readings or account records are altered to reduce charges.In addition, it may include hacking into the utility's billing software to change consumption data or the unauthorized use of another customer's identity to evade payment.

Cyber-A acks for Energy Theft
With the rise of smart grids, cyber-a acks have become an increasingly prevalent method of energy theft [16,18,21].These a acks typically involve hacking into the smart grid network to manipulate the data transmi ed from smart meters to the utility provider.This includes injecting false data, which is achieved by manipulating the system to bypass existing detection methods, leading to erroneous data measurement.Such illegal activities result in economic losses and compromise the integrity and reliability of a power supply.

Hardware-Based Energy Theft Detection
Recent progress has been made in detecting energy theft by integrating advanced metering infrastructure (AMI) and machine learning algorithms within a smart grid [28,29].The main goals of these methods are classified into prediction, monitoring, detection and localization, classification, and resolution, with algorithms being applied accordingly [23].
In a hardware-based approach, unusual or unauthorized energy access within an energy metering and distribution system can be monitored and detected using physical devices and sensors [8].In order to monitor the real-time energy flow throughout a distribution network, the detection process is carried out by utilizing data from AMI, the phasor measurement unit (PMU), and an intelligent electronic device (IED) [30].Additionally, energy theft is effectively identified and pinpointed by employing the Astar derivative algorithm and geographical information system (GIS) applications.In [31], advanced multi-sensor fusion technologies, which are based on micro-inertial measurement units (MIMUs) and intelligent pipeline inspection gauges (PIGs), were comprehensively reviewed to enhance the detection and localization of potential theft or leakage points.In [32], a novel method for hardware-based ETD, combining clustering and the local outlier factor (LOF), was proposed to effectively identify the energy theft within an AMI system.
However, the limitations of hardware-based methods include high implementation costs, frequent maintenance, and the inability to adapt quickly to new theft techniques [22].These methods can be limited by their dependency on physical infrastructure.Therefore, data-driven ETD methods are regarded as promising solutions for their adaptability, scalability, and efficiency.

Data-Driven Energy Theft Detection
Data-driven ETD methods have been emphasized in the aim to enhance detection capabilities without the high costs associated with hardware-based approaches via increasing reliance on AI techniques [19].
Data-driven ETD methods can be divided into four categories: data mining, stateand network-based, game theory, and machine learning methods [18].Within state-and network-based methods, AMI systems play a crucial role in real-time monitoring and detection, offering detailed data and pa erns on consumption that can be examined for ETD.In data mining methods, support vector machine (SVM), k-nearest neighbor (KNN), neural networks, and clustering algorithms are predominantly utilized to analyze consumption pa erns and to identify irregularities suggestive of energy theft [14].Game theory methods are employed to model consumer behavior as game players, where their decisions impact the utility losses and gains.By analyzing these interactions, utilities can predict and detect malicious behaviors.In order to identify and mitigate cybera acks, in which malicious users manipulate smart meter data to misreport higher energy production, machine learning-based ETD models have been reviewed [33].These models utilize historical solar irradiance, temperature data, and smart meter readings to detect anomalies that indicate potential electricity theft.A pa ern-based and context-aware approach for ETD has been proposed by combining dynamic time warping (DTW) and KNN [34].This method provides a robust ETD framework to detect and manage electricity theft more effectively by considering variations in electricity usage that align with human activities and seasonal changes.
In [21], data-driven ETD methods were emphasized in the aim to prevent financial losses and ensure the reliability and safety of energy distribution networks.These methods encompass supervised and semi-supervised learning-based detection techniques.Moreover, a generative AI-based ETD method was proposed to obtain a balanced dataset for improving data generation and classification performances [11,35].In [35], a time series generative model and a hybrid, multi-time scale neural network model were developed to capture and analyze consumption pa erns at different time scales effectively.However, in cyber-a ack-based energy theft, the areas of consumption and generation can be distinguished, with each targeting different functions, thus necessitating the design of tailored models [36,37].Cyber-a ack functions applied in the consumption domain aim to reduce the electric charges for malicious customers; meanwhile, in the generation domain, the goal is to supply more energy to the grid.To provide be er detection performance against cyber energy theft, data-driven ETD methods that capture complex pa erns and temporal correlations within generation profiles, data characteristics, and supervisory control and data acquisition (SCADA) meter readings need to be developed.

Dataset Issues with Data-Driven ETD
The energy theft data used for ETD refers to energy consumption data, which contain data on normal and malicious users.The energy consumption data are time series data measured from smart meters.The energy theft data, measured at various time intervals depending on the certain requirements of the detection model, can be summarized by four main characteristics.


High-dimensional data: Given that energy theft data may be influenced by the interaction of numerous variables, including time, weather, and consumer behavior, they can exhibit high-dimensional properties [38].The high-dimensional data can increase the model complexity and the required computational resources.Therefore, research using energy theft datasets should explore approaches to reduce computational complexity while reflecting high-dimensional characteristics.


Imbalanced dataset: The datasets used for ETD are often imbalanced, with instances of energy theft being significantly outnumbered by normal energy usage instances [11].This imbalance can be ascribed, when compared to benign energy consumption pa erns, to the challenge of obtaining empirical data on incidents of energy theft and the relatively brief duration during which energy theft typically occurs.An imbalanced dataset poses challenges such as biased learning and overfi ing for numerous data-driven algorithms that operate under the assumption of an equitable distribution of classes.Therefore, adequate data preprocessing or advanced algorithms may be required to mitigate these challenges.


Inaccurate readings: Energy theft data may include inaccuracies and errors from data collection procedures or malfunctioning meters.These discrepancies have the potential to impact the effectiveness of detection models.In order to mitigate the problems caused by inconsistencies, a preprocessing measure could be required before model training is performed. Absence of labels: The instances of energy theft may not be accurately labeled, resulting in a label deficiency.This absence of labels can be ascribed to various factors, such as the difficulties in obtaining accurate labels and the time-consuming, costly nature of the labeling process.Such challenges are significant in training supervised machine learning models, which depend on labeled data for learning and making predictions.Comprehension of these properties may be crucial for developing efficient models to detect energy theft.
In order to design effective ETD, several types of datasets can be used to design datadriven ETD.First, the State Grid Corporation of China (SGCC) dataset is the most popular dataset used in energy theft due to offering detailed electricity consumption data with labeled information on energy theft.The Irish Commission for Energy Regulation (CER) dataset, derived from smart metering trials, provides valuable insights into consumer energy behavior and its impact on energy usage.The Electricity Load Diagrams (ELD) dataset is a time-series dataset commonly used for forecasting tasks, encompassing consumer and prosumer information.The Ausgrid Solar Dataset, gathered by Ausgrid in Australia, focuses on electricity consumption and solar power production, offering valuable insights for energy management tasks.Finally, the Uruguayan Electric Company (UTE) dataset provides detailed household electricity consumption data, and it is useful for research on energy consumption pa erns and intelligent energy utilization in smart cities.Table 2 shows a summary of the datasets utilized in the research on energy theft.

Mitigation for Data Imbalance Problem
In order to address the imbalance problem in energy theft data, several techniques have been developed.Resampling methods can be considered as creating a balanced distribution.Balance can be achieved through over-sampling, where instances from the minority class are replicated, and under-sampling, where instances from the majority class are removed.The random over-sampling (ROS) technique is a simple and effective method where minority class instances are randomly duplicated to balance the dataset [53].In addition, the synthetic minority class oversampling technique (SMOTE) has been employed to enhance model generalizability by interpolating between the existing instances to generate more diverse samples [54][55][56][57][58].
The adaptive synthetic (ADASYN) sampling technique [59], which is a more advanced technique derived from the SMOTE, generates minority class samples to address data imbalance.Unlike the SMOTE, the ADASYN technique can dynamically adjust the sample creation ratio based on the proximity of each minority class sample to the boundary of majority class samples.In [59], the ADASYN technique was used to design an anomaly transformer (AT) model for ETD.In [60], ADASYN-SGWO was used, which is a method that is designed to alleviate the class imbalance problem by using the ADASYN method for oversampling and the stochastic universal sampling-based grey wolf optimizer (SSO-GWO) for under-sampling [61].

Correction for the Inaccurate Readings Problem
In order to resolve the problem of missing values in an energy theft dataset, various interpolation methods have been employed, and they can be categorized into two techniques: linear [62] and polynomial [63,64] methods.Based on the simple linear algorithm, zero and average values can be used to replace missing values to recover the energy consumption data over a period [62].Polynomial interpolation may be effective when the derivatives between data points approach tend to follow certain polynomial expressions [63,64].Interpolation can be beneficial when there are only a few missing values, but it may lead to substantial information losses in scenarios with abundant missing values.A Bayesian ridge regression-based iterative interpolation method, which estimates missing values using a probabilistic approach, has been proposed to mitigate information loss [65].Furthermore, the relationship between missing values and energy theft has been explored, with missing data being used as a feature to enhance detection performance [66].However, these conventional methods are limited in scope and do not comprehensively solve all the issues arising from the inaccurate data that comes from malfunctioning smart meters.Therefore, advanced methodologies, such as machine learning and deep learning, may be required to address the problems arising from limited storage capacity, disconnected communication, and extreme weather conditions.

Adversary Modeling for Deficiency Problem
Malicious behavior in power grid systems can lead to operational instability, financial losses, safety risks, data integrity issues, and regulatory problems.The economic stability and functionality of energy distribution networks may be vulnerable to such behaviors, making their consideration crucial in modeling ETD.Although data-driven methods for ETD using datasets with malicious data have been proposed, some datasets still lack this type of data.In order to address the deficiency of malicious data, various studies have been proposed with synthetic data using a ack models or functions that replicate malicious behavior.The robustness and accuracy of ETD systems can be significantly enhanced by training models with these synthetic datasets.

Energy Theft in Power Generation
In power generation, energy theft is characterized by manipulating the data relevant to power production facilities or distributed energy resources (DERs).Numerous studies have been conducted on methods for injecting theft into benign data or utilizing data injected into datasets.Several malicious behaviors have been proposed to manipulate photovoltaic systems for modeling photovoltaic electricity theft [67].According to [67], photovoltaic electricity theft can include voltage boosting, current boosting, altering electrical supply connections, and using solar array simulators.In [68], the performance of ETD has been evaluated by integrating synthetic anomalies simulating energy theft into a real dataset [68].Herein, these synthetic anomalies encompass full scaling theft, which elevates all data points, and partial scaling theft, which involves adjusting readings to a specific threshold.The authors in [36] developed cyber-a ack functions that manipulate the benign data from distributed generation smart meters to simulate electricity theft by malicious customers.A benign dataset was constructed by simulating an IEEE 123-bus test system using practical load and irradiance data, and malicious data were developed by applying the proposed cyber-a ack functions.Considering the expansion of renewable energy in the future, modeling new types of energy theft in power generation will be required.

Energy Theft in Power Utility
In power utilities, energy theft can involve manipulating the data associated with the transmission and distribution of electrical power.Investigations have focused on methods that involve the injecting of supply data-oriented theft into benign data or utilizing data that have been injected into datasets.In [69], false data injection a acks on the state estimation of power grids were proposed.The proposed false data injection a acks demonstrated their capability to generate malicious data in state estimation using a standard IEEE test.The authors in [69] performed model simulations using a synthetic dataset that included malicious data to detect stealthy false data injection a acks in state estimation.
Power utilities can exhibit substantial and variable energy usage that are a ributed to their operation as large-scale energy systems.Due to the complexity of power utilities, there is a lack of research and available datasets to design ETD in power utilities.Nevertheless, it is essential to examine the modeling of energy theft from the perspective of power utilities due to the potential for significant financial losses compared to other types of energy theft.

Energy Theft in Energy Consumers
With respect to energy consumers, energy theft typically manifests when consumers or prosumers reportedly consume less energy than utilized to diminish their financial obligations.There have been studies on methods to inject demand data-oriented theft into benign datasets or to utilize data injected into datasets.A ack models have been proposed to generate malicious data by manipulating smart meter readings in AMI systems [70].Energy consumption data can be manipulated with the proposed a ack models by drastically reducing recorded consumption or changing load profiles.A different a ack model has been proposed to simulate malicious behavior such as meter tampering, bypassing, and malfunctioning meters.According to [71], synthetic data can be employed as a dataset for testing an intermediate monitor meter (IMM)-based power distribution network model.In [72], six theft generation functions were proposed to generate real-time malicious data and to evaluate the performance of a gradient-boosting-based energy theft detector using these synthetic data.
The majority of ETD models have been examined from the viewpoint of energy consumers.Nonetheless, it is anticipated that different forms of intelligent energy theft will emerge in the future, making traditional energy theft modeling approaches inadequate for modeling precise ETD.

Methodologies for Implementing Data-Driven ETD
This section introduces several methods for implementing ETD, including supervised learning-based, semi-supervised learning-based, and generative AI-based ETD approaches to improve data efficiency and model performance.

Supervised Learning-Based Approaches for ETD
Various methodologies have been proposed for ETD based on supervised learning [8,29,39,43,[73][74][75][76].When implementing extreme learning machines, support vector machine (SVM)-based models have achieved 70% accuracy [29].To overcome the problem of numerous false positives with SVM models, a novel method combining SVMs with a decision tree (DT) has been proposed [76].While detection accuracy has been substantially enhanced by integrating these two methods, the DT is prone to overfit specific pa erns, which may reduce its effectiveness in identifying previously unseen a acks.
Deep learning methods have been employed as supervised learning approaches to enhance performance against unseen a acks over artificial feature extraction [8,40,43,[73][74][75].In [8,43], a structure combining multiple layers of convolutional neural networks (CNNs) with fully connected layers was employed via extracting features from energy theft data, thereby achieving higher accuracy than traditional machine learning methods.Several models have been proposed by employing recurrent neural network (RNN) structures to extract temporal data and enhance classification performance [74,75].Hyperparameter optimization with RNNs has been demonstrated to offer superior classification performance compared to conventional SVM models [74].Subsequent research [75] has shown that efficient time series feature extraction and ETD can be achieved using the gated recurrent unit (GRU).
Methods have been introduced that combine the structures of CNNs and RNNs to improve classification performance by combining their separate feature extractions [39,73].A hybrid deep learning model was demonstrated to enhance, when compared to using CNNs and LSTM independently, feature extraction by cascading a CNN with an RNN-based long short-term memory (LSTM) [39].However, this architecture may have an inherent limitation in feature extraction capabilities as it relies on transferring features from the CNN to LSTM.ConvLSTM architecture has been introduced to address this inherent limitation, which replaces the matrix multiplication of the traditional CNN-LSTM stack with a globally connected layer.This results in a model that efficiently captures cyclical pa erns and extracts local features.
Figure 5 depicts the representative structures of supervised learning methods for ETD. Figure 5a shows the CNN process with two separate convolutional layers to extract spatial features.In Figure 5b, the capability of RNN for processing data sequences is described by incorporating the contextual information from previous inputs.The processing for CNN with LSTM to extract integrated features is described in Figure 5c.

Semi-Supervised Learning-Based Approaches for ETD
As mentioned in the previous subsection, several data-driven approaches for ETD have been proposed based on supervised learning.However, the uneven distribution of labeled data in ETD can lead to a decline in model performance due to biased learning effects.As alternatives, unsupervised learning methods have been proposed [77,78].These conventional approaches can involve using clustering techniques in combination with the maximum information coefficient (MIC) [77] or using density-based spatial clustering [78] to identify abnormal behavior.However, the conventional unsupervised learning approaches may struggle with high-dimensional noisy data.Consequently, semisupervised learning approaches have been developed to address the limitations of both supervised and unsupervised methods.
Semi-supervised learning methods have been developed to leverage the benefits of both supervised and unsupervised learning for efficient ETD.A transductive SVM (TSVM) method [79] has been utilized as a semi-supervised learning for ETD, but it may encounter challenges in scaling when confronted with large volumes of data.In deep learning-based semi-supervised learning for ETD, two primary approaches can be employed: (1) augmenting data by assigning pseudo-labels [80,81] and (2) integrating supervised and unsupervised learning [82,83].In the first approach, a model can assign pseudo-labels to unlabeled data, reducing overfi ing and improving model generalization [80,81].In the second approach, networks trained on unsupervised and supervised learning tasks are used to reconstruct load profiles and differentiate between classes [82,83].
The structures of semi-supervised learning methods are described in Figure 6. Figure 6a depicts a framework that employs a trained teacher model to predict unlabeled data, generating pseudo-labels for augmenting data for ETD; the framework then proceeds to train the classifier from these augmented data.Figure 6b describes a model integration approach that combines an unsupervised learning framework using an autoencoder, thereby using similarity learning through a Siamese network and supervised learning using labeled data.

Generative AI-Based Approaches for ETD
While semi-supervised approaches have shown improvements in specific datasets, their effectiveness in practical systems may be compromised by the complexity of theft methods that are not represented in training data.In addition, when a specific label is extremely scarce, the correlation in the augmented data may become excessively high, thus requiring further adjustments to enhance the performance of the model.Generative AI-based approaches can be used to enhance the performance of model generalization.However, their application in ETD has been relatively limited [12].These techniques can generate new data by analyzing pa erns in the existing data, and they may effectively address unseen a acks when applied to ETD.
Generative AI methods for ETD can be classified into probabilistic, direct distribution approximation, and diffusion-based methods.Probabilistic methods, such as the variational autoencoder (VAE) [11,12], can extract information from a latent space within the data.VAEs have been utilized as a data augmentation method in ETD [11,12] since they offer more reliable training than GANs.In [12], a conditional VAE (CVAE) was applied, demonstrating the capability to generate samples resembling the original power curve through using only a few samples without assuming a probability distribution of the power curve.Furthermore, it has been confirmed that combining VAE and GANs [11] can improve classification accuracy compared to separately applying each VAE-and GAN-based method.
Unlike VAE, which assumes a prior data distribution such as a standard Gaussian distribution, direct distribution approximation techniques leverage generative adversarial networks (GANs) to produce new data.A cooperative training GAN (CT-GAN) method has been proposed to address the challenge of obtaining labeled data for ETD [13].The CT-GAN method can enhance training stability by training two discriminators.This approach improves the generation of labeled sample data and increases the accuracy of semi-supervised classification.According to simulation results [13], CT-GAN substantially enhances generalization capabilities.
Diffusion-based methods [4] have focused on identifying complex data distributions and generating new data by progressively reducing noise.The diffusion method has been proposed to capture pa erns for data sequences with intricate pa erns by introducing noise to the data samples [4].The inverse process is employed to gradually eliminate noise and generate new samples, permi ing the neural network to acquire information regarding the data distribution.It has been demonstrated that diffusion-based detectors can outperform alternative methods, such as LSTM or AE-based methods, in identifying diverse user pa erns.
In order to effectively address the 5 Vs of big data, generative AI-based ETD can be considered a promising solution, but it presents several challenges.Generative AI models can be trained on large volumes of datasets to detect specific pa erns that are indicative of energy theft.However, managing and processing such vast amounts of data requires significant computational resources and efficient data-handling techniques to avoid performance bo lenecks.Additionally, if there are various types of data, such as timeseries data, customer profiles, and transaction records, a generative AI model can comprehensively learn from this diverse information.Nonetheless, the integration and normalization of heterogeneous data sources can be challenging and may require advanced data fusion to ensure compatibility and reliability.The data from smart grids may be noisy or incomplete, posing a challenge to maintaining data integrity.Therefore, advanced data processing and validation techniques are necessary to mitigate the impact of poor data quality on detection performance.The goal of processing big data in generative AI-based ETD is to extract valuable insights that can lead to actionable outcomes.Generative AI-based ETD models with high-value data can uncover hidden pa erns and anomalies that conventional methods might miss.In Table 3, the 5 Vs of big data in generative AI-based ETD are summarized.The structures of generative AI methods for ETD are detailed in Figure 7.According to Figure 7a, the data generated by the VAE model can be used to improve the learning performance of a separate ETD classifier module.Conversely, Figure 7b,c represent that the classifier is concurrently trained with the generative model in the structures of GANs and diffusion-based method.Additionally, LSTM algorithms are employed to extract temporal features from energy theft data and improve the performance of denoising diffusion probabilistic models (DDPM).
In Table 4, the background and limitations of the proposed model are summarized with a focus on methods for ETD model design.

Open Issues and Future Research Directions
Given the complex nature of energy theft datasets, the current supervised and semisupervised learning-based models face limitations in effectively analyzing such data characteristics.In order to address this challenge, recent research has explored the potential of generative AI-based ETD.Various significant advantages are expected through the implementation of generative AI-based ETD.Firstly, generative AI-based ETD possesses the capability to analyze large volumes of data.This ability allows generative AI to detect anomalies in energy consumption pa erns and identify instances of energy theft or illegal energy acquisition.Secondly, the insights provided by generative AI into energy consumption pa erns can enhance energy management systems, facilitating the optimization of energy usage and minimizing energy loss.These capabilities are instrumental in detecting and preventing energy theft effectively.Lastly, generative AI-based ETD contributes to improved system reliability by mitigating risks of system failures, reducing downtime, and ensuring a reliable power supply for customers.Despite these advantages, generative AI-based ETD is still in the early stages of development.Therefore, open issues and research directions focusing on generative AI in ETD will be discussed in this section.

Handling Imbalanced Data
By generating synthetic instances of the minority class, generative AI, such as GANs and VAEs, have shown promise in handling imbalanced datasets.However, challenges remain to be addressed since the generated instances cannot replace the real instances completely.For instance, ensuring the quality and diversity of the generated samples and avoiding overfi ing to the minority class are critical issues.Furthermore, a combination of various proportions of real and synthetic samples may be advantageous to enhance the diversity of the training samples and the performance of the detectors.Therefore, future research could focus on developing novel generative AI or improving existing ones to generate synthetic samples and efficiently handle imbalanced data in ETD.

Incorporating Time-Series Analysis with Data Features
In order to capture temporal pa erns in time-series data, generative AI has been employed in time-series analysis for the past few years.However, accurately modeling and generating time-series data remains challenging due to its high dimensionality, noise, and complex temporal dependencies.Transformers with self-supervised representational learning ability can be a promising candidate model for challenging tasks.Future research could explore advanced generative AI using transformers for time-series data, as well as their application in ETD.

Dealing with High-Dimensional Data
Deep learning-based generative AI models have recently exhibited the capacity to handle high-dimensional data.However, training these models can be computationally expensive and may require large amounts of data.Future work could investigate efficient training methods for generative AI models in ETD, such as meta learning and transfer learning [85], and dimensionality reduction techniques, like singular value decomposition and diffusion maps [86].

Addressing Noise and Errors
In order to remove the noise from data and rectify errors, generative AI models can be utilized.The generative AI employing unsupervised or semi-supervised learning may denoise and correct errors of the energy theft data.However, ensuring that the denoising or error correction does not result in losing important information is challenging.Future research should aim to develop robust generative AI that can handle noise and errors in energy theft data by using unsupervised learning and semi-supervised learning approaches.

Exploiting Characteristic Variables
Generative AI models can potentially learn and generate characteristic variables of energy consumption records.However, ensuring that these generated variables are meaningful and useful for ETD is challenging.In order to overcome this challenge, one should look to information bo leneck-based approaches, which could be promising since they can enhance domain generalization and improve the performance of generative AI.Future work could focus on exploiting these variables, incorporating generative AI models with an information bo leneck-based approach for more accurate ETD.

Overcoming a Lack of Labels
Non-adversarial generative AI, particularly diffusion models, have recently gained significant interest.Diffusion models typically involve a forward process that gradually corrupts the input by an added noise and a reverse process that reconstructs them sequentially to learn the distribution of the latent representation of the input.The models can be more stable, and they can model small datasets more effectively.Diffusion models can be adequately employed to overcome the lack of labels in ETD.However, ensuring that these models can effectively learn from unlabeled data and make accurate predictions is a challenge.Research for ETD models that exploit the properties of non-adversarial generative AI could be an area of future work for generating reliable datasets in ETD.

Integration of Energy Consumption and Multimodal Data
There is a strong possibility that the future of generative AI models in detecting energy theft will involve an important shift toward using several types of data sources.Generative AI models can enhance the comprehension of complex pa erns by integrating energy consumption data with textual, auditory, visual, and sensor-based information.Besides smart meter and/or energy consumption data, other types of data such as climatic and meta data may be exploited in detecting energy theft.Climatic data encompasses variables such as temperature, wind speed, humidity, etc., which can be provided in textual, audio, or visual formats.The meta data may drop a hint by indicating the device type, customer features, and region characteristics.By adopting a multimodal approach, ETD approaches can improve their ability to identify subtle anomalies and forecast instances accurately.This capability for ETD can be achieved by utilizing the combined capabilities of diverse data sources, which also automates the validation process.Future research should aim to improve the performance of ETD by integrating energy consumption values with diverse data sources.

Large Language Models for ETD
Large language models (LLMs), an advancement of generative AI models, have been recently marked in a variety of fields due to their significant potential in analyzing comprehensive datasets to find pa erns, forecast future events, and detect abnormal behavior across different domains [87].For example, anomaly detection systems can discover uncommon access pa erns that could indicate a cybersecurity security breach.Likewise, in ETD, LLMs may identify the energy thieves in energy systems.Future work could explore applying LLMs in ETD to achieve optimal performance.

Conclusions
In this paper, a comprehensive review of data-driven approaches for ETD is presented, focusing on methodologies such as datasets, preprocessing, adversary modeling, and detection algorithms.It has also underscored the need for data-driven ETD analysis in terms of the 5 Vs of big data: velocity, volume, value, variety, and veracity.In this regard, the limitations of the energy theft dataset and previous studies in overcoming limitations were analyzed and systematically organized.Then, various detection methods, including supervised learning, semi-supervised learning, and generative AIbased approaches, were analyzed to implement effective ETD models.The limitations of existing data-driven ETD, such as supervised learning and semi-supervised learning, were analyzed, and the potential of generative AI to revolutionize ETD was highlighted.Finally, this paper suggests future research directions in applicable generative AI for addressing imbalanced data, incorporating time-series analysis with data features, dealing with high-dimensional data, addressing noise and errors, exploiting characteristic variables, overcoming the lack of labels, utilizing LLMs for ETD, and integrating energy consumption data with multimodal data.

Figure 1 .
Figure 1.Overall design process of data-driven ETD.

Figure 2 .
Figure 2. Number of publications per (a) journal/conference and (b) year.

Figure 3 .
Figure 3. Overview and categorization of data-driven approaches for ETD.

Figure 4 .
Figure 4. Approaches for designing ETD categorized by energy theft types.

Table 1 .
Comparison of data-driven approaches for ETD with related survey papers.

Table 2 .
Overview of the dataset used in the field of energy theft.

Table 3 .
The 5 Vs of big data, and the issues of generative AI-based ETD.

Table 4 .
The main existing works using data-driven ETD approaches.