1. Introduction
Stimulated by the unabated global demand for electrical energy, the conventional power grid infrastructures have experienced unrelenting upgrades to accommodate changing requirements, such as the high penetration of renewable energy. The emergence of smart electrical power grid technologies has made desired enhancements, offering improved efficiency, reliability, and sustainability. Evolutionary perceptible transformations invariably present challenges; a primary issue facing smart grids is the detection and management of faults. Intrinsic equipment malfunctions or failures, in addition to invincible cyber-attacks, present significant risks to the uninterrupted operability of smart grids, which call for fault detection-armored grids to guarantee uninterrupted services. The complex nature of smart grids, characterized by the integration of various energy sources, advanced communication networks, and intelligent control systems, imposes difficult obstacles and makes fault detection extremely difficult. In contrast to traditional power grids, smart grids are distinguished by the continuous exchange of real-time data that are processed by advanced algorithms to optimize energy flow, track the health of the system, and react to dynamic shifts in energy supply and demand. Due to this increased complexity, fault occurrences have a greater impact and must be detected and mitigated quickly to prevent widespread disruptions.
Traditional fault detection methods are based on large labeled datasets, rule-based thresholds, and predefined fault models. However, particularly in a rapidly changing grid landscape, these approaches frequently lack adaptation to changing fault circumstances. Furthermore, the efficacy of classic machine learning (ML) algorithms in identifying unique fault states is limited due to their reliance on past data. Modern power systems are becoming increasingly complicated, necessitating a move towards more sophisticated fault detection techniques that make use of state-of-the-art artificial intelligence (AI) methodology. Deep learning-based architectures, especially autoencoders, have shown promise in addressing these constraints. Because autoencoders are unsupervised models, they learn the underlying data representations instead of only relying on labeled fault situations, allowing them to demonstrate higher generalization capabilities.
Furthermore, the incorporation of Generative AI for the creation of synthetic data has shown promise in augmenting scarce real-world datasets, resolving issues associated with data scarcity, and improving model resilience. Generative AI can enhance fault detection model training and testing by producing high-quality synthetic datasets, guaranteeing the models’ effectiveness in a variety of unexpected and varied fault scenarios.
This study investigates a two-pronged strategy for the detection of smart grid failure. Initially, a thorough comparison of several machine learning models is performed, supervised and unsupervised, to evaluate how well they identify symmetric and unsymmetrical flaws. Among them, a fault detection system based on autoencoders is presented that exhibits enhanced precision in detecting invisible flaws. Second, to assess the effect of synthetic data on model performance, a conventional dataset (D1) is contrasted with a synthetic dataset (D2) produced using generative artificial intelligence. In order to improve smart grid resilience and eventually create a more resilient and adaptable fault detection framework, the study emphasizes the value of sophisticated machine learning architectures and superior synthetic datasets.
For smart grid systems to remain operationally reliable and sustainable, fault detection and categorization are essential. The accuracy and flexibility needed for contemporary power networks are frequently lacking in traditional fault detection techniques. Fault detection, classification, and mitigation techniques have greatly improved with the integration of machine learning (ML) and artificial intelligence (AI) models. With an emphasis on machine learning, deep learning, and generative AI approaches, this review of the literature examines current approaches and developments in fault detection within smart grids.
Generation, transmission, and distribution are among the operational scenarios in which smart grids are susceptible to a variety of problems. Ref. [
1] offers a thorough analysis of smart grid failure types and characterization, including advanced metering infrastructure (AMI), communication systems, cyberattack detection, and real-time monitoring. Economic issues that affect consumers and service providers must be taken into consideration by fault detection systems.
ML-based methods for smart grid fault detection have been the subject of numerous investigations. Ref. [
2] suggested a Recurrent Neural Network (RNN) model that uses voltage and current measurements to identify arc and pole-to-pole faults in electric vehicle (EV) charging systems. For validation, accuracy and F1-score were used. In a similar vein, ref. [
3] addressed consumer resistance to the deployment of smart meters, highlighting the function of monitoring and security applications in smart grids.
Concerns about voltage instability have grown as EVs and photovoltaic (PV) systems are integrated into power grids. Ref. [
4] proposed integrating smart meters as a way to address undetected voltage variations. In [
5], a 9-bus distribution system using the Fault Detection, Isolation, and Restoration (FDIR) technique was used to illustrate the self-healing potential of smart grids. Ref. [
6] explored how plug-in hybrid EVs affected grid stability and suggested coordinated charging to lessen disruptions. Rapid charging and vehicle-to-grid (V2G) capabilities were also examined in the study. Furthermore, ref. [
7] improved grid protection, especially in microgrid settings, by introducing a revolutionary anti-islanding strategy based on Support Vector Machines (SVM).
In microgrids, traditional protection strategies frequently fail, requiring ML-based alternatives. Ref. [
8] introduced a hybrid machine learning technique that combines Gaussian regression for localization and prediction with SVM for fault identification. The model was put to the test in conditions of load fluctuation and distributed generation (DG) penetration. By utilizing networked intelligent agents to maximize communication, control, and protection, multi-agent systems (MASs) have been investigated as a means of enhancing smart grid operations. The effectiveness of MASs in controlling power system functions, such as transmission switching, relaying, and plant control, is demonstrated by a study [
9] examining MAS-based smart grid control. However, while an MAS enhances decision-making in smart grids, its direct application to real-time fault detection and classification remains an area for further exploration.
The emergence of sensor faults in power systems demands prompt identification. Ref. [
10] utilized the Unknown Input Observer to ensure accuracy in fault detection under uncertain conditions, including renewable energy fluctuations and load variations. Ref. [
11] examined fault detection via the JADE platform using a multi-agent architecture, integrating EV batteries with PV systems for effective restoration.
Artificial Neural Networks (ANNs) have been widely employed for fault detection, classification, and location identification. Ref. [
12] utilized ANNs for detecting transmission line faults, leveraging their ability to handle nonlinear system volatilities. A model incorporating Empirical Wavelet Transform (EWT) and cyclic entropy for fault detection in EV inverters was proposed in [
13], effectively mitigating non-Gaussian white noise effects. A fault detection model utilizing Matching Pursuit Decomposition (MPD) and Hidden Markov Models (HMM) was presented in [
14]. The study demonstrated superior accuracy by grouping voltage and frequency characteristics through multiple algorithms. Similarly, ref. [
15] employed discrete wavelet transform to classify transmission line faults within a Simulink simulation environment.
Sparse autoencoders have been leveraged for overvoltage detection and classification, as demonstrated in [
16], which achieved feature extraction and dimensionality reduction without manual feature engineering. Ref. [
17] introduced an Isolation Forest-based anomaly detection scheme for smart grids, focusing on anomalies in current, voltage, and power consumption using real-world data. K-nearest neighbor (KNN) algorithms were employed for fault location in [
18], reducing errors induced by multiple factors and improving localization precision. Ref. [
19] compared the performance of three ML algorithms for fault detection, whereas [
20] proposed an Extreme Learning Machine (ELM)-based approach for fault detection in extensive EV charging stations, validated through Simulink simulations. The complex nature of smart grids—which are characterized by the integration of various energy sources, advanced communication networks, and intelligent control systems—imposes difficult obstacles and makes fault detection extremely challenging. In contrast to traditional power grids, smart grids are distinguished by the continuous exchange of real-time data that is processed by advanced algorithms to optimize energy flow, track system health, and respond to dynamic shifts in supply and demand. Consequently, robust condition monitoring, fault detection and identification, and intelligent system frameworks are indispensable to guarantee reliable operation [
21]. Foundational studies in mechanical, structural, and energy domains demonstrate how intelligent monitoring approaches can significantly enhance resilience and predictive capabilities, providing a strong basis for their application in smart grid environments [
22].
A comparative overview of recent studies is provided in
Table 1, which summarizes the key contributions and limitations of various approaches to smart grid fault detection and control. As shown in the table, traditional methods (e.g., threshold- or rule-based) often lack adaptability under diverse operating conditions, while machine learning-based techniques such as SVM, Random Forest, and Autoencoders demonstrate higher flexibility and accuracy. Moreover, recent works emphasize the role of synthetic data generation (e.g., via GANs) to overcome data scarcity, highlighting the importance of integrating generative models into the fault detection framework.
Traditional ML models often require high-quality training data, which is expensive and labor-intensive to obtain. Generative Adversarial Networks (GANs) have emerged as a promising solution for synthetic dataset generation. Ref. [
24] demonstrated the efficacy of GANs in producing synthetic load profiles that closely resemble empirical data, enhancing the accuracy of fault detection. Forecasting errors for electricity usage are greatly decreased when artificial datasets are combined with actual data. The integration of synthetic datasets with empirical data can markedly diminish prediction errors in electricity consumption and enhance risk management assessments. We can use a GAN-based model that generates synthetic data by using individual electricity consumption data as input to explore this possibility. We study one-dimensional time series, and numerical tests on an empirical dataset verify that GANs can produce synthetic data that is quite similar to the real data.
In order to improve fault detection in power systems and smart grids, this study investigated the application of both conventional and generative techniques. Using two different datasets—one created using conventional techniques (D1) and the other using generative artificial intelligence (D2)—a comparison of supervised and unsupervised machine learning models was conducted. The study emphasizes how generative AI may produce artificial data that closely mimics actual circumstances, improving the precision and dependability of fault identification. We also investigate the role of autoencoders as unsupervised models. This work demonstrates how GAN-generated synthetic datasets can improve ML-based load models, increasing the smart grid applications’ robustness in identifying random and invisible failures.
This work aims to address the following novelty.
Two-Dataset Method: For a thorough evaluation of fault detection performance, the research uses both synthetic datasets produced using generative artificial intelligence (D2) and traditionally maintained datasets (D1).
Superiority of the Autoencoder: A thorough assessment of the Autoencoder’s capacity to identify faults outside of the training dataset as an unsupervised model, hence improving the resilience of the smart grid.
This study makes contributions both to theory and practice. From a theoretical standpoint, it advances the integration of supervised and unsupervised machine learning approaches with generative models, demonstrating how synthetic data can overcome limitations of conventional datasets and improve generalization to unseen fault conditions. From a practical perspective, the proposed framework validates the effectiveness of GAN-generated data for smart grid fault detection, providing a scalable and adaptable solution that enhances grid resilience under diverse operating conditions. Together, these contributions highlight the dual value of the work: extending the methodological foundations of AI-based fault detection while offering actionable insights for real-world power system applications.
The overall structure of the paper is outlined as follows.
Section 2 presents the system model and methodology, including the simulation setup, dataset preparation, and machine learning models considered.
Section 3 describes the data collection from the smart distribution grid and the generation of synthetic datasets.
Section 4 provides the comparative analysis of supervised and unsupervised learning models under both conventional and random fault conditions.
Section 5 introduces the GAN-based framework for synthetic data generation and compares classifier performance using real and synthetic datasets. Finally,
Section 6 summarizes the key findings, discusses their implications for smart grid resilience, and highlights directions for future research.
2. System Model and Methodology
The methodology adopted in this research integrates detailed power system simulations with advanced machine learning and generative modeling techniques. The overall approach includes (i) defining the research problem and objectives, (ii) constructing and simulating the grid model, (iii) preparing both real and synthetic datasets, (iv) developing and tuning fault detection models, and (v) evaluating the models through multiple performance indicators.
The central research problem is the reliable detection of diverse fault types in distribution grids that integrate renewable resources. Traditional protection schemes are limited in adaptability under stochastic operating conditions, motivating the use of data-driven methods. The objectives of this work are (a) to benchmark conventional and AI-based fault detection methods, (b) to assess the contribution of synthetic data generated through Generative Adversarial Networks (GANs), and (c) to evaluate detection accuracy across operating conditions with variable distributed generation (DG).
The study is conducted on the IEEE 9-bus test system, implemented in MATLAB/Simulink. The network consists of multiple feeders and distributed generators modeled as constant power sources. To capture operational diversity, scenarios with varying levels of photovoltaic and wind generation are simulated. Faults considered include single line-to-ground, line-to-line, double line-to-ground, and three-phase faults. Each case records bus voltages, currents, and relay signals for subsequent feature extraction.
The dataset combines simulated fault cases with synthetic samples generated using GANs. The synthetic augmentation ensures balanced class distribution and addresses the limited availability of rare fault types. All features are normalized using min–max scaling. The dataset is divided into training and testing subsets using an 80:20 split, with five-fold cross-validation applied on the training set to enhance generalization.
Four machine learning approaches are developed and tuned: Support Vector Machine (SVM with RBF kernel), Random Forest (200 trees, depth optimized by grid search), K-Nearest Neighbors (k = 5), and an autoencoder-based anomaly detector. Hyperparameters such as kernel width, learning rate, and maximum depth are optimized through grid search. The models are implemented in Python using the Scikit-Learn and TensorFlow libraries.
The models are evaluated using accuracy, precision, recall, F1-score, and the Area Under the ROC Curve (AUC). Confusion matrices are analyzed to identify misclassification patterns among different fault types. This ensures both overall performance and class-wise robustness are quantified.
The complete research pipeline is summarized in
Figure 1. It illustrates the process from grid simulation, through synthetic data generation and preprocessing, to model training and evaluation.
The proposed methodology integrates system-level simulations, synthetic data augmentation, and advanced machine learning models to enable robust fault detection in a 9-bus distribution network. By simulating diverse fault scenarios and enriching the dataset with GAN-generated samples, the approach ensures adequate coverage of both common and rare operating conditions. The application of supervised and unsupervised learning models, coupled with rigorous training–testing protocols and cross-validation, provides a transparent and reproducible framework. This methodological design establishes a solid foundation for the comparative analysis presented in the following sections, enabling a fair evaluation of model accuracy, generalization capability, and resilience under stochastic distributed generation conditions.
4. Comparative Analysis
Four different machine learning algorithms—KNN, SVM, Random Forest, and Autoencoder—were used in a case study on a dataset. The dataset we generated through simulation was used to train the machine learning models, and each model was examined. Accuracy score, Losses, Precision, Recall, F1-score, Receiver Operating Characteristic curve (ROC curve), and Area Under the Curve (AUC) were the primary parameters used to analyze each algorithm. The region that the ROC curve covers is denoted by AUC. An ideal choice is one with an AUC greater than 0.5. The True Positive Rate (TPR) and the False Positive Rate (FPR), which are explained below, are the two components that the ROC curve depends on. To identify the best algorithm for fault detection, the advantages and disadvantages of each algorithm were examined.
True Positive Rate is given as:
False Positive Rate is given as:
To thoroughly assess the above-mentioned machine learning algorithms, it is crucial to consider factors such as computational efficiency, interpretability, and adaptability to diverse fault types. The simplicity and transparency of KNN render it an ideal foundational option, in contexts where interpretability of fault classification and detection processes is essential. The capability of SVM to manage non-linear relationships and high-dimensional data makes it a formidable option for intricate fault situations. Random Forest, with its ensemble approach, excels in capturing intricate patterns and providing robust predictions, while Autoencoder, leveraging neural network capabilities, proves effective in learning complex representations and detecting anomalies. The specific requirements of the power system and the nature of the problems detected ultimately determine which of these approaches is best.
4.1. Case 1: Training and Testing with Identical Fault Simulation Data
In the first case, both the training and testing datasets are obtained directly from the same pool of simulated fault data generated in the 9-bus Simulink model. Faults of different types and at different locations are simulated, and the dataset is randomly divided into training (80%) and testing (20%) subsets. This case provides a baseline evaluation of the models’ ability to learn fault signatures when the training and testing distributions are identical. Although it offers insight into model accuracy under controlled conditions, it does not reflect real-world variations where unseen operating conditions or new data distributions may occur.
4.1.1. K-Nearest Neighbor (KNN)
KNN is an instance-based, non-parametric learning method that uses the proximity principle. For a given dataset,
where
represents input features and
denotes the class label. KNN uses the majority class of its k-nearest neighbors to predict the class label for a new input x. The following is the categorization mathematical equation:
The projected class label is denoted by y in this equation, whereas the number of neighbors taken into account is k. KNN is a good option for preliminary research because it is easy to use and computationally efficient.
To validate the algorithm, the Accuracy score was computed and an accuracy of 99% was achieved as shown in
Figure 5. The classification report in
Table 6 gives the precision, recall, and F1 score of the model for both normal conditions and fault-induced cases. This algorithm produces the second-lowest value for the loss function, as shown in
Figure 6, and the AUC curve, given in
Figure 7, is 0.9975 which suggests very efficient performance.
For this study, the KNN classifier was implemented with using the Euclidean distance metric. The dataset was divided into 80% training and 20% testing, with random shuffling applied to avoid bias. A five-fold cross-validation strategy was employed to improve generalization and reduce overfitting.
4.1.2. Support Vector Machine (SVM)
SVM is a discriminative model that excels at tasks involving binary classification. Given a set of training data
where
, a hyperplane that maximally divides classes is found by SVM. The definition of the decision function is:
The parameters and b in this equation were acquired during the training phase, and denotes the dot product of vectors and x. SVM is a flexible option for power system failure detection because of its capacity to manage high-dimensional data and non-linear decision boundaries.
Support Vector Machine is a supervised learning algorithm primarily used for classification tasks, and in this study, it was employed to classify data into two categories: “Normal” (no fault) and “With Fault” (fault present). Two variants of SVM were implemented—Support Vector Classification (SVC) and Nu-Support Vector Classification (NuSVC). Both can function as either binary or multi-class classifiers, but a binary classification approach was used in this study. The key difference between SVC and NuSVC lies in how they regulate support vectors; SVC uses the regularization parameter C, which controls the trade-off between maximizing the margin and minimizing classification errors, whereas NuSVC replaces C with (Nu), a parameter that controls the proportion of support vectors and margin errors, making it more flexible in handling imbalanced datasets. The SVM classifier was trained using a radial basis function (RBF) kernel with regularization parameter and kernel coefficient . An 80/20 train–test split with five-fold cross-validation was applied for consistent evaluation.
From the simulation results, SVC outperformed NuSVC in terms of accuracy and loss value, indicating better classification performance. However, despite its structured classification approach, SVM had the lowest accuracy and Area Under the Curve (AUC) among all the machine learning algorithms evaluated. The AUC, which measures classification quality, confirmed that SVM performed the worst in
Table 7, making it the least preferred model for fault detection in the proposed system. Due to its lower accuracy and weaker classification ability in this fault detection context, SVM is not the most suitable algorithm for the given dataset. While it remains a powerful classifier in certain applications, its performance in detecting faults in smart grids was suboptimal compared to other machine learning methods analyzed in this study.
4.1.3. Random Forest
Several decision trees are used in the ensemble learning technique Random Forest to produce reliable predictions. A bootstrap sample of the data is used to train each tree, and the trees’ diversity is increased through random feature selection. The sum of the various tree predictions determines the final prediction. The following is the expression for the decision tree’s mathematical equation:
where
represents the set of parameters defining the decision tree and x represents the features of the data. When it comes to managing intricate relationships in data, preventing overfitting, and producing accurate predictions, Random Forest shines. The Random Forest model was implemented with 100 estimators (trees) and a maximum depth of 10. Similar to the other models, an 80/20 train–test split and five-fold cross-validation were adopted to ensure robustness and reduce bias.
The Random Forest classifier achieved the highest accuracy among all the algorithms compared in this investigation. Acknowledging the least value of the loss function and an AUC of 1.0 contributed by the Random Forest Classifier, it was the best ML algorithm among the supervised learning models examined in this study, giving exemplary results as given in
Figure 5.
Figure 5.
Accuracy comparison of different machine learning algorithms.
Figure 5.
Accuracy comparison of different machine learning algorithms.
4.1.4. Autoencoder
Autoencoder is built on a neural network architecture, designed for unsupervised learning, particularly effective for anomaly detection. Comprised of an encoder and a decoder, it maps input data to a latent representation and reconstructs the input from this representation. Anomaly detection is achieved by measuring the reconstruction error as defined by the following equation:
In this equation,
x represents the input data, and the reconstruction error quantifies the dissimilarity between the input data
x and its reconstructed form. Autoencoders are adept at capturing complex patterns in the data, making them suitable for power system fault detection. Autoencoder is a fully connected unsupervised neural network and was trained using only clean input data, as well as the targeted output data of this neural network. Autoencoder is of interest in this study due to the following advantages: its ability to reduce the dimensionality of data, and to learn from the correlation between feature vectors, i.e., Autoencoders are not attack-specific. Autoencoder gave an accuracy of 98%,
Figure 5, which is lower compared to other algorithms. The Autoencoder was configured with three hidden layers using ReLU activation, a latent dimension of 16, and trained with the Adam optimizer at a learning rate of 0.001. The model was trained exclusively on normal operating data, with evaluation based on an 80/20 train–test division and five-fold cross-validation.
The results indicate that Autoencoder and Random Forest are the most reliable algorithms for fault detection in smart distribution grids. Autoencoder’s ability to minimize reconstruction error and Random Forest’s high accuracy and low loss values make them suitable for ensuring grid resilience. KNN, while simple and interpretable, and SVM, although effective for specific scenarios, do not perform as well as Autoencoder and Random Forest in this context.
The results demonstrate that supervised learning algorithms, such as KNN, SVM, and Random Forest, are effective models for fault detection in a smart grid when trained on labeled datasets containing both normal and faulty conditions. However, their performance is highly dependent on the nature of the test data, particularly when it closely resembles the training data. In contrast, unsupervised learning models like Autoencoders operate differently; they are trained exclusively on normal (clean) data and detect anomalies based on deviations from learned patterns. This distinction is critical when evaluating fault detection capabilities, as supervised models may exhibit high accuracy when tested on faults they have been explicitly trained on but struggle with unseen or random faults—fault conditions that were not included in the training dataset.
Random faults, such as high impedance faults, intermittent faults, or cyber-induced disturbances, introduce unpredictable variations that test a model’s ability to generalize beyond predefined faults. To ensure a fair comparison, the study evaluates how each model performs when tested with such unknown faults. This highlights a key limitation of supervised models, which tend to be fault-specific, whereas Autoencoders generalize better by identifying deviations from normal behavior. Therefore, for robust fault detection in smart grids, where unforeseen faults are common, Autoencoders offer a significant advantage in ensuring grid resilience.
The comparative analysis of machine learning algorithms for fault detection in transmission lines indicates that the Random Forest model is the most reliable, exhibiting consistently low log loss (
Figure 6) and high AUC values (
Figure 7). The SVM model also demonstrates excellent performance, particularly in capturing non-linear relationships and managing high-dimensional data (
Figure 5 and
Figure 6). Although KNN is a simpler approach, it achieves competitive classification accuracy and retains interpretability, making it suitable for baseline fault detection applications (
Figure 5 and
Figure 6). The Autoencoder, while effective for unsupervised anomaly detection, shows higher log loss (
Figure 6) and comparatively lower robustness in terms of AUC (
Figure 8). Overall, Random Forest and SVM emerge as the most effective models, offering a favorable balance of accuracy, adaptability, and interpretability across diverse fault conditions.
Figure 6.
Log loss for case 1.
Figure 6.
Log loss for case 1.
Figure 7.
AUC of supervised learning algorithms.
Figure 7.
AUC of supervised learning algorithms.
Figure 8.
AUC of Autoencoders.
Figure 8.
AUC of Autoencoders.
4.2. Case 2: Training with Fault Simulation Data and Testing with a Random Fault
Accuracies of the supervised algorithms drop drastically when tested for a random fault; comparison of the accuracies is presented in
Figure 9. The supervised algorithms are inefficient in detecting a fault that is out of scope of their training data, which indicates that they only succeed at detecting specific faults that they are trained for. Therefore, the supervised algorithms in this case study, SVM, KNN, and Random forest, are fault-specific in nature. This performance characteristic is least preferred in the context of a smart grid. A smart grid is vulnerable to diverse kinds of faults because of the extensive integration of information and communication technologies and intelligent systems.
Autoencoders under the same condition retain an accuracy of 98%, similar to that of the previous condition. This can be attributed to the training process of Autoencoders, wherein the algorithm, trained only using clean data, learns from the correlation between the data entries. Therefore, autoencoders can detect any fault since such a dataset would stand out from the clean data and also because such a dataset would fail to produce a correlation structure similar to the training process. Therefore, Autoencoders are not fault-specific, which makes them an optimal choice for fault detection in smart grids and helpful to ensure enhanced grid stability and resilience. The log losses for the 2 cases are given in
Table 8.
The comparative analysis was carried out on a workstation equipped with an Intel Core i7 processor (3.2 GHz), 16 GB RAM, and an NVIDIA GTX 1660 GPU All simulations of the distribution grid and fault scenarios were implemented in MATLAB/Simulink (R2022a). The machine learning models were developed in Python (version 3.9), utilizing the scikit-learn library for Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN), and TensorFlow/Keras for the Autoencoder.
5. GAN-Based Algorithm for Generation of Synthetic Data in Fault Prediction
Building directly on the findings of the Comparative analysis, this section introduces a Generative Adversarial Network (GAN)-based approach for generating synthetic data (D2). The purpose of this extension is to address the limitations identified in the comparative analysis by augmenting the training dataset with diverse and realistic fault scenarios. By comparing the performance of machine learning models under both D1 and D2, we verify the effectiveness of GAN-generated data in enhancing fault detection accuracy and robustness.
This algorithm is designed to generate synthetic data for fault prediction using GAN, a critical tool for improving the reliability and efficiency of industrial systems. The Generative Adversarial Network (GAN)-based algorithm is designed to generate synthetic data for fault prediction, serving as a critical tool for improving the reliability and efficiency of industrial systems. Availability of labeled data for ML models is arduous due to various constraints such as privacy issues, so this paper proposes a method which uses GANs to create synthetic data points that closely mimic the underlying distribution of the original dataset.
The synthetic data generation process using GAN begins with preprocessing the dataset by cleaning, normalizing, and selecting relevant features, focused on the extraction of numerical features for fault prediction. Once these features are selected, GANs are trained to model the data as a combination of several Gaussian distributions, characterized by their means and covariance. After training, the GANs generate synthetic data by sampling from the learned distributions, ensuring the generated data replicates the original dataset’s underlying patterns. For categorical features, the algorithm assumes independence and samples based on their observed distributions, thereby preserving the categorical characteristics. Finally, the sampled numerical and categorical features are combined to create a synthetic dataset that closely mirrors the original data distribution. This synthetic dataset is valuable for training ML models, in situations where labeled data is limited or privacy concerns restrict access to openly accessible real-world datasets, thus enhancing the performance of fault prediction systems [
26].
When access to real-world datasets is restricted by privacy concerns or labeled data is scarce, this synthetic dataset is a useful tool for training machine learning models. The creation of accurate machine learning models depends on having access to sufficient labeled data; however, obtaining a significant quantity of labeled data is frequently difficult because of several limitations in practical applications. By adding to already-existing datasets or producing brand-new ones that closely resemble the original data, synthetic data production offers an alternate approach. In this research, we focus on generating synthetic data for fault-type prediction using GANs.
In fields where data is expensive or hard to obtain, GANs have become a potent foundation for creating realistic synthetic data. The generator
G and discriminator
D, two neural networks that make up a GAN, are trained concurrently in a minimax game. The generator
, where z is a noise vector, learns to map from the latent space in the data distribution
. The discriminator
learns to distinguish between real data samples
and generated samples
. The objective function of a GAN is given by:
In this case, the discriminator seeks to maximize this loss, whereas the generator seeks to decrease it. The two networks’ adversarial natures push the generator to create data that is identical to the real data, increasing the model’s ability to produce high-quality synthetic data for a range of uses.
This study investigated the use of both conventional and generative techniques to improve fault detection in power systems and smart grids. It uses two datasets—a synthetic dataset produced using generative artificial intelligence (D2) and a traditionally curated dataset (D1)—to compare supervised and unsupervised machine learning methods. More precisely, because Dataset D1 includes experimental data gathered using traditional techniques, it approaches what might be referred to as the traditional notion of a dataset. The study demonstrates how the synthetic data greatly increases fault detection accuracy and reliability by closely simulating real-world settings.
As shown in
Figure 10, the process begins with the acquisition of two datasets: (i) Dataset D1 obtained from conventional simulation of grid faults and (ii) Dataset D2 generated using GAN to mimic real operating conditions. Both datasets undergo preprocessing steps such as normalization and feature selection. The processed data are then used to train machine learning models (Random Forest, SVM, KNN, and Autoencoder). In addition, a voting-based ensemble classifier is applied to combine predictions from individual models. Evaluation metrics, including Accuracy, Precision, Recall, F1-score, log loss, and ROC–AUC, are calculated. Finally, the performance of the models trained on D1, and D2 is compared to validate the effectiveness of synthetic data in improving fault detection accuracy and robustness.
The results show how generative AI can produce high-quality datasets that improve fault detection processes, providing a more dependable and scalable solution for industrial applications—
Figure 11.
To ensure fair comparability, both the conventional dataset D1 and the GAN-augmented synthetic dataset D2 are tied to the same Simulink grid model and the same set of fault locations. Dataset D2 is generated by training the GAN on features derived from D1 (RMS , , , , , ) and then sampling additional points that preserve the joint distribution of these features under the same grid topology and operating ranges. Consequently, all model training and testing using D2 remains grounded in the identical physical grid representation used for D1. The purpose of our approach is to anticipate fault types using synthetic data, which is crucial when addressing privacy issues or the scarcity of relevant, high-quality data. To illustrate this, we use ensemble learning with a Voting Classifier to create a fault prediction system. We use important technologies to manage data pretreatment, model training, and evaluation libraries such as pandas, numpy, and scikit-learn. Our approach involves setting up several base classifiers, including Random Forest Classifier, Logistic Regression, Stochastic Gradient Descent (SGD) Classifier, and K-Nearest Neighbors (KNN) Classifier. We then create a Voting Classifier, combining the predictions of these base classifiers through a hard voting strategy. The ensemble model is trained on the available dataset and used to predict fault types on the generated synthetic dataset, as described in the subsequent sections.
5.1. Comparing Classifier Performance: Synthetic vs. Real Data
In fault detection tasks, synthetic data generated from Gaussian models offers valuable perspectives. When comparing the accuracy of classifiers trained on synthetic data rather than real-world data, noteworthy differences emerge. According to the examination, the Boosting Classifier (BC) [
27] and the Random Forest Classifier (RFC) [
28] both exhibit the best accuracy on synthetic data, with RFC slightly beating BC. According to these results, fault identification via synthetic data can be accomplished successfully while preserving competitive classifier performance [
29].
The findings in
Figure 11 show that some classifiers successfully use synthetic data to improve the accuracy of fault detection. Interestingly, when trained on synthetic data instead of real data, the KNN [
30] classifier shows a substantial increase in accuracy. On the other hand, classifiers such as Gaussian Naive Bayes (GNB) [
31] and Stochastic Gradient Descent (SGD) exhibit reduced accuracy on synthetic data. This implies that the synthetic data produced by Gaussian models might not accurately capture certain subtleties or features present in the general data, resulting in a decline in the models’ performance.
For many classifiers, the general increase in accuracy with synthetic data is apparent; nevertheless, additional research is necessary to understand the underlying dynamics behind these variations. Comparing the distribution of synthetic data to that of general data, spotting biases or deviations, and improving the synthetic data creation procedure to better match the features of the original dataset could all be part of this analysis. Furthermore, adjusting the Gaussian model parameters or investigating different synthetic data-generating techniques may improve the value of synthetic data in fault detection tasks.
With the general data results displayed on the right and the synthetic data results on the left, each bar reflects the accuracy added by a particular classifier. Insights into how well synthetic data generation strategies for fault prediction tasks improve classifier accuracy are provided by the graph, which shows the performance difference between the two types of datasets for different classifiers. The accuracy of various classifiers utilizing both synthetic and real data is compared in the bar graph. With the results for synthetic data on the left and the results for general data on the right, each bar shows the accuracy added by a particular classifier.
5.2. Impact of Synthetic Data Proportion on Model Performance
To further evaluate the effectiveness of GAN-generated data, we conducted an analysis by varying the proportion of synthetic data combined with real data during model training. Four scenarios were considered: 25% synthetic data, 50% synthetic data, 75% synthetic data, and 100% synthetic data. The evaluation was performed using Accuracy, Precision, Recall, F1-score, and ROC–AUC metrics.
The results in
Table 9 indicate that introducing synthetic data improves model performance when used in moderate proportions. In particular, adding 50–75% synthetic data yields the highest improvement in Accuracy and AUC. However, when the training dataset consists entirely of synthetic data (100%), a slight decline in performance is observed compared to the mixed scenarios. This suggests that GAN-generated data are highly effective for augmenting real data, but the best results are obtained when synthetic and real datasets are combined.
These findings validate the effectiveness of the data generated by GAN and highlight the importance of balancing synthetic and real data when developing robust fault detection models.