A Comparative Study of Dimensionality Reduction Methods for Accurate and Efficient Inverter Fault Detection in Grid-Connected Solar Photovoltaic Systems

Tufail, Shahid; Sarwat, Arif I.

doi:10.3390/electronics14142916

Open AccessArticle

A Comparative Study of Dimensionality Reduction Methods for Accurate and Efficient Inverter Fault Detection in Grid-Connected Solar Photovoltaic Systems

by

Shahid Tufail

^1,2

and

Arif I. Sarwat

^1,*

¹

Department of Electrical and Computer Engineering, Florida International University, Miami, FL 33174, USA

²

NextNRG LLC., Miami Beach, FL 33139, USA

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2916; https://doi.org/10.3390/electronics14142916

Submission received: 30 May 2025 / Revised: 9 July 2025 / Accepted: 17 July 2025 / Published: 21 July 2025

(This article belongs to the Special Issue Machine Learning Applications in Predictive Monitoring of Power Grid Stability and Resiliency Enhancement)

Download

Browse Figures

Versions Notes

Abstract

The continuous, effective operation of grid-connected photovoltaic (GCPV) systems depends on dependable inverter failure detection. Early, precise fault diagnosis improves general system dependability, lowers maintenance costs, and saves downtime. Although computing efficiency remains a difficulty, particularly in resource-limited contexts, machine learning-based fault detection presents interesting prospects in accuracy and responsiveness. By streamlining data complexity and allowing faster and more effective fault diagnosis, dimensionality reduction methods play vital role. Using dimensionality reduction and ML techniques, this work explores inverter fault detection in GCPV systems. Photovoltaic inverter operational data was normalized and preprocessed. In the next step, dimensionality reduction using Principal Component Analysis (PCA) and autoencoder-based feature extraction were explored. For ML training four classifiers which include Random Forest (RF), logistic regression (LR), decision tree (DT), and K-Nearest Neighbors (KNN) were used. Trained on the whole standardized dataset, the RF model routinely produced the greatest accuracy of 99.87%, so efficiently capturing complicated feature interactions but requiring large processing resources and time of 36.47 s. LR model showed reduction in accuracy, but very fast training time compared to other models. Further, PCA greatly lowered computing demands, especially improving inference speed for LR and KNN. High accuracy of 99.23% across all models was maintained by autoencoder-derived features.

Keywords:

machine learning; artificial intelligence; fault detection; principal component analysis; autoencoders; support vector machine; classification; supervised training; random forest; decision tree

1. Introduction

The global shift towards renewable energy sources, driven by the urgent need to mitigate environmental impacts and ensure sustainable energy generation, has significantly increased the deployment of solar photovoltaic (PV) systems. At the core of these PV systems are solar inverters, critical devices responsible for converting direct current (DC) produced by solar panels into alternating current (AC) compatible with the electric grid [1]. Due to their crucial role, the operational reliability and performance efficiency of solar inverters directly influence the overall system effectiveness, financial return, and grid stability [2]. However, these inverters are susceptible to various faults, including open-circuit, short-circuit, insulation degradation, overheating, and grid synchronization issues, potentially reducing system efficiency, compromising safety, and resulting in substantial financial losses [3,4,5].

Detecting inverter faults rapidly and accurately is a vital task that ensures seamless operation, system reliability, and long-term profitability. Multiple factors, including weather, operating conditions, manufacturing defects, could significantly impact the reliability of the inverters [6].

With advancements in hardware storage technologies and the declining cost of high-speed data storage and IOT devices [7], both the volume and velocity of data have increased substantially. This increase in data volume and velocity has created new opportunities for machine learning (ML) approaches to explore, as the increased speed of data writing and reading enables faster processing, real-time analytics, and the development of more sophisticated models capable of handling large-scale, dynamic datasets. ML algorithms have found widespread applications in various domains, including cybersecurity, anomaly detection, and load forecasting [3,8,9,10,11,12,13,14,15,16,17,18]. The availability of such large-scale data has enabled the development of predictive models, contributing to significant progress in these emerging areas.

Recent studies highlight the efficacy of ML-based fault detection in PV systems [19,20]. For example, recognizing patterns in power system data enables supervised learning models for cyberattack detection [21,22]. Additionally, advanced anomaly detection methods, such as stacked autoencoders and PCA [21,23,24], effectively identify deviations from normal operating conditions, enabling early detection of False Data Injection Attacks (FDIAs) in smart grids.

Figure 1 presents a hierarchical tree structure for fault detection in solar inverters. It categorizes faults into electrical, mechanical, and environmental types. Fault detection methods are divided into traditional methods (rule-based monitoring, statistical analysis) and intelligent methods (machine learning and deep learning-based approaches). Data acquisition involves sensor-based collection, SCADA & IoT monitoring, and synthetic data generation. Performance evaluation is classified into classification metrics and regression metrics. The intelligent methods include supervised and unsupervised learning, CNNs for image detection, LSTMs for time series [25], multi-layer perceptrons, and autoencoders with generative adversarial networks (GANs) for anomaly detection [26,27].

1.1. Related Work

Extensive research conducted over the past decade (2015–2025) has consistently demonstrated that ML methods effectively detect and diagnose inverter faults in grid-connected solar systems, frequently achieving accuracies exceeding 95%. For instance, recent studies highlight methods such as the improved SE-ResNet18 (Squeeze-and-Excitation Residual Network) [28], which used techniques like Conditional Variational Autoencoders (CVAE) and signal denoising via Wavelet Packet Decomposition (WPD). These techniques significantly increase the model performance reaching upto 100% accuracy. In this study applying SE-ResNet18, accuracy reached upto 98.18% on the original dataset and also improved to a remarkable 100% after the dataset was augmented.

Deep learning models are very accurate, but putting them to use in the real world comes with challenges. Most importantly, real-time detection faces issues because they need a lot of computing power, including fast hardware for processing and a lot of time for training [29]. Also, deep learning models are considered as black-box and are not interpretable [30,31]. This is because their very complicated, multiple structures make them hard to see, making it hard to figure out why certain fault predictions are made. This limitation can be especially annoying in places with important systems where clear communication, following rules, and being able to understand are very important. Whereas the alternative ML methods like RF classifiers offer simpler, interpretable, and computationally efficient solutions [31]. These ensemble-based models employ DTs that inherently facilitate transparency through explicit rule-based structures, providing insights into feature importance and decision-making criteria. A recent application of RF classifiers in inverter fault detection [32] showed an accuracy of up to 99%, validating the reliability and usability of tree based algorithms in scenarios where model interpretability and resource efficiency are prioritized.

A 2023 research [33] developed a hybrid strategy using AI to improve fault detection in PV arrays and inverters. This research predicted AC power using a regression model to identify inverter failures. They used Elman neural networks (ENN), boosted tree algorithms (BTA), multi-layer perceptrons (MLP), and Gaussian process regression. The scientists obtained great accuracy using real-world datasets containing operational characteristics such daily energy production, ambient and module temperatures, solar radiation, and DC and AC power measurements. The optimized GPR model (GPR-M4) achieved low errors, with a mean absolute percentage error (MAPE) of 3.9% and MAE of 0.002 for inverter faults and a similar result for PV array faults (MAPE of 0.091 and negligible MAE).Another study in 2024 [34] investigated ML-based methods for monitoring and classifying faults in solar photovoltaic (PV) inverters. Using real operational data from two solar plants (140 kWp and 590 kWp), they applied supervised learning algorithms, including fine, medium, and coarse DT models. The fine tree algorithm achieved the highest accuracy (up to 98.4%) in classifying common faults like grid voltage abnormalities and output overloads. A semi-supervised VAE-based method [35] detects PV system faults by identifying latent-space deviations using 1SVM, iForest, EE, and LOF algorithms.

Among recent studies [36,37] focus on fault detection in grid-connected PV systems operating under MPPT and IPPT modes. These used high-frequency data that addresses seven realistic fault scenarios. The approach in [36] combines PCA, KDE, and KLD with an adaptive mechanism to account for environmental variations. Although it achieved a low false alarm rate below one percent and fast processing time. But this study did not report standard classification metrics. Whereas, ref. [37] used the GPVS-Faults dataset and evaluated supervised ML models including RF, LR, and NB. RF achieved an F1-score of 0.96, while LR slightly outperformed it in accuracy but required more training time. The PV array voltage was found to be the most important predictor.

An advanced fault diagnosis method for photovoltaic (PV) systems using cascaded multilevel H-bridge inverters was proposed in [19]. A two-stage classification framework using PCA and Support Vector Machine was developed to distinguish two groups of similar open-circuit faults in power switching devices (IGBTs). They tested 37 fault scenarios—one normal, eight single IGBT faults, and twenty-eight double IGBT faults. Each fault type had 200 samples with 10,000 inverter output voltage sampling points under different solar irradiance and temperature conditions. After PCA reduced dimensionality, SVM classified fault types. A second PCA-SVM classification distinguished similar faults that were initially difficult to classify. The proposed PCA-SVM secondary classification strategy outperformed traditional methods like PCA-SVM (94.59%) and PCA-ELM (89.0%) with a diagnostic accuracy of 99.95%. By eliminating ambiguities between similar fault groups, the method improved classification accuracy.

These findings show that ML improves inverter fault detection. However, given the large volume of sensor data, dimensionality reduction must be balanced with model accuracy and complexity. A balanced dataset is also needed to find meaningful patterns in faulty classes. A simple oversampling technique, SMOTE lacks diversity, making GAN-based approaches a viable alternative for imbalanced datasets. PCA reduces dimensionality in feature engineering, but stacked autoencoders should be considered. Real-time deployment requires critical analysis of model training and inference times.

The proposed study integrates autoencoder-based nonlinear feature extraction with conventional ML classifiers, unlike previous studies that focused on deep learning classifiers or anomaly detection frameworks using autoencoders or VAEs. We also provide a benchmarking framework that compares the accuracy, AUC, training/inference time, and confusion matrix metrics of multiple classifiers after dimensionality reduction (via PCA and autoencoders), making it ideal for real-time and embedded fault detection systems. This structured analysis addresses the accuracy-computational feasibility trade-offs in resource-constrained environments, which are often overlooked in the literature.

1.2. Objective

The objective of this study is to develop an efficient inverter fault detection framework for grid-connected photovoltaic systems using dimensionality reduction and ML classifiers. We employ PCA and autoencoders to reduce the high-dimensional sensor data while preserving critical fault-related features. Scree analysis is performed to determine the optimal number of principal components (PCs). The dataset is trained using multiple classifiers, including DT, RF, KNN and LR, on both the full feature set and reduced feature sets (8 PCs from PCA and 8 features from autoencoders). The study also evaluates model performance across different dimensionality settings.

The key contribution of this paper includes:

Comprehensive Classifier Benchmarking: Conducted a comparative analysis of multiple ML classifiers (RF, DT, LR) for inverter fault detection, highlighting trade-offs in accuracy, inference time, and deployment feasibility.
Evaluation of Dimensionality Reduction Techniques: Assessed the impact of PCA and autoencoders on classifier performance, including feature space reduction, optimal component selection, and preservation of critical fault signatures.
Real-Time Suitability Analysis: Investigated model training and inference times across different feature sets to identify configurations suitable for real-time deployment in resource-constrained PV systems.

2. System Design, Data Acquisition, and Methodology

This section provides an overview of the experimental setup and data used in this study. It begins with a discussion of inverter topologies commonly found in grid-connected PV systems, along with their typical fault mechanisms. This context establishes the relevance of the faults investigated. Following this, we describe the laboratory-based GCPVS used to generate fault data, the instrumentation involved, and the structure of the dataset. This information forms the basis for the ML-based fault classification framework proposed in subsequent sections.

2.1. Inverter Topologies and Fault Mechanisms in Grid-Connected PV Systems

Grid-connected photovoltaic (PV) systems commonly use three types of inverter topologies: string inverters, central inverters, and microinverters [38]. These topologies differ in architecture, scale of deployment, and associated failure modes as shown in Figure 2.

2.2. GCPVS for Inverter/IGBT Fault Analysis

String inverters are widely used in residential and small commercial applications. They connect multiple PV modules in series (forming a string), with each string feeding a single inverter. String inverters are prone to faults such as maximum power point tracking (MPPT) failures, DC-link capacitor degradation, IGBT open/short circuit faults, and ground leakage issues [39,40].

Central inverters, used in large utility-scale installations, aggregate power from multiple strings or arrays. Their larger size and centralized architecture make them susceptible to cooling system failures, control board malfunctions, synchronization issues with the grid, and bulk capacitor degradation. IGBT-related faults and DC overvoltage or undervoltage issues are also frequent [39,41].

Microinverters operate at the individual panel level, providing module-level conversion. While they offer enhanced fault isolation, common issues include communication loss, islanding faults, overheating due to enclosure design, and intermittent MPPT failures [42].

In this study, the IGBT open-circuit, IGBT short-circuit, open-switch, and normal operating conditions modeled in the benchmark dataset are representative of faults typically seen in both string and central inverters. These faults directly impact parameters such as input DC voltage (

V_{dc}

), input PV current (

I_{pv}

), and three-phase output currents (

I_{a}

,

I_{b}

,

I_{c}

), which are used as features in our ML framework. Thus, the classification tasks and fault detection methods explored in this paper are well-aligned with real-world inverter configurations used in grid-connected PV systems.

The dataset utilized in this study is derived from experimental fault scenarios in grid-connected photovoltaic (PV) systems operating under both Maximum Power Point Tracking (MPPT) and Intermediate Power Point Tracking (IPPT) modes. The dataset, known as GPVS-Faults, was obtained from laboratory experiments that systematically introduced faults into a PV system, including inverter faults and Insulated Gate Bipolar Transistor (IGBT) failures, among other fault types [43]. This study validates a fault detection method using a lab-implemented GCPVS, evaluating its performance under experimental conditions in two operational modes: MPPT and IPPT. The objective is to assess the method’s effectiveness in accurately identifying Inverter/IGBT faults under these modes.

The GCPV system employed a Programmable DC Power Supply Chroma 62150H-1000S (1000 V/15 A/15 kW) with Solar Array Simulator software to simulate PV array outputs under varying irradiance (Gi) and PV cell temperature (Tc) conditions [39]. The Chroma 62150H-1000S enabled the emulation of crystalline, multi-crystalline, and thin-film PV arrays with distinct fill factors. A Programmable AC Source Chroma 61511 (0–300 V, 151.5 kHz/12 kVA) replicated AC grid conditions and captured critical PV inverter dynamics [36]. The AC load ensured system protection during the intentional introduction of Inverter/IGBT faults into the GCPVS, maintaining experimental safety and reliability.

Data acquisition and control algorithms were implemented using DSpace 1104 hardware with MATLAB/Simulink’s RTI [39]. Voltage Oriented Control (VOC) and Space Vector Pulse Width Modulation (SVPWM) regulated active/reactive power, while a Phase Lock Loop (PLL) synchronized inverter output with the grid. A Particle Swarm Optimization (PSO)-based controller switched multiple modes depending on available power.

2.3. Grid-Connected PV System Fault Description

This study specifically addresses inverter-related faults, particularly Insulated-Gate Bipolar Transistor (IGBT) failures within the implemented GCPVS. Real inverter fault data was intentionally generated by inserting failure of one of the six IGBT and collected to rigorously test and validate fault detection algorithms. Fault scenarios were manually introduced during multiple independent trials lasting 10–15 s, with faults injected around the 7th or 8th second of each trial. The dataset was sampled at 100 microseconds (µs), ensuring high-resolution measurements for both faulty and fault-free conditions. These experimentally generated faults provide reliable, realistic data suitable for training and validating fault detection algorithms aimed at inverter protection and predictive maintenance tasks. Additional scenario descriptions are provided in [43]. The Table 1 presents a complete of the features in the dataset. The Figure 3 presents correlations between input features and the target variable for inverter fault detection. The most strongly correlated features with the target are Vpv (0.72 positive correlation) and Ipv (−0.51 negative correlation), indicating these variables significantly impact fault prediction. Features like Iabc also exhibit a moderate negative correlation (−0.49).

2.4. Methodology

This study presents a comprehensive methodology for evaluating and comparing the effectiveness of various ML algorithms in the classification of inverter faults.The dataset includes a total record of 272,727 with 15 features including the target. Out of the total records, there are 143,715 (52.7%) instances of healthy and 129,012 (47.3%) instances of fault. The methodology is summarized in Figure 4 and Algorithm 1 which covers data preprocessing, dimensionality reduction, and model evaluation.

Algorithm 1 Inverter fault diagnosis.

1:

Step 1: Dataset Loading

2:

Load dataset D containing inverter operational data from data repository.

3:

Log dataset dimensions and structure.

4:

Step 2: Exploratory Data Analysis (EDA)

5:

Analyze data summary, descriptive statistics, and check for missing values.

6:

Visualize data distribution and correlation between features.

7:

Step 3: Data Preprocessing

8:

Standardize input features using StandardScaler.

9:

Step 4: Feature Importance Analysis

10:

Train a RF classifier on standardized data.

11:

Compute and plot feature importance scores.

12:

Step 5: Dimensionality Reduction using PCA

13:

Apply PCA to standardized data.

14:

Reduce dimensionality to 8 principal components.

15:

Step 5: Autoencoder Training

16:

Define Autoencoder neural network architecture:

Input layer matching input feature dimensions.
Hidden layers with ReLU activation functions and Batch Normalization.
Bottleneck layer of 8 dimensions.

17:

Train Autoencoder for 30 epochs with batch size 32.

18:

Extract encoded features from the trained Autoencoder for classification.

19:

Step 6: Feature Importance Analysis

20:

Train RF classifier on standardized original features.

21:

Compute and visualize feature importance scores.

22:

Step 7: Model Training and Evaluation

23:

Split dataset into training and test sets (80:20 ratio).

24:

Split training data in training and validation set (80:20 ratio)

25:

for each classifier in {LR, DT, RF, KNN} do

26:

Train classifier on:

Original standardized features,
PCA-transformed features,
Autoencoder-encoded features.

27:

Evaluate models using accuracy, confusion matrices, classification reports, ROC curves, and ROC-AUC scores.

28:

end for

29:

Step 8: Results and Comparison

30:

Summarize and compare classifier performance metrics in tabular form.

31:

Visualize ROC curves for model comparison.

32:

Step 9: Model Selection and Saving

33:

Identify best-performing model based on evaluation metrics.

34:

Save the selected model for future deployment.

The initial phase involved loading of the dataset for preprocessing and initial data analysis. The dataset comprises a series of features and a binary target variable indicating the presence or absence of inverter faults, as shown in Table 1. Following data loading, the independent features and target variables were separated, facilitating independent analysis and preprocessing. The independent numerical variables in the dataset were then standardized using the StandardScaler from scikit-learn v1.6.1. The scaling is a crucial step in ensuring that each feature contributes equally to the model training by normalizing the distribution of each variable.

In the next step, the standardized data was split into training, validation, and testing subsets by applying the train-test split strategy. In the first split process, the dataset was split into an 80% training set and a 20% testing set. The training subset was further split into training and validation subsets, comprising 80% and 20% of the initial training data, respectively. This multi-step splitting allowed for effective model training, hyperparameter tuning, and unbiased evaluation on unseen test data and avoided overfitting.

During the training phase, four commonly applied ML classifiers were employed: RF, LR, DT, and KNN. These models were selected for their different types of learning approaches—ensemble-based, linear, tree-based, and instance-based, respectively. This provides a wide range of comparisons. Initial training was performed using the entire standardized feature set.

Following the first phase of the training, a feature importance analysis was performed to analyze the contribution of each feature in correctly classifying the target label, as shown in the Figure 5. We also used PCA to reduce dimensionality and improve performance and computational efficiency. The PCA transformation aimed at reducing the feature space to eight principal components, retaining maximum data variance in the data without losing important information. This procedure resulted in new derived feature sets, subsequently used to retrain the ML models. The PCA derived features followed the same train-test split structure for consistency in comparison.

Moreover, an advanced autoencoder-based feature extraction method was introduced to explore a deep learning-based dimensionality reduction technique. The autoencoder network architecture comprised an input layer corresponding to the standardized feature set’s dimensionality, an intermediate dense layer with 32 neurons utilizing a ReLU activation function, followed by a Batch Normalization layer to stabilize learning. This was succeeded by an encoding layer compressing features down to eight dimensions. The decoding process mirrored the encoding pathway but aimed at reconstructing the original input. The autoencoder was trained using 30 epochs with batch sizes of 32 and a mean-squared error loss function. Upon training completion, the encoded features were extracted and employed as inputs for retraining the selected ML classifiers.

For each transformation method—standardized features, PCA-transformed features derived using scree plot analysis as shown in Figure 6, and autoencoder-extracted features—four ML models (RF, LR, DT, and KNN) were systematically trained and evaluated. The RFC, with 100 estimators, leveraged ensemble learning to mitigate overfitting and capture complex feature interactions. LR, configured to execute binary classification with a maximum iteration limit set to ensure convergence, provided insights into linear relationships within the data. DT classification offered intuitive model interoperability, and KNN, with k = 5, utilized proximity-based classification to identify patterns in the transformed feature spaces. The packages during in this work includes numpy, pandas, scikitlearn, tensorflow, tabulate, and matplotlib.

Key performance indicators included accuracy, confusion matrices, classification reports detailing precision, recall, and F1-scores, and the computation of the Receiver Operating Characteristic (ROC) curve along with the Area Under the Curve (AUC) score. The ROC analysis was particularly insightful, providing a comprehensive view of the trade-off between true positive rates and false positive rates across varying classification thresholds.

Furthermore, the autoencoder employed for feature extraction was carefully validated by analyzing training convergence through epochs to ensure the stability of the encoding process. The training of the autoencoder spanned 30 epochs with a batch size of 32, with the optimizer effectively adjusting the model parameters iteratively to minimize reconstruction error, thus ensuring the robustness of the extracted features. The Figure 7 shows the variation of the loss with respect to epochs during autoencoder training.

Detailed visualizations significantly enhanced the interoperability of the results. Confusion matrices were generated for each model, clearly illustrating the distinction between true positives, false positives, true negatives, and false negatives. These visual tools were instrumental in diagnosing the strengths and limitations of each model with respect to classification accuracy and types of errors made.

Further enriching the analytical depth, the ROC curves for each model provided graphical representations of their discriminative capabilities, showcasing sensitivity (true positive rate) against specificity (false positive rate). These curves, supplemented by the numerical AUC scores, facilitated straightforward comparisons among models, underscoring their relative performances and their predictive capabilities in handling binary inverter fault classification tasks.

3. Results and Discussion

The comparative study of dimensionality reduction methods for inverter fault detection in grid-connected solar photovoltaic (PV) systems yielded a comprehensive evaluation of various ML models, both with and without dimensionality reduction techniques. The experimental results are summarized in Table 2 and Table 3, which present the performance metrics of the models in terms of training time, prediction time, area under the curve (AUC) score, accuracy across training, validation, and test sets, as well as detailed classification metrics including TP, TN, FP, FN, accuracy, precision, recall, and F1 score. These metrics collectively provide insight into the effectiveness and efficiency of the proposed methods: RF, LR, DT, and KNN, when applied to the original feature set (All Features), PCA-reduced features, and autoencoder (AE)-reduced features. Three types of input features that are original standardized features, PCA-reduced features, and autoencoder-derived features, are used in Figure 6 to show the ROC curves for various ML models. The figure shows that RF model with all features demonstrated optimal classification ability with AUC of almost 1.0. Whereas KNN with all features yielded AUC of 0.9988, RF with PCA slightly lower AUC of 0.9994, RF with autoencoder features (0.9993), and KNN with PCA and AE features (AUCs > 0.997) are additional high-performing models.

3.1. Model Performance with All Features

The baseline performance of the models utilizing the full feature set demonstrated high predictive accuracy and robustness across all classifiers. As shown in Table 2, the RF with all features model achieved the highest test accuracy of 0.99, with a training time of 36.47 s and a prediction time of 0.39 s. The AUC score of 0.99 further corroborates its excellent discriminative ability. Detailed results in Table 3 indicate an accuracy of 0.9987, with a precision of 0.9992, recall of 0.9980, and F1 score of 0.9986. The model correctly identified 25,860 TP and 28,614 TN instances, with only 21 FP and 51 FN, as shown in all models’ confusion matrix Figure 8 which highlights the superior performance of RF in minimizing misclassifications.

The LR with all features model, while computationally efficient with a training time of 0.81 s and a prediction time of 0.006 s, exhibited a slightly lower test accuracy of 0.97. Its AUC score remained high at 0.99, but the classification metrics in Table 2 reveal a drop in performance, with an accuracy of 0.9776, precision of 0.9815, recall of 0.9707, and F1 score of 0.9761. The increase in FP to 474 in comparison to FP of RF which was only 21 and FN to 759 from 51 suggests that LR struggled to capture the full complexity of the data compared to RF.

The DT with all features model performed comparably to RF, with a test accuracy of 0.99, a training time of 3.82 s, and a prediction time of 0.008 s. Its AUC score of 0.99 and classification metrics (accuracy: 0.9963, precision: 0.9961, recall: 0.9961, F1 score: 0.9961) indicate strong performance, though it recorded slightly higher FP (101) and FN (100) than RF with all features.

The KNN with all features model, despite its high test accuracy of 0.99 and AUC score of 0.99, incurred a significant computational cost during prediction, with a time of 21.815 s. This is likely due to the distance computation required for all features in the original high-dimensional space. Its classification metrics (accuracy: 0.9951, precision: 0.9955, recall: 0.9942, F1 score: 0.9948) were slightly lower than RF and DT, with 116 FP and 151 FN.

3.2. Model Performance with PCA

Applying PCA as a dimensionality reduction technique resulted in varied impacts on model performance. The RF based PCA model maintained a high test accuracy of 0.99 and an AUC score of 0.99, though its training time increased to 57.20 s, likely due to the additional computational overhead of PCA transformation. Prediction time rose to 0.750 s, reflecting a trade-off between dimensionality reduction and inference speed. Table 3 shows an accuracy of 0.9900, precision of 0.9905, recall of 0.9885, and F1 score of 0.9895, with 246 FP and 297 FN, indicating a slight decline in classification performance compared to RF with all features.

The LR based PCA model exhibited the most significant reduction in performance, with a test accuracy of 0.92 and an AUC score of 0.98. Its training and prediction times were notably low (0.11 s and 0.002 s, respectively), making it the most computationally efficient model in this category. However, its classification metrics (accuracy: 0.9277, precision: 0.9217, recall: 0.9265, F1 score: 0.9241) reflect a substantial increase in FP (2039) and FN (1904), suggesting that PCA may have discarded critical features necessary for effective fault detection with LR.

The DT based PCA model retained a test accuracy of 0.98 and an AUC score of 0.98, with a training time of 4.68 s and a prediction time of 0.008 s. Its classification metrics (accuracy: 0.9811, precision: 0.9810, recall: 0.9792, F1 score: 0.9801) indicate a modest decline from DT with all features, with 492 FP and 538 FN, suggesting that PCA preserved most of the DT’s discriminative power.

The KNN based PCA model benefited significantly from dimensionality reduction, reducing its prediction time to 9.129 s while maintaining a test accuracy of 0.99 and an AUC score of 0.99. Its classification metrics (accuracy: 0.9928, precision: 0.9927, recall: 0.9920, F1 score: 0.9924) improved slightly compared to KNN with all features, with 189 FP and 206 FN, demonstrating that PCA effectively reduced computational complexity without compromising accuracy.

3.3. Model Performance with Autoencoder

The use of an autoencoder (AE) for dimensionality reduction produced results that were generally competitive with PCA, with some notable differences. The RF based AE model achieved a test accuracy of 0.98 and an AUC score of 1.00, with a training time of 49.54 s and a prediction time of 0.379 s. Its classification metrics (accuracy: 0.9893, precision: 0.9901, recall: 0.9873, F1 score: 0.9887) indicate robust performance, though slightly below RF with all features, with 257 FP and 328 FN.

The LR based AE model showed a test accuracy of 0.94 and an AUC score of 0.98, with exceptionally low training and prediction times (0.12 s and 0.000 s, respectively). However, its classification metrics (accuracy: 0.9492, precision: 0.9504, recall: 0.9422, F1 score: 0.9463) reflect a higher error rate, with 1273 FP and 1499 FN, suggesting that the AE may not have captured the linear relationships as effectively as PCA for LR.

The DT based AE model recorded a test accuracy of 0.97 and an AUC score of 0.97, with a training time of 2.80 s and a prediction time of 0.000 s. Its classification metrics (accuracy: 0.9790, precision: 0.9784, recall: 0.9774, F1 score: 0.9779) indicate a slight decline from DT with all features, with 559 FP and 586 FN, reflecting a minor loss of discriminative capability.

The KNN based AE model achieved a test accuracy of 0.99 and an AUC score of 0.99, with a training time of 0.15 s and a prediction time of 2.200 s. Its classification metrics (accuracy: 0.9923, precision: 0.9921, recall: 0.9917, F1 score: 0.9919) were comparable to KNN based PCA, with 205 FP and 216 FN, demonstrating that the AE effectively reduced dimensionality while preserving KNN’s performance.

3.4. Comparative Analysis

Across all configurations, RF consistently outperformed other models in terms of accuracy, precision, recall, and F1 score, particularly when using the full feature set. Dimensionality reduction with PCA and AE maintained high accuracy for RF and KNN, though with minor trade-offs in classification performance and increased training times. LR exhibited the greatest sensitivity to dimensionality reduction, with significant drops in accuracy and increases in misclassifications, suggesting that it relies heavily on the original feature space. DT showed moderate resilience to both PCA and AE, with slight reductions in performance metrics.

In terms of efficiency, PCA and AE significantly reduced prediction times for KNN, making it a more practical choice for real-time applications despite its high computational cost in the original feature space. LR remained the fastest model across all configurations, though its lower accuracy limits its suitability for critical fault detection tasks. RF, while computationally intensive during training, offered a balanced trade-off between accuracy and inference speed, particularly with AE.

3.5. Comparison with Prior Work Using the GPVS-Faults Dataset

This study investigates the problem of accurately and efficiently detecting inverter faults in grid-connected photovoltaic systems through the application of dimensionality reduction techniques. As summarized in Table 4, recent literature utilizing the GPVS-Faults dataset has explored a range of methodologies. Ref. [35] employed a semi-supervised approach based on variational autoencoders combined with anomaly detection algorithms, achieving competitive AUC values; however, the absence of supervised classifier comparisons and limited model interpretability constrain its broader applicability. Ref. [44] developed a two-tier framework using Extra Trees with explainable artificial intelligence, which demonstrated high classification accuracy, yet did not incorporate dimensionality reduction, potentially affecting computational scalability. Ref. [45] proposed a hybrid model combining Modified Independent Component Analysis with RFs to address class imbalance, yielding high predictive accuracy, though the complexity of the ICA component may limit its suitability for real-time deployment.

In response to these limitations, the present work introduces a supervised learning framework that systematically benchmarks four classifiers—RF, DT, K-Nearest Neighbors, and LR—under both principal component analysis and autoencoder-derived feature sets. Among the evaluated models, RF achieved the highest accuracy and AUC with minimal inference latency, demonstrating strong potential for real-time applications. Furthermore, the integration of autoencoders proved particularly effective for enhancing the performance of computationally lightweight classifiers such as LR. These results highlight the proposed method’s capacity to balance predictive accuracy, interpretability, and operational efficiency, contributing a novel comparative perspective to the existing body of work on PV inverter fault detection.

3.6. Strategies to Mitigate Prediction Errors

Despite the high classification accuracy achieved by the proposed models (e.g., RF with 99.87% accuracy), discrepancies between predicted and actual fault states showed as FP and FN in Table 3 makes it an essential task to adopt strategies to enhance system reliability.

First, classification thresholds can be adjusted to favor recall over precision in safety-critical environments, as illustrated by the ROC curves in Figure 9. Second, a hybrid ensemble strategy combining RF and DT predictions through majority voting can reduce misclassification by exploiting the models’ complementary strengths.

Furthermore, low-confidence predictions can be flagged for manual verification, while periodic model retraining on updated operational data can mitigate concept drift. Finally, integrating the ML-based detection pipeline with rule-based heuristics (e.g., real-time monitoring of critical parameters such as

V_{d c}

) provides an additional layer of validation, thereby minimizing the operational impact of classification errors.

3.7. Strengths, Limitations, and Future Directions

This study presents a comprehensive evaluation of dimensionality reduction methods, including principal component analysis and autoencoders, combined with multiple classifiers for inverter fault detection using the GPVS-Faults dataset. One key strength of the proposed framework is its suitability for real-time deployment, supported by low inference latency and reliance on standard inverter measurements such as

V_{d c}

,

I_{p v}

, and

I_{a b c}

. Additionally, the comparative benchmarking of model accuracy, training and inference time, and classification metrics provides a transparent and reproducible foundation for future studies.

However, certain limitations should be acknowledged. The current analysis focuses on binary classification of fault and non-fault states, which may not fully capture the complexity of real-world inverter behavior. Furthermore, the experimental validation is limited to a controlled dataset, and its generalizability across diverse inverter types, environmental conditions, and grid configurations remains to be explored. Although autoencoders improved model efficiency, their latent representations are less interpretable compared to more transparent methods such as DTs or feature ranking techniques.

Future research may extend this work by incorporating multi-class fault classification, validating models on field data from different PV installations, and exploring explainable ML approaches. In addition, real-time deployment on edge computing platforms such as Raspberry Pi or embedded controllers, along with adaptive retraining mechanisms to handle data drift, could enhance the robustness and applicability of the proposed method.

4. Conclusions

This study comprehensively explored the effectiveness of various ML algorithms: RF, LR, DT, and KNN, in diagnosing inverter faults using binary inverter data. The performance was assessed using three distinct feature sets: original standardized features, PCA-derived features, and AE-derived features, employing accuracy, training and inference times, and AUC scores as evaluation metrics.

RF consistently outperformed other models across all feature extraction techniques, achieving exceptional accuracy and near-perfect AUC scores. Specifically, RF with the complete original feature set reached an impressive accuracy of 99.87% and an AUC score of approximately 0.99, emphasizing its superior capacity to model complex interactions between features. However, this high performance comes at the cost of increased computational demands during training. Despite longer training durations (ranging up to 57.20 s with PCA), its inference speed remained practical for real-world applications.

LR exhibited the fastest computational performance, with minimal training times (0.11–0.81 s) and near-instantaneous inference. While LR with original features retained high accuracy (approximately 97.76%), its performance decreased notably after dimensionality reduction with PCA (accuracy dropping to 92.77%). Interestingly, employing AE-derived features significantly boosted LR’s accuracy (94.92%), demonstrating the value of autoencoder-based nonlinear transformations.

DTs provided a strong balance between computational efficiency and accuracy. DT models trained on original features maintained excellent accuracy (99.63%) and rapid inference speed, making them especially suitable for real-time prediction environments. Their performance slightly diminished when PCA or AE-derived features were used, highlighting DT’s sensitivity to transformed features.

KNN demonstrated robust accuracy (approximately 99%) but experienced substantial computational overhead during inference due to intensive distance computations. Dimensionality reduction significantly improved its inference time, particularly with autoencoder-extracted features, which dropped inference duration dramatically (from 21.82 s to 2.20 s), indicating that AE-derived features effectively preserve essential data characteristics for proximity-based classifiers.

The analysis underscores the value of dimensionality reduction techniques particularly autoencoders for balancing model performance with computational efficiency, which is essential for deploying predictive systems in resource-constrained environments. Ultimately, model selection should align explicitly with operational priorities: RF is optimal when accuracy is paramount; DT and KNN with AE-derived features are ideal for real-time inference; and LR with AE-derived features is recommended for applications demanding extreme computational efficiency with acceptable accuracy. This study focus on binary fault classification of inverter the future work will extend the methodology to multi-class fault detection framework capturing different types of fault including weather induced anomalies, equipment failure and severity of the failure. Additionally, the proposed method is computationally lightweight and relies on standard inverter measurements such as

V_{d c}

,

I_{p v}

, and

I_{a b c}

, making it suitable for real-time deployment. From our analysis, the RF and DT models demonstrated optimal inference time for real-time applications. The trained model can be deployed on a single-board computer. This will receive live sensor input, performs inference locally, and raises alerts based on output, without requiring major changes to existing PV system hardware.

Author Contributions

Conceptualization, S.T.; Funding acquisition, Methodology, S.T.; Investigation, S.T.; Resources, S.T.; Supervision, A.I.S.; Writing—original draft preparation, S.T.; Writing—review and editing, S.T. and A.I.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is publicly available at [43].

Conflicts of Interest

Author Shahid Tufail was employed by the company NextNRG LLC. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Pastuszak, J.; Węgierek, P. Photovoltaic Cell Generations and Current Research Directions for Their Development. Materials 2022, 15, 5542. [Google Scholar] [CrossRef] [PubMed]
Hacke, P.; Lokanath, S.; Williams, P.; Vasan, A.; Sochor, P.; TamizhMani, G.; Shinohara, H.; Kurtz, S. A status review of photovoltaic power conversion equipment reliability, safety, and quality assurance protocols. Renew. Sustain. Energy Rev. 2018, 82, 1097–1112. [Google Scholar] [CrossRef]
Roy, S.; Tufail, S.; Tariq, M.; Sarwat, A. Photovoltaic Inverter Failure Mechanism Estimation Using Unsupervised Machine Learning and Reliability Assessment. IEEE Trans. Reliab. 2024, 73, 1418–1432. [Google Scholar] [CrossRef]
Hassan, Y.B.; Orabi, M.; Gaafar, M.A. Failures causes analysis of grid-tie photovoltaic inverters based on faults signatures analysis (FCA-B-FSA). Solar Energy 2023, 262, 111831. [Google Scholar] [CrossRef]
Li, T.; Tao, S.; Zhang, R.; Liu, Z.; Ma, L.; Sun, J.; Sun, Y. Reliability Evaluation of Photovoltaic System Considering Inverter Thermal Characteristics. Electronics 2021, 10, 1763. [Google Scholar] [CrossRef]
Roy, S.; Tufail, S.; Riggs, H.; Tariq, M.; Sarwat, A. An Alert-Ambient Enrolled Deep Learning Model for Current Reliability Prediction of Weather Impacted Photovoltaic Inverter. In Proceedings of the 2023 IEEE Industry Applications Society Annual Meeting (IAS), Nashville, TN, USA, 29 October–2 November 2023; pp. 1–6. [Google Scholar] [CrossRef]
Riggs, H.; Tufail, S.; Parvez, I.; Sarwat, A. Survey of Solid State Drives, Characteristics, Technology, and Applications. In Proceedings of the 2020 SoutheastCon, Raleigh, NC, USA, 28–29 March 2020; pp. 1–6. [Google Scholar] [CrossRef]
Tufail, S.; Tariq, M.; Batool, S.; Sarwat, A. Comparative Analysis Between Feedforward Neural Network and CNN-LSTM Neural Network To Predict Household Electrical Energy Consumption. In Proceedings of the 2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), Tenerife, Spain, 19–21 July 2023; pp. 1–6. [Google Scholar] [CrossRef]
Sarwat, A.; McCluskey, P.; Mazumder, S.K.; Russell, M.; Roy, S.; Tufail, S.; Dharmasena, S.; Stevenson, A. Reliability Assessment of Grid Connected Solar Inverters in 1.4 MW PV Plant from Anomalous Classified Real Field Data. In Proceedings of the 2022 North American Power Symposium (NAPS), Salt Lake City, UT, USA, 9–11 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Tufail, S.; Batool, S.; Sarwat, A.I. A Comparative Study Of Binary Class Logistic Regression and Shallow Neural Network For DDoS Attack Prediction. In Proceedings of the SoutheastCon 2022, Mobile, AL, USA, 26 March–3 April 2022; pp. 310–315. [Google Scholar] [CrossRef]
Riggs, H.; Tufail, S.; Khan, M.; Parvez, I.; Sarwat, A.I. Detection of False Data Injection of PV Production. In Proceedings of the 2021 IEEE Green Technologies Conference (GreenTech), Denver, CO, USA, 7–9 April 2021; pp. 7–12. [Google Scholar] [CrossRef]
Tufail, S.; Batool, S.; Sarwat, A.I. False Data Injection Impact Analysis In AI-Based Smart Grid. In Proceedings of the SoutheastCon 2021, Atlanta, GA, USA, 10–13 March 2021; pp. 1–7. [Google Scholar] [CrossRef]
Riggs, H.; Tufail, S.; Parvez, I.; Tariq, M.; Khan, M.A.; Amir, A.; Vuda, K.V.; Sarwat, A.I. Impact, Vulnerabilities, and Mitigation Strategies for Cyber-Secure Critical Infrastructure. Sensors 2023, 23, 4060. [Google Scholar] [CrossRef] [PubMed]
Sharma, S.; Chen, Z. A Systematic Study of Adversarial Attacks Against Network Intrusion Detection Systems. Electronics 2024, 13, 5030. [Google Scholar] [CrossRef]
Oancea, B.; Simionescu, M. Gross Domestic Product Forecasting: Harnessing Machine Learning for Accurate Economic Predictions in a Univariate Setting. Electronics 2024, 13, 4918. [Google Scholar] [CrossRef]
Khan, R.; Saeed, U.; Koo, I. FedLSTM: A Federated Learning Framework for Sensor Fault Detection in Wireless Sensor Networks. Electronics 2024, 13, 4907. [Google Scholar] [CrossRef]
Das, S.; Gangwani, P.; Upadhyay, H. Integration of Machine Learning with Cybersecurity: Applications and Challenges. In Artificial Intelligence in Cyber Security: Theories and Applications; Bhardwaj, T., Upadhyay, H., Sharma, T.K., Fernandes, S.L., Eds.; Springer International Publishing: Cham, Switzerland, 2023; pp. 67–81. [Google Scholar] [CrossRef]
Gangwani, D.; Gangwani, P. Applications of Machine Learning and Artificial Intelligence in Intelligent Transportation System: A Review. In Proceedings of the Applications of Artificial Intelligence and Machine Learning; Choudhary, A., Agrawal, A.P., Logeswaran, R., Unhelkar, B., Eds.; Springer: Singapore, 2021; pp. 203–216. [Google Scholar]
Yuan, W.; Wang, T.; Diallo, D. A Secondary Classification Fault Diagnosis Strategy Based on PCA-SVM for Cascaded Photovoltaic Grid-connected Inverter. In Proceedings of the IECON 2019—45th Annual Conference of the IEEE Industrial Electronics Society, Lisbon, Portugal, 14–17 October 2019; Volume 1, pp. 5986–5991. [Google Scholar] [CrossRef]
Fahim, S.R.; Sarker, S.K.; Das, S.K.; Islam, M.R.; Kouzani, A.Z.; Mahmud, M.A.P. Three-Phase Inverter Faults Diagnosis Using Unsupervised Sparse Auto-Encoder. In Proceedings of the 2020 IEEE International Conference on Applied Superconductivity and Electromagnetic Devices (ASEMD), Tianjin, China, 16–18 October 2020; pp. 1–2. [Google Scholar] [CrossRef]
Tufail, S.; Iqbal, H.; Tariq, M.; Sarwat, A.I. A Hybrid Machine Learning-Based Framework for Data Injection Attack Detection in Smart Grids Using PCA and Stacked Autoencoders. IEEE Access 2025, 13, 33783–33798. [Google Scholar] [CrossRef]
Almalaq, A.; Albadran, S.; Mohamed, M.A. Deep Machine Learning Model-Based Cyber-Attacks Detection in Smart Power Systems. Mathematics 2022, 10, 2574. [Google Scholar] [CrossRef]
Ayad, A.G.; El-Gayar, M.M.; Hikal, N.A.; Sakr, N.A. Efficient Real-Time Anomaly Detection in IoT Networks Using One-Class Autoencoder and Deep Neural Network. Electronics 2025, 14, 104. [Google Scholar] [CrossRef]
Luo, C.; Zhou, Z.; Jiang, R.; Zheng, B. Attentional Convolutional Neural Network Based on Distinction Enhancement and Information Fusion for FDIA Detection in Power Systems. Electronics 2024, 13, 4862. [Google Scholar] [CrossRef]
Hu, D.; Zhang, C.; Yang, T.; Chen, G. Anomaly Detection of Power Plant Equipment Using Long Short-Term Memory Based Autoencoder Neural Network. Sensors 2020, 20, 6164. [Google Scholar] [CrossRef] [PubMed]
Malik, A.; Haque, A.; Kurukuru, V.B.; Khan, M.A.; Blaabjerg, F. Overview of fault detection approaches for grid connected photovoltaic inverters. e-Prime—Adv. Electr. Eng. Electron. Energy 2022, 2, 100035. [Google Scholar] [CrossRef]
Soni, J.; Gangwani, P.; Sirigineedi, S.; Joshi, S.; Prabakar, N.; Upadhyay, H.; Kulkarni, S.A. Deep Learning Approach for Detection of Fraudulent Credit Card Transactions. In Artificial Intelligence in Cyber Security: Theories and Applications; Bhardwaj, T., Upadhyay, H., Sharma, T.K., Fernandes, S.L., Eds.; Springer International Publishing: Cham, Switzerland, 2023; pp. 125–138. [Google Scholar] [CrossRef]
Fu, Y.; Ji, Y.; Meng, G.; Chen, W.; Bai, X. Three-Phase Inverter Fault Diagnosis Based on an Improved Deep Residual Network. Electronics 2023, 12, 3460. [Google Scholar] [CrossRef]
Putro, M.D.; Kurnianggoro, L.; Jo, K.H. High Performance and Efficient Real-Time Face Detector on Central Processing Unit Based on Convolutional Neural Network. IEEE Trans. Ind. Inform. 2021, 17, 4449–4457. [Google Scholar] [CrossRef]
Rahmani, S.; Amjady, N.; Shah, R. Application of Deep Learning Algorithms for Scenario Analysis of Renewable Energy-Integrated Power Systems: A Critical Review. Electronics 2025, 14, 2150. [Google Scholar] [CrossRef]
Li, X.; Xiong, H.; Li, X.; Wu, X.; Zhang, X.; Liu, J.; Bian, J.; Dou, D. Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond. Knowl. Inf. Syst. 2022, 64, 3197–3234. [Google Scholar] [CrossRef]
Amiri, A.F.; Oudira, H.; Chouder, A.; Kichou, S. Faults detection and diagnosis of PV systems based on machine learning approach using random forest classifier. Energy Convers. Manag. 2024, 301, 118076. [Google Scholar] [CrossRef]
Abubakar, A.; Jibril, M.M.; Almeida, C.F.M.; Gemignani, M.; Yahya, M.N.; Abba, S.I. A Novel Hybrid Optimization Approach for Fault Detection in Photovoltaic Arrays and Inverters Using AI and Statistical Learning Techniques: A Focus on Sustainable Environment. Processes 2023, 11, 2549. [Google Scholar] [CrossRef]
Pereira, F.; Silva, C. Machine learning for monitoring and classification in inverters from solar photovoltaic energy plants. Solar Compass 2024, 9, 100066. [Google Scholar] [CrossRef]
Harrou, F.; Dairi, A.; Taghezouit, B.; Khaldi, B.; Sun, Y. Automatic fault detection in grid-connected photovoltaic systems via variational autoencoder-based monitoring. Energy Convers. Manag. 2024, 314, 118665. [Google Scholar] [CrossRef]
Bakdi, A.; Bounoua, W.; Guichi, A.; Mekhilef, S. Real-time fault detection in PV systems under MPPT using PMU and high-frequency multi-sensor data through online PCA-KDE-based multivariate KL divergence. Int. J. Electr. Power Energy Syst. 2021, 125, 106457. [Google Scholar] [CrossRef]
Darville, J.; Runsewe, T.; Yavuz, A.; Celik, N. Machine Learning Based Simulation for Fault Detection in Microgrids. In Proceedings of the 2022 Winter Simulation Conference (WSC), Singapore, 11–14 December 2022; pp. 701–712. [Google Scholar] [CrossRef]
Singh, B.P.; Goyal, S.K.; Siddiqui, S.A.; Kumar, P. A Study and Comprehensive Overview of Inverter Topologies for Grid-Connected Photovoltaic Systems (PVS). In Proceedings of the Intelligent Computing Techniques for Smart Energy Systems; Kalam, A., Niazi, K.R., Soni, A., Siddiqui, S.A., Mundra, A., Eds.; Springer: Singapore, 2020; pp. 1009–1017. [Google Scholar]
Kolantla, D.; Mikkili, S.; Pendem, S.R.; Desai, A.A. Critical review on various inverter topologies for PV system architectures. IET Renew. Power Gener. 2020, 14, 3418–3438. [Google Scholar] [CrossRef]
Zuñiga-Reyes, M.A.; Robles-Ocampo, J.B.; Sevilla-Camacho, P.Y.; Rodríguez-Reséndiz, J.; Lastres-Danguillecourt, O.; Conde-Díaz, J.E. Photovoltaic Failure Detection Based on String-Inverter Voltage and Current Signals. IEEE Access 2021, 9, 39939–39954. [Google Scholar] [CrossRef]
Gunda, T.; Hackett, S.; Kraus, L.; Downs, C.; Jones, R.; McNalley, C.; Bolen, M.; Walker, A. A Machine Learning Evaluation of Maintenance Records for Common Failure Modes in PV Inverters. IEEE Access 2020, 8, 211610–211620. [Google Scholar] [CrossRef]
Vairavasundaram, I.; Varadarajan, V.; Pavankumar, P.J.; Kanagavel, R.K.; Ravi, L.; Vairavasundaram, S. A Review on Small Power Rating PV Inverter Topologies and Smart PV Inverters. Electronics 2021, 10, 1296. [Google Scholar] [CrossRef]
Bakdi, A.; Guichi, A.; Mekhilef, S.; Bounoua, W. GPVS-Faults: Experimental Data for Fault Scenarios in Grid-Connected PV Systems Under MPPT and IPPT Modes. 2020. Available online: https://data.mendeley.com/datasets/n76t439f65/1 (accessed on 21 December 2023).
Noura, H.N.; Allal, Z.; Salman, O.; Chahine, K. Explainable artificial intelligence of tree-based algorithms for fault detection and diagnosis in grid-connected photovoltaic systems. Eng. Appl. Artif. Intell. 2025, 139, 109503. [Google Scholar] [CrossRef]
Yang, N.C.; Ismail, H. Robust Intelligent Learning Algorithm Using Random Forest and Modified-Independent Component Analysis for PV Fault Detection: In Case of Imbalanced Data. IEEE Access 2022, 10, 41119–41130. [Google Scholar] [CrossRef]

Figure 1. Solar inverter fault detection and analysis.

Figure 2. Inverter topologies and fault mechanisms in grid-connected PV systems.

Figure 3. Correlation analysis.

Figure 4. Methodology summary.

Figure 5. Feature importance plot.

Figure 6. PCA analysis.

Figure 7. Loss plot of autoencoders.

Figure 8. Confusion matrix for all models.

Figure 9. ROC curve.

Table 1. Description of features in the GPVS-Faults dataset.

Feature	Description
Time (s)	Timestamp of real measurement, sampled at $T_{s}$ = 9.9989 µs.
Ipv (A)	PV array current measurement.
Vpv (V)	PV array voltage measurement.
Vdc (V)	DC link voltage measurement between PV array and inverter.
ia (A)	Phase A current measurement at inverter output.
ib (A)	Phase B current measurement at inverter output.
ic (A)	Phase C current measurement at inverter output.
va (V)	Phase A voltage measurement.
vb (V)	Phase B voltage measurement.
vc (V)	Phase C voltage measurement.
Iabc (A)	Positive-sequence estimated current magnitude.
If (Hz)	Positive-sequence estimated current frequency.
Vabc (V)	Positive-sequence estimated voltage magnitude.
Vf (Hz)	Positive-sequence estimated voltage frequency.
Target	Inverter failure status.

Table 2. Model performance summary.

No.	Model	Training Time (s)	Prediction Time (s)	AUC Score	Train Accuracy	Val Accuracy	Test Accuracy
1	RF_All_Features	36.47	0.39	0.99	1	0.99	0.99
2	LR_All_Features	0.81	0.006	0.99	0.98	0.97	0.97
3	DT_All_Features	3.82	0.008	0.99	1	0.99	0.99
4	KNN_All_Features	0.62	21.815	0.99	0.99	0.99	0.99
5	RF_PCA	57.20	0.750	0.99	1	0.98	0.99
6	LR_PCA	0.11	0.002	0.98	0.92	0.92	0.92
7	DT_PCA	4.68	0.008	0.98	1	0.98	0.98
8	KNN_PCA	0.31	9.129	0.99	0.99	0.99	0.99
9	RF_AE	49.54	0.379	1	1	0.98	0.98
10	LR_AE	0.12	0.000	0.98	0.94	0.94	0.94
11	DT_AE	2.80	0.000	0.97	1	0.97	0.97
12	KNN_AE	0.15	2.200	0.99	0.99	0.99	0.99

Table 3. Result summary.

Model	True Positive (TP)	True Negative (TN)	False Positive (FP)	False Negative (FN)	Accuracy	Precision	Recall	F1 Score
RF_All_Features	25,860	28,614	21	51	0.9987	0.9992	0.9980	0.9986
LR_All_Features	25,152	28,631	474	759	0.9776	0.9815	0.9707	0.9761
DT_All_Features	25,811	28,234	101	100	0.9963	0.9961	0.9961	0.9961
KNN_All_Features	25,760	28,519	116	151	0.9951	0.9955	0.9942	0.9948
RF_PCA	25,614	28,289	246	297	0.9900	0.9905	0.9885	0.9895
LR_PCA	24,007	26,596	2039	1904	0.9277	0.9217	0.9265	0.9241
DT_PCA	25,373	28,143	492	538	0.9811	0.9810	0.9792	0.9801
KNN_PCA	25,705	28,446	189	206	0.9928	0.9927	0.9920	0.9924
RF_AE	25,583	28,378	257	328	0.9893	0.9901	0.9873	0.9887
LR_AE	24,412	27,362	1273	1499	0.9492	0.9504	0.9422	0.9463
DT_AE	25,325	28,076	559	586	0.9790	0.9784	0.9774	0.9779
KNN_AE	25,695	28,430	205	216	0.9923	0.9921	0.9917	0.9919

Table 4. Comparative summary of studies using the GPVS-Faults Dataset.

Study	Methodology	Dimensionality Reduction	Classifiers/Techniques	Performance Metrics	Dataset Reliability	Key Findings	Implications for Fault Detection
[35]	Semi-supervised anomaly detection using VAE with outlier detection algorithms	VAE latent space	Isolation Forest, EE, LOF, One-Class SVM	MPPT: 92.90% (VAE-EE), IPPT: 93.10% (VAE-LOF), AUC-based	GPVS-Faults, high-frequency (100 µs), realistic fault scenarios	VAE extracts latent features effectively; high AUC in both MPPT and IPPT modes	Robust for semi-supervised settings; lacks classifier benchmarking; limited interpretability
[44]	Two-tier tree-based XAI framework	None explicitly	Extra Trees (binary + multiclass)	99.5% (detection), 98.7% (diagnosis)	GPVS-Faults, multi-sensor, realistic fault scenarios	Extra Trees best among tree models; XAI improves transparency	High accuracy and interpretability, but lacks dimensionality reduction
[45]	RF with Modified ICA (RF-MICA)	Modified ICA (MICA)	RF	99.88% (Scenario 1), 99.43% (Scenario 2)	GPVS-Faults, imbalanced data addressed via SMOTE and undersampling	MICA improves accuracy; RF handles imbalance well	Excellent accuracy; MICA complexity may limit real-time use
Proposed Work	Supervised learning with dimensionality reduction	PCA (8 PCs), autoencoder (8 features)	RF, LR, DT, KNN	RF: 99.87%, AUC 0.99; KNN_PCA: 99.28%, LR_AE: 94.92%	GPVS-Faults, high-resolution (100 µs), balanced dataset	RF excels in accuracy; AE better than PCA for LR; low latency	Suitable for real-time deployment; balances accuracy and computational efficiency

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tufail, S.; Sarwat, A.I. A Comparative Study of Dimensionality Reduction Methods for Accurate and Efficient Inverter Fault Detection in Grid-Connected Solar Photovoltaic Systems. Electronics 2025, 14, 2916. https://doi.org/10.3390/electronics14142916

AMA Style

Tufail S, Sarwat AI. A Comparative Study of Dimensionality Reduction Methods for Accurate and Efficient Inverter Fault Detection in Grid-Connected Solar Photovoltaic Systems. Electronics. 2025; 14(14):2916. https://doi.org/10.3390/electronics14142916

Chicago/Turabian Style

Tufail, Shahid, and Arif I. Sarwat. 2025. "A Comparative Study of Dimensionality Reduction Methods for Accurate and Efficient Inverter Fault Detection in Grid-Connected Solar Photovoltaic Systems" Electronics 14, no. 14: 2916. https://doi.org/10.3390/electronics14142916

APA Style

Tufail, S., & Sarwat, A. I. (2025). A Comparative Study of Dimensionality Reduction Methods for Accurate and Efficient Inverter Fault Detection in Grid-Connected Solar Photovoltaic Systems. Electronics, 14(14), 2916. https://doi.org/10.3390/electronics14142916

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Study of Dimensionality Reduction Methods for Accurate and Efficient Inverter Fault Detection in Grid-Connected Solar Photovoltaic Systems

Abstract

1. Introduction

1.1. Related Work

1.2. Objective

2. System Design, Data Acquisition, and Methodology

2.1. Inverter Topologies and Fault Mechanisms in Grid-Connected PV Systems

2.2. GCPVS for Inverter/IGBT Fault Analysis

2.3. Grid-Connected PV System Fault Description

2.4. Methodology

3. Results and Discussion

3.1. Model Performance with All Features

3.2. Model Performance with PCA

3.3. Model Performance with Autoencoder

3.4. Comparative Analysis

3.5. Comparison with Prior Work Using the GPVS-Faults Dataset

3.6. Strategies to Mitigate Prediction Errors

3.7. Strengths, Limitations, and Future Directions

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI