1. Introduction
Photovoltaic (PV) systems have undergone exponential growth in Brazil over the past five years, driven by the increasing global demand for sustainable energy sources and the urgent need to mitigate the environmental impacts associated with fossil fuel consumption [
1]. In this context, photovoltaic power plants play a critical role in the national energy matrix. Their operational efficiency and financial viability, however, are highly dependent on continuous monitoring and the resolution of technical challenges that may arise during their lifecycle [
2].
In response to these challenges, technical standards have been developed to guide Operation and Maintenance (O&M) strategies. In the Brazilian context, the ABNT NBR 16.274:2014 standard highlights the importance of commissioning procedures, recommending systematic inspections and performance tests before and after system energization. Commissioning encompasses a set of strategic actions aimed at verifying compliance with installation requirements, identifying component defects, mitigating operational risks, and ensuring the proper configuration of system parameters. This process also strengthens early fault detection and prevention practices, which are essential for ensuring the reliability, safety, and long-term performance of PV plants [
1].
Fault diagnosis in PV systems is conventionally performed through the analysis of current–voltage (I–V) and power–voltage (P–V) characteristics, which are fundamental for identifying anomalies in individual modules or string arrays. Critical parameters such as short-circuit current (
), open-circuit voltage (
), maximum power point current (
), maximum power point voltage (
), and maximum power output (
) are essential indicators for assessing module integrity and system performance. Deviation from expected values may signal underlying electrical or environmental faults, which, if left unaddressed, can reduce energy output, increase financial losses, and pose safety hazards such as fire risks [
3].
Although numerous studies have explored fault detection in photovoltaic (PV) systems, many still fall short in effectively integrating these methods into the commissioning phase. Persistent challenges include the lack of standardized fault classification frameworks, the influence of variable environmental conditions, and the limited generalizability of proposed techniques across diverse PV plant configurations—particularly in addressing electrical, physical, and environmental faults.
Additionally, there remains a noticeable gap in the literature regarding the direct correlation between proposed fault detection methods and real-world practices employed in field analysis. During the commissioning process, I–V and P–V curve acquisition is a standard diagnostic practice for identifying anomalies. However, this procedure is often time-intensive, demands specialized technical expertise, and is susceptible to inaccuracies due to equipment variability and human handling errors.
In response to these limitations, the present study proposes the integration of advanced Condition Monitoring (CM) algorithms and multiclass fault classification techniques into the commissioning workflow of PV systems. This approach is designed to augment conventional methods—such as I–V and P–V curve analysis—by leveraging machine learning models capable of accurately identifying electrical, physical, and environmental faults based on historical operational data.
Incorporating such techniques during the commissioning phase yields several benefits: it accelerates anomaly detection, reduces diagnostic efforts and associated labor costs, and enhances system reliability from the outset.
A comprehensive review of the main contributions to the literature on fault detection in photovoltaic (PV) systems is presented below, with particular emphasis on methodologies employing machine learning techniques.
For example, ref. [
1] presented a fault taxonomy focused on electrical failures occurring during commissioning and operation. Their study emphasized the importance of fault categorization for improving diagnostics and optimizing O&M planning, while highlighting the lack of standardization and the need for detailed studies that account for environmental conditions such as storms and temperature fluctuations.
In a complementary effort, ref. [
4] proposed a failure modeling approach using MATLAB/Simulink to simulate scenarios involving short circuits, open circuits, and inverter malfunctions on both the AC and DC sides. Although valuable for fault characterization, the study did not extend its models to support generalization across different module types or consider early-stage fault detection strategies.
A more specific contribution was provided by [
5], who developed a methodology to correct the short-circuit current (
) measured during cold commissioning procedures, accounting for soiling effects. By employing the Soiling Ratio (SRatio)—a comparative metric between
values from clean and dirty modules—the study improved diagnostic accuracy and mitigated false fault interpretations in plants located in diverse Brazilian climates.
Recent advancements in machine learning have introduced new possibilities for PV fault detection. For instance, ref. [
6] applied Convolutional Neural Networks (CNNs) to detect and classify faults in real time, achieving a classification accuracy of 97.64% using voltage, current, temperature, and irradiance data. Although the model demonstrated high effectiveness, it also revealed limitations, including computational complexity and the requirement for large datasets to improve generalization and robustness.
In [
7], the authors investigated the use of artificial neural networks for early fault detection in photovoltaic systems, employing simulated voltage, current, and irradiance data generated in the MATLAB/Simulink environment. The methodology was based on a feedforward neural network trained using backpropagation and tested across different fault types. The study demonstrated that the model was capable of rapidly identifying anomalies with good accuracy. However, the main limitation lies in the absence of cross-validation and comparative analysis with other approaches, which hinders a more comprehensive assessment of the proposed method’s relative effectiveness.
In [
8], the authors proposed a fault detection and classification system for photovoltaic systems using machine learning (ML) algorithms applied to simulated data under two distinct operating modes. Among the tested models, XGBoost achieved the highest accuracy (99%) after hyperparameter tuning. The main contribution lies in the high precision attained in identifying seven different fault types. However, the study does not account for gradual degradation effects or transient conditions, which are commonly observed in real-world scenarios. Ref. [
9] proposed a fault detection and classification approach for photovoltaic systems by simulating complex failures on the direct current (DC) side—including intra-string line-to-line faults, inter-string faults, and open-circuit conditions—in a laboratory-scale experimental plant. The study employed Support Vector Machine (SVM) and XGBoost classifiers, both optimized using the Bee Algorithm (BA) and Particle Swarm Optimization (PSO). The best performance was achieved by the BA-XGBoost model, which reached an accuracy of 87.56%. Despite its high accuracy, the approach exhibited a greater computational time and faced challenges in distinguishing inter-string faults, due to their resemblance to normal operating conditions.
Finally, ref. [
10] investigated the optimization of machine learning algorithms for fault detection and diagnosis in photovoltaic systems, using only inverter data under Maximum Power Point Tracking (MPPT) and Limited Power Point Tracking (LPPT) conditions. Based on 2.2 million simulated measurements, models such as Bagged Trees and Neural Networks exhibited high accuracy, with Bagged Trees achieving 92.2% and Wide Neural Networks reaching 92.0%. The proposed approach proved effective in identifying several fault types, including partial shading and Insulated Gate Bipolar Transistor (IGBT) failures, although certain anomalies—such as grid-related faults (F3)—remained more challenging. Limitations included the exclusive use of controlled laboratory data, high computational cost for some models, and the lack of validation under real-world conditions and variable climate conditions.
Moreover, few studies employ multiclass classification techniques—such as One-Versus-One (OVO) and One-Versus-Rest (OVR)—that simultaneously address electrical, physical, and environmental fault types while establishing correlations with practical applications, such as the field commissioning process.
In this context, the main contributions of this article are summarized as follows:
Comprehensive multiclass framework. We propose a data-driven methodology able to distinguish more than two DC-side fault types in photovoltaic (PV) arrays, surpassing the binary classifiers that dominate the literature.
High-accuracy machine-learning model. Using One-Versus-One (OVO) and One-Versus-Rest (OVR) strategies, the classifier achieved 100% accuracy on 704 single-string samples and 2480 three-string samples; the lowest observed accuracy was 99.03% (OVO, 1024 samples, single string).
Robustness under realistic conditions. The approach was validated over wide irradiance and temperature ranges and under several environmental perturbations, demonstrating resilience and field applicability.
Field-oriented dataset generation. We detail a repeatable procedure to build labelled data sets from I–V and P–V curves during commissioning. This allows operators to use the same measurements for immediate fault diagnostics and for long-term O&M analytics.
The article is structured as follows:
Section 2 introduces the main fault categories—electrical, physical, and environmental—commonly affecting PV systems on the DC side, and discusses their implications during the commissioning process.
Section 3 describes the implementation of the PV system model in Matlab/Simulink, detailing system parameters, fault injection mechanisms, and data acquisition protocols. In
Section 4, the proposed fault detection and classification framework is presented, including its architecture and feature processing strategy.
Section 5 discusses the simulation results and evaluates the performance of the classification approach under multiple fault conditions. Finally, conclusions and future research directions are provided in
Section 6.
4. Proposed Fault Classification Method
This section presents the proposed methodology for classifying faults in photovoltaic (PV) systems. It begins with the process of database creation, including the extraction of relevant electrical features from I–V and P–V curves. It then addresses data preprocessing procedures such as normalization and labeling. Next, it outlines the multiclass classification techniques employed, with emphasis on the One-Versus-One (OVO) and One-Versus-Rest (OVR) strategies. Finally, the evaluation metrics used to validate the classification model are discussed.
4.1. Database Creation
A structured methodology was developed to extract electrical parameters from I–V and P–V curves, thereby forming the basis of the dataset. The key features extracted include the short-circuit current (
), open-circuit voltage (
), current at maximum power point (
), voltage at maximum power point (
), and maximum power output (
), as illustrated in
Figure 6.
The extraction was performed using a custom MATLAB algorithm, depicted in the flowchart shown in
Figure 7.
The algorithm first initializes key parameters such as temperature and defines the structure for storing extracted features. A FOR loop iterates over temperature values from 20 °C to 35 °C in 1 °C increments. Within each temperature iteration, irradiance values are varied between 700 W/m2 and 1000 W/m2, with step sizes of 10 W/m2, 20 W/m2, and 30 W/m2, ensuring a rich and diverse dataset.
At the end of the process, as outlined in the flowchart in
Figure 7, an Excel-based dataset is generated containing the key features extracted from the I–V and P–V curves, as described in
Figure 6, under varying irradiance and temperature levels. Additionally, the dataset includes records for different fault conditions, such as short-circuit, open-circuit, partial shading, connector fault, and normal operating conditions.
Table 3 presents an example of the dataset structure, considering a fixed temperature of 20 °C, irradiance variation in 20 W/m
2 increments, and the main extracted features: open-circuit voltage (
Voc), short-circuit current (
Isc), maximum power (
Pmax), current at maximum power point (
Imax), and voltage at maximum power point (
Vmax). The implementation methods for all fault types are described in
Section 3.3.
The entire feature extraction process for the dataset was performed on an Acer Aspire 5 notebook, equipped with a 10th-generation Intel Core i5 processor. MATLAB was used to execute the script described in
Figure 7, with support from the Simulink environment. The simulated configurations for one and three PV strings are represented in
Figure 3 and
Figure 4, respectively.
Table 4 presents the total processing time required for dataset generation, using as examples the datasets with 1984 and 2480 samples for the single-string and three-string configurations, respectively.
4.2. Data Preprocessing
The complete dataset comprised 3712 samples for the single-string configuration and 4640 samples for the three-string configuration. Specifically, the single-string model used subsets of 704, 1024, and 1984 samples, while the three-string model used 880, 1280, and 2480 samples. In each case, the data were randomly partitioned into training (70%) and testing (30%) subsets.
The purpose of this data partitioning approach was to evaluate model performance across different dataset sizes, assess the model’s generalization capability under conditions that approximate real-world scenarios, and verify model stability—that is, whether the model delivers consistent results even with varying data volumes.
Each sample was labeled based on the operational conditions simulated in the PV circuit. In the single-string system, the following labels were assigned: 0 (normal), 1 (shading), 2 (short circuit), and 3 (connector failure). For the three-string configuration, an additional label 4 was used to represent open-circuit faults.
All features were normalized using the
MinMaxScaler technique to rescale the feature values to a [0, 1] range:
where
x represents the original value, and
and
denote the minimum and maximum values within the dataset, respectively. This procedure prevented variables with different scales from disproportionately affecting the model.
4.3. Multiclass Classification
The classification problem was addressed using two widely adopted multiclass strategies, One-Versus-Rest (OVR) and One-Versus-One (OVO), which were originally applied to multiclass classification with Support Vector Machines (SVMs) by [
14] and discussed in [
15].
Both techniques were implemented using the
algorithm, developed by [
16], due to its robustness and high accuracy in supervised learning tasks.
4.3.1. OVR Multiclass Technique
In the OVR strategy, a separate binary classifier is trained for each class to distinguish it from all other classes. For a three-class problem (, , ), the following classifiers are trained:
: Classifies vs. and ;
: Classifies vs. and ;
: Classifies vs. and .
Each classifier computes a score:
where
Prediction Scoring: Each classifier returns a confidence score for its prediction:
;
;
.
From the obtained scores, the probability of the input
x belonging to a specific class
k is determined using the following equation:
A logistic (sigmoid) function is used to transform the score into a probability for class k.
The class with the highest score is selected. In this case,
;
;
.
The final predicted class is the one associated with the highest probability, determined by the following equation:
where
denotes the predicted probability of the input sample
x belonging to class
k.
Thus, x is classified as , since it has the highest score. In this study, the algorithm (based on Gradient Boosting) was used for each subproblem generated by the OVR approach due to its high efficiency in supervised learning.
4.3.2. OVO Multiclass Technique
In the OVO strategy, a binary classifier is trained for each unique pair of classes. For three classes (, , ), the following classifiers are trained:
Trained Classifiers:
: Classifies between and .
: Classifies between and .
: Classifies between and .
Each classifier is trained using
, which uses multiple decision trees to produce a prediction score. The score for each binary classifier
is computed as follows:
where
M is the total number of decision trees;
is the learning rate;
is the output of the m-th tree in classifier .
The raw score is transformed into a probability using the sigmoid function:
The class prediction is then made as follows:
Each classifier votes for one class. The number of votes received by class
is
where
is the indicator function, which returns 1 if the condition is true, and 0 otherwise.
The final predicted class is the one with the most votes:
The votes for each class are as follows:
vote.
votes.
votes.
Since class received the highest number of votes among all classifiers, it was selected as the final predicted class. Similarly to the OVR approach, was used to solve each binary subproblem, ensuring model robustness.
4.4. Evaluation Metrics
The model’s performance was evaluated using four metrics: accuracy, precision, sensitivity (recall), and the confusion matrix.
4.4.1. Accuracy
Accuracy measures the proportion of correct predictions relative to the total number of predictions made:
where:
TP (True Positives): Correctly classified positive instances.
TN (True Negatives): Correctly classified negative instances.
FP (False Positives): Negative instances incorrectly classified as positive.
FN (False Negatives): Positive instances incorrectly classified as negative.
4.4.2. Precision
Precision quantifies the proportion of correctly predicted positive instances relative to the total number of positive predictions made by the model:
4.4.3. Sensitivity
Sensitivity, also known as recall, measures the proportion of actual positive instances that are correctly identified:
4.4.4. Confusion Matrix
The confusion matrix (
Table 6) provides an overview of the model’s performance by comparing correct and incorrect predictions for each class.
These metrics offer a comprehensive evaluation of the model’s performance, serving as essential indicators for assessing its effectiveness, particularly in scenarios with imbalanced class distributions.
6. Conclusions
This study analyzed and classified faults in a photovoltaic power plant using the MATLAB/Simulink environment, focusing on the direct current (DC) side. The implementation of robust fault detection and classification methodologies proved essential for optimizing the commissioning, operation, and maintenance (O&M) processes, thereby minimizing technical and economic impacts.
The analysis considered the characteristics of an 11.76 kWp photovoltaic power plant located in Pesqueira, Pernambuco, under different environmental conditions. The I–V and P–V curve results revealed distinct patterns for each type of fault, indicating that short-circuit faults are the most severe, while connector faults, although initially less impactful, can evolve into critical issues if not detected early.
Shading faults resulted in multiple Maximum Power Points, complicating system operation, while open-circuit faults caused significant reductions in current and power output. The classification tests, considering three data subsets, achieved average accuracies exceeding 99.83% and 99.62% with the One-Versus-Rest (OVR) and One-Versus-One (OVO) techniques, respectively, using a single string. For three strings, the results were 99.44% (OVR) and 98.94% (OVO), validating the effectiveness of the proposed method and demonstrating a low error rate across the evaluation metrics.
Unlike previous studies, this work emphasizes the importance of integrating fault detection and classification systems during the commissioning phase, incorporating environmental variability and expanding the practical applicability of multiclass machine learning techniques in the photovoltaic sector.
For future work, it is recommended to validate the methodology using real-world data from operational systems, implement it in the field to assess its robustness under actual conditions, and develop solutions for real-time fault detection, enabling faster and more efficient preventive responses in the context of solar power plant operation and maintenance.