Next Article in Journal
Quantifying the Economic Advantages of Energy Management Systems for Domestic Prosumers with Electric Vehicles
Previous Article in Journal
Energy State Estimation for Series-Connected Battery Packs Based on Online Curve Construction of Pack Comprehensive OCV
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Euclidean Distance-Based Tree Algorithm for Fault Detection and Diagnosis in Photovoltaic Systems

1
Laboratoire des Systèmes Electriques et Télécommande, Faculté de Technologie, Université Blida 1, BP 270, Blida 09000, Algeria
2
Electrical Engineering Laboratory (LGE), University Mohamed Boudiaf of M’sila, BP 166, M’sila 28000, Algeria
3
Department of Electronic Engineering, Universitat Politècnica de Catalunya, 08034 Barcelona, Spain
*
Author to whom correspondence should be addressed.
Energies 2025, 18(7), 1773; https://doi.org/10.3390/en18071773
Submission received: 17 February 2025 / Revised: 18 March 2025 / Accepted: 27 March 2025 / Published: 1 April 2025
(This article belongs to the Section A2: Solar Energy and Photovoltaic Systems)

Abstract

:
In this paper, a new methodology for fault detection and diagnosis in photovoltaic systems is proposed. This method employs a novel Euclidean distance-based tree algorithm to classify various considered faults. Unlike the decision tree, which requires the use of the Gini index to split the data, this algorithm mainly relies on computing distances between an arbitrary point in the space and the entire dataset. Then, the minimum and the maximum distances of each class are extracted and ordered in ascending order. The proposed methodology requires four attributes: Solar irradiance, temperature, and the coordinates of the maximum power point (Impp, Vmpp). The developed procedure for fault detection and diagnosis is implemented and applied to classify a dataset comprising seven distinct classes: normal operation, string disconnection, short circuit of three modules, short circuit of ten modules, and three cases of string disconnection, with 25%, 50%, and 75% of partial shading. The obtained results demonstrate the high efficiency and effectiveness of the proposed methodology, with a classification accuracy reaching 97.33%. A comparison study between the developed fault detection and diagnosis methodology and Support Vector Machine, Decision Tree, Random Forest, and K-Nearest Neighbors algorithms is conducted. The proposed procedure shows high performance against the other algorithms in terms of accuracy, precision, recall, and F1-score.

1. Introduction

The worldwide demand for electrical energy continues to increase, and the governments of different nations must face several challenges to effectively respond to this demand. The first challenge is to provide energy for the growing proportion of the world’s population [1,2,3], while the second challenge lies in the production of this energy without causing environmental pollution or causing climatic problems such as global warming [4,5,6].
One of the ways to reduce greenhouse gas emissions is to use renewable energy, such as wind and solar energies. The wind and the sun provide infinite amounts of energy without generating greenhouse gases, unlike fossil-fuel-burning electric power stations. Solar energy allows producing electricity from photovoltaic panels or solar thermal power stations, thanks to sunlight captured by solar panels. Solar energy is clean, does not emit any greenhouse gases, and its source—the sun—is free, inexhaustible, and available everywhere in the world. Several countries around the globe are already at the forefront of renewable energy technologies and generate a large part of their electricity from photovoltaic systems (PVSs).
Like all other industrial process, a PVS can be subjected, during its operation, to various faults and anomalies, leading to a drop in the performance of the system and even to the total unavailability of the system. These faults will deviously reduce the productivity of the installation [7] and generate an additional cost of maintenance to restore the system to normal conditions—hence, the importance of having a system to detect and diagnose faults in system photovoltaic installation, which contributes to raising production efficiency and reducing maintenance time and cost [8].
There are many research contributions over the past decades in developing methods and algorithms for detecting and diagnosing faults in PV systems [9,10,11]. According to references [12,13], these algorithms can be classified into three distinct categories, whereas in reference [14], they are grouped into six categories. In this quick narration, the first categorization is adopted.
The first category encompasses all algorithms that use mathematical analysis and signal processing. Methods within this category heavily rely on the information extracted solely from the I–V characteristic, whether it pertains to a photovoltaic module (PVM), a PV string, or a photovoltaic array (PVA). Time Domain Reflectometry (TDR) is a technique utilized for identifying faulty photovoltaic modules within a photovoltaic array [15,16]. It has been employed to detect open circuits in grid-connected photovoltaic systems (GCPV) [17,18]. Earth Capacitance Measurement (ECM) has been added to the TDR to identify the PV module disconnected from the PV string [18]. Reference [18] has demonstrated the applicability of the ECM algorithm in PV strings made of both silicon and amorphous silicon.
The second category comprises several algorithms characterized by two main phases: the detection phase, utilizing a PV model, and the subsequent diagnosis phase, employing various methods, such as artificial intelligence [19,20,21,22]. These algorithms detect faults by comparing the measured values extracted from the considered PV generator with the simulated values from the PV model. The residual signal derived from this comparison can be utilized to detect degradation faults [23] as well as various cases of line-to-line faults [24].
The third category encompasses artificial intelligence and machine learning algorithms, including support vector machine (SVM) [25,26,27,28,29], decision tree (DT) [7,30], random forest (RF) [31,32,33], K-nearest neighbors (KNN) [34,35,36], and artificial neural network (ANN) [37,38,39] algorithms. Reference [25] provides a comparison of efficiency and execution time among various multiclass strategies—such as one vs. all (OVA), adaptive directed acyclic graph (ADAG), and decision-directed acyclic graph (DDAG) utilizing SVM. The goal of SVM classification is to categorize data into four classes: module short circuit, inverse bypass diode module, shunted bypass module, and shadowing effect in a module. The OVA strategy has demonstrated significant superiority over others in terms of efficiency, achieving an 88.33% accuracy rate. In reference [26], the SVM algorithm was employed to detect series faults of 10%, 50%, 70%, and 90% under sunny, cloudy, and rainy weather conditions. The recorded accuracies were 88.3%, 91.5%, and 75.3%, respectively. In reference [27], both the CPA and SVM algorithms were utilized to identify four operating states: normal, open circuit, short circuit, and partial shading. The authors concluded that with k = 6 (number of dimension), the algorithm achieved an accuracy rate of 100%. In reference [28], the authors used the SVM algorithm to detect faults such as open circuit, short circuit, and lack of solar radiation. The algorithm requires four inputs: short-circuit current Isc, open-circuit voltage Voc, and coordinates of maximum power point Impp and Vmpp. The algorithm’s efficiency and accuracy were enhanced by employing k-fold cross-validation. The drawback with the mentioned algorithms is that their authors solely relied on accuracy as the criterion for evaluation, whereas employing different metrics like precision and recall could offer a more comprehensive assessment of the algorithms.
In [7], a novel approach based on the DT algorithm is presented. This approach comprises two models: the first model detects faults, while the second model diagnoses four different fault types: short circuit, string, line-to-line, and free faults. The accuracy rates for the first and second models are 99.86% and 99.80%, respectively. Notably, although a confusion matrix was calculated, precision and recall metrics were not evaluated. The utilization of the random forest algorithm for fault detection and classification in PV systems is highlighted in [31]. The method introduced in this study necessitates the current from each string in the PV array, along with the PV array voltage, as features. Successfully, the algorithm detects and diagnoses four different faults: degradation, partial shading, line-to-line, and short-circuit faults. Authors have employed the grid-search method to optimize the random forest parameters. In order to evaluate this method, experimental and simulation samples were used. The accuracy of this method reached 99%. Another method was developed based on RF to detect and diagnose faults in photovoltaic systems [32]. A set of criteria were used to evaluate the method, which are computation time, accuracy, and F1-score. In [34], a modified KNN algorithm was proposed and applied to photovoltaic systems for fault detection and diagnosis. The main modification made by the researchers in this work is to facilitate the selection of the appropriate K value in addition to the distance function. This modification greatly contributed to the increase in the classification speed. Moreover, in [35], an interesting technique that uses the KNN algorithm was developed to detect multiple faults, including line-to-line and partial shading faults. Remarkably, this method only relies on data from the datasheet and achieves an accuracy rate of 99%. Another model was developed based on the combination of KNN and the Exponential Weighted Moving Average (EWMA) [36]. The KNN aims to detect faults on the DC side of the PV system, while the EMWA works to diagnose those faults. Researchers in [37] built a two-stage classifier: the first stage was a model of the PV system to detect faults, while the second stage was devoted for diagnosis purposes, in which two artificial neural networks were used to identify eight different faults. In [38], another approach based on artificial neural networks for fault detection and diagnosis in PV systems was introduced. The authors utilized an ANN with radial basis function (RBF) architecture, relying on two features: power generated and solar irradiance. The achieved accuracy in this study was 97.9% for the 2.2 kW PV system and 97% for the 4.16 kW PV system.
In this work, a novel fault detection and diagnosis algorithm is developed and designed in the DC of PV systems. This methodology is based on an innovative tree algorithm that mainly depends on calculating Euclidean distances to detect faults when they occur, effectively. At first, the algorithm classifies the data into two classes so that all the distances between a random point in space and the entire data set are calculated. Then, in each class, the minimum and maximum distances are extracted. After that, all the distances are arranged in ascending order to show one case out of five possible cases. Based on the apparent case, the data are classified. The algorithm needs four features to function properly: solar radiation, temperature, and current and voltage at the maximum power point. This algorithm has been implemented to seven different classes: normal operation, series disconnection, short circuit for 3 and 10 modules, and three other classes for series disconnection with partial shading of 25%, 50%, and 75%. The efficiency and effectiveness of the methodology is clearly demonstrated by the accuracy rate achieved, which exceeds 97%. In order to further evaluate the algorithm, a comparative study was conducted between the proposed algorithm and several well-known algorithms (support vector machine, K-nearest neighbors, decision tree, and random forest). The comparison results show a clear superiority of the developed algorithm in terms of accuracy, precision, recall, and F1-score.
This paper is organized as follows: Section 2 is dedicated to introducing the developed algorithm, while Section 3 explains the database used and its different categories. Section 4 reveals the classification strategy followed in this work. In Section 5, the results obtained are presented and discussed. The last section of this paper presents a summary of the work done in this research paper

2. Proposed Euclidean-Based Decision Tree Classification Algorithm

Despite the similarities between the proposed algorithm and the decision trees in their data splitting approach, the key distinction lies in using the Euclidean distance for partitioning data instead of the Gini index.
Initially, a training dataset, comprising values for   N features for each of the two classes (class 0 and class 1), is created. Then, the following steps are performed:
(a)
Choose an arbitrary point ( x 1 ,   x 2 , , x N ) in an N -dimensional space.
(b)
Using Equations (1) and (2), compute the Euclidean distances between the chosen point and all samples within the training dataset for each respective class:
d i s t i 0 =   ( x 1 x i 1 0 ) 2 +   ( x 2 x i 2 0 ) 2 +   +   ( x N x i N 0 ) 2                       i = 1,2 , , n
d i s t i 1 =   ( x 1 x i 1 1 ) 2 +   ( x 2 x i 2 1 ) 2 +   +   ( x N x i N 1 ) 2                     i = 1,2 , , m
where ( x i 1 0 ,   x i 2 0 ,   ,   x i n 0 ) and x i 1 1 ,   x i 2 1 ,   ,   x i m 1 represent the ith samples of class 0 and class 1, respectively. n denotes the number of samples in class 0, while m denotes the number of samples in class 1. At this point, two vectors are obtained: d i s t 0 and d i s t 1 . d i s t 0 contains distances computed between the arbitrary point and all data in the training set that belong to class 0, while d i s t 1 contains distances computed between the arbitrary point and all data in the training set that belong to class 1.
(c)
Determine the minimum and maximum distances for each class:
m i n 0 = min i = 1,2 , , n ( d i s t i 0 )
m a x 0 = max i = 1,2 , , n ( d i s t i 0 )
m i n 1 = min i = 1,2 , , m ( d i s t i 1 )
m a x 1 = max i = 1,2 , , m ( d i s t i 1 )
Minimum and maximum values ( m i n 0 ,   m a x 0 ,   m i n 1 ,   m a x 1 ) as well as the arbitrary point coordinates are considered as the algorithm parameters which are needed in the testing phase.
(d)
Merge the two vectors d i s t 0 and d i s t 1 into one vector. Then, arrange the vector in ascending order. Among the following five cases, one may arise:
  • case 1: m i n 0 < m i n 1 < m a x 0 < m a x 1
    Figure 1 shows a graphical representation of the first case.
    -
    Training sample shaving distances within the interval [ m i n 0 ,   m i n 1 [ belong to class 0 (pure data in class 0).
    -
    Training samples having distances within the interval ] m a x 0 ,   m a x 1 ] belong to class 1 (pure data in class 1).
    -
    Training samples having distances within the interval [ m i n 1 ,   m a x 0 ] cannot be classified; therefore, another random point must be chosen for their classification.
  • case 2: m i n 1 < m i n 0 < m a x 1 < m a x 0
    Figure 2 shows a graphical representation of the second case.
    -
    Training samples having distances within the interval [ m i n 1 ,   m i n 0 [ belong to class 1 (pure data in class 1).
    -
    Training samples having distances within the interval ] m a x 1 ,   m a x 0 ] belong to class 0 (pure data in class 0).
    -
    Training samples having distances within the interval [ m i n 0 ,   m a x 1 ] cannot be classified; therefore, another random point must be chosen for their classification.
  • case 3: m i n 0 < m i n 1 < m a x 1 < m a x 0
    Figure 3 shows a graphical representation of the third case.
    -
    Training samples having distances within the interval [ m i n 0 ,   m i n 1 [ or ] m a x 1 ,   m a x 0 ] belong to class 0.
    -
    Training samples having distances within the interval [ m i n 1 ,   m a x 1 ] cannot be classified; therefore, another random point must be chosen for their classification.
  • case 4: m i n 1 < m i n 0 < m a x 0 < m a x 1
    Figure 4 shows a graphical representation of the forth case.
    -
    Training samples having distances within the interval [ m i n 1 ,   m i n 0 [ or ] m a x 0 ,   m a x 1 ] belong to class 0.
    -
    Training samples having distances within the interval [ m i n 0 ,   m a x 0 ] cannot be classified; therefore, another random point must be chosen for their classification.
  • case 5:   m i n 0 < m a x 0 < m i n 1 < m a x 1 or m i n 1 < m a x 1 < m i n 0 < m a x 0
    Figure 5 shows a graphical representation of the fifth case.
    -
    Training samples having distances within the interval [ m i n 0 ,   m a x 0 ] belong to class 0.
    -
    Training samples having distances within the interval [ m i n 1 ,   m a x 1 ] belong to class 1.
(e)
If the case that occurred in the previous step is case 1, 2, 3, or 4:
-
Choose another random point ( x 1 ,   x 2 , , x N ) .
-
Using Equations (1) and (2), compute the Euclidean distances between the chosen points and the unclassified samples within the training dataset for each respective class.
-
Go to step (c).
(f)
The algorithm iterates through steps (c) to (e) until all data are classified (case 5) or the stopping criterion is met. It employs early stopping as its stopping criterion to effectively mitigate overfitting without compromising the accuracy of the algorithm [40,41,42]
To address the overfitting issue, the difference between test accuracy and training accuracy is calculated. This difference should be minimal (less than 3% for example). If it exceeds this threshold, the training process is halted.
Figure 6 provides a graphical illustration of the proposed algorithm depicting a given possible situation.
The flowchart of the algorithm is given in Figure 7.
Algorithm 1 presents the pseudo-code of the proposed algorithm.
Algorithm 1. Pseudo-code of the proposed algorithm
STEP (a): Generate a random point.
STEP (b): Using Equations (1) and (2), calculate the distances d i s t i 0 and d i s t j 1   ,   ( i = 1,2 , , n         j = 1,2 , , m ) .
STEP (c): Find m i n 0 ,   m i n 1 ,   m a x 0 ,   m a x 1 , the minimal and maximal distances of each class.
STEP (d): Store the computed di-stances int a vector named d i s t and organize it in ascending order.
STEP (e):
j = 1 (the counter for unclassified data)
- If m i n 0 < m i n 1 < m a x 0 < m a x 1
    For i = 1 to k ( k = n + m .   )
       If m i n 0 d i s t ( i ) < m i n 1
        The point associated to d i s t ( i ) belongs to c l a s s   0
      elseif m a x 0 < d i s t ( i ) m a x 1
        The point associated to d i s t ( i ) belongs to c l a s s   1
      Else ( m i n 1   d i s t ( i )   m a x 0 )
         U n c l a s s i f i e d ( j : ) = t r a i n i n g s e t ( i : )
        Increment j
End if
    END for
     t r a i n i n g s e t = U n c l a s s i f i e d .
if the stopping criteria is not verified
Choose a new arbitrary point.
Calculate the distances d i s t i 0 and d i s t j 1 for unclassified data.
Go to step (c).
Else
Go to step (f).
End if

Elseif m i n 1 < m i n 0 < m a x 1 < m a x 0
      For i = 1 to k
      If d i s t ( i ) < m i n 0
        The point associated to d i s t ( i ) belongs to c l a s s   1
      elseif m a x 1 < d i s t ( i )
        The point associated to d i s t ( i ) belongs to c l a s s   0
      Else  ( m i n 0   d i s t ( i )   m a x 1 )
       U n c l a s s i f i e d ( j : ) = t r a i n i n g s e t ( i : )
      Increment j
End if
      END for
t r a i n i n g s e t = U n c l a s s i f i e d .
if the stopping criteria is not verified
Choose a new arbitrary point.
Calculate the distances d i s t i 0 and d i s t j 1 for unclassified data.
Go to step (c).
Else
Go to step (f).
End if.

Elseif m i n 0 < m i n 1 < m a x 1 < m a x 0
      For i = 1 to k
      If d i s t ( i ) < m i n 1 or m a x 1 < d i s t ( i )
        The point associated to d i s t ( i ) belongs to c l a s s   0
      Else      ( m i n 1   d i s t ( i )   m a x 1 )
         U n c l a s s i f i e d ( j : ) = t r a i n i n g s e t ( i : )
        Increment j
End if
      END for
t r a i n i n g s e t = U n c l a s s i f i e d .
if the stopping criteria is not verified
Choose a new arbitrary point.
Calculate the distances d i s t i 0 and d i s t j 1 for unclassified data.
Go to step (c).
Else
Go to step (f).
End if.

Elseif m i n 1 < m i n 0 < m a x 0 < m a x 1
      For i = 1 to k
      If d i s t ( i ) < m i n 0 or m a x 0 < d i s t ( i )
        The point associated to d i s t ( i ) belongs to c l a s s   1
      Else ( m i n 0   d i s t ( i )   m a x 0 )
       U n c l a s s i f i e d ( j : ) = t r a i n i n g s e t ( i : )
      Increment j
End if
      END for
t r a i n i n g s e t = U n c l a s s i f i e d .
if the stopping criteria is not verified
Choose a new arbitrary point.
Calculate the distances d i s t i 0 and d i s t j 1 for unclassified data.
Go to step (c).
Else
Go to step (f).
End if.

Else  (( m i n 0 < m a x 0 < m i n 1 < m a x 1 ) or ( m i n 1 < m a x 1 < m i n 0 < m a x 0 ))
      For i = 1 to k
      If m i n 0   d i s t ( i )   m a x 0
        The point associated to d i s t ( i ) belongs to c l a s s   0
      If m i n 1   d i s t ( i )   m a x 1
        The point associated to d i s t ( i ) belongs to c l a s s   1
      END for
        Go to step (f)
End if

Step (f): End (all data are classified or the stopping criterion is met).

3. Dataset Description

The PV array used to generate the dataset, for both healthy and faulty states, consists of two parallel strings. Each string comprises fifteen series-connected Isofoton PVM (106 W–12 V), modeling a realistic photovoltaic (PV) system located at a research center in Bouzareah, Algeria. The Simulink/MATLAB 2015a platform is utilized to simulate the current (Impp) and voltage (Vmpp) at the maximum power point of this PV array under both healthy and faulty states, considering various values of cell temperature (T) and irradiance (G). Through this simulation, 753 samples are generated for each of the considered classes, consisting of the four physical quantities (T, G, Impp, Vmpp). In this study, besides the normal operating state, six faulty states are considered. These states and their corresponding labels are given in Table 1. Environmental factors such as dust accumulation, adverse weather conditions, and snowfall often lead to partial shading faults. The algorithm’s ability to detect and classify such faults has been rigorously tested (refer to Table 1 for the examined faults). Results demonstrate a high degree of accuracy in diagnosing these issues. However, module aging, which gradually degrades PV system performance, was not included in this study. Future research will incorporate this factor to enhance fault detection capabilities further.
Before fitting the classifier, the dataset needs to be preprocessed. Three steps are included: attribute normalization, where input data are scaled to preserve the consistency in computing distances; data structuring by adding labels for each data point to create a labeled dataset; and filtering out of outlier data.
The proposed algorithm is an excellent candidate to implement it in large-scale PV farms. Its reliance on key attributes ensures rapid fault detection, making it well suited for real-time monitoring and diagnosis in large-scale installations.
As shown in Figure 8, utilizing the Impp as a feature makes it possible to distinguish between three faults: string disconnection, string disconnection with 50% shading, and string disconnection with 75% shading. Meanwhile, in Figure 9, it appears that the Vmpp feature can be used to classify faults such as string disconnection with 25% shading, short circuits of three modules, and short circuits of ten modules. To detect the healthy state class, both Impp and Vmpp features must be used simultaneously.
In fact, the proposed algorithm was tested on a dataset containing four classes. Subsequently, it was generalized to a more complex dataset with seven classes, as presented in this study.

4. Fault Detection and Diagnosis Methodology

The flowchart of the classification strategy used is shown in Figure 10. In order for the algorithm to function effectively, the multi-class dataset must be adapted to a bi-class dataset by isolating one class at a time, starting from class 0 and going up to class 6. Therefore, six classifiers must be designed for this purpose.
The first, the second, and the third classifiers isolate classes 0, 1, and 2 from the rest of the classes, respectively. Then, the fourth classifier isolates class 4 from classes 3, 5, and 6. The fifth classifier separates class 3 from classes 5 and 6. Finally, the sixth classifier distinguishes between classes 5 and 6.
Each classifier is designed based on the classification algorithm described previously and uses the four specified features (T, G, Impp, and Vmpp).

5. Results and Discussion

The confusion matrix is a well-known and important mathematical tool in the field of machine learning for evaluating algorithms. The elements of this matrix play a role in calculating the accuracy, precision, and recall metrics. This matrix has two rows and two columns, as illustrated in Table 2.
The four three metrics are computed as follows:
  • Accuracy: A metric that shows how many data the algorithm correctly classifies. It is given by
A c c u r a c y = T P + T N T P + F N + F P + T N × 100
  • Precision: Measures the proportion of correctly predicted positive data to the total data predicted as positive. It is given by
P r e c i s i o n = T P T P + F P × 100
  • Recall: A metric that shows how many data that are really in class 1 which the classifier correctly predicted to be in class 1. It is given by
R e c a l l = T P T P + F N × 100
  • F1-score: Measures the harmonic mean between the precision and the recall. It is given by
F 1 s c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l

5.1. Training the Fault Detection and Diagnosis Model Using the Proposed Algorithm

Like any other statistical learning algorithm, the proposed algorithm firstly needs to be trained using a training dataset. Following training, its performance is evaluated using a separate testing set. The dataset is partitioned into two subsets: the training set comprises 87% of the global dataset, while the testing set encompasses 13% of the global dataset. As mentioned earlier, six classifiers are necessary to detect and diagnose the specified faults. To mitigate overfitting effectively without compromising the algorithm’s accuracy, the early stopping criterion is employed to stop the training process of each classifier.
The accuracy metric for each classifier is calculated at every iteration and illustrated in Figure 11, Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16. As can be seen, for all classifiers, the accuracy value increases over iterations. Classifiers 1 to 6 of the trained model require 23, 9, 4, 6, 16, and 17 steps, respectively, to separate a class from the other classes.

5.2. Evaluating the Performance of the Obtained Model Using the Proposed Algorithm

The performance of the proposed approach is evaluated using the average values of precision, precision, and recall. The higher these values, the better the performance of the proposed approach is, and vice versa. The confusion matrices and values of the three metrics are calculated from the test dataset. The results are presented in Table 3 and Table 4, respectively.
Table 3 presents the confusion matrix values for each classifier within the resulting model. These values were used to calculate the precision, accuracy, and recall measures for all six classifiers and are presented in Table 4. The last row of Table 4 shows the average values of the three measures, which represent the measures of the resulting model.
Two additional train–test variants were conducted, and the results are presented in Table 5 and Table 6.
Table 5 and Table 6 highlight the accuracy for the fault detection and diagnosis model across all classifiers under varied training and testing scenarios.

5.3. Comparative Studyy of Various Machine Learning Algorithms

In this comparative study, the fault detection and diagnosis model depicted in the flowchart of Figure 5 is constructed using various statistical methods, namely, the SVM algorithm [27], the DT algorithm [8,31], the RF algorithm [32,33,34], and the KNN algorithm [35,36,37].
The confusion matrices for the obtained model using the aforementioned algorithms are provided in Table 7, while Table 8 presents the values for accuracy, precision, recall, F1-score, and execution time, along with the average values of these metrics.
Table 9 collects all the average values of the three metrics for each of the proposed algorithms, as well as the SVM, DT, RF, and KNN algorithms.
From the table, it can be seen that the performance of the proposed algorithm is superior to the rest of the algorithms in terms of accuracy, precision, recall, and F1-score. Although the proposed algorithm is slower in fault detection compared to other algorithms, it remains suitable for industrial application and real-time operation.
Figure 17, Figure 18, Figure 19, Figure 20 and Figure 21 display the fault detection and diagnosis results using the proposed algorithm-based model and those based on the SVM, DT, RF, and KNN algorithms, respectively. It can be seen from these figures that the smallest number of incorrectly classified data are obtained in the case of both the RF algorithm-based model and the proposed algorithm-based model. The models fail to correctly classify all data due to data overlap and overfitting issues.

6. Conclusions

In this work, an enhanced approach was proposed for identifying and diagnosing PV array faults. A comparative study was conducted between the proposed algorithm-based model and models based on four statistical learning algorithms: SVM, DT, RF, and KNN. Unlike the decision tree algorithm, which uses the Gini index to split the data onto two classes, the proposed algorithm calculates Euclidean distances between an arbitrary point and the dataset samples. It then utilizes the minimal and maximal distances to separate the samples belonging to each class.
In this study, four features, namely, cell temperature, irradiance, and current and voltage of the maximum power point, were utilized. The proposed methodology effectively distinguishes the normal operating condition from other abnormal states, achieving a classification accuracy of 97%. The comparative investigation demonstrated that the proposed approach outperformed the other methods considered in this work in terms of accuracy, precision, recall, F1-score, and execution time.
By increasing the number of classifiers, the proposed technique can be easily extended to encompass additional faults.

Author Contributions

Conceptualization, Y.M.; methodology, K.K. and A.C.; software, Y.M. and A.A.; validation, Y.M., K.K. and A.C.; formal analysis, K.K.; investigation, K.K. and A.A.; resources, Y.M., K.K. and S.S.; data curation, Y.M.; writing—original draft preparation, Y.M.; writing—review and editing, K.K., A.C. and S.S.; visualization, Y.M. and A.A.; supervision, K.K., A.C. and S.S.; project administration, K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data are not available.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADAGAdaptive directed acyclic graph
ANNArtificial neural network
DDAGDecision directed acyclic graph
DTDecision Tree
FDDFault Detection and Diagnosis
FNFalse Negative
FPFalse Positive
GIrradiance
GCPVGrid connected photovoltaic
ImppCurrent at the maximum power point
IscCurrent of short circuit
KNNK-Nearest Neighbors
OVAOne vs. all
PVPhotovoltaic
PVMPhotovoltaic module
PVSPhotovoltaic system
RFRandom Forest
SVMSupport Vector Machine
TTemperature
TDRTime domain reflectometry
TNTrue Negative
TPTrue Positive
Vmpp Voltage at the maximum power point
Voc Voltage of open circuit

References

  1. Sohani, A.; Sayyaadi, H.; Cornaro, C.; Shahverdian, M.; Pierro, M.; Moser, D.; Karimi, N.; Doranehgard, M.; Li, L.K. Using machine learning in photovoltaics to create smarter and cleaner energy generation systems: A comprehensive review. J. Clean. Prod. 2022, 364, 132701. [Google Scholar]
  2. Mughal, S.; Sood, Y.R.; Jarial, R. A review on solar photovoltaic technology and future trends. Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol. 2018, 4, 227–235. [Google Scholar]
  3. Madeti, S.R.; Singh, S. A comprehensive study on different types of faults and detection techniques for solar photovoltaic system. Sol. Energy 2017, 158, 161–185. [Google Scholar]
  4. Hernandez, J.; Velasco, D.; Trujillo, C. Analysis of the effect of the implementation of photovoltaic systems like option of distributed generation in Colombia. Renew. Sustain. Energy Rev. 2011, 15, 2290–2298. [Google Scholar]
  5. Qais, M.H.; Hasanien, H.M.; Alghuwainem, S.; Nouh, A.S. Coyote optimization algorithm for parameters extraction of three-diode photovoltaic models of photovoltaic modules. Energy 2019, 187, 116001. [Google Scholar]
  6. Kumar, B.P.; Ilango, G.S.; Reddy, M.J.B.; Chilakapati, N. Online fault detection and diagnosis in photovoltaic systems using wavelet packets. IEEE J. Photovolt. 2017, 8, 257–265. [Google Scholar]
  7. Benkercha, R.; Moulahoum, S. Fault detection and diagnosis based on C4.5 decision tree algorithm for grid connected PV system. Sol. Energy 2018, 173, 610–634. [Google Scholar]
  8. Villarini, M.; Cesarotti, V.; Alfonsi, L.; Introna, V. Optimization of photovoltaic maintenance plan by means of a FMEA approach based on real data. Energy Convers. Manag. 2017, 152, 1–12. [Google Scholar]
  9. Pillai, D.S.; Rajasekar, N. A comprehensive review on protection challenges and fault diagnosis in PV systems. Renew. Sustain. Energy Rev. 2018, 91, 18–40. [Google Scholar]
  10. Zhao, Q.; Shao, S.; Lu, L.; Liu, X.; Zhu, H. A new PV array fault diagnosis method using fuzzy C-mean clustering and fuzzy membership algorithm. Energies 2018, 11, 238. [Google Scholar] [CrossRef]
  11. Hazra, A.; Das, S.; Basu, M. An efficient fault diagnosis method for PV systems following string current. J. Clean. Prod. 2017, 154, 220–232. [Google Scholar]
  12. Khelil, C.K.M.; Amrouche, B.; Soufiane Benyoucef, A.; Kara, K.; Chouder, A. New intelligent fault diagnosis (IFD) approach for grid-connected photovoltaic systems. Energy 2020, 211, 118591. [Google Scholar]
  13. Khelil, C.K.M.; Amrouche, B.; Kara, K.; Chouder, A. The impact of the ANN’s choice on PV systems diagnosis quality. Energy Convers. Manag. 2021, 240, 114278. [Google Scholar]
  14. Mellit, A.; Tina, G.M.; Kalogirou, S.A. Fault detection and diagnosis methods for photovoltaic systems: A review. Renew. Sustain. Energy Rev. 2018, 91, 1–17. [Google Scholar]
  15. Takashima, T.; Yamaguchi, J.; Otani, K.; Kato, K.; Ishida, M. Experimental studies of failure detection methods in PV module strings. In Proceedings of the 2006 IEEE 4th World Conference on Photovoltaic Energy Conference, Waikoloa, HI, USA, 7–12 May 2006; Volume 2, pp. 2227–2230. [Google Scholar]
  16. Takashima, T.; Yamaguchi, J.; Ishida, M. Fault detection by signal response in PV module strings. In Proceedings of the 2008 33rd IEEE Photovoltaic Specialists Conference, San Diego, CA, USA, 11–16 May 2008; pp. 1–5. [Google Scholar]
  17. Takashima, T.; Yamaguchi, J.; Otani, K.; Oozeki, T.; Kato, K.; Ishida, M. Experimental studies of fault location in PV module strings. Sol. Energy Mater. Sol. Cells 2009, 93, 1079–1082. [Google Scholar]
  18. Takashima, T.; Yamaguchi, J.; Ishida, M. Disconnection detection using earth capacitance measurement in photovoltaic module string. Prog. Photovolt. Res. Appl. 2008, 16, 669–677. [Google Scholar] [CrossRef]
  19. Chouder, A.; Silvestre, S. Automatic supervision and fault detection of PV systems based on power losses analysis. Energy Convers. Manag. 2010, 51, 1929–1937. [Google Scholar]
  20. Silvestre, S.; Chouder, A.; Karatepe, E. Automatic fault detection in grid connected PV systems. Sol. Energy 2013, 94, 119–127. [Google Scholar] [CrossRef]
  21. Spataru, S.; Sera, D.; Kerekes, T.; Teodorescu, R. Photovoltaic array condition monitoring based on online regression of performance model. In Proceedings of the 2013 IEEE 39th Photovoltaic Specialists Conference (PVSC), Tampa, FL, USA, 16–21 June 2013; pp. 0815–0820. [Google Scholar]
  22. Drews, A.; De Keizer, A.; Beyer, H.G.; Lorenz, E.; Betcke, J.; Van Sark, W.; Heydenreich, W.; Wiemken, E.; Stettler, S.; Toggweiler, P.; et al. Monitoring and remote failure detection of grid-connected PV systems based on satellite observations. Sol. Energy 2007, 81, 548–564. [Google Scholar]
  23. Bastidas-Rodriguez, J.D.; Franco, E.; Petrone, G.; Ramos-Paja, C.A.; Spagnuolo, G. Quantification of photovoltaic module degradation using model based indicators. Math. Comput. Simul. 2017, 131, 101–113. [Google Scholar]
  24. Dhoke, A.; Sharma, R.; Saha, T.K. An approach for fault detection and location in solar PV systems. Sol. Energy 2019, 194, 197–208. [Google Scholar]
  25. Mandal, R.K.; Kale, P.G. Assessment of different multiclass SVM strategies for fault classification in a PV system. In Proceedings of the Proceedings of the 7th International Conference on Advances in Energy Research, Singapore, 13–15 August 2019; Springer: Singapore, 2021; pp. 747–756. [Google Scholar]
  26. Cho, K.H.; Jo, H.C.; Kim, E.s.; Park, H.A.; Park, J.H. Failure diagnosis method of photovoltaic generator using support vector machine. J. Electr. Eng. Technol. 2020, 15, 1669–1680. [Google Scholar]
  27. Chen, L.; Lin, P.; Zhang, J.; Chen, Z.; Lin, Y.; Wu, L.; Cheng, S. Fault diagnosis and classification for photovoltaic arrays based on principal component analysis and support vector machine. IOP Conf. Ser. Earth Environ. Sci. 2018, 188, 012089. [Google Scholar]
  28. Wang, J.; Gao, D.; Zhu, S.; Wang, S.; Liu, H. Fault diagnosis method of photovoltaic array based on support vector machine. Energy Sources Part A Recovery Util. Environ. Eff. 2019, 45, 5380–5395. [Google Scholar]
  29. Yi, Z.; Etemadi, A.H. Line-to-line fault detection for photovoltaic arrays based on multiresolution signal decomposition and two-stage support vector machine. IEEE Trans. Ind. Electron. 2017, 64, 8546–8556. [Google Scholar]
  30. Dhibi, K.; Mansouri, M.; Bouzrara, K.; Nounou, H.; Nounou, M. An enhanced ensemble learning-based fault detection and diagnosis for grid-connected PV systems. IEEE Access 2021, 9, 155622–155633. [Google Scholar]
  31. Chen, Z.; Han, F.; Wu, L.; Yu, J.; Cheng, S.; Lin, P.; Chen, H. Random forest based intelligent fault diagnosis for PV arrays using array voltage and string currents. Energy Convers. Manag. 2018, 178, 250–264. [Google Scholar]
  32. Dhibi, K.; Fezai, R.; Mansouri, M.; Trabelsi, M.; Kouadri, A.; Bouzara, K.; Nounou, H.; Nounou, M. Reduced kernel random forest technique for fault detection and classification in grid-tied PV systems. IEEE J. Photovolt. 2020, 10, 1864–1871. [Google Scholar]
  33. Dhibi, K.; Fezai, R.; Bouzrara, K.; Mansouri, M.; Nounou, H.; Nounou, M.; Trabelsi, M. Enhanced RF for Fault Detection and Diagnosis of Uncertain PV systems. In Proceedings of the 2021 IEEE 18th International Multi-Conference on Systems, Signals & Devices (SSD), Monastir, Tunisia, 22–25 March 2021; pp. 103–108. [Google Scholar]
  34. Wang, L.; Qiu, H.; Yang, P.; Gao, J. Fault Diagnosis Method Based on An Improved KNN Algorithm for PV strings. In Proceedings of the 2021 IEEE 4th Asia Conference on Energy and Electrical Engineering (ACEEE), Virtual, 10–12 September 2021; pp. 91–98. [Google Scholar]
  35. Madeti, S.R.; Singh, S. Modeling of PV system based on experimental data for fault detection using kNN method. Sol. Energy 2018, 173, 139–151. [Google Scholar]
  36. Harrou, F.; Taghezouit, B.; Sun, Y. Improved k NN-based monitoring schemes for detecting faults in PV systems. IEEE J. Photovolt. 2019, 9, 811–821. [Google Scholar]
  37. Karatepe, E.; Hiyama, T. Controlling of artificial neural network for fault diagnosis of photovoltaic array. In Proceedings of the 2011 IEEE 16th International Conference on Intelligent System Applications to Power Systems, Crete, Greece, 25–28 September 2011; pp. 1–6. [Google Scholar]
  38. Chine, W.; Mellit, A.; Lughi, V.; Malek, A.; Sulligoi, G.; Pavan, A.M. A novel fault diagnosis technique for photovoltaic systems based on artificial neural networks. Renew. Energy 2016, 90, 501–512. [Google Scholar]
  39. Hussain, M.; Dhimish, M.; Titarenko, S.; Mather, P. Artificial neural network based photovoltaic fault detection algorithm integrating two bi-directional input parameters. Renew. Energy 2020, 155, 1272–1292. [Google Scholar]
  40. Bai, Y.; Yang, E.; Han, B.; Yang, Y.; Li, J.; Mao, Y.; Niu, G.; Liu, T. Understanding and improving early stopping for learning with noisy labels. Adv. Neural Inf. Process. Syst. 2021, 34, 24392–24403. [Google Scholar]
  41. Prechelt, L. Automatic early stopping using cross validation: Quantifying the criteria. Neural Netw. 1998, 11, 761–767. [Google Scholar]
  42. Zhang, T.; Yu, B. Boosting with early stopping: Convergence and consistency. Ann. Stat. 2005, 33, 1538–1579. [Google Scholar]
Figure 1. Data splitting for case 1.
Figure 1. Data splitting for case 1.
Energies 18 01773 g001
Figure 2. Data splitting for case 2.
Figure 2. Data splitting for case 2.
Energies 18 01773 g002
Figure 3. Data splitting for case 3.
Figure 3. Data splitting for case 3.
Energies 18 01773 g003
Figure 4. Data splitting for case 4.
Figure 4. Data splitting for case 4.
Energies 18 01773 g004
Figure 5. Data splitting for case 5.
Figure 5. Data splitting for case 5.
Energies 18 01773 g005
Figure 6. Graphical illustration of the proposed algorithm.
Figure 6. Graphical illustration of the proposed algorithm.
Energies 18 01773 g006
Figure 7. Flowchart of the proposed algorithm.
Figure 7. Flowchart of the proposed algorithm.
Energies 18 01773 g007
Figure 8. Impp for various operating states of the PV array.
Figure 8. Impp for various operating states of the PV array.
Energies 18 01773 g008
Figure 9. Vmpp for various operating states of the PV array.
Figure 9. Vmpp for various operating states of the PV array.
Energies 18 01773 g009
Figure 10. Fault detection and diagnosis flowchart.
Figure 10. Fault detection and diagnosis flowchart.
Energies 18 01773 g010
Figure 11. Evolution of accuracy for the first classifier.
Figure 11. Evolution of accuracy for the first classifier.
Energies 18 01773 g011
Figure 12. Evolution of accuracy as a function of iterations in the second classifier.
Figure 12. Evolution of accuracy as a function of iterations in the second classifier.
Energies 18 01773 g012
Figure 13. Evolution of accuracy as a function of iterations in the third classifier.
Figure 13. Evolution of accuracy as a function of iterations in the third classifier.
Energies 18 01773 g013
Figure 14. Evolution of accuracy as a function of iterations in the fourth classifier.
Figure 14. Evolution of accuracy as a function of iterations in the fourth classifier.
Energies 18 01773 g014
Figure 15. Evolution of accuracy as a function of iterations in the fifth classifier.
Figure 15. Evolution of accuracy as a function of iterations in the fifth classifier.
Energies 18 01773 g015
Figure 16. Evolution of accuracy as a function of iterations in the sixth classifier.
Figure 16. Evolution of accuracy as a function of iterations in the sixth classifier.
Energies 18 01773 g016
Figure 17. Fault detection and diagnosis results using the proposed algorithm-based model.
Figure 17. Fault detection and diagnosis results using the proposed algorithm-based model.
Energies 18 01773 g017
Figure 18. Fault detection and diagnosis results using the SVM algorithm-based model.
Figure 18. Fault detection and diagnosis results using the SVM algorithm-based model.
Energies 18 01773 g018
Figure 19. Fault detection and diagnosis results using the DT algorithm-based model.
Figure 19. Fault detection and diagnosis results using the DT algorithm-based model.
Energies 18 01773 g019
Figure 20. Fault detection and diagnosis results using the RF algorithm-based model.
Figure 20. Fault detection and diagnosis results using the RF algorithm-based model.
Energies 18 01773 g020
Figure 21. Fault detection and diagnosis results using the KNN algorithm-based model.
Figure 21. Fault detection and diagnosis results using the KNN algorithm-based model.
Energies 18 01773 g021
Table 1. Operating states and their labels.
Table 1. Operating states and their labels.
Class NameLabel
Normal operationClass 0
Short circuit of three modulesClass 1
Short circuit of ten modulesClass 2
String disconnectionClass 3
String disconnection with 25% of partial shadingClass 4
String disconnection with 50% of partial shadingClass 5
String disconnection with 75% of partial shadingClass 6
Table 2. Confusion matrix used to evaluate the algorithm.
Table 2. Confusion matrix used to evaluate the algorithm.
Predicted Classes
Real class Class 0Class 1
Class 0TPFN
Class 1FPTN
TP: stands for True Positive, which means the number of data that are in class 0 in the dataset which the algorithm successfully considered to be in class 0; FN: stands for False Negative, which means the number of data that are in class 0 in the dataset which the algorithm considered to be in class 1; FP: stands for False Positive, which means the number of data that are in class 1 in the dataset which the algorithm considered to be in class 0; TN: stands for True Negative, which means the number of data that are in class 1 in the dataset which the algorithm successfully considered to be in class 1.
Table 3. Confusion matrices for the obtained model.
Table 3. Confusion matrices for the obtained model.
TPFNFPTN
Classifier 111071929168
Classifier 294734178
Classifier 376213183
Classifier 457031190
Classifier 53541310179
Classifier 6177205155
Table 4. Metric values for the obtained model.
Table 4. Metric values for the obtained model.
Accuracy (%)Precision (%)Recall (%)Execution Time (s)F1-Score (%)
Classifier 19793850.9898
Classifier 29998980.79100
Classifier 310099980.51100
Classifier 49998990.31100
Classifier 59693950.2198
Classifier 69389970.0693
Average values97.339595.330.4798.2
Table 5. Metric values for the obtained model with a second train–test variant.
Table 5. Metric values for the obtained model with a second train–test variant.
Accuracy (%)Precision (%)Recall (%)F1-Score (%)
Classifier 198999999
Classifier 299100100100
Classifier 3100100100100
Classifier 4100100100100
Classifier 597979998
Classifier 696979696
Average values98.33999999
Table 6. Metric values for the obtained model with a third train–test variant.
Table 6. Metric values for the obtained model with a third train–test variant.
Accuracy (%)Precision (%)Recall (%)F1-Score (%)
Classifier 198999999
Classifier 2100100100100
Classifier 3100100100100
Classifier 499100100100
Classifier 599999999
Classifier 699989998
Average values9999.3399.599.33
Table 7. Confusion matrices for the obtained model using the four algorithms.
Table 7. Confusion matrices for the obtained model using the four algorithms.
SVMDTRFKNN
TPFNFPTNTPFNFPTNTPFNFPTNTPFNFPTN
Cl 1110415165341113623176111981517611071215544
Cl 210860122619503318095011180107811182
Cl 31021010384765201867640118689200187
Cl 4927010790566301965681019669600196
Cl 5796571315036410718537143185508116720
Cl 66311128490163291722021573617520210573201296
Table 8. Metric values for the obtained model using the four algorithms.
Table 8. Metric values for the obtained model using the four algorithms.
SVMDTRFKNN
Classifier 1Accuracy (%)86989987 (K = 7)
Precision (%)87989988
Recall (%)999910099
F1-score (%)92999993
Execution time (s)0.280.020.0080.21 (K = 1)
Classifier 2Accuracy (%)9099100100
Precision (%)90100100100
Recall (%)100100100100
F1-score (%)9391100100
Execution time (s)0.290.030.050.17
Classifier 3Accuracy (%)91100100100 (K = 3)
Precision (%)91100100100
Recall (%)100100100100
F1-score (%)100100100100
Execution time (s)0.050.0040.070.15
Classifier 4Accuracy (%)90100100100 (K = 2)
Precision (%)90100100100
Recall (%)10099100100
F1-score (%)8798100100
Execution time (s)0.140.0030.050.12
Classifier 5Accuracy (%)82979976 (K = 35)
Precision (%)86989975
Recall (%)939799100
F1-score (%)86989986
Execution time (s)0.160.0030.060.11
Classifier 6Accuracy (%)78646359 (K = 1)
Precision (%)88878580
Recall (%)84545360
F1-score (%)51786568
Execution time (s)0.160.160.060.1
Average valuesAccuracy (%)85.839393.587
Precision (%)88.6597.1697.1690.5
Recall (%)9691.509293.16
F1-score (%)83.83949491.16
Execution time (s)0.180.0060.060.14
Table 9. Metric average values.
Table 9. Metric average values.
Accuracy (%)Precision (%)Recall (%)Execution Time (s)F1-Score
The proposed algorithm97.3398.6697.50.4798.2
SVM85.3388.65960.1883.83
DT9397.1691.500.00694
RF93.5097.16920.0694
KNN8790.5093.160.1491.16
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mouleloued, Y.; Kara, K.; Chouder, A.; Aouaichia, A.; Silvestre, S. Euclidean Distance-Based Tree Algorithm for Fault Detection and Diagnosis in Photovoltaic Systems. Energies 2025, 18, 1773. https://doi.org/10.3390/en18071773

AMA Style

Mouleloued Y, Kara K, Chouder A, Aouaichia A, Silvestre S. Euclidean Distance-Based Tree Algorithm for Fault Detection and Diagnosis in Photovoltaic Systems. Energies. 2025; 18(7):1773. https://doi.org/10.3390/en18071773

Chicago/Turabian Style

Mouleloued, Youssouf, Kamel Kara, Aissa Chouder, Abdelhadi Aouaichia, and Santiago Silvestre. 2025. "Euclidean Distance-Based Tree Algorithm for Fault Detection and Diagnosis in Photovoltaic Systems" Energies 18, no. 7: 1773. https://doi.org/10.3390/en18071773

APA Style

Mouleloued, Y., Kara, K., Chouder, A., Aouaichia, A., & Silvestre, S. (2025). Euclidean Distance-Based Tree Algorithm for Fault Detection and Diagnosis in Photovoltaic Systems. Energies, 18(7), 1773. https://doi.org/10.3390/en18071773

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop