1. Introduction
The electrical grid is facing new challenges due to the expansion of renewable energies, the proliferation of electric vehicles, and the incorporation of storage systems. As a result, the growing demand for high-quality electrical energy in industrial and domestic applications underscores the importance of accurate monitoring and diagnosis of power grid disturbances, which has generated growing interest in grid improvement studies in recent years.
Power quality refers to the purity and stability of electrical current [
1]. A high-quality power supply is essential to ensure the efficiency, safety, and durability of electrical and electronic equipment. Power Quality Disturbances (PQDs), such as voltage sags, harmonics, or transients, can cause equipment failures, increase operating costs, and reduce energy efficiency. Therefore, the development of automated and efficient classification systems for the detection and categorization of these anomalies is crucial [
2,
3,
4].
In this context, Machine Learning (ML) and Deep Learning (DL) methodologies have become powerful tools for the analysis and classification of complex signals [
5,
6,
7,
8,
9]. Their ability to extract complex patterns from large datasets makes them ideal candidates for solving PQDs classification challenges. However, existing studies in this field often present key limitations. Many high-performance models, particularly end-to-end learning architectures, act as “black boxes”, lacking the necessary interpretability for engineering applications where understanding physical phenomena is essential. Moreover, the reliance on purely real-world datasets poses a significant practical challenge, as a wide variety of PQD labeled with different levels of noise is extremely scarce and difficult to obtain.
This study directly addresses these limitations by providing a fundamental cross-platform benchmark of various ML and DL algorithms. Our approach applies and compares a diverse set of models using controlled hybrid datasets, which combine a validated real signal with synthetic perturbations and Gaussian noise to ensure reproducibility and systematic evaluation. The input features for classification include higher-order statistical (HOS) estimators, frequency-domain features, and features derived from the envelope and the derivative of the signal.
Specifically, this work makes the following original contributions:
It presents a direct comparative analysis of the classification capabilities of various ML models (Support Vector Machines, Decision Trees, Random Forest, k-Nearest Neighbors, Gradient Boosting) and DL (Dense Neural Networks) across a wide range of noise conditions, providing a valuable performance benchmark.
Examine the impact of different noise levels on model accuracy, specifically focusing on the robustness and noise resilience of each algorithm.
Analyze and contrast the practical implications of implementing these models on two widely used platforms, MATLAB [
10] and Python [
11,
12], with special attention to their suitability for real PQD classification tasks.
The rest of the paper is structured as follows.
Section 2 describes the signal processing methodology, including data generation, the addition of Gaussian noise, and feature extraction.
Section 3 provides a brief description of the ML and DL models employed.
Section 4 and
Section 5 present a critical analysis and a discussion of the experimental results, which also suggest future research directions. Finally,
Section 6 concludes the article.
2. Materials and Methods
2.1. Data Generation
The main power quality disturbances addressed in this study are defined according to the
1159-2019 standard [
13].
Figure 1 illustrates these disturbances, which were generated as 1-s signals from an ideal function. The specific parameters of the samples used in the classifier are detailed in
Table 1.
For our evaluation methodology, we opted for a hybrid dataset because purely real-world datasets containing a wide variety of labeled PQs and covering a broad range of Signal-to-Noise Ratios (SNRs) are extremely scarce and difficult to obtain. This approach, which combines a validated, anomaly-free real-world signal with synthetic disturbances and Gaussian noise, allowed us to create a repeatable and verifiable testbed. This was crucial for our main objective: to evaluate the robustness of various machine learning and deep learning models under precisely defined conditions.
To generate the power quality disturbances, we began by acquiring three hours of undisturbed reference signals from the database at the University of Cadiz. After a rigorous validation to confirm the absence of anomalies, we randomly injected disturbances in terms of time and type, strictly following the parameters of the IEEE standards. This procedure ensured the generation of an identical number of disturbed and undisturbed signals, thereby guaranteeing a balanced and robust dataset for comparative analysis.
2.2. Random Noise
To enhance the robustness of the PQD classifier, random Gaussian noise was added to of the samples before feature extraction. The noise was introduced with Signal-to-Noise Ratios (SNRs) ranging from 40 dB to 1 dB, following the sequence below:
2.3. Feature Extracted from PQDs
To optimize the classification of PQDs, this experiment incorporated a set of multidomain features. Key information was extracted from the time, frequency, envelope, and derivative domains of the signals. This integration of various features enriched the model with a multidimensional data representation, significantly improving its ability to accurately identify and classify PQDs.
2.3.1. Time Domain Features
High Order Statistics (HOS)
Statistical features have been widely used over the years to detect trends in the behavior of signal of different natures. Higher-order statistics (HOS) have gained popularity due to their effectiveness when signals exhibit non-Gaussian noise. Higher-order cumulants are mathematical tools that allow the analysis of properties of signals that do not follow a normal (Gaussian) distribution. In signal processing, the following formula is commonly used to calculate the
r-th order cumulants of a set of signals
[
14,
15]:
where the summation extends to all possible partitions of the indices,
E denotes the expected value, and
are the sets that make up the partition. For a stationary signal
of order
r, the
r-th order cumulant is defined as:
For the zero-mean signal
, the second, third, and fourth-order cumulants are expressed by:
If
, the equations above simplify to:
These last expressions represent variance, skewness, and kurtosis, respectively. Normalized skewness and kurtosis can be defined as and , respectively. Normalizing these measures is useful, since they remain unchanged regardless of the signal’s scale or shift.
Total Harmonic Distortion (THD)
Total harmonic distortion is defined as the ratio between the Root Mean Square (RMS) value of the harmonic components and the RMS value of the fundamental component of the signal. This definition universally applies to both current and voltage signals [
16].
where
and
are the total harmonic distortions for current and voltage, respectively. The terms
and
denote the current and voltage components of the
n-th harmonic, while
and
represent their corresponding fundamental components.
Crest Factor
The crest factor, defined as the ratio between the peak value of a waveform and its RMS value, is a crucial parameter for analyzing low-frequency signals. It is calculated to quantify the influence of low frequencies on a signal. For an ideal sinusoidal waveform, the crest factor is
[
17].
where
and
represent the peak and
values of the signal, respectively, and
N is the number of samples.
2.3.2. Frequency Domain Features
The frequency domain, accessed through the Fourier Transform, offers a critical representation of a signal as the composition of its sinusoidal components. The characteristics derived from this domain are particularly effective for analyzing certain disturbances whose defining features are more pronounced in their frequency content [
18].
Frequency Magnitudes
The Fast Fourier Transform (FFT) converts a signal from the time domain into a representation in the frequency domain. The resulting complex coefficients reveal the amplitude and phase of each frequency component present in the original signal. In a 50 Hz power grid, harmonic disturbances are characterized by the presence of energy at integer multiples of the fundamental frequency. Consequently, the detection and quantification of these disturbances can be achieved by identifying and measuring the magnitudes of these harmonic peaks in the frequency spectrum.
Frequency Band Energy
Band energy offers a complementary approach to the analysis of individual frequencies, as it aggregates the total energy within specific frequency segments, which are subsequently normalized to obtain proportional contributions. This metric is especially effective for identifying transients in high-frequency bands and flickers in low-frequency bands, given their characteristic energy concentrations in these regions.
THD in Frequency Domain
Parallel to the analyzes in the time domain, (
) in the frequency domain provides a comprehensive measure of waveform distortion by summing the magnitudes of its harmonic components. A high
value indicates a presence of harmonics in the signal, allowing for effective detection and quantitative evaluation of these specific disturbances.
2.3.3. Wavelet Features
The wavelet transform is frequently used in PQD classification due to its ability to provide multiresolution analysis, which facilitates the simultaneous extraction of features from the temporal and frequency domains of the signal [
19]. In this research, feature extraction was performed using the PyWavelets library [
20] and the ‘db4’ wavelet, generating a set of 10 features. The main features obtained from the wavelet coefficients at each level are the mean and standard deviation.
2.3.4. Envelope Features
The envelope of a signal provides a representation of its amplitude varying over time. A widely used technique to calculate this envelope is the Hilbert transform, which transforms a real signal into its analytic counterpart. The Hilbert transform,
[
18], of a real function
is defined as:
Envelope features are valuable for analyzing variations in signal amplitude.
Mean of the Envelope
The mean of the signal envelope quantifies its average amplitude in a given sample. For slow variations, this metric can effectively indicate the presence of voltage sags or swells.
Standard Deviation of the Envelope
The standard deviation of the envelope quantifies the magnitude of the amplitude variations around its mean. This characteristic is especially useful for measuring flicker, as its repeated characteristic fluctuations directly result in a high standard deviation.
Maximum of the Envelope
By capturing the peak amplitude of the signal, the maximum value of the envelope serves as an effective feature for identifying disturbances such as flickers and voltage swells.
2.3.5. Derivative Features
The derivative of a signal, which serves as a measure of the instantaneous rate of change, is fundamental for identifying rapid and directional changes. Therefore, it is an effective feature for detecting transient events. The derivative of a continuous signal
is defined as:
In the case of electrical signals, the data is collected discretely, in this case with a sampling frequency of 10 KHz.
Maximum Derivative
This characteristic quantifies the fastest and most pronounced change within a signal. An high value serves as a reliable indicator of transients, given its defining characteristic as abrupt changes of large magnitude in the waveform.
Mean of the Absolute Derivative
Although the maximum derivative indicates the most extreme rate of change at a single point, the mean of the absolute derivative provides a more comprehensive measure by calculating the average rate of change over the entire segment of the signal. This makes it specially effective for characterizing events that involve widespread and rapid fluctuations, such as impulses and high-frequency noise.
3. Power Quality Disturbance Classifiers
Figure 2 provides a generalized algorithm detailing the multi-step process involved in the classification of power quality (PQ) events. This study discusses various algorithms and approaches based on machine learning and deep learning (AI), focusing on their application for the classification and detection of disturbances within a power system.
3.1. Machine Learning-Based Classifiers
For the classification task, we selected a set of representative and high-performing Machine Learning (ML) models: Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), K-Nearest Neighbours (kNN), and Gradient Boosting (GB). These algorithms are well-suited for feature-based data and serve as a robust benchmark due to their proven effectiveness across a wide range of classification problems.
3.1.1. Support Vector Machine (SVM)
A SVM model was utilized in this study for PQD classification, as it is a model well-suited for handling high-dimensional and nonlinear data. To optimize its performance, the key hyperparameters, C and Gamma, are tuned to find the best balance between a smooth decision boundary and accurate classification of the training data [
17].
3.1.2. Decision Tree (DT)
A DT algorithm uses a recursive process to classify data [
21]. It works in four main steps: first, it selects the best feature to split the dataset using a criterion like the Gini index. Second, it splits the data into subsets based on that feature. Third, it repeats this process to grow the tree until a stop condition is met. Finally, it “prunes” the tree by cutting branches to prevent the model from overfitting.
3.1.3. Random Forest (RF)
By combining the outputs of multiple independent decision trees, the RF algorithm improves overall predictive performance and reduces the risk of overfitting. The process involves three main steps: first, it creates multiple training datasets through bootstrapping (random sampling with replacement). Next, it trains a separate decision tree for each of these subsets, using a random selection of characteristics to ensure diversity among trees. Finally, the model aggregates the predictions of all individual trees, either averaging them for regression or using a majority vote for classification, to produce the final more robust output [
22].
3.1.4. K-Nearest Neighbors (kNN)
Operating on the principle that similar data points are close to each other, the kNN algorithm is a non-parametric classifier [
23]. It assigns a label to a new data point based on the majority class of its nearest neighbors in the training set, often determined by the Euclidean distance. This makes kNN a robust choice for datasets with complex, non-linear relationships, particularly when the dataset is not excessively large.
3.1.5. Gradient Booster (GB)
The GB algorithm is a powerful and flexible method to build a predictive model by combining a sequence of simple “weak” learners, typically regression trees. At each step, a new tree is trained to correct the residual errors of the existing ensemble. This is achieved by fitting the new tree to the negative gradient of a chosen differentiable loss function, effectively performing a form of gradient descent in the function space to minimize the overall prediction error [
24].
For multi-class classification, a separate regression tree is trained for each class to minimize the loss. Binary classification is a simpler case where only a single regression tree is used per stage.
3.2. Deep Learning-Based Classifiers
For the DL comparison, a Dense Neural Network (DNN) was selected as the foundational architecture due to its ability to learn complex patterns directly from the input features, providing a crucial contrast to the traditional ML models.
The exclusion of other architectures, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) like LSTMs, was based on the scope of this study. Our methodology relies on a multi-domain feature extraction process, transforming the raw time-series signal into a feature vector. The chosen algorithms are well-suited for this feature-based, tabular data format. In contrast, CNNs are primarily designed to handle spatial hierarchies (like images) and local patterns in raw time-series data, while LSTMs are specialized for sequential data where the order is a critical factor.
Dense Neural Networks (DNNs)
Dense Neural Networks feature a fully connected architecture where each neuron in a given layer is connected to every neuron in the preceding layer, allowing for extensive information transfer (
Figure 3). This structure is particularly effective for problems that require significant learning capacity, such as natural language processing, image recognition, and signal classification [
25].
4. Results
The experiments were carried out on a robust computational setup consisting of an Apple M2 Pro chip, 12-core CPU with 8 performance cores and 4 efficiency cores, with a total of 19-core GPU and 16-core Neural Engine, and 32 GB of unified memory. This hardware provided the significant resources required for the analysis. Leveraging both MATLAB and Python, this study facilitated a direct comparison of the latest features and optimizations for algorithm development and data analysis across the two platforms.
The classifier’s robustness was assessed by evaluating its performance at Signal-to-Noise Ratios (SNRs) ranging from 40 dB to 1 dB. This analysis aims to determine the specific threshold at which the model’s accuracy significantly degrades in noisy environments.
The dataset was carefully partitioned to ensure a fair and reliable evaluation.
For all Machine Learning (ML) models, we adopted a standard data split. The training set () was used for both model fitting and hyperparameter tuning, which was performed using k-fold cross-validation to prevent data leakage. The testing set () was held out and used exclusively for the final, unbiased evaluation of the model’s performance on unseen data.
For the Deep Learning (DL) models, a three-way split was implemented to account for the need for continuous performance monitoring. The dataset was divided into a training set, a validation set, and a independent testing set. The validation set was used to fine-tune the model’s hyperparameters and prevent overfitting during the training process, while the testing set provided an unbiased final evaluation of the model’s generalizability. Additionally, to ensure a fair evaluation and prevent class bias, all classes within the dataset contained an equal number of samples before the split.
The final performance of all models was then evaluated only on the independent test set, which was not used in any part of the training or validation process. By holding out this final test set, our results provide a reliable and unbiased measure of the models’ ability to generalize to unseen data.
Model performance was evaluated using several key metrics to provide a comprehensive assessment. Accuracy measures the overall percentage of corrects predictions, but it can be misleading in unbalanced datasets. Recall (or sensitivity) quantifies the model’s ability to correctly identify all positive instances, which is crucial in applications where missing a positive case is costly. Finally, the F1-Score is used to provide a single, balanced metric that harmonizes precision and recall, offering a more reliable measure of performance, particularly in imbalanced datasets.
As part of our methodology, each algorithm was rigorously tuned on both MATLAB and Python to achieve its maximum performance. The specific hyperparameters that yielded the best results for each model are detailed in
Table 2, providing a clear record for the precise replication of our findings.
The results presented in
Table 3 and
Figure 4 demonstrate that all evaluated classifiers have a high level of robustness to noise. This is a significant finding, as noise can easily blur the boundaries between different types of power quality disturbances, making accurate classification challenging. Our results indicate that well-designed Machine Learning models are a viable and effective solution for classification tasks in noisy environments.
As shown in
Table 4, the performance of the classifiers evaluated in MATLAB is comparable to those in Python. While MATLAB achieved slightly better results in some cases, this marginal difference can be attributed to the distinct optimization algorithms used in MATLAB’s toolboxes compared to scikit-learn in Python. Both experiments were conducted using the same sample set, ensuring a fair comparison.
The Gradient Boosting experiment necessitated different implementations in MATLAB and Python. The Python experiment employed the HistGradientBoostingClassifier [
26], while MATLAB’s lack of a direct equivalent led to the selection of the AdaBoostM2 algorithm from its documentation. Although these algorithms differ in their core mechanisms, AdaBoostM2 was chosen as the most suitable alternative for a comparative analysis.
In this experiment, signal preprocessing and feature extraction were the most computationally intensive stages. The specific execution times for the subsequent classification and validation phases are detailed in
Table 5 for Python and
Table 6 for MATLAB.
By far, the most challenging and time-consuming task was the fine-tuning of the SVM hyperparameters. The iterative process required to find the optimal configuration demanded a significant amount of computational time. It is important to note that the total test time reported in the tables includes the time for signal preprocessing, feature extraction, and the final classification execution.
The MATLAB script’s reported training time explicitly includes an extensive hyperparameter search using cross-validation. This computationally intensive process, which is integrated into the model training pipeline in MATLAB, is the main source of its longer execution time.
In contrast, our Python implementation used a different approach. Based on preliminary experiments that consistently showed optimal performance with a specific set of hyperparameters (), we chose to hardcode these values. This decision allowed us to exclude the time-consuming hyperparameter search from the final reported training time for the Python script, thereby providing a more direct comparison of the base model training performance itself.
It’s important to note that the total test time reported in our tables includes the time for signal preprocessing, feature extraction, and the final classification execution. This methodological choice to hardcode hyperparameters in Python does not impact the validity of the final performance metrics reported for the Python SVM.
5. Discussion
The results indicate that both Python and MATLAB are effective platforms for PQD classification. While MATLAB demonstrates slightly better performance for certain classifiers at higher SNRs, Python’s DNN shows superior noise robustness at 1 dB . This suggests that the best choice of platform depends on the specific noise characteristics of the deployment environment and the preferred model architecture.
Across all models, a critical inflection point for performance degradation was observed at approximately 10 dB . This is attributed to the classifier’s training range, which spanned from 40 dB down to 10 dB. The classifiers did not perform as well on samples with SNRs below 10 dB because they had not been sufficiently exposed to these conditions during training. The combination of features extracted from multiple domains—including higher-order statistics, wavelet transforms, envelope analysis, and derivatives—was instrumental in enhancing the models’ ability to distinguish between different types of disturbances.
The significant distinction between our work and many high-performing studies in the literature lies in our methodology. Many papers, particularly those employing Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), adopt an end-to-end classification approach. These models act as a “black box”, extracting features directly from the raw signal, which can lead to high accuracy but lacks interpretability.
In contrast, our study utilizes a handcrafted feature extraction method, which provides greater insight into the physical characteristics of the disturbances that the model is learning. This approach offers significant practical benefits, including a substantial reduction in the computational resources and time required to obtain the parameters and train the models.
While a direct quantitative comparison is challenging due to these methodological differences, our results are highly competitive. For instance, the Python DNN achieving accuracy at 1 dB SNR demonstrates a level of noise robustness that is on par with, and in some cases even surpasses, the performance of more complex end-to-end systems reported in the literature, particularly under high-noise conditions. Our findings thus provide a valuable contribution by showing that a computationally efficient and interpretable feature-based approach can achieve comparable, high-performance results.
Future research will focus on developing a simulated smart grid to conduct more realistic classifier testing. Additional PQDs will be incorporated into the dataset, and further research will be conducted to identify optimal hyperparameters and features to improve performance in highly noisy environments.
6. Conclusions
This study successfully demonstrates that both Machine Learning (ML) and Deep Learning (DL) models can achieve highly reliable Power Quality Disturbance (PQD) classification in noisy environments. The research confirms the viability of these models for real-world applications, with accuracies consistently exceeding for Signal-to-Noise Ratios (SNRs) of 10 dB or higher.
At extreme noise levels, the performance differences become particularly significant. While some ML models demonstrated robustness (e.g., Python’s Gradient Boosting and Random Forest maintaining ∼ accuracy at 1 dB SNR), the Python-based Dense Neural Network (DNN) emerged as the most resilient classifier. Achieving accuracy at 1 dB SNR, it surpassed other models by 5–7 percentage points and the MATLAB DNN by nearly 7 points.
This numerical superiority has direct and crucial engineering implications. Maintaining a classification accuracy above under severely noisy conditions is critical for industrial grids, as it significantly reduces the risk of misclassification and enhances system reliability. Furthermore, the Python DNN’s high efficiency, with an inference time of approximately 18–21 s, makes it exceptionally suitable for near real-time monitoring applications where rapid response is mandatory.
The findings underscore two critical points. First, effective PQD classification in adverse conditions depends on meticulous feature engineering and robust preprocessing strategies. Second, there is no single universally optimal model. The ideal classifier must be carefully selected based on a trade-off between the expected noise level and computational constraints. Our results suggest that while MATLAB implementations remain competitive at higher SNRs, Python combined with a DNN offers a superior balance of noise resilience and computational efficiency, making it the most compelling choice for practical and robust PQD applications.