Power Outage Prediction on Overhead Power Lines on the Basis of Their Technical Parameters: Machine Learning Approach

Bol’shev, Vadim; Budnikov, Dmitry; Dzeikalo, Andrei; Korolev, Roman

doi:10.3390/en18185034

Open AccessArticle

Power Outage Prediction on Overhead Power Lines on the Basis of Their Technical Parameters: Machine Learning Approach

¹

Laboratories of Power Supply, Electrical Equipment and Renewable Energy, Federal Scientific Agroengineering Center VIM, 109428 Moscow, Russia

²

Laboratories of Electrical, Thermal Technologies and Energy Saving, Federal Scientific Agroengineering Center VIM, 109428 Moscow, Russia

³

Independent Researcher, Houston, TX 77077, USA

⁴

Independent Researcher, 196191 St. Petersburg, Russia

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(18), 5034; https://doi.org/10.3390/en18185034

Submission received: 7 August 2025 / Revised: 25 August 2025 / Accepted: 19 September 2025 / Published: 22 September 2025

(This article belongs to the Special Issue Future Multi-Energy Smart-Grids: Advances in Operation, Control, and Monitoring)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In this study, data on the characteristics of overhead power lines of high voltage was used in a classification task to predict power supply outages by means of a supervised machine learning technique. In order to choose the most optimal features for power outage prediction, an Exploratory Data Analysis on power line parameters was carried out, including statistical and correlational methods. For the given task, five classifiers were considered as machine learning algorithms: Support Vector Machine, Logistic Regression, Random Forest, and two gradient-boosting algorithms over decisive trees LightGBM Classifier and CatBoost Classifier. To automate the process of data conversion and eliminate the possibility of data leakage, Pipeline and Column Transformers (builder of heterogeneous features) were applied; data for the models was prepared using One-Hot Encoding and standardization techniques. The data were divided into training and validation samples through cross-validation with stratified separation. The hyperparameters of the classifiers were adjusted using optimization methods: randomized and exhaustive search over specified parameter values. The results of the study demonstrated the potential for predicting power failures on 110 kV overhead power lines based on data on their parameters, as can be seen from the derived quality metrics of tuned classifiers. The best quality of outage prediction was achieved by the Logistic Regression model with quality metrics ROC AUC equal to 0.78 and AUC-PR equal to 0.68. In the final phase of the research, an analysis of the influence of power line parameters on the failure probability was made using the embedded method for determining the feature importance of various models, including estimating the vector of regression coefficients. It allowed for the evaluation of the numerical impact of power line parameters on power supply outages.

Keywords:

electrical network; power line; power supply reliability; power outage; power line failure; machine learning; feature importance; support vector machine; logistic regression; random forest; gradient boosting

1. Introduction

Electrical energy is distributed to consumers via a sophisticated network of supports, transformers, overhead and underground power lines at various voltages, linear equipment of various kinds, and other equipment. The power supply system (PSS) reliability—that is, the probability of an uninterrupted supply of electrical energy to consumers—increases with the use of new, innovative solutions, including the modern means of protection, monitoring, and the management of electrical networks [1,2]. Despite the overall high reliability of PSS, failures on power transmission lines (PTLs) are a relatively common occurrence, accounting for 35–50% of all failures in power supply systems with voltages of 35–750 kV [3]. The large territorial extent of power lines and their vulnerability to weather-related events are factors contributing to this volume of outages [4,5]. Even while adverse weather conditions are the most common causes of failure, there are other factors to consider, such as wear and tear as well as the aging of the electrical network infrastructure [6].

The analysis of data on power outages allows us to determine the factors that have the greatest impact on failure probability. There is a large number of works devoted to the statistical analysis of failures on power lines, for example, examining situations in various countries such as the Russian Federation [7,8,9], Estonia [10], Iraq [11], and Bulgaria [12]. These studies aim to determine the causes of power line outages and identify primary strategies for reducing their frequency. Statistical analysis is also used in work [13] to examine power line failures that occur as a result of icing conditions. In work [14], in addition to statistical data processing, a regression analysis of the time series of PSS failures, which made it possible to determine the primary causes of emergency outages, parameters, and quality criteria were determined for trend inverse models of the frequency of average monthly failures.

Despite the fact that some works on the statistical analysis of emergency outages have identified the causes of failures, indicating certain failed elements, they have not considered the influence of the element types on the probability of power line failures. Predictive modeling based on machine learning (ML) techniques is one of the advanced techniques to calculate the failure probability [15,16]. So, in work [17], the use of a three-dimensional Support Vector Machine (SVM) is proposed for predicting outages of power system components; in work [18], the same method is used to determine the damage location in the system by measuring the magnitude and angle of incidence voltage at the primary substation of the distribution system. Article [19] presents an approach to identifying equipment faults in distribution systems, which solves the binary classification problem and separates outages into two categories: equipment failures and failures unrelated to the equipment. Three algorithms are used as classifiers: a decision tree, Logistic Regression, and naive Bayes classifier. The application of artificial neural networks is presented in [20] for multi-class fault classification in power supply systems based on current and voltage values on all power transmission lines (PTLs). In [21], authors successfully applied a Graph Neural Networks model for predicting cascading failures in power systems evaluated on the IEEE 39-bus and 118-bus test systems. In the study [22], the authors apply a Natural Language Processing (NLP) approach to predict the time required for power supply restoration. They utilize a recurrent neural network (RNN) to analyze historical outage reports and repair logs.

A significant number of research focuses on applying machine learning methods for predicting outages in power distribution networks during adverse weather conditions [23,24], in particular, hurricanes, tropical storms, rain, and windstorms, as well as coastal floods, including databases in the USA [24], France [25], Puerto Rico [26], and China [27]. So as to determine the number of outages in the event of a typhoon, work [28] applies a Random Forest algorithm for solving a multi-class classification problem, work [29] does a gradient boosting algorithm, and work [27] does an ensemble consisting of two levels, with the last one being XGBoost gradient boosting. Furthermore, research [30] proposes an approach for predicting outages in distribution systems caused by environmental factors through the use of deep neural networks with independent blocks combined into an ensemble. In a related study, the authors of [17] utilize a three-dimensional Support Vector Machine (SVM) to predict power grid component failures based on three features: component deterioration, distance from an extreme event, and the event’s intensity. Convolutional Neural Network-Long Short-Term Memory architecture is proposed to use in research [31] to predict power outage areas during extreme weather events.

Other studies investigate specific causes of power outages, such as those related to trees and animals, by leveraging weather data. For instance, the authors in [32] employ a combination of nonlinear machine learning regression and a time series model to identify the primary causes of vegetation-related outages. Their analysis uses outage data from Duke Energy (USA), geographical information, and weather forecasts. Similarly, research [33] analyzes faults caused by falling trees using a Logistic Regression on data related to weather conditions, seasons, the time of day, and activated protective devices. In a study on outages caused by wildlife, researchers in [34] apply AdaBoost algorithms (AdaBoost.RT and AdaBoost+) to predict such events in overhead distribution systems based on weather conditions and the time of year.

Despite the extensive existing literature on power outage prediction, no studies have explored the prediction of PTL failures using data on the lines’ own technical parameters. This represents a critical and unaddressed area, demonstrating the relevance of the conducted research. The purpose of this research is to develop a machine learning model that can solve the classification problem of predicting potential outages on power transmission lines based on their technical specifications.

2. Methodology and Materials

2.1. Materials for Research

This study is based on data on power outages in electrical networks in the Orel region, Russian Federation. In the previous study [35], the data under consideration were analyzed by methods of mathematical statistics, which included processing missing values, removing duplicates, creating synthetic parameters (including a target feature), identifying outliers and anomalies, and choosing the best features for ML models. A table with 9 features, including the target feature, and 395 objects was the end product.

The features listed in the table are as follows.

(a) Three characteristics with categorical values:

-: Outage fact (target feature);
-: Conductor, type, and crosssection;
-: PTL relation to transit (it indicates whether power transmission line is in transit or not).

(b) Six characteristics with quantitative values:

-: Condition index, % (PTL Technical Condition Score: A rating assigned by maintenance personnel, ranging from 0 (poor) to 100 (excellent));
-: Overhead PTL length, km;
-: Overexploitation, d.q. (this indicator shows whether a PTL has exceeded its standard service life of 35 years. A value greater than 1 indicates that the actual service life exceeds the normative period, while a value less than 1 indicates it does not);
-: Reinforced concrete supports, % (the ratio of reinforced concrete supports to the total number of supports);
-: PTL length through the forest, % (the ratio of the total PTL length in forest areas to the overall length);
-: PTL length in populated areas, % (the ratio of the total PTL length in populated areas to the overall length).

2.2. Research Methodology

The beginning of the study consisted of checking the quality of the prepared data using exploratory data analysis methods. It includes statistical analysis of the distribution of quantitative and categorical variables in the context of the target feature, as well as studying the correlation dependence between variables using the correlation coefficient phi ϕ_k.

Since the goal of the work is to predict the probability of PTL failure based on its parameters, binary classification is the task for machine learning algorithms. Therefore, five ML models based on the following algorithms were selected as classifiers:

Support Vector Machine (SVM);
Logistic Regression (LR);
Random Forest Classifier (RFC);
Gradient-boosting algorithms over decisive trees: LightGBM Classifier and CatBoost Classifier.

The correct selection of classifier hyperparameters was carried out using the method of random parameter optimization (RandomizedSearchCV), and the best parameters found were refined by the GridSearchCV grid search algorithm. The data is divided into training and validation samples through cross-validation with stratified separation. Data preparation for ML models was carried out using the methods of One-Hot Encoding for categorical variables and data standardization (Standard Scaler) for quantitative ones. To automate the process of data conversion and model training, as well as to eliminate the possibility of data leakage, a pipeline and a heterogeneous feature builder, Column Transformer, were used. The selection of the best trained ML model was based on the ROC AUC quality metric. In addition to this metric, the quality of models was assessed using additional metrics [36], such as AUC-PR, Accuracy, Precision, Recall, and the F1-score. At the last stage, an analysis was made on the influence of power line parameters on failure probability by using the embedded method of determining the importance of the features of various models, including estimating the vector of regression coefficients.

In this research, data processing and analysis were carried out in the Python programming language (ver. 3.13.7) in the Jupyter notebook (ver. 7.0.6) development environment of the Anaconda Python software package. To work with tabular data, the Pandas library (ver. 2.1.4) was used, for mathematical processing of data arrays—NumPy (ver. 1.26.3), for data visualization—Matplotlib (ver. 3.8.0) and Seaborn (ver. 0.12.2), and for correlation analysis—Phik (ver. 0.12.4). Scikit-learn library (ver. 1.2.2) was used to build learning algorithms, transform data suitable for ML tasks, and work with the main classical machine learning models. In addition to it, gradient boosting frameworks from the LightGBM (ver. 4.1.0) and CatBoost (ver. 1.2.2) libraries were used. To work with an unbalanced data set, tools from the Imbalanced-learn library were applied.

3. Results

The development of a machine learning model capable of performing a given task involves properly selecting and preprocessing training data, selecting the most appropriate machine learning algorithm, and fine-tuning the model’s hyperparameters to ensure optimal performance through model testing [37]. Each of these stages is an important link in building a quality model in accordance with the metric being measured.

3.1. Exploratory Data Analysis

Before starting to develop machine learning models, it is necessary to study the prepared data. Figure 1 depicts histograms derived for categorical features and a distribution density graph for quantitative ones, while the latter are constructed with a kernel density estimate [38,39] as the weight function, which is the Gaussian kernel [40].

The following conclusions can be drawn from the given graphs:

As can be seen from the histograms of categorical features, it is clear that there is a different distribution of values for power transmission lines with and without failures, which indicates their impact on the target variable. The low cardinality of categorical features (four values for the feature “Conductor, type, section” (Figure 1b) and two for “PTL relation to transit” (Figure 1c)) must also be noted. According to this, when preparing data for the direct training of ML models, the most optimal coding of categorical variables is considered to be the One-Hot Encoding technique.
Density graphs of quantitative feature distributions between power lines, on which outages were observed, and they did not differ in their shape, so it can be assumed that there is an influence of these features on the target variable.
There is some imbalance in the target attribute “Outage fact” (Figure 1a). There are 163 electrical lines that have failed and 232 that have not; the difference is almost 20%. Class imbalance can cause problems when training machine learning models that are non-probabilistic, such as the SVM (Support Vector Machine), or when solving multi-class classification problems.

The method of correlation analysis was applied so as to find the degree of the relationship between various variables. The correlation coefficient phi ϕ_k was used as a correlation measure, which enables the analysis of not only quantitative variables but also categorical ones [41]. The degree of correlation with this method is in the range 0…1, where 0 means the absence of a relationship between the characteristics, and 1 indicates its maximum degree [42]. The correlation matrix obtained by computing the correlation coefficient ϕ_k between the variables under investigation is graphically displayed as a heat map in Figure 2.

The correlation analysis showed that

-: The replacement of the absolute values of the number of reinforced concrete (RC) and metal supports, as well as the length of the forest and populated areas with relative values, made it possible to cope with the problem of the multicollaterality of these features (this problem was discussed at the previous stage of the study [35]), which should have a beneficial effect on the training of ML models;
-: The target attribute “Outage fact” has a not strong but sufficient correlation with the variables under consideration. The highest correlation is shown by the feature “Overhead PTL length” (0.63) and the smallest by the categorical feature “Conductor, type, section” (0.11);
-: A fairly strong correlation (0.85) was revealed between PTL service life and the fact of whether the power line is in transit or not. This issue was discussed at the previous stage of preparing data for machine learning [35], in which it was concluded that transit PTLs, according to the statistics presented, have a longer service life than non-transit lines. Finally, it was decided to leave both characteristics, since these parameters reflect completely different values.

3.2. Algorithm for Training and Tuning Hyperparameters of ML Models

The algorithm for training machine learning models with the method of random parameter optimization is shown in Figure 3. In accordance with the algorithm, a randomized set of model hyperparameters is selected from the distribution over possible parameter values (parameter grid) and initialized for training (block 4). The number of variations in model hyperparameters using the RandomizedSearchCV method was chosen to be 100 iterations (the “n_iter” value is block 2 of Figure 3). In the same block, the number of folds for cross-validation is also selected (the “cv” value is block 2 of Figure 3). The value of “scor_iter” (the value of the quality metric of the model) is set to 0 and is used later to find the best model (blocks 11, 12, and 13). As soon as the number of iterations exceeds the “n_iter” value (block 3), the process of selecting hyperparameters will stop and an ML model with hyperparameters corresponding to the best quality metric will be produced (block 14).

3.3. Feature Encoding and Scaling

A focused and thorough approach to data preparation is critical to the development and application of machine learning models, including neural networks. The success of the ML model training process largely depends on the quality and suitability of the data fed into the models. In the previous stage, [35], an important part of the processing was performed. It included the processing of missing values and duplicates, as well as the identification of suitable features for the given classification task. Now it is required to convert categorical variables to numeric representations as well as to normalize quantitative variables by scaling or centering their numeric constituents to promote convergence.

As noted above, categorical variables have low cardinality, so it was decided to use the One-Hot Encoding method to encode these variables. To normalize quantitative variables, the data standardization method was used (block 7 of Figure 3). To automate the process of data conversion and model training, as well as to eliminate the possibility of data leakage, a pipeline and a heterogeneous feature builder Column Transformer were used.

It was decided to use two methods to combat the class imbalances of the target feature: the method of weighting classes (CW) implemented inside models from the Scikit-learn library, and the synthetic minority over-sampling technique for the nominal and continuous (SMOTE-NC) method, implemented through the pipeline of the Imblearn library to automate the process and eliminate the issue of feature leakage. Every model was trained using both techniques, and the outcomes were then shown in the final table.

3.4. Splitting the Data Set into Training and Test Samples

When developing machine learning models, it is also critical to validate the model performance using independent data; that is, the data the model has not seen before. For this purpose, a test sample was taken from the source data, usually in the amount of 10–20% of the training one. This approach allows for identifying and, accordingly, eliminating the main problem with training ML models—overtraining, which is the phenomenon when trained ML models learn answers from the training set but poorly determine patterns from third-party data.

In the research, it was decided to separate the test sample in the amount of 20% and apply the cross-validation method to the training sample (in Figure 3, blocks 5–9 and 10), which implies dividing the data into several parts (folds). Each fold at its training stage should act as a validation sample, the rest as a training one. In our case, we used a 5-fold cross-validation with stratified division, guaranteeing the same class ratio in all samples, which is especially important for unbalanced data. Thus, the number of training iterations on one set of model hyperparameters depends on the number of folds (the “cv” value, block 5 of Figure 3). Then, according to the algorithm, as soon as training was carried out on all folds, the average value of the quality metric was calculated (block 10) and compared with the best metric in the condition (block 11).

3.5. ML Models and Hyperparameter Grids

3.5.1. Support Vector Machine

The Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression analysis [43]. The SVM algorithm works by transforming the input data into a high-dimensional space using a kernel function, then finding the hyperplane that best separates the data points into different classes. The main hyperparameters on which the training of an SVM model depends are the regularization coefficient C, the type of kernel used, and, accordingly, its coefficient. The grid of values for these hyperparameters is presented in Table 1.

3.5.2. Logistic Regression

Regression modeling is one of the most popular statistical approaches for determining the relationships between a target variable and a set of independent predictors. Regression models are divided into logistic and linear, but a linear regression cannot be used to determine a dichotomous (binary) variable, since the result of such a model is continuous values, including those with a negative direction [44]. Therefore, to predict the probability of a binary value, an extended version of linear regression is used—Logistic Regression (LR), which predicts the probability distribution of an event (yes/no or 1/0) through the logit link function. During the training process, unregulated hyperparameters of the optimization algorithm (“saga”) and the type of regularization (“Elastic-Net”) were selected, which allow for the use of both types of regularization (l1 and l2) simultaneously. The acceptable ratio between these two regularization types was determined by changing the hyperparameter “l1_ratio” (Table 1). The strength of C regularization was also adjusted.

3.5.3. Random Forest

A Random Forest is one of the most widely used machine learning algorithms, applied in both regression and classification problems [45]. The random forest algorithm is based on the ensemble learning method, the results of which are constructed by combining independent forecasts of the forest—many decision trees trained on samples obtained using the bootstrap method [46]. In the case of a classification problem, the final answer of the model will be based on the voting of decision trees [47]. It is known that the more decision trees there are, the higher (in most cases) the model quality is; therefore, up to 1000 trees were used when training the Random Forest. The depth of decision trees also affects the quality of the model’s prediction (the larger it is, the higher the quality is), so ML training was carried out at a depth of 1 to 21 objects. In addition to this, all splitting criteria, the limit on the number of objects in leaves, and the minimum number of objects at which splitting is performed were also examined (Table 1).

3.5.4. Gradient Boosting Algorithms LightGBM and CatBoost

The method of gradient boosting over decision trees is also used, the essence of which is to train a certain number of models (in our case, decision trees), taking into account the errors obtained on previous models. Gradient boosting allows for building an additive function in the form of a sum of decision trees iteratively by analogy with the gradient descent method [48]. Thus, this approach enables achieving a higher prediction accuracy.

As gradient boosting algorithms, the LightGBM Classifier and CatBoost Classifier were used [49]. CatBoost is an open-source gradient boosting library introduced by Yandex in 2017 for supervised machine learning, containing two innovations: ordered target statistics and ordered boosting [50]. A distinctive feature of CatBoost is the ability to work with heterogeneous data sets with different data types. CatBoost is used for regression, classification, and ranking problems. LightGBM is a gradient boosting library introduced by Microsoft DMKT, also in 2017. Due to its speed and high performance [51,52], this model is widely used in solving regression, classification, and other ML problems. Like CatBoost, LightGBM has built-in support for encoding categorical variables.

Gradient boosting algorithms are retrained quite quickly, so it is necessary to accurately select hyperparameters to obtain a high-quality model. The number of decision trees, their depth, and the number of leaves (terminal nodes) are important parameters, as they are for a Random Forest model. Nonetheless, if, for a Random Forest, a large depth of trees, in most cases, has a positive role on the prediction quality, for gradient boosting, it is recommended to use an average depth to achieve a balance between learning and generalization. Other important parameters are the learning rate, which is the degree of contribution of each tree to the model’s prediction, the type of boosting algorithm (gbdt, dart, or goss) and the degree of regularization.

3.6. ML Model Quality Assessment

The efficiency of predicting a dichotomous variable in ML models is assessed by various metrics, calculated on the basis of the confusion matrix (error matrix). The confusion matrix is a classification of model prediction results into true positive ones (TP), true negative ones (TN), false positive ones (FP), and false negative ones (FN), as shown in Figure 4.

Since there is a slight class imbalance in the data under study, the quality of the model was assessed using the ROC AUC metric, which is immune to this problem. The ROC AUC metric is the area under the ROC curve (receiver operating characteristic curve), which displays the relationship between the number of correctly and incorrectly classified answers when varying the threshold of the decision rule [53]. ROC AUC effectively captures the trade-off between the true positive rate and the false positive rate, which is critical for our task, where missing a true event (high false negative rate) is as important as avoiding excessive false alarms (high false positive rate). The metric values range from 0 to 1, with 1 indicating a high-quality model, 0.5 performing a random prediction of the model’s answers, and 0 performing a false prediction—that is, the model predictions are opposite to the true values. Correctly classified TPR (true positive rate) and incorrectly classified FPR (false positive rate) answers are found by Equations (1) and (2).

T P R = \frac{T P}{T P + F N}

(1)

F P R = \frac{F P}{F P + T N}

(2)

In addition to the ROC AUC metric also being taken as a loss function when training models, additional quality metrics were calculated to assess the model performance [36], such as the following:

-: Accuracy—proportion of correct answers:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(3)

-: Precision—proportion of true positive predictions among all positive predictions of the model:

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

-: Recall—proportion of true positive predictions among all positive cases:

R e c a l l = \frac{T P}{T P + F N}

(5)

-: F1score (F1 мepa)—harmonic mean of Recall and Precision, taking a value from 0 to 1 and allowing for assessing the model quality in the presence of an unbalanced data set:

F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n * R e c a l l}

(6)

-: AUCPR—area under the PR curve, displaying the relationship between the metrics of Precision and Recall. AUCPR, unlike ROC, is sensitive to class imbalance.

4. Discussion

4.1. Model Training Results

The results of model training are summarized in Table 2, sorted by the ROC AUC metric. Table 3 displays the hyperparameters of the top performing models. To check for adequacy, in addition to the considered models, a constant DummyClassifier model was added with a “uniform” classification strategy that generates predictions randomly with equal probability for each class.

There were experiments combining two methods for handling class imbalance: CW and the SMOTE-NC method. For all models analyzed, this combined approach did not lead to any performance increase compared to using each method separately. Although we believe there is a high probability that combining these methods could yield more satisfactory results, we decided not to include these findings in Table 2 to avoid visual clutter.

According to Table 2, all trained ML models showed results that were an order of magnitude higher than the predictions of the constant model. The best method for dealing with unbalanced sampling turned out to be the method of weighting classes. According to the ROC AUC metric, the best model was chosen to be a Logistic Regression with the class weighting method to combat class imbalance (0.779), which is reflected in the simplicity of the model itself and its good approximation. This model also showed the best results in terms of the AUC-PR metric (0.68), which indicates successful prediction of the target feature under the conditions of an unbalanced sample. The model was tested on a test set and showed excellent results—ROC AUC, with a value of 0.84. The metric value turned out to be higher than on the validation set, which indicates the absence of overtraining the model; however, large discrepancies between the metrics on different samples indicate a problem with insufficient data (there were only 79 objects in the test set). The problem of a small sample is clearly visible in the ROC and PR curves, which have sharp “breaks” due to insufficient data (Figure 5).

Despite this, the model copes with the task, and the graphical display of the ROC curve (Figure 5a) allows us to state that, by changing the classification threshold, a high result can be achieved. For example, if it is necessary that the model’s prediction includes all 100% of the power lines on which failures are observed (a lack of type I errors), then, according to the ROC curve graph (Figure 5a), up to 60% of power lines with no failures will be assigned to this class (type II errors). Conversely, if prediction accuracy for both classes is desired, then it can be calculated that, at a classification threshold of approximately 0.4, the TPR predictions will be 0.8 while the FPR predictions are only 0.2.

The PR curve (Figure 5b) shows that Precision stabilizes at a value of 0.72…0.75 with Recall from 0.35 to 0.78 and corresponding classification thresholds of 0.45…0.58. Thus, knowledge of the metrics discussed above at certain levels of the classification threshold makes it possible to regulate the number of errors of different types. The optimal threshold for deployment must be chosen based on the specific requirements of the end application.

4.2. Feature Importance Analysis by Embedded Methods

In Section 3.1 of this paper, an analysis of the correlation between features is carried out using the correlation coefficient phi ϕ_k, which is essentially one of the statistical methods (filtering methods). During this analysis, the degree of influence of the PTL parameters on the target variable “Outage fact” was determined, and, accordingly, the features were properly processed to achieve the best correlation. At the same time, many ML models themselves determine the degree of influence of features on the final prediction result, and accordingly optimize their influence to obtain the best result. Analysis of such data allows for an even better understanding of the features’ importance in the given task.

The selection of features in accordance with model parameters is called the embedded method for assessing the importance of features [54]. Regarding regression models, such a method is the analysis of weighting coefficients for each feature (degree of regularization), as it is shown for the distribution network fault diagnosis in North Carolina (USA) as the LASSO/ALASSO feature selection method [55]. For models based on decision trees, the appropriate method is the analysis of importance indicators for each feature, calculated at the training stage using the Gini criterion. Figure 6 shows the importance of features for the best ML models trained in this article: Logistic Regression and CatBoost Classifier.

As expected, the greatest contribution to the prediction of both models is made by the feature “Length of PTL overhead sections, km”, while the influence of this feature on the CatBoost Classifier prediction reaches 48%. Thus, it is once again established that the number of failures on a power line grows with its length. For this reason, the failure flow parameter, the failure number per certain length of power lines, is frequently used to evaluate the effectiveness of energy supply organizations.

The features determining the length of power lines in forested and populated areas were converted into relative units of the total length of power lines, so their influence on the model results turned out to be not so significant. Despite this, an interesting fact is the presence of an inverse relationship between power line failures and the relative length in a populated area (the weight coefficient of the LR is −0.27, with an importance for the CatBoost Classifier of 3%); that is, the longer the PTL length in a populated area, the lower the failure probability is. This can likely be attributed to the enhanced control and maintenance of power lines in populated areas, which is due to their critical social importance and potential risks to the public. In turn, the CatBoost Classifier (Figure 1b) assigns a 3% importance to the PTL length through the forest, while the weight coefficient for the LR model is only 0.02, which is quite small and indicates that tree and shrub vegetation are well cleared in the protective zone of 110 kV power lines, and their influence on failures is not evident.

A previous study on EDA [35] found that the number of supports clearly correlates with the length of power lines, so the current research used a synthetic feature reflecting the relative number of reinforced concrete supports out of the total number of supports (RC plus metal). It was revealed that a higher content of RC supports relative to metal supports entails an increase in the probability of failures on power lines. So, the importance of this feature in the CatBoost Classifier is 15.5%; the weight coefficient of the logistic model is 0.2.

The fact that the line is of transit has a strong influence on the probability of outages on 110 kV power lines. This feature is categorical, and in the data preparation algorithm proposed in this work, this feature was processed by the One-Hot Encoding method, which resulted in the formation of two features, the fact of PTL transit and the fact of PTL non-transit, both of which influence the CatBoost prediction with an importance of almost 10%. At the same time, the weight coefficients of the Logistic Regression also confirm the strong influence of this feature on predicting the power outage fact. The fact of transit has a direct relationship with the target variable (weight coefficient is 0.49), while the fact of non-transit has an inverse relationship (weight coefficient is −0.59).

The service life expressed through the “Overexploitation” feature affects the target variable rather weakly (the feature importance is 3.3%; the weight coefficient is 0.04). While this may seem counterintuitive, it can reflect the reality of maintenance practices for this specific infrastructure. The 110 kV PTLs are critical components of the power supply network. Consequently, they receive significant and consistent maintenance, which helps mitigate the effects of aging. This rigorous maintenance regimen likely prevents any significant correlation between the age of the line and its vulnerability to outages.

The influence of the PTL condition index on the target variable is quite unobvious. Logically, it is possible to assume that the worse the condition of the line is, the more probable the failure is. However, this fact is not observed when analyzing the importance of features—only 3.8% for predicting the CatBoost Classifier, while the weighting coefficient of the LR model is minus 0.04. Such a weak influence can be explained by the artificial control of this parameter in the reporting documents of electric grid companies, depending on the need for scheduled maintenance.

Considering the importance of the type of power line conductor on the model prediction, it is necessary to note that the weighting coefficients for conductors of the AC-185, AC-120, and AC-95 types are equal to zero, which indicates the absence of their influence on power line failures. However, there is a sufficient negative dependence between the AC-150 conductor and the target variable (−0.24). This fact can be explained by a small sample and the random specifics of electrical networks in the Orel region—AC-type conductors with a cross-section of 150 and 185 are more often of transit (as a percentage in 78% and 80% of cases, respectively), while conductors with a cross-section of 150 are only in 56% of cases.

5. Conclusions

As part of this study, it was proposed to solve classification task of predicting power supply outages on 110 kV power lines based on the technical parameters of the lines themselves by means of a supervised machine learning technique. Five classifiers were considered as ML algorithms: Support Vector Machine, Logistic Regression, Random Forest, LightGBM Classifier, and CatBoost Classifier. It was proposed to prepare data for the models using One-Hot Encoding and standardization techniques for categorical and quantitative variables, respectively. To automate the process of data conversion and eliminate the possibility of data leakage, Pipeline and Column Transformers were applied. The hyperparameters of the classifiers were adjusted using a randomized and exhaustive search over specified parameter values.

According to the ROC AUC metric, the best quality was obtained by Logistic Regression with the class weighting method to combat class imbalance. The model showed a result of 0.78, which was reflected in the simplicity of the model itself and its good approximation. This model also produced the highest results in terms of the AUC-PR metric (0.68), demonstrating that, even in an unbalanced sample, the target feature could be successfully predicted. Additionally, the Logistic Regression performed exceptionally well on the test sample, as evidenced by a ROC AUC of 0.84, which supported the lack of model overtraining. However, this result indicates the impact of the data shortage issue, which was seen in a marked excess of quality metrics on the test sample relative to the validation sample.

The use of embedded methods for assessing the importance of features for a Logistic Regression and CatBoost Classifier made it possible to assess the degree of influence of PTL parameters on power outage facts. As anticipated, the feature “Length of overhead power line sections” contributes the most to the prediction of both models (the importance in the CatBoost Classifier prediction reaches up to 48%). The PTL length in populated areas shows the presence of an inverse relationship with the target variable; that is, the greater the PTL length in this area is, the less probable the failure is. At the same time, the PTL length through the forest has a rather weak correlation with the failure fact. In terms of the support types, it was revealed that an increased proportion of reinforced concrete supports in comparison to metal supports results in a higher risk of power line failures. So, the importance of this feature in the CatBoost Classifier is 15.5%, while the weight coefficient of the logistic model is 0.2. The transit fact of the PTL has a significant impact on the probability of outages—the importance of this feature in the CatBoost Classifier is 10%. The transit fact has a direct relationship with the target variable (the LR weight coefficient is 0.49), while the non-transit fact has a reverse influence (the weight coefficient is −0.59). The service life and conditions of power lines have a negligible impact on the target variable. This feature analysis can serve as a valuable starting point for guiding future power network reconstruction and maintenance efforts.

The results of this study showed the possibility of predicting power outages on 110 kV power lines based on data on the technical parameters of the lines themselves, as can be seen from the derived quality metrics of ML models. Nevertheless, due to the limited data set (395 objects), it was not possible to achieve a consistent result. Specifically, the notable discrepancy between the test and validation set metrics suggests a high degree of model overfitting. Therefore, future research directions should focus on mitigating this limitation by expanding the data set to include power lines from other regions and/or incorporating data from additional time periods. This expansion is crucial for building more robust and generalizable models and moving toward a practical, deployable solution for outage prediction.

Author Contributions

Conceptualization, V.B.; methodology, V.B.; Software, V.B. and A.D.; validation, V.B., D.B., A.D. and R.K.; formal analysis, V.B. and D.B.; investigation, V.B. and D.B.; resources, V.B., D.B. and R.K.; data curation, V.B., D.B., A.D. and R.K.; writing—original draft preparation, V.B., D.B. and R.K.; writing—review and editing, V.B., D.B., A.D. and R.K.; visualization, V.B., D.B. and A.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available upon reasonable request from the corresponding author. The code used for the research is publicly available in the GitHub repository at [https://github.com/VadZhen/Agro_and_energy_science_projects/tree/main/enery_outages (accessed on 18 September 2025)].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Long, L. Research on Status Information Monitoring of Power Equipment Based on Internet of Things. Energy Rep. 2022, 8, 281–286. [Google Scholar] [CrossRef]
Sun, B.; Jing, R.; Zeng, Y.; Li, Y.; Chen, J.; Liang, G. Distributed Optimal Dispatching Method for Smart Distribution Network Considering Effective Interaction of Source-Network-Load-Storage Flexible Resources. Energy Rep. 2023, 9, 148–162. [Google Scholar] [CrossRef]
Bazan, T.V.; Galaburda, Y.V.; Iselenok, E.B. Analysis of outages of 35-750 kV overhead lines. In Current Problems of Energy. Electrical Power Systems; BNTU: Minsk, Republic of Belarus, 2020; pp. 114–116. [Google Scholar]
Yang, L.; Teh, J. Review on Vulnerability Analysis of Power Distribution Network. Electr. Power Syst. Res. 2023, 224, 109741. [Google Scholar] [CrossRef]
Shakiba, F.M.; Shojaee, M.; Azizi, S.M.; Zhou, M. Real-Time Sensing and Fault Diagnosis for Transmission Lines. Int. J. Netw. Dyn. Intell. 2022, 1, 36–47. [Google Scholar] [CrossRef]
Latka, M.; Hadaj, P. Technical and Statistical Analysis of the Failure of Overhead Lines and Its Impact on Evaluating the Quality of the Power Supply. In Proceedings of the 2016 Progress in Applied Electrical Engineering, PAEE 2016, Koscielisko-Zakopane, Poland, 26 June–1 July 2016. [Google Scholar] [CrossRef]
Ildiryakov, S.R.; Vafin, S.I. Statistical Analysis of Voltage Failures in the Power Supply System of JSC Kazanorgsintez. Energy Probl. 2011, 3, 73–81. [Google Scholar]
Lanin, A.V.; Polkovskaya, M.N.; Yakupov, A.A. Statistical Analysis of Emergency Outages in 10 kV Electrical Networks. Curr. Issues Agric. Sci. 2019, 30, 45–52. [Google Scholar]
Vinogradov, A.; Vasiliev, A.; Bolshev, V.; Semenov, A.; Borodin, M. Time Factor for Determination of Power Supply System Efficiency of Rural Consumers. In Handbook of Research on Renewable Energy and Electric Resources for Sustainable Rural Development; Kharchenko, V., Vasant, P., Eds.; IGI Global: Hershey, PA, USA, 2018; pp. 394–420. ISBN 9781522538677. [Google Scholar]
Zahraoui, Y.; Korõtko, T.; Rosin, A.; Mekhilef, S.; Seyedmahmoudian, M.; Stojcevski, A.; Alhamrouni, I. AI Applications to Enhance Resilience in Power Systems and Microgrids—A Review. Sustainability 2024, 16, 4959. [Google Scholar] [CrossRef]
Salman, H.M.; Pasupuleti, J.; Sabry, A.H. Review on Causes of Power Outages and Their Occurrence: Mitigation Strategies. Sustainability 2023, 15, 15001. [Google Scholar] [CrossRef]
Ivanova, M.; Dimitrova, R.; Filipov, A. Analysis of Power Outages and Human Errors in the Operation of Equipment in Power Grids. In Proceedings of the 2020 12th Electrical Engineering Faculty Conference, BulEF 2020, Varna, Bulgaria, 9–12 September 2020. [Google Scholar] [CrossRef]
Ratushnyak, V.S.; Ilyin, E.S.; Vakhrusheva, O.Y. Statistical Analysis of Emergency Power Outages due to Ice Formation on Power Line Wires on the Territory of the Russian Federation. Young Sci. Sib. Electron. Sci. Mag. 2018, 1, 107–113. [Google Scholar]
Sbitnev, E.A.; Zhuzhin, M.S. Accident Analysis of 0.38 kV Rural Electric Networks of the Nizhny Novgorod Power System. Bull. NGIEI 2020, 11, 36–47. [Google Scholar]
Sood, S. Power Outage Prediction Using Machine Learning Technique. In Proceedings of the 2023 International Conference on Power Energy, Environment & Intelligent Control (PEEIC), Greater Noida, India, 19–23 December 2023; pp. 78–80. [Google Scholar] [CrossRef]
Haleem Medattil Ibrahim, A.; Sadanandan, S.K.; Ghaoud, T.; Subramaniam Rajkumar, V.; Sharma, M. Incipient Fault Detection in Power Distribution Networks: Review, Analysis, Challenges, and Future Directions. IEEE Access 2024, 12, 112822–112838. [Google Scholar] [CrossRef]
Eskandarpour, R.; Khodaei, A. Leveraging Accuracy-Uncertainty Tradeoff in SVM to Achieve Highly Accurate Outage Predictions. IEEE Trans. Power Syst. 2018, 33, 1139–1141. [Google Scholar] [CrossRef]
Gururajapathy, S.S.; Mokhlis, H.; Illias, H.A.B.; Abu Bakar, A.H.; Awalin, L.J. Fault Location in an Unbalanced Distribution System Using Support Vector Classification and Regression Analysis. IEEJ Trans. Electr. Electron. Eng. 2018, 13, 237–245. [Google Scholar] [CrossRef]
Doostan, M.; Chowdhury, B.H. Power Distribution System Equipment Failure Identification Using Machine Learning Algorithms. In Proceedings of the IEEE Power and Energy Society General Meeting, Chicago, IL, USA, 16–20 July 2017; pp. 1–5. [Google Scholar] [CrossRef]
Warlyani, P.; Jain, A.; Thoke, A.S.; Patel, R.N. Fault Classification and Faulty Section Identification in Teed Transmission Circuits Using ANN. Int. J. Comput. Electr. Eng. 2011, 3, 807–811. [Google Scholar] [CrossRef]
Bhaila, K.; Wu, X. Cascading Failure Prediction in Power Grid Using Node and Edge Attributed Graph Neural Networks. In Proceedings of the International Joint Conference on Neural Networks, Yokohama, Japan, 30 June–5 July 2024. [Google Scholar] [CrossRef]
Jaech, A.; Zhang, B.; Ostendorf, M.; Kirschen, D.S. Real-Time Prediction of the Duration of Distribution System Outages. IEEE Trans. Power Syst. 2019, 34, 773–781. [Google Scholar] [CrossRef]
Alqudah, M.; Obradovic, Z. Enhancing Weather-Related Outage Prediction and Precursor Discovery Through Attention-Based Multi-Level Modeling. IEEE Access 2023, 11, 94840–94851. [Google Scholar] [CrossRef]
Allen, M.; Fernandez, S.; Omitaomu, O.; Walker, K. Application of Hybrid Geo-Spatially Granular Fragility Curves to Improve Power Outage Predictions. J. Geogr. Nat. Disasters 2014, 4, 2167–2587. [Google Scholar]
Lair, W.; Michel, G.; Meyer, F.; Chapert, M.; Decroix, H. Windy Smart Grid; Forecasting the Impact of Storms on the Power System. In Book of Extended Abstracts for the 32nd European Safety and Reliability Conference; Research Publishing Services: Singapore, 2022; pp. 905–912. [Google Scholar]
Montoya-Rincon, J.P.; Azad, S.; Pokhrel, R.; Ghandehari, M.; Jensen, M.P.; Gonzalez, J.E. On the Use of Satellite Nightlights for Power Outages Prediction. IEEE Access 2022, 10, 16729–16739. [Google Scholar] [CrossRef]
Hou, H.; Chen, X.; Li, M.; Zhu, L.; Huang, Y.; Yu, J. Prediction of User Outage under Typhoon Disaster Based on Multi-Algorithm Stacking Integration. Int. J. Electr. Power Energy Syst. 2021, 131, 107123. [Google Scholar] [CrossRef]
Li, M.; Hou, H.; Yu, J.; Geng, H.; Zhu, L.; Huang, Y.; Li, X. Prediction of Power Outage Quantity of Distribution Network Users under Typhoon Disaster Based on Random Forest and Important Variables. Math. Probl. Eng. 2021, 2021, 6682242. [Google Scholar] [CrossRef]
Taylor, W.O.; Cerrai, D.; Wanik, D.; Koukoula, M.; Anagnostou, E.N. Community Power Outage Prediction Modeling for the Eastern United States. Energy Rep. 2023, 10, 4148–4169. [Google Scholar] [CrossRef]
Das, S.; Kankanala, P.; Pahwa, A. Outage Estimation in Electric Power Distribution Systems Using a Neural Network Ensemble. Energies 2021, 14, 4797. [Google Scholar] [CrossRef]
Huang, W.; Zhang, W.; Chen, Q.; Feng, B.; Li, X. Prediction Algorithm for Power Outage Areas of Affected Customers Based on CNN-LSTM. IEEE Access 2024, 12, 15007–15015. [Google Scholar] [CrossRef]
Doostan, M.; Sohrabi, R.; Chowdhury, B. A Data-Driven Approach for Predicting Vegetation-Related Outages in Power Distribution Systems. Int. Trans. Electr. Energy Syst. 2020, 30, e12154. [Google Scholar] [CrossRef]
Xu, L.; Chow, M.Y.; Taylor, L.S. Data Mining and Analysis of Tree-Caused Faults in Power Distribution Systems. In Proceedings of the 2006 IEEE PES Power Systems Conference and Exposition, PSCE 2006—Proceedings, Atlanta, GA, USA, 29 October–1 November 2006; pp. 1221–1227. [Google Scholar] [CrossRef]
Kankanala, P.; Pahwa, A.; Das, S. Estimating Animal-Related Outages on Overhead Distribution Feeders Using Boosting. IFAC-PapersOnLine 2015, 48, 270–275. [Google Scholar] [CrossRef]
Bolshev, V.E.; Vinogradova, A. V Analysis of the Influence of 110 KV Power Line Parameters on the Probability of Their Failures. J. Sib. Fed. Univ. Eng. Technol. 2024, 17, 758–776. [Google Scholar]
Chokr, B.; Chatti, N.; Charki, A.; Lemenand, T.; Hammoud, M. Feature Extraction-Reduction and Machine Learning for Fault Diagnosis in PV Panels. Sol. Energy 2023, 262, 111918. [Google Scholar] [CrossRef]
Maraden, Y.; Wibisono, G.; Nugraha, I.G.D.; Sudiarto, B.; Jufri, F.H.; Kazutaka; Prabuwono, A.S. Enhancing Electricity Theft Detection through K-Nearest Neighbors and Logistic Regression Algorithms with Synthetic Minority Oversampling Technique: A Case Study on State Electricity Company (PLN) Customer Data. Energies 2023, 16, 5405. [Google Scholar] [CrossRef]
Vorontsov, K.V. Mathematical Methods of Teaching by Precedents (Theory of Machine Learning); MachineLearning: Moscow, Russia, 2011. [Google Scholar]
Boutaba, R.; Salahuddin, M.A.; Limam, N.; Ayoubi, S.; Shahriar, N.; Estrada-Solano, F.; Caicedo, O.M. A Comprehensive Survey on Machine Learning for Networking: Evolution, Applications and Research Opportunities. J. Internet Serv. Appl. 2018, 9, 16. [Google Scholar] [CrossRef]
Vershinin, D.S.; Branishti, V.V. Calculation of the Optimal Blurry Parameter for Rosenblatt–PARZEN ESTImation with Gaussian Kernel. Curr. Probl. Aviat. Astronaut. 2021, 2, 109–111. [Google Scholar]
Baak, M.; Koopman, R.; Snoek, H.; Klous, S. A New Correlation Coefficient between Categorical, Ordinal and Interval Variables with Pearson Characteristics. Comput. Stat. Data Anal. 2020, 152, 107043. [Google Scholar] [CrossRef]
Barnard, G.A. Introduction to Pearson (1900) On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling. In Breakthroughs in Statistics: Methodology and Distribution; Springer: New York, NY, USA, 1992; pp. 1–10. [Google Scholar] [CrossRef]
Jang, H.S.; Bae, K.Y.; Park, H.S.; Sung, D.K. Solar Power Prediction Based on Satellite Images and Support Vector Machine. IEEE Trans. Sustain. Energy 2016, 7, 1255–1263. [Google Scholar] [CrossRef]
Das, A. Logistic Regression. In Encyclopedia of Quality of Life and Well-Being Research; Springer: Cham, Switzerland, 2021. [Google Scholar] [CrossRef]
Jogunuri, S.; Josh, F.T.; Stonier, A.A.; Peter, G.; Jayaraj, J.; Jaganathan, S.; Jency Joseph, J.; Ganji, V. Random Forest Machine Learning Algorithm Based Seasonal Multi-Step Ahead Short-Term Solar Photovoltaic Power Output Forecasting. IET Renew. Power Gener. 2024, 19, e12921. [Google Scholar] [CrossRef]
Villegas-Mier, C.G.; Rodriguez-Resendiz, J.; Álvarez-Alvarado, J.M.; Jiménez-Hernández, H.; Odry, Á. Optimized Random Forest for Solar Radiation Prediction Using Sunshine Hours. Micromachines 2022, 13, 1406. [Google Scholar] [CrossRef] [PubMed]
Druzhkov, P.N.; Zolotykh, N.Y.; Polovinkin, A.N. Implementation of a Parallel Prediction Algorithm in the Gradient Boosting Method of Decision Trees; Mathematical Modeling and Programming Series; Bulletin of the South Ural State University: Chelyabinsk, Russia, 2011; pp. 82–89. [Google Scholar]
Salakhutdinova, K.I.; Lebedev, I.S.; Krivtsova, I.E. Algorithm for Gradient Boosting of Decision Trees in the Software Identification Problem. Sci. Tech. Bull. Inf. Technol. Mech. Opt. 2018, 18, 1016–1022. [Google Scholar]
Hancock, J.T.; Khoshgoftaar, T.M. CatBoost for Big Data: An Interdisciplinary Review. J. Big Data 2020, 7, 94. [Google Scholar] [CrossRef] [PubMed]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2018; Volume 31. [Google Scholar]
Rufo, D.D.; Debelee, T.G.; Ibenthal, A.; Negera, W.G. Diagnosis of Diabetes Mellitus Using Gradient Boosting Machine (LightGBM). Diagnostics 2021, 11, 1714. [Google Scholar] [CrossRef]
Singh, N.K.; Fukushima, T.; Nagahara, M. Gradient Boosting Approach to Predict Energy-Saving Awareness of Households in Kitakyushu. Energies 2023, 16, 5998. [Google Scholar] [CrossRef]
Nusinovici, S.; Tham, Y.C.; Chak Yan, M.Y.; Wei Ting, D.S.; Li, J.; Sabanayagam, C.; Wong, T.Y.; Cheng, C.Y. Logistic Regression Was as Good as Machine Learning for Predicting Major Chronic Diseases. J. Clin. Epidemiol. 2020, 122, 56–69. [Google Scholar] [CrossRef]
Kudryavtseva, A.S. Application of Embedded Selection Methods to Optimize the Referential Choice Model. Comput. Linguist. Intellect. Technol. 2017, 16, 1–9. [Google Scholar]
Cai, Y.; Chow, M.Y.; Lu, W.; Li, L. Statistical Feature Selection from Massive Data in Distribution Fault Diagnosis. IEEE Trans. Power Syst. 2010, 25, 642–648. [Google Scholar] [CrossRef]

Figure 1. Distribution of PTL parameter values in the context of the target variable. (a) Outage fact; (b) conductor, type, and cross-section; (c) PTL relation to transit; (d) condition index, %; (e) overhead PTL length, km; (f) overexploitation, d.q.; (g) reinforced concrete supports, %; (h) PTL length through the forest, %; and (i) PTL length in populated areas, %.

Figure 2. Heat map of the correlation matrix ϕ_k between PTL parameters.

Figure 3. Algorithm for tunning ML model hyperparameters.

Figure 4. Confusion matrix.

Figure 5. ROC and PR curves on test set for the best Logistic Regression model; (a) is ROC curve, (b) is Precision–Recall curve.

Figure 6. Importance of features by embedded methods; (a) is Logistic Regression, (b) is CatBoost Classifier.

Table 1. ML model hyperparameter grid.

ML Model	Hyperparameter	Hyperparameter Definition	Hyperparameter Values
SVM	C	Regularization strength	0…10, step–0.5
	kernel	Kernel type	‘rbf’, ‘poly’, ‘sigmoid’
	gamma	Kernel coefficient	0…1, step–0.01
Logistic Regression	solver	Optimization algorithm	“saga”
	penalty	Norm of penalty	“elasticnet”
	l1_ratio	Relation between l1 and l2 regularizations	0…1, step–0.1
	C	Regularization strength	0…5, step–0.1
Random Forest Classifier	n_estimators	Number of solving trees	10…1000, step–100
	min_samples_split	Minimum number of samples required to split an internal node	2…50, step–10
	min_samples_leaf	Minimum number of samples required to be at a leaf node	2…50, step–10
	max_depth	Maximum tree depth	1…21, step–1
	criterion	Split criterion	“gini”, “entropy”, “log_loss”
LGBM Classifier	learning_rate	Boosting learning rate	0.0001, 0.001, 0.01
	max_depth	Maximum tree depth	1…21, step–1
	n_estimators	Number of solving trees	10…1000, step–10
	num_leaves	Maximum tree leaves	2…50, step–1
	boosting_type	Boosting algorithm	“gbdt”, “dart”, “goss”
	reg_alpha	L1 regularization coefficient	0…1, step–0.1
	reg_lambda	L2 regularization coefficient	0…1, step–0.1
CatBoost Classifier	depth	Maximum tree depth	1…10, step–1
	learning_rate	Boosting learning rate	0.0001, 0.001, 0.01
	iterations	Number of solving trees	10…1000, step–10
	l2_leaf_reg	L2 regularization coefficient	1…15, step–1
	max_leaves	Maximum tree leaves	2…50, step–1

Table 2. Quality metrics of models.

ML Model	Accuracy	Recall	Precision	F1	AUC-PR	ROC-AUC
Support Vector Machine (CW)	0.687	0.685	0.606	0.642	0.672	0.768
Support Vector Machine (SMOTE-NC)	0.687	0.67	0.61	0.637	0.671	0.761
Logistic Regression (CW)	0.69	0.685	0.611	0.644	0.683	0.779
Logistic Regression (SMOTE-NC)	0.69	0.678	0.612	0.641	0.676	0.772
Random Forest Classifier (CW)	0.687	0.693	0.607	0.644	0.661	0.772
Random Forest Classifier (SMOTE-NC)	0.671	0.67	0.592	0.626	0.67	0.761
LightGBM Classifier (CW)	0.715	0.762	0.631	0.687	0.647	0.771
LightGBM Classifier (SMOTE-NC)	0.69	0.686	0.609	0.642	0.652	0.772
CatBoost Classifier (CW)	0.693	0.708	0.618	0.657	0.655	0.776
CatBoost Classifier (SMOTE-NC)	0.712	0.692	0.641	0.664	0.64	0.771
Dummy Model	0.519	0.523	0.431	0.472	0.411	0.5

Table 3. Hyperparameters of best models.

ML Model	Hyperparameters
Support Vector Machine	kernel: ‘rbf’; gamma: 0.04; C: 9.0; class_weight: ‘balanced’
Support Vector Machine (SMOTE)	kernel: ‘sigmoid’; gamma: 0.03; C: 6.0
Logistic Regression	penalty: ‘elasticnet’; solver: ‘saga’; l1_ratio: 0.8; C: 1; class_weight: ‘balanced’
Logistic Regression (SMOTE)	penalty: ‘elasticnet’; solver: ‘saga’; l1_ratio: 0.6; C: 1
Random Forest Classifier	n_estimators: 300; min_samples_split: 12; min_samples_leaf: 32; max_depth: 12; criterion: ‘log_loss’; class_weight: ‘balanced’
Random Forest Classifier (SMOTE)	n_estimators: 100; min_samples_split: 32; min_samples_leaf: 22; max_depth: 1
LightGBM Classifier	reg_lambda: 0.2; reg_alpha: 0.6; num_leaves: 31; n_estimators: 490; max_depth: 1; learning_rate: 0.01; boosting_type: ‘gbdt’; class_weight: ‘balanced’
LightGBM Classifier (SMOTE)	reg_lambda: 0.7; reg_alpha: 0.9; num_leaves: 47; n_estimators: 865; max_depth: 15; learning_rate: 0.001; boosting_type: ‘goss’
CatBoost Classifier	max_leaves: 16; learning_rate: 0.001; l2_leaf_reg: 14; iterations: 720; depth: 4; class_weight: ‘balanced’
CatBoost Classifier (SMOTE)	max_leaves: 16; learning_rate: 0.001; l2_leaf_reg: 14; iterations: 720; depth: 4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bol’shev, V.; Budnikov, D.; Dzeikalo, A.; Korolev, R. Power Outage Prediction on Overhead Power Lines on the Basis of Their Technical Parameters: Machine Learning Approach. Energies 2025, 18, 5034. https://doi.org/10.3390/en18185034

AMA Style

Bol’shev V, Budnikov D, Dzeikalo A, Korolev R. Power Outage Prediction on Overhead Power Lines on the Basis of Their Technical Parameters: Machine Learning Approach. Energies. 2025; 18(18):5034. https://doi.org/10.3390/en18185034

Chicago/Turabian Style

Bol’shev, Vadim, Dmitry Budnikov, Andrei Dzeikalo, and Roman Korolev. 2025. "Power Outage Prediction on Overhead Power Lines on the Basis of Their Technical Parameters: Machine Learning Approach" Energies 18, no. 18: 5034. https://doi.org/10.3390/en18185034

APA Style

Bol’shev, V., Budnikov, D., Dzeikalo, A., & Korolev, R. (2025). Power Outage Prediction on Overhead Power Lines on the Basis of Their Technical Parameters: Machine Learning Approach. Energies, 18(18), 5034. https://doi.org/10.3390/en18185034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Power Outage Prediction on Overhead Power Lines on the Basis of Their Technical Parameters: Machine Learning Approach

Abstract

1. Introduction

2. Methodology and Materials

2.1. Materials for Research

2.2. Research Methodology

3. Results

3.1. Exploratory Data Analysis

3.2. Algorithm for Training and Tuning Hyperparameters of ML Models

3.3. Feature Encoding and Scaling

3.4. Splitting the Data Set into Training and Test Samples

3.5. ML Models and Hyperparameter Grids

3.5.1. Support Vector Machine

3.5.2. Logistic Regression

3.5.3. Random Forest

3.5.4. Gradient Boosting Algorithms LightGBM and CatBoost

3.6. ML Model Quality Assessment

4. Discussion

4.1. Model Training Results

4.2. Feature Importance Analysis by Embedded Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI