Data-Driven Prediction Method for Power Grid State Subjected to Heavy-Rain Hazards

: This study presents a machine learning-based method for predicting the power grid state subjected to heavy-rain hazards. Machine learning models can recognize key knowledge from a dataset without any preliminary knowledge about the dataset. Hence, machine learning methods have been utilized for solving power grid-related problems. Two sets of historical data were used herein: Local weather data and power grid outage data. First, we investigated the heavy-rain-related outage distribution and analyzed the correlated characteristics between weather and outages to characterize the heavy rain events. The analysis results show that multiple weather e ﬀ ects are signiﬁcant in causing power outages, even under heavy-rain conditions. Furthermore, this study proposes a cost-sensitive prediction method using a support vector machine (SVM) model. The accuracy of the model was improved by applying a cost-sensitive learning algorithm to the SVM model, which was subsequently used to predict the state of the grid. The developed model was evaluated using G-mean values. The proposed method was veriﬁed via actual data of a heavy rain event that occurred in South Korea


Introduction
The electrical power grid constitutes a vital infrastructural component and serves as an essential foundation for modern life. Severe weather events have been considered a major cause of enormous power-outage events. Moreover, the frequency and intensity of severe weather events have increased over recent years [1][2][3][4][5]. Due to climate change, the frequency, intensity, and duration of extreme weather are expected to increase further, resulting in large-scale power outages [6] and motivating this study on grid resilience, defined as the ability to withstand and rapidly recover from a severe weather event.
A key driver for improving grid resilience is to predict the state of the grid [7]. The grid state demonstrates how the grid, or its components, performs with respect to severe weather events and it can determine the operating conditions of the grid. Identification of the grid state is an essential factor in mitigating the impact of severe weather, preparing the emergency response, and managing the restoration crew within an acceptable time and cost limits. The grid is typically designed to operate under a certain weather intensity. If the weather intensity increases beyond the design criteria, there will be an increase in the probability of a power outage. Therefore, the grid state can be represented by either a normal or outage state, given the weather intensity. Modeling the grid state is very complex; therefore, the formulation of the relationship and its solution is difficult to determine.
Data-driven methods can be utilized to model the complex relationship for the grid state; these methods are time efficient and the components do not require a physical model. Data-driven methods, such as machine learning (ML) models, can recognize key knowledge from a dataset without any prior knowledge about the dataset. Thus, ML methods have been utilized to solve power grid-related problems. ML applications include fault identification and detection, based on the support vector machine or neural network [8,9], and risk analysis [10]. A more prevalent application of ML in the power grid field is in relation to forecasting. Several ML-based forecasting methods have been introduced for short-term load forecasting [11,12], renewable power forecasting [13,14], electricity price forecasting [15], and power grid damage forecasting [16][17][18][19][20], which utilize regression models and tree-based regression. A probabilistic approach based on the Tobit regression model to forecast power outage was proposed by [21]. A Bayesian network-based storm-induced power outage forecasting model for an electrical grid was presented in [22]. Predictive analytics for extreme weather events were presented in [23]. A power system contingency forecasting tool under extreme weather conditions was presented in [24]. These authors proposed a risk-based security method to predict the risky contingencies on the power system.
This study presents an ML-based method for predicting the power grid state subjected to heavy-rain hazards. Two sets of historical data were used: Local weather data and power grid outage data. First, we investigated the heavy-rain-related outage distribution and analyzed the correlated characteristics between weather and outages in order to characterize the heavy-rain events. The analysis results show that multiple weather effects are significant in causing power outages, even under heavy-rain conditions. Furthermore, this study proposes a cost-sensitive prediction method using the support vector machine (SVM) model. The accuracy of the model was improved by applying a cost-sensitive learning algorithm to the SVM model, which was subsequently used to predict the state of the grid. The developed model was evaluated using G-mean values. The proposed method was verified using the actual data of a heavy-rain event that occurred in South Korea.
This study is organized as follows: Section 2 describes the related work, Section 3 describes the used data, Section 4 describes the characteristics of the heavy-rain event, Section 5 presents the method and materials for the development of the power grid state prediction model, Section 6 presents results of the actual data, and Section 7 concludes this study.

Related Work
Several researchers have studied the grid resilience corresponding to severe weather events. The linear regression model-based prediction approach was presented in [16,17]. They tried to forecast the number of damages caused by strong wind or storm events. Various statistical methods, such as regression tree and multivariate regression, were used to predict power outage durations during a hurricane [18]. Liu et al. [19,20] developed a linear model-based power outage prediction method using hurricane variables and environmental variables at a given location. Han et al [25,26] presented a power-outage estimation model using a similar approach with Liu et al [19,20] but with additional exploratory variables. Guikema et al. [27] introduced a generalized hurricane outage prediction model by considering the additional explanatory variables, such as land cover, soil moisture, and hurricane indicators. Recently, the power grid component state prediction method during hurricane events was presented in [28,29]. Logistic regression and support vector machines were used to model the power component state during the hurricane.
Most previous studies focused on hurricane events and their developed models relied on several grid vulnerability variables and specific hurricane indicators. However, not all power companies have detailed data about the power outage caused by weather events and it can be expensive to gather extensive datasets. Recent studies tried to predict the grid component state during the hurricane event based on a few weather variables. However, these researches used artificially generated samples rather than real power outage data. Therefore, the previous studies have limited its scalability to other weather events and real applications.
The objective of this study was to characterize the power outage caused by heavy rain and to predict grid states during a heavy-rain event based on real data. To be scalable and applicable in reality, the common weather variables were used to develop a prediction model. Therefore, it can be easily extended to other disruptive weather events.

Description of Data
Heavy-rain hazards can be determined by certain weather conditions. For instance, when the precipitation over 6 h is expected to be more than 70 mm, the Korea Meteorological Administration (KMA) issues a special weather report to warn of heavy-rain-related hazards.
The supervised ML approach generally uses previous historical data to build its prediction model. This study considers the SVM approach to predict the grid states using common weather variables. Thus, two sets of historical data, local weather and outage data, were used to analyze the power outage. All the data ranged from January 2008 to March 2018.
In order to evaluate the present power grid state over the subsequent hours, the model required a numerical weather prediction (NWP) forecast. Therefore, this study considered publicly available weather data provided by the KMA, which could be associated with South Korea's NWP system. NWP forecasts were transformed into a grid state via a machine learning model that was trained by the historical data to learn the dependence of the output on explanatory variables.

Weather Station Data
The local weather data was obtained from an automatic weather station (AWS); the weather variables were recorded at least every hour. The collected data consisted of accumulated rainfall, 1 min average wind speed, 10 min average wind speed (m/s), wind direction ( • ), and humidity (%). For this study, the accumulated rainfall and 10 min average wind speed were used to analyze the power outage. These weather variables are the expected cause of power outages under heavy-rain conditions. Furthermore, when several AWSs were active in a region, this study used the maximum/minimum weather values as it is the extreme weather conditions that are associated with power outages.
The standard criteria for issuing advisory or warning information concerning extreme weather conditions in South Korea was investigated to extract the heavy-rain data. The KMA issues a special weather report whenever a sudden and significant change in weather conditions is observed. The standard criteria for heavy rain are divided into the advisory level and warning level. The advisory level means that the precipitation for 6 h is expected to be more than 70 mm or the precipitation for 12 h to be over 110 mm. The warning level means that the precipitation for 6 h is expected to be more than 110 mm or the precipitation for 12 h to be over 180 mm [30]. In this paper, the standard criteria for heavy rain were used to extract the heavy rain condition data to predict the grid state.

Outage Data
The electrical power company in South Korea records the data regarding power outage events in its branch offices. The outage information includes the details of the head/branch office, time of failure, failure component, cause of failure, number of affected customers, the time required for restoration, and type of failure. The outage is related to the failure components; the location of the failures is not exactly specified in the system topology. Despite this limitation, a wider geographical area managed by each branch office can be evaluated from the data.

Integration of Weather and Outage Data
Since local weather data is collected by a different institute than outage data, AWS data is required to be mapped with outage data to analyze the relationship between them. Furthermore, the KMA provides a detailed regional short-term weather (STW) forecast every 3 h at a dense grid point by gridding the whole country into 5 × 5 km intervals. The grid state prediction would be performed using STWs; therefore, the available datasets, with different data frames and time periods, required integration into 3-h segments to develop the power grid state prediction model. This data frame is developed to utilize weather forecast data. It allows the proposed method to be extended to other disruptive weather events.

Outage Pattern
The U.S. Department of Energy (DOE) tracked the outage distribution for each power outage caused by major hurricane events [31]. This research reported that the number of outage customers rose within a small number of hours of the event. It suggests that the power outage caused by weather events may be observed to be a general pattern.
Therefore, we can expect that the outage distribution caused by heavy rain may follow a general pattern, similar to hurricane events. Figure 1 illustrates the normalized damage distribution for several heavy rain events. These heavy rain event durations and the number of damages were normalized into between 0 and 1 using its maximum value. Therefore, different heavy rain event durations and the number of damages can be compared over the same axis range. Consequently, Figure 1 shows that the peak number of damages often occurred close to the end of the heavy rain event time. In the case of a hurricane, the system experiences a strong weather attack early in the event. Conversely, heavy rain consistently hits the system with a relatively weak attack, causing the peak outage to occur near the end of the event. Therefore, it can be concluded that the power outage events may occur due to prolonged stress caused by accumulated rainfall under heavy-rain conditions.

Skewness Distribution of Outage Data
The scarcity of available data on the power grid remains one of the predominant challenges of studying grid resilience. The occurrence of heavy-rain events is relatively rare and does not always cause grid damage. Most points on the scatter plot shown in Figure 1 lie on the x-axis, i.e., no outage cases. The figure indicates that the grid outage data is not normally distributed, which represents the skewness of the data distribution. Therefore, the grid outage data has an unbalanced label ratio and may lead to a deterioration in the performance of the machine learning model.
Due to these issues, a new method is required to capture the inherently complex characteristics of the imbalanced data and derive knowledge from the data. Innovative research regarding imbalanced learning issues includes oversampling, undersampling, and cost-sensitive learning methods [32]. Although the application of these techniques can help improve the classification results, various empirical studies have been introduced asserting that cost-sensitive learning is superior to sampling methods in certain datasets [33]. To achieve an accurate grid state prediction method, this study considered the SVM model based on a cost-sensitive method for evaluating the grid state under heavy-rain conditions.

Correlated Characteristic between Weather and Outage
Most power components are exposed to weather events that can cause unexpected power outages. Power components are designed to endure normal intensity weather; however, if the weather intensity exceeds the power component's design threshold point, a power outage occurs. In outage data, wind and rain events are recorded as the causes of power outages and hence it is difficult to specify the exact type of event that induces the power outage. Moreover, heavy-rain events generate a large amount of precipitation and sometimes occur simultaneously to strong wind conditions. Therefore, two types of events must be considered as the cause of power outages under heavy-rain conditions. There were no historical outage cases predominantly recorded, indicating that the distribution of outage data was significantly distorted. In other words, the dataset has zero-inflated characteristics since the power outage events occur with low probability even under heavy-rain conditions. Therefore, the probability density distribution of this dataset has a right-skewed distribution rather than the normal distribution. A log-transformation was performed to improve interpretability and to reduce the skewness of the data. The zero values of the outage customers were not included when analyzing the relationship between weather and outages. Figures 2 and 3 show that with higher accumulated rainfall and wind speed, the number of outage customers tended to be larger. The Pearson correlation coefficients of rainfall and wind speed with the number of outage customers are calculated as 0.3145 and 0.2354, respectively. The color bar represents the probability density, estimated via the kernel density estimation method, and the red line represents the regression line of the scatter plot. These figures show that major outages occurred with a low probability and minor outages occurred with a relatively high probability.  To further verify the relationship between the variables, a multivariate regression model was developed using the two weather variables as inputs. Table 1 shows the R-squared values of each model with various combinations of explanatory variables. A larger R-squared value indicates that the model can capture a larger portion of the variance. Although all models have a low R-squared value, this analysis suggests that wind speed could help to predict the power outages even under heavy rain conditions. Moreover, the model with wind and rain has the highest R-squared value, accounting for 8.21% of the variance. This suggests that the multiple weather effect could play an important role in causing the power outage event.

Method and Materials
The grid state can be represented by the normal state or outage state. In this study, we considered the problem of binary classification for predicting the state of the grid, as follows.
Define a binary event, O as follows: A data-driven method can be used to solve the above binary classification problem. SVM models have been widely used in that field, but their model performance is dependent on a dataset. SVM always guarantees the global optimum and is less prone to overfitting than neural network models, which are also widely used [34]. Moreover, SVM could be suited to small and medium datasets. Severe weather events and power outage events occur with low probability and therefore, the used dataset in this study is a relatively smaller dataset than other application field datasets, such as image recognition and natural language. For those reasons, this study utilizes the SVM approach despite SVM not always guaranteeing the most suitable model.

SVM
SVM is a general approach to supervised learning for classification via the definition of a separating hyperplane, which allows discrimination between classes. It is widely used for classifying binary problems. The optimal hyperplane is considered as the hyperplane with the maximal margin between the classes, which decreases the probability of misclassification and increases the generalization of the model. The following paragraphs briefly demonstrate the SVM model in three-dimensional space for a binary classification problem. The theoretical details of SVM can be found in the literature [35].
Consider n observations x i ∈ R 3 , i = 1, . . . , n with corresponding target values t 1 , . . . , t n , where t n ∈ {−1, 1}, and new data points x are classified according to the sign of y w T x + b , where w is the weight vector that defines a direction perpendicular to the hyperplane of the decision boundary plane, while b is a bias that moves the hyperplane parallel to itself. The optimal decision boundary plane given by the support vector can be found by solving the optimization problem, as follows: where ξ i is a slack variable that regularizes the weight of misclassified data points inside the margin. C is a penalty coefficient that controls the trade-off between the margin and classification error. In the aforementioned formulation, increasing C toward ∞ places more weight on the slack variables ξ i , and the optimization attempts to make a stricter separation, whereas reducing C towards 0 reduces the importance of misclassification. This 'soft-margin' formulation can be re-formulated using the Lagrange multiplier α, as follows: The solution of the soft-margin problem is usually found via sequential minimal optimization (SMO) [36]. SMO iteratively selects and optimizes two Lagrange multipliers until convergence at an optimal solution. Figure 4 shows the support vectors and hyperplane in a separable two-class classification of SVM. Figure 4 shows the support vectors and hyperplane for separating outages from no outages, based on the associated weather variables. The finding of the best hyperplane assumes that the data is linearly separable; however, typical data often has non-linear characteristics. The kernel trick can be used to map the non-linear input space into a linear separable feature space. In this study, we applied the radial basis kernel trick function, k x i , x j . This can be replaced as the dot product of x in another space. The radial basis function kernel (RBF) can be expressed as follows: where σ 2 represents the kernel parameter. The best kernel should be found by an experiment while adjusting kernel parameters via a search method to minimize the error of the model. Furthermore, the performance of the general SVM algorithm tends to be lower for an imbalanced ratio dataset as the separating boundary tends toward the minority class to obtain better classification accuracy. Therefore, in order to obtain an effective classification model, the separating boundary needs to be pushed toward the majority class and away from the minority class. This is realized through penalizing the advantages of the majority class for the separating boundary while maximizing the marginal distance. We define a cost matrix, C, associated with training observations. For the minority class, the associated misclassification cost, C ij , is relatively large compared to that of the majority class. The misclassification cost penalizes the favors of the separating boundary for the majority class; therefore, the separating boundary is pushed towards the majority class. Accordingly, the degree to which the separating boundary is pushed must be determined. The element C(i, j) of this matrix is the cost of classifying an observation into class j, if the true class is i. The diagonal elements C(i, i) of the cost matrix must be 0, as follows: where c > 1 is the cost of misidentifying an outage as no outage. Cost is relative; multiplying the costs, the separating boundary for the majority class is penalized and the minority class will be preferred.

Evaluation Metric
Classification accuracy can be measured by the number of correct predictions out of all the predictions. However, the accuracy may fail to adequately measure the performance of the classifier due to the presence of imbalanced data. To obtain more accurate classifier information, this study adopts the G-mean evaluation metric, which represents the geometric mean of the accuracies of each class. The G-mean ranges from 0 to 1, where 1 indicates a perfect classifier.
For binary classification problems, the G-mean represents the square root of the product of a positive class accuracy and a negative class accuracy. It can be mathematically expressed as follows: where TP, TN, FP, and FN represent the true positive, true negative, false positive, and false negative, respectively.

Results
The data range used for this study was from 2008 through to 2018; a historical dataset up to mid of 2016 was used to develop the prediction model and the remaining historical dataset was used to evaluate the performance of the prediction model. Furthermore, the parameters (kernel parameter (σ 2 ) and penalty coefficient (C)) of the SVM model were tuned using Bayesian optimization. In total, 30 trials were conducted to find the best combination of the parameters with a minimum classification error of 5-fold cross-validation. A search for the C log-scaled in the range between 1e-3 and 1e3 and σ 2 log-scaled in the range between 1e-3 and 1e3 was performed to find the best parameters of SVM. The G-mean values were evaluated for each of the 10 available explanatory variables to identify the key explanatory variables as shown in Table 2. The complexity of the model may lead to overfitting and a computational burden and inadequate variables can decrease the performance of the model. Thus, in this study, variables were selected by adding variables sequentially and the model with the highest performance was selected as the final model. Moreover, the degree of misclassification cost was determined to obtain an appropriate separating boundary. First, this study applied the reference ratio of the cost, 0.618:1 (the golden standard), between minority classes and the majority classes, as follows: where t j and t i are the minority and majority class, respectively. Twenty percent of the historical dataset was used to evaluate the performance of the proposed method. Based on the cost matrix C, the SVM models with a combination of weather variables were developed and their G-mean values are summarized in Table 3. Based on the average accumulated rainfall, x 7 , other variables were added sequentially. In Table 3, the G-mean values tend to decrease with the addition of the wind direction-related variables, x 5 , x 6 . It implies that wind direction is not a substantial factor for power grid outage subjected to heavy rain hazards. Conversely, the G-man values tend to increase with the addition of the wind speed-related variable, x 1 , x 2 , x 3 . This means that wind speeds have a substantial impact on the power grid even under heavy-rain conditions. Consequently, the SVM model with x 1 , x 2 , x 3 , x 7 , x 9 recorded the highest G-mean values. This model is not significantly different compared to the others; however, it is superior to other models. Table 3. G-mean value with a combination of variables.

Variables
G-mean To further improve the SVM model performance and analyze the effect of misclassification cost, the costs were explored with the variables x 1 , x 2 , x 3 , x 7 , x 9 . A grid search for the misclassification costs c, {1, 1.5, 2, 0, . . . ., 11,11.5} was performed to find the optimal separating boundary that properly classifies the majority class and minority class. Figure 5 shows the G-mean values with cost values. As shown in this figure, when the cost value was 1, the G-mean value was recorded as 0. This implies the general SVM model without the cost matrix so that the separating boundary of the model was placed towards the minority class to obtain better classification accuracy. This results in the separating boundary being pushed toward the majority class by the high misclassification cost. This is realized by penalizing the advantages of the majority class for the separating boundary while maximizing the marginal distance. Increasing misclassification costs can have the effect of moving the separating boundary away from the minority class. According to the misclassification cost, a biased separating boundary can be obtained. Therefore, the G-mean value, where the true positive rate and true negative rate are similar, is an appropriate result. The G-mean value for a misclassification cost of 11 was calculated as the maximum and it also has the highest true positive rate and true negative rate in the grid search range. Consequently, the highest G-mean value was recorded as 0.7188 with a misclassification cost of 11. This is superior to the best model shown in Table 3. Therefore, this study used the SVM model with a cost of 11 to demonstrate the overall model performance. Table 4 shows the confusion matrix of the SVM models, with the variables x 1 , x 2 , x 3 , x 7 , x 9 and cost of 11 and Table 5 shows the SVM models' performance. The other typical classification models (SVM with the linear and polynomial kernel, multilayer perceptron, decision tree, random forecast, logistic regression) were also trained and tested in the same fashion (i.e., their hyperparameters with misclassification cost were searched and selected to obtain the best performance). The results show that the normal state prediction accuracy was 70.93%, the outage prediction accuracy was 72.83%, and the SVM with RBF had an overall prediction accuracy with a G-mean of 0.7188. Whereas, the random forest can classify with a G-mean of 0.7000, which is the most superior model except for SVM. The results indicate that the obtained decision boundary using the cost-sensitive SVM could be used to adequately predict the grid state for future heavy-rain events.  Predicted results are shown based on administrative areas since the branch office areas are close to the administrative areas and it is a convenient method of displaying the spatial distribution of the results. In order to show the spatial distribution of the proposed method, it was used to predict a new heavy-rain event in October 2016. Figure 6 shows the study area. The actual heavy-rain event lasted 15 h; however, for simplicity, the prediction results show only the final nine hours of heavy rain. Figure 7 shows the spatial distribution of the actual and predicted grid state. The black boundary represents the no heavy-rain event for administrative areas and the red and white boundaries represent the outage and normal grid state, respectively. As shown in Figure 7, the proposed method demonstrated adequate prediction performance, however, it does not represent a perfect classifier. This may be due to a significant number of normal grid states mispredicted as outage grid states.  However, the proposed method does predict a grid state similar to the actual grid state and the overall predicted grid states follow a similar spatial pattern to the actual grid states. Hence, it can be concluded that the proposed method can be utilized to mitigate the impact of heavy rain by issuing an alarm for high-risk areas in advance.

Conclusions
This study presents a machine learning-based method for predicting the power grid state subjected to heavy-rain hazards. This study used two sets of data: Local weather data and power distribution grid damage data, from January 2008 to March 2018. To understand the heavy-rain event, we investigated the heavy-rain-related outage distribution and analyzed the correlated characteristics between weather and outages. The results show that the effect of wind speed is significant in causing power outages, even under heavy-rain conditions. Furthermore, this study proposed a grid state prediction method using a cost-sensitive support vector machine (SVM) model. The penalty and kernel parameters of SVM were optimized using Bayesian optimization. The accuracy of the model was improved by applying a cost-sensitive learning algorithm to the SVM model, which was subsequently used to predict the grid state of a region managed by a branch office. The developed models were evaluated using the G-mean values. The proposed method was verified using the actual data of heavy-rain events that occurred in South Korea. The results show that the SVM model with cost-sensitive effects outperforms the general SVM model.
The method proposed in this study is useful in predicting the grid state during heavy-rain events. This study is expected to provide insight into heavy-rain events, as well as damage prediction. Moreover, the proposed method can be extended to other disruptive events since it only relies on the simple weather variable provided by the weather forecast system. Moreover, the condition of power components and weather are changing over time. Thus, future study is needed to develop grid state prediction models to generalize and reflect the changes in the grid conditions over time.