Outage Estimation in Electric Power Distribution Systems Using a Neural Network Ensemble

: Outages in an overhead power distribution system are caused by multiple environmental factors, such as weather, trees, and animal activity. Since they form a major portion of the outages, the ability to accurately estimate these outages is a signiﬁcant step towards enhancing the reliability of power distribution systems. Earlier research with statistical models, neural networks, and committee machines to estimate weather-related and animal-related outages has reported some success. In this paper, a deep neural network ensemble model for outage estimation is proposed. The entire input space is partitioned with a distinct neural network in the ensemble performing outage estimate in each partition. A novel algorithm is proposed to train the neural networks in the ensemble, while simultaneously partitioning the input space in a suitable manner. The proposed approach has been compared with the earlier approaches for outage estimation for four U.S. cities. The results suggest that the proposed method signiﬁcantly improves the estimates of outages caused by wind and lightning in power distribution systems. A comparative analysis with a previously published model for animal-related outages further establishes the overall effectiveness of the deep neural network ensemble.


Introduction
The reliability of an electric power distribution system indicates its ability to deliver power to the customers without interruptions. Due to their exposure to various hazards, overhead feeders, which are commonly used in distribution systems, contribute to the majority of service interruptions to the customers. Environmental factors such as weather, trees, and animals can be extremely hazardous for overhead feeders [1]. High wind and lightning not only cause damage directly to overhead lines, but they also break trees, which in turn can cause damage to the nearby feeders. Squirrels cause outages by creating short circuits across the insulators on the overhead feeders. Although complete prevention is not possible, proper design and maintenance can help in reducing weather-and animalrelated outages to acceptable levels. Utilities keep records of outages, and based on that determine system upgrades to address the areas of higher outages. Most of the measures used today are retroactive in nature, where preventive decisions are made entirely upon past experience. Forecasting failures proactively from weather patterns can help utilities deal with outages much more effectively. Additionally, the normalization of reliability indices to weather allows them to have a better justification for the outages [1].
Since weather-related and animal-related outages are highly random, predicting their occurrence is a challenging task. Some basic approaches to model the causal relationships for outage prediction are based on statistical regression [2][3][4]. Simple neural networks that have been proposed to predict weather-related outages [5] have shown better performance than statistical models. However, as outages follow a very non-uniform distribution, there is a paucity of high outage samples in the training data. As a result, these neural

Outage and Weather Data
Outages due to weather and animals occur randomly and the probability of their occurrence increases during adverse conditions. For example, during stormy conditions due to high wind and lightning, the probability of outages increases. Similarly, the probability of outages due to animals increase under conditions that promote higher animal activity. Previous research showed that the aggregation of outages in space and time is necessary to obtain a meaningful a causal relationship [1,7,8]. The aggregation of outages over a day due to wind and lightning, and over a week due to animals, provide the best results.
An electric utility provided the recorded outage data for multiple years for the four cities considered for the research. Detailed weather information for these years was obtained from the local weather stations. The maximum daily 5-s wind gust speed measured in miles/hour was chosen to study wind effects because it has high correlation with the other measures of wind speeds. The absolute values of all the lightning strokes in kA, including the first stroke and the flashes in the defined area around the feeders, for each day of the study were added to find the aggregate of the lightning strokes for each of these days. Note that the earlier models [2,3,5] considered the lightning strokes recorded within 400 m of the distribution feeders. In contrast, the present study considers lightning strikes within 500 m. Furthermore, while only those days with lightning strikes were included earlier, in this approach, days without lightning are also included. The inputs to the model are the normalized wind and lightning data. The raw data from 2005 to 2009 was processed to identify outages caused by wind and lightning, and they were aggregated to create a database of daily outages.
Outages due to squirrels in the distribution systems of the same four cities in a period from 1998 to 2007 were considered to study animal-caused outages. The evaluation of squirrels' yearly life cycle and animal-caused outage data showed that squirrels cause most damage in fair weather, which are the days with temperature within 40 • F and 85 • F with no other weather activity. Further, the behavioral patterns of squirrels are different in each month of the year, and thus months have considerable impacts on squirrel-related outages. Hence, the months were grouped into three groups based on the level of squirrel activity: Low (January, February, and March), Medium (April, July, August, and December), and High (May, June, September, October, and November). They are classified as Month Types 1, 2, and 3. The number of fair-weather days in each week were counted and they were classified into three levels, which are Low (zero fair weather days), Medium (one to three fair weather days), and High (four or more fair-weather days in the week). In addition, outages in the previous week due to animals were considered as an input to the model. They were classified into two categories (Low and High). The cutoff for High outage levels in the previous week is set at the 70th percentile, which means weeks with outages higher than those occurring in 70% of the weeks during the study are defined as High. The outages due to animals were aggregated to create a database of weekly outages.

Deep Neural Network Ensemble Approach
It is well-known that combining the outputs of an ensemble of neural networks (hereafter referred to as sub-networks) improves the overall performance [22]. In one such approach, the sensitivity of each sub-network to the output is used during training [23]. Another ensemble method decomposes the input space into Voronoi partitions (i.e., nonintersecting, convex partitions whose union is the input space) and associates a distinct sub-network for each partition. In [24], fuzzy C-means clustering is applied to partition the input space. In [25], a Boltzmann parameter has been used to reduce the degree of randomness of the genetic algorithm operators used in training. In another method [26], this parameter is steadily lowered but then reset to a high value to allow the neural networks to escape local minima.
In this research, the input space is indirectly partitioned by means of a softmax output neuron that also incorporates a Boltzmann parameter, which is initialized to a high value to allow the faster training of all sub-networks in the ensemble. As training progresses, it is steadily lowered so that the Voronoi partition boundaries become increasingly more pronounced. In the limiting case, when the Boltzmann parameter approaches zero, these boundaries are well-defined. This allows the output neuron to function as a switch, picking the output of the appropriate sub-network as the ensemble output. This arrangement also helps in avoiding overfitting [27].

Layout of the Deep Neural Network Ensemble
The schematic in Figure 1 shows the layout of the deep neural network ensemble proposed in this research. The ensemble consists of a set of K sub-networks, where each is a D × H × 1 neural network, where D and H are the number of neurons in the input and hidden layers. Each hidden neuron realizes a sigmoidal saturating nonlinearity, while the output neuron is a weighted summing unit. The weights (including biases) of sub-network k ∈ K are collectively represented in terms of a vector, w (k) ∈ R (D+2)H+1 . A training sample consists of the input vector x i ∈ X ⊂ R D + and a scalar desired output, y i ∈ [0, y max ], where i ∈ {1, . . . , |N |}, N is the training dataset, X is the input metric space, and y max is the maximum possible failure experienced in the grid. The output of sub-network k corresponding to the ith sample is denoted as y input space. In [25], a Boltzmann parameter has been used to reduce the degree of randomness of the genetic algorithm operators used in training. In another method [26], this parameter is steadily lowered but then reset to a high value to allow the neural networks to escape local minima.
In this research, the input space is indirectly partitioned by means of a softmax output neuron that also incorporates a Boltzmann parameter, which is initialized to a high value to allow the faster training of all sub-networks in the ensemble. As training progresses, it is steadily lowered so that the Voronoi partition boundaries become increasingly more pronounced. In the limiting case, when the Boltzmann parameter approaches zero, these boundaries are well-defined. This allows the output neuron to function as a switch, picking the output of the appropriate sub-network as the ensemble output. This arrangement also helps in avoiding overfitting [27].

Layout of the Deep Neural Network Ensemble
The schematic in Figure 1 shows the layout of the deep neural network ensemble proposed in this research. The ensemble consists of a set of sub-networks, where each is a × × 1 neural network, where and are the number of neurons in the input and hidden layers. Each hidden neuron realizes a sigmoidal saturating nonlinearity, while the output neuron is a weighted summing unit. The weights (including biases) of subnetwork ∈ are collectively represented in terms of a vector, ( ) ∈ ℝ ( ) . A training sample consists of the input vector ∈ ⊂ ℝ and a scalar desired output, ∈ 0, , where ∈ 1, … , | | , is the training dataset, is the input metric space, and is the maximum possible failure experienced in the grid. The output of sub-network corresponding to the th sample is denoted as ( ) .

Supervised Learning in Sub-Networks
The sub-network is incrementally trained using weighted stochastic gradient descent as, In the above expression, is the learning rate and the quantities ( ) ∈ (0,1) will be referred to as ensemble weights (Not to be confused with sub-network weights ( ) ). The optional second term is meant to regularize the weight vector with the ridge regression function (•); it can be excluded by setting the regularization parameter to zero. The weight updates in (1) can be algorithmically realized by means of a single step of backpropagation. The ensemble output is obtained as the weighted sum of the sub-network outputs,

Supervised Learning in Sub-Networks
The sub-network k is incrementally trained using weighted stochastic gradient descent as, In the above expression, η is the learning rate and the quantities β i ∈ (0, 1) will be referred to as ensemble weights (Not to be confused with sub-network weights w (k) ). The optional second term is meant to regularize the weight vector with the ridge regression function R(·); it can be excluded by setting the regularization parameter ξ to zero. The weight updates in (1) can be algorithmically realized by means of a single step of backpropagation. The ensemble output is obtained as the weighted sum of the sub-network outputs, It is evident from the above expression that the quantity β (k) i determines the weight assigned to sub-network k in the ensemble output. Likewise, the update rule in (1) shows that the sub-network's own weight vector w (k) is incremented in proportion to i . Each sub-network can be viewed as implementing some neural network function This function f NN (·) requires a forward pass through each sub-network. The expected absolute error over the entire dataset has an upper bound ε so that, The universal approximation theorem (cf. [28]) shows that as long as K is sufficiently large, then with proper training the upper bound ε can be made as small as possible, regardless of the probability distribution of the sample points in X .
With T denoting the Boltzmann parameter, the ensemble weights β (k) i are determined according to the expression below: The denominator Z in the above exponent is the partition function, The purpose of the partition function is to normalize the ensemble weights so that i is the normalized error. The error reflects the amount by which a sub-network estimate, y (k) i , deviates from the corresponding desired value, y i . However, due to the wide range of the desired output values of y i (with y i = 0 being the most common, and higher values occurring increasingly infrequently), the absolute difference y (k) i − y i between the two quantities is normalized by dividing it with 1 + y i , so that δ (k) i can be interpreted as the fraction error. The 1 is included in the denominator to ensure that δ (k) i stays finite when y i = 0 In accordance with these observations, The expression in (5) shows that larger weights are assigned to sub-networks with lower δ (k) i , so that they are trained more with those samples for which they are more accurate. The role of the Boltzmann parameter T is addressed in the next subsection.

Partitioning Using Unsupervised Learning
From (5) and (6), it can be seen that all sub-networks in the ensemble are weighted equally when T is high, Accordingly, the Boltzmann parameter T is initialized to a sufficiently high value, T ∞ , so that all training samples in the dataset N are equally utilized while training each sub-network. From another standpoint, T = T ∞ corresponds to the situation where the input space is not decomposed into Voronoi partitions. Conversely, when binary valued, being equal to unity only for the sub-network k whose output y (k) i is closest to its desired value y i , and zero for all others. More specifically, In this situation, the input space X is partitioned into a set of Voronoi regions X (k) , k ∈ K, such that X (k) ∩ X (k ) = φ whenever k = k , and ∪ k∈K X (k) = X . Suppose the input x i ∈ X (k) for some k, only the corresponding sub-network k that yields the most accurate output is subjected to further training. The softmax output neuron functions as a switch, selecting the appropriate sub-network for input x i , which is hereafter referred as the 'winner' sub-network. The intermediate values of T have the effect of Voronoi partitioning X with 'fuzzy' boundaries as in [24]. At the end of each training epoch, the Boltzmann parameter is geometrically decreased by a factor γ ∈ (0, 1) (referred to as the 'cooling' rate) such that, The rate γ is kept sufficiently close to unity to prevent the premature partitioning of X . Upon the termination of the gradient descent training procedure, each sample in N can be placed into its own Voronoi partitioned subset, N (k) , The convexity of Voronoi partitions is well-established. Consequently, their centroids c (k) ∈ X (k) can be empirically estimated in the following manner, The weights and centroids are the sole parameters that fully define the ensemble Ω, The other quantities, including ensemble weights, are no longer required after the ensemble has been fully trained. This is because the outage estimate y of an unknown input x / ∈ N compares the distance from it to the centroids c (k) contained in Ω, and then uses the sub-network with the smallest distance, for the estimation. In other words, where k is the winning sub-network, i.e., The steps involved in the training algorithm, as well as during estimating the outage for an unknown input x, are outlined in Figure 2. Initially, the sub-network weights w (k) are assigned small random values and T is set to its maximum, T ∞ . Each iteration of the outer loop comprises of a training epoch, where all samples in the dataset N are sampled in a random order. In an epoch, each sub-network k ∈ K of the ensemble undergoes training in the following manner.
Energies 2021, 14, 4797 7 of 18 estimated averaged absolute error saturates at a steady value indicating that the stochastic gradient descent process has reached a minimum. Upon exiting the iterative process, the samples in are placed in their own partitions, ( ) , as shown in (11). The last two steps of the training algorithm, involves computing the centroids ( ) and the ensemble Ω in accordance with (12) and (13). The fully-trained ensemble can now be used to obtain a reliable estimate of the outages from an arbitrary input ∉ , as described in (14) and (15). At first, the sub-network output y (k) i is obtained through a forward pass (4). This is used to compute the normalized error δ (k) i , as in (7). To implement (6), the partition function is reset to 0 at the beginning of each epoch and is incremented in a stepwise manner. Next, the ensemble weights β (k) i are computed as per (5), following which the sub-network weight w (k) is incremented as shown in (1). Note that the regularization term in (1) is not shown in Figure 2. At the end of each epoch, T is reduced as in (10). The convergence condition is satisfied either when the Boltzmann parameter T acquires a very small value, or when the estimated averaged absolute error saturates at a steady value indicating that the stochastic gradient descent process has reached a minimum. Upon exiting the iterative process, the samples in N are placed in their own partitions, N (k) , as shown in (11). The last two steps of the training algorithm, involves computing the centroids c (k) and the ensemble Ω in accordance with (12) and (13). The fully-trained ensemble can now be used to obtain a reliable estimate y of the outages from an arbitrary input x / ∈ N , as described in (14) and (15).

Gaussian Distribution Assumption
In this model, it is assumed that the probability of sub-network k follows a Gaussian distribution with some constants C and σ, It can readily be shown that the maximum likelihood estimate y, given any input x, is obtained using the expressions provided earlier in (14) and (15). Let us define an assignment mapping : X → K . For the sake of conciseness, given any sample x i ∈ N , let (x i ) be written as i . The joint probability of the dataset N is given by, In order to maximize Pr[N ]), the summation, ∑ i∈N x i − µ ( i ) 2 , must be minimized. It follows that the optimal assignment * and Gaussian centers µ * (k) are defined according to the expression below, * From (18), the optimal assignment can be determined in a straightforward matter since for each sample It can be observed that (A4), shown above, is identical to the expression in (15), provided µ (k) = c (k) .
Once the samples x i have been mapped to their respective partitions * i , we turn our attention to obtaining the optimal locations of the Gaussian means, µ * (k) . Let us define the partitions, N (k) = {x i |i ∈ N , k = * i } so that the summation in (18) can be expressed as, It is evident that the inner summations ∑ i∈N (k) x i − µ (k) 2 can be minimized separately for each k. It follows that, It can be readily established using a few algebraic manipulations that the optimal Gaussian location is, The right-hand side of (21) is identical to that in (12), providing the rationale for selecting the winning sub-network in (15).

Markov Random Field Viewpoint
A Markov random field model interpretation [29] best illustrates the influence of the Boltzmann parameter T. Let each sample input x i ∈ N in the dataset be treated as a vertex of a graph G ≡ (N , E ), where the set of edges is E ⊂ N × N . The graph G is undirected so that (i, j) ∈ E ⇔ (j, i) ∈ E . Furthermore, it is assumed that the vertices in G are divided into |K| maximal cliques, N (k) , k ∈ K. All vertices in each N (k) are assumed to be connected to one another so that i, j ∈ N (k) ⇒ (i, j) ∈ E . A clique potential function Ψ : K → R + can be defined in the following manner, The energy of the system is the sum of all clique potentials, From the Hammersley-Clifford theorem [30], at any given value of T, the joint probability of all outages in N is given by the Gibb's distribution, Using (22) and (23), the above probability is, It is trivial to establish the following identity, Using this identity, the joint probability in (24) reduces to, It is evident that the expressions for the joint probability Pr[N ] in (25) and (16) are equivalent with the appropriate choice of centroids, with σ = T −1 . Steadily lowering the Boltzmann parameter T as in (10) serves to increase this probability. This behavior is analogous to that observed in spin glass models in statistical physics.

Results
The four cities involved in this study are referred hitherto as A, B, C, and D, labeled according to their population sizes. Thus, city A is the least populous city, whereas city D is the largest. The available data were divided into training and test samples, separately for each city. In the following discussion, N could represent any of the four cities and either the training or the test dataset, depending on the context.

Performance Metrics
In order to analyze the performance of the proposed deep neural network ensemble (DNNE), the following metrics were used.
(i) Mean Absolute Error: It must be noted that since larger cities are expected to experience more outages, the MAE increases with city size, regardless of the model used.
(ii) Mean Squared Error: Due to the same reason as above, the MSE for city A will be the smallest, and that of city D, the highest.
(iii) Slope: This is the slope of the best linear fit that passes through the origin.
The larger values of S indicate better performance, with S = 1 being the limiting case of 100% accuracy.
(iv) Coefficient of Correlation: where µ y and µ y are the average values of the estimated and actual outages. The larger values of R indicate better performance, with R = 1 being the limiting case of 100% accuracy.

Weather-Related Outage Prediction
The daily outages due to wind and lightning, beginning in 2005 until the last day of 2011, were obtained from the utility company. The training dataset comprised of daily samples that were within the period 2005-2009, while the test dataset had daily samples from two years, 2010 and 2011. The values of the parameters were η = 0.2, γ = 0.95, and T ∞ = 5. The weights w (k) were initialized to very small random initial values, and a maximum of 100 epochs was allowed, which was enough to ensure convergence while avoiding overfitting. As overfitting was not encountered, regularization was not used (ξ = 0).
Early experiments with first and second order polynomial regression strongly suggest that statistical methods may not be well-suited for outage estimation tasks. For conciseness, these results have not been reported in this section. Classical neural networks consistently outperformed them in all of the above metrics, corroborating earlier conclusions drawn with weather-related outages. Accordingly, neural networks and AdaBoost+ are the only two alternate methods adopted here for comparisons with the proposed DNNE. A cardinality |K| = 4 was found to be adequate for the purpose, with larger networks providing insignificant gains while increasing the computational burden.
Each sample (indexed i) contained the total daily lightning strikes L i , daily maximum wind gust speed W i , and the total outages y i due to wind and lightning for the day. Although the numerical range of L i for each city was very large, the distribution was highly skewed towards lower values. Hence, log(1 + L i ) and W i were the two inputs to each model. This also allowed the inclusion of days with zero lightning in the analysis.
Initial investigations were carried out to draw insights into the convergence properties of the DNNE. Figure 3 shows the behavior of the learning algorithm as the Boltzmann parameter T, which is initialized to a high value of T ∞ , progressively decreases at the end of each training epoch. Figure 3 (left) shows how the MSE (scaled up by a factor |K|) drops steadily with decreasing T. Figure 3 (right) shows the evolution of the quantities β (k) i . They are initialized to identical values of |K| −1 as in (8), so that the outputs of all sub-networks are weighted equally, as seen in (2). It can be observed that for each input x i , the β (k) i steadily increases for only a single sub-network k in K, while simultaneously decreasing for the remaining ones. Upon termination, when the MSE saturates to its asymptotic value, the values of β are differentiated enough to assign the samples to their eventual partitions. More interestingly, for a few initial training epochs, the β (k) i s do not show any perceptible differences. The onset of differences can be observed beyond a certain critical threshold (log T −1 ≈ −1.4). This observation is consistent with spin glass models in statistical physics where there exists a critical temperature below which particle spins within each domain align in the same directions. Although only the results of city D are shown in Figure 3, those of the other cities followed very similar patterns. The final assignment of the samples to their respective partitions is shown in Figure 4, separately for all four cities. It can be observed that the distributions of real lighting strikes are heavily skewed towards L i = 0. Motivated by this observation, the task of weatherrelated outage estimation is extended so that one deep network is trained to handle samples with L i = 0, and the other, with L i > 0. This approach is abbreviated as DNNE-H (i.e., DNNE-Hybrid). Table 1 shows the performances of the neural network, ADABoost+, DNNE, and DNNE-H. For each metric, the best performance is highlighted in bold. It can be clearly seen that in all cases, DNNE and DNNE-H outperformed the other models. Furthermore, in most cases, the performance of DNNE-H was marginally better than that of DNNE. Figures 5 and 6 show the scatter plots of the observed and estimated outages obtained using all four models. The values of one for the regression coefficient (R) and the slope of the regression line (S) imply perfect prediction. The results illustrate the overall superior performance of the proposed approach in comparison to previous methods, with both R and S being closer to 1 in DNE and DNNE-H.   Table 1 shows the performances of the neural network, ADABoos DNNE-H. For each metric, the best performance is highlighted in bold. I seen that in all cases, DNNE and DNNE-H outperformed the other model in most cases, the performance of DNNE-H was marginally better than Figures 5 and 6 show the scatter plots of the observed and estimated ou using all four models. The values of one for the regression coefficient (R) a the regression line (S) imply perfect prediction. The results illustrate the o performance of the proposed approach in comparison to previous metho and S being closer to 1 in DNE and DNNE-H.

Animal-Related Outage Prediction
To investigate the effectiveness of DNNE for animal-related outage estimation, further simulations were carried out using animal related data, which involved the weekly outages of four cities, A, B, C, and D. Weekly outages during a nine year period between 1998 and 2006 were used as the training dataset; the test dataset comprised of outages that occurred in a single year, 2007. The sample inputs were 3 × 1 vectors consisting of the number of fair-weather days of the week , the month type , and the outages of the immediately preceding week. The DNNE sub-networks were of size 3 × 5 × 1 and the parameters were kept at = 0.2, = 0.95, = 0, and = 10. The initial weights

Animal-Related Outage Prediction
To investigate the effectiveness of DNNE for animal-related outage estimation, further simulations were carried out using animal related data, which involved the weekly outages of four cities, A, B, C, and D. Weekly outages during a nine year period between 1998 and 2006 were used as the training dataset; the test dataset comprised of outages that occurred in a single year, 2007. The sample inputs were 3 × 1 vectors consisting of the number of fair-weather days of the week F i , the month type M i , and the outages y i−1 of the immediately preceding week. The DNNE sub-networks were of size 3 × 5 × 1 and the parameters were kept at η = 0.2, γ = 0.95, ξ = 0, and T ∞ = 10. The initial weights w (k) were very small random quantities. A cardinality |K| = 4 was used. A maximum of 150 epochs was allowed. Table 2 shows the performances of a simple neural network, AdaBoost+, and the proposed DNNE in terms of the four metrics. It can be seen that the DNNE consistently outperformed the other models for the training datasets of cities B, C, and D. In city A, although AdaBoost+ yielded a value of R that is closer to unity than DNNE's, the difference was too small (0.0051) to be of significance. The test data produced very similar patterns, with AdaBoost+ being marginally better in terms of R (0.0164) in city C. The only other anomalous pattern was in city D's value of S, where the neural network outperformed both AdaBoost+ and DNNE. This may be attributed to the large degree of randomness present in the datasets.

Conclusions
The objective of this paper is to propose a novel deep neural network approach for estimating weather-related outages in electric power distribution systems. The proposed method relies on a divide-and-conquer strategy to partition the entire input space into non-overlapping partitions, and a hybrid training algorithm that combines supervised and unsupervised learning is implemented, which under certain assumptions avoids overfitting. The results obtained for wind and lightning caused outages show that the accuracy of the proposed model shows a marked improvement over that of other models in predicting outages. In addition, the results indicate that separately estimating the outages of days with zero and nonzero lightning yielded further (albeit marginal) improvement in the prediction accuracy. Similarly, the results of animal-related outages with the proposed model show improvement over the previous models. While the proposed models show promising results, they still underestimate the outages caused by wind and lightning, and animals. More research is needed to further improve modeling for the estimation of these outages in power distribution systems.
The results presented in the paper are specific to the selected locations. Geographic features, climate, and other local factors can influence the results. However, the focus of the paper is to present a methodology, which can be applied at any location with some modifications.

Nomenclature
Frequently used symbols and their associated meaning are provided in Table 3 below.  Funding: This research was funded by National Science Foundation, grant number ECCS-0926020 and the APC was funded by the Electrical and Computer Engineering Department, Kansas State University.