The results of the analyses are summarized in this section. The first subsection provides a few examples of statistical characterization (descriptive statistics) of the feeder population. The second sub-section provides some insights into the approach used to select the parameters used for classifying and clustering the feeders. The results of the classification and of the clustering are summarized in the last two subsections.
3.1. Statistical Analysis of the Feeders
Before starting the analyses on classification and clustering, the population of feeders has been explored and analyzed with different statistical methods. In this section, only a few examples are provided. A more detailed analysis has been presented in [
23].
Since the
R/
X ratio of distribution feeders significantly affects the potential of reactive power-based voltage control, a closer look at the
R/
X distribution has been taken.
Figure 2 shows the distribution of the feeder
R/
X ratio at the end node for the scenario “uniform” (see
Section 2.1). For each bar of the histogram (length class), the share of voltage-constrained feeders is show in blue, the share of voltage and current-constrained feeders in magenta and the share of current-constrained feeders in red (for the scenario “uniform”—see
Section 2.1).
This
R/
X ratio plays an important role in the effectiveness of reactive power control for limiting the voltage rise as shown in Equation (4) [
29] (for one generator connected at the end of the feeder):
where
is the voltage rise caused by the power infeed,
the nominal voltage,
and
the feeder resistance and reactance,
the injected active power, and
the displacement angle. By consuming reactive power (increasing
), the voltage rise can be partly compensated: the lower the
R/
X ratio, the more effective the control.
This figure shows that the R/X ratio is almost always above 1, which is in accordance with the usual assumption that LV feeders have a “large” R/X ratio. Only about 21% of the feeder have a R/X ratio below 2.4, which allows a compensation of the voltage rise by 20% with cos ϕ = 0.90. The large peak for a R/X ratio of 2.6 corresponds to the most common cable type Al 150 mm2. Another aspect to consider in this context, is that the reactive power consumption leads to an increase of the current, which might bring originally voltage-constrained (without reactive power-based voltage control) to be current-constrained (with control).
The increase of current due to the reactive power flows caused by the voltage control, can be estimated with Equations (5) and (6) ((5) is obtained by neglecting the voltage drop which is in quadrature with the voltage [
30], and (6) is derived from a first order approximation of the current, considering small voltage variations):
where
is the maximum allowed voltage rise (3% for LV feeders according to [
5]), defining the hosting capacity for voltage-constrained feeders,
and
are the feeder resistance and reactance,
and
the hosting capacity without and with reactive power control and
the nominal voltage. Putting this simple equation in relation with the
R/
X statistics shown on
Figure 2 lead to the following conclusion: about 50% of the feeders have a
R/
X ratio below 3, which results in an increase of the current by a factor 1.33 (+33%). This means that these feeders could only fully benefit from a reactive power-based voltage control, if the maximum loading (without control) is below 1/1.33 = 0.75 p.u.
The remaining of this section is devoted to the main objective of this study: perform and analyze feeder classification and clustering methods.
3.2. Parameter Selection and Data Reduction
Before performing the clustering analysis and classification (see
Section 2.1), several data exploration techniques have been used to get a better understanding about the relation between the parameters intended to be used for clustering and classification. Although the number of variables (feeder parameters) available for the analysis does not require the selection of a subset (i.e., the use of data reduction/feature selection techniques), the proximity between these variables has been analyzed in a first step. For this, three methods have been used:
In order to investigate the correlation between parameters, the Spearman correlation has been computed for the whole data set (for all the feeders) and a threshold of 0.70 has been used to identify significantly correlated variables.
Figure 3 shows the correlation between parameters (only the lower triangle is shown). Red crosses indicate a high correlation (>0.70), and blue crosses a low correlation (≤0.70). The set of parameters with a correlation lower than the considered threshold of 0.70 is indicated with blue squares: three “poorly correlated parameters” and two more which can be taken out of the group of correlated parameters.
Finally, a principal component analysis has been performed in order to validate the variable selection and allow a graphical representation of the clustering result.
Principal component analysis consists in an orthonormal transformation into components which are a linear combination of variables. The components are selected in a way to maximize the variance explained by the first components. In this case, the first two principal components lead to a ratio of explained variance of about 66%. These two first components have the following main parameters (see
Figure 4): component 1 is dominated by the parameter Rsum (and its correlated parameters such as the three distance metrics) while component 2 is dominated by the parameter
km/
load,
ANON (and
ADTN). The first component therefore mainly relates to the feeder length and impedance, and the second component to the structural properties (different metrics related to load density).
The results from all these analyses are coherent and confirmed the parameter selection. The final set of “poorly correlated variables” has been determined by selecting the parameters with the highest variance within a cluster (within a group of correlated parameters). These parameters are:
In_max
In_avg
km/load
ANON
Rsum
3.3. Classification of LV Feeders
As explained in
Section 2.2, the data set of about 24,000 LV feeders has been classified, using as explanatory variables the feeder parameters mentioned un
Section 2.1 (
Table 2) and as category the following characteristics:
Among all the different supervised machine-learning techniques such as neural networks or discriminant analysis, classification trees have been used for their simplicity and interpretability. The software package Matlab has been used for this purpose.
In a first step, a fully-grown classification tree (without restriction on the tree depth, i.e., a deep tree) has been built. It uses all the 12 parameters (or predictors) available. Before looking at the tree properties, the predictor importance can be evaluated. It quantifies the contribution of each variable (predictor) to split the tree.
Figure 5 shows the predictor importance for the fully-grown tree: the most important variable is the
kWm, followed by the parameters
kWOhm,
FeederLength,
In_
avg and
LastBusDist (about 10 times less important).
Deep classification trees (such as the one obtained from this first attempt (fully-grown tree)) are known to be prone to overfitting, which means that the good fitting obtained on a training set cannot be reproduced with a different set (testing set). In such cases, the tree has memorized a learning set instead of learning the general data structure.
Several alternatives are possible in order to avoid overfitting. One possibility is to constrain the tree building process by specifying a maximum number of splits or a minimum leaf size. The drawback of this method is that the constraints must be set from the beginning, i.e., before having some good understanding of the data. Another widely used possibility is to prune the tree (merge leaves) in order to reduce its complexity [
31]: this is known as cost-complexity pruning. In order to evaluate the performance of a classifier, two generic indicators are widely used: the resubstitution error and the cross-validation error. The first one is simply evaluated by counting the misclassified observations on the whole data set while the second one requires to separate the data set into a training set and a testing set (usually with the proportion 90%/10%). The advantage of using the cross-validation error is that over-fitting can be detected.
Figure 6 shows the misclassification errors (resubstitution and cross-validation) as a function of the pruning-level: a low number of splits (only two) is enough to obtain the best achievable classification performance (any increase of the number of splitting does not reduce the cross-validation error and might lead to over-fitting).
This figure also shows that a low classification error (cross-validation error) can be achieved (about 3.4%). This result should however be carefully interpreted. Indeed, the data set is rather heavily unbalanced (skewed) with about 90% of the observations falling in one category (voltage-constrained feeders), and about 10% falling in the other category (current-constrained feeders)—see
Section 3.1 [
23]. This means that a random guess would already lead to a rather low misclassification error (10%). In order to consider this, several options are possible. The first one is to partition the data set (training and testing sets) in order to have a balanced proportion of both classes. The second one is to adjust misclassification costs and force the classification to be “equally good” for both categories. In this work, the second option has been selected and some cost weights have been used with the ratio between voltage-constrained and current-constrained feeders (about 90/10). When doing so, the corrected cross-validation error increases to 15.3%.
The final classification tree which is obtained from this analysis (specifying misclassification costs to “balanced” the data set, and pruning to obtain the best compromise between complexity and accuracy) is shown on
Figure 7.
In the considered application (classification of LV feeders into voltage-constrained and current-constrained feeders for network planning purpose), misclassification does not have the same impact for both categories (voltage- and current-constrained feeders).
Indeed, one of the options to extend the hosting capacity is to implement a reactive power-based voltage control. As explained in the introduction, this type of control allows reducing the voltage rise caused by the infeed from distributed energy resources, at the cost of an increase of the current resulting from the additional reactive power flow. Since the current is not observed with the considered voltage control concepts, the deployment of such solutions should be limited to feeders which are actually voltage-constrained and not current-constrained. For this reason, a heavily unsymmetrical cost function has been introduced to force the classifier to avoid misclassification of current-constrained feeders. With this cost function, none of the current-constrained feeder is classified as voltage-constrained feeder: there is no misclassified feeder I→U (true class = I, and predicted class = U). This is reached at the cost of a significantly higher misclassification, as seen in
Table 3. In order to look into this, the confusion matrix can be used (
Table 3): the left part of this table shows the confusion matrix of the pruned tree with “balanced” misclassification costs (i.e., reflecting the ratio between classes). The ratio of problematic misclassified feeders (I→U) is 3.3%. In order to bring this ratio down to 0, high (I→U) “selective” misclassification costs are specified. As a side effect, the ratio of (U→I) misclassified feeders increases strongly from 11.4% to 53.8% (right part of the table).
To further analyze the classifier performance, different indicators have been computed:
Accuracy: probability of a correct classification among the data set (Equation (7))
Sensitivity: the ability to classify correctly I-constrained feeders among the I-constrained feeders (Equation (8))
Specificity: the ability to classify correctly U-constrained feeders among the U-constrained feeders (Equation (9))
False positive rate (false alarm rate): the rate of U-constrained feeders which have been classified as I-constrained feeders (Equation (10)):
where stands for number of elements in the corresponding subset.
Besides the indicators which are commonly derived from the confusion matrix (first three indicators), the fourth one plays an important role in this study as previously explained (reactive power-based voltage control should not be implemented in I-constrained feeders).
Table 3 shows that using high costs for the misclassification of I-constrained feeders allows reaching 100% sensitivity at the expenses of a rather high false alarm rate (53.8%), which represents a strong loss of potential. In fact, a trade-off between selectivity and false alarm rate must be found. In order to visualize this, Receiver Operating Characteristics (ROC) graphs can be used [
32]. A ROC curve is a technique to visualize the performance of classifiers, and in particular, the trade-off between sensitivity (
y-axis) and false alarm rate (
x-axis). A random classifier would have a diagonal (0, 0)-(1, 1) as ROC-curve while a perfect classifier would follow the y and then the
x-axes: (0, 0)-(0, 1)-(1, 1).
Figure 8 shows the ROC-curves obtained by varying the ratio of the misclassification costs between FI and FU between 10
−4 and 10
4. For each misclassification cost, a classification tree has been grown and pruned as previously explained, and the ROC curve has been built. This figure shows that a sensitivity rate above 90% can be reached with a rather low false alarm rate (about 5%). However, in order to reach a sensitivity rate of 100%, the false alarm rate increases to about 54%. In fact, these ROC-curves are suitable for a decision-making process implemented in the frame of a probabilistic network planning by specifying a risk level.
These results might be interpreted as a poor performance of the classifier (high false alarm rate). In fact, using very unsymmetrical misclassification costs (very high costs for FU) forces the classifier to exclude many U-constrained feeders due to only few I-constrained feeders, which are in a region of the variable space dominated by U-constrained feeders. This effect has been indeed observed for the feeder data set.
Figure 9 shows the scatter plot for the whole data set, projected on the first two principal components obtained by a principal component analysis (PCA), which are dominated by the equivalent sum impedance
Rsum (first component) and the parameters
km/
load or
ANON (second component)—see
Section 3.2.
In this figure, the observations (feeders) are colored according to the constraint (blue for voltage-constrained feeders and red for current-constrained feeders). This figure shows that current-constrained feeders seem to be located close to the lower left corner (small Rsum and small km/load or ANON) while voltage-constrained-feeders are found further from the origin. However, a careful look at the partial distributions (left and lower part of the figure) shows that there is a strong overlap in the region close to the origin, in which most of the feeders are found. This figure confirms the difficulty to discriminate between both classes (here on the sole basis of the two first principal components). In fact, the decision tree allows identifying those I-constrained feeders, which force to exclude numerous U-constrained feeders. Excluding them from the data set is however not a valid approach since these feeders are not outliers.
Finally, a careful look at the misclassified feeders U→I, which necessarily lead to a loss of potential, shows that a great share of these feeders (70%) have a loading greater than 70%, as visible on
Figure 10. On this figure, the right axis is for the probability density function (pdf—bars) and the left axis for the cumulative density function (cdf—curve).
These feeders are voltage-constrained but have a rather high loading, which means that they would probably turn to be current-constrained when reactive power-based voltage control is implemented due to the increased active and reactive power flows (see
Section 3.1). This is, in fact, confirmed by analyzing the hosting capacity values obtained from the scenario with reactive power-based voltage control (see
Section 2.1). As a result, only about one third of the misclassified feeders U→I remain voltage-constrained when implementing a reactive power-based voltage control. This means that the actual reduction of the potential due to the U→I misclassification (“false alarm”) is reduced from 53.8% to about 18%.
As a conclusion, a decision tree classifier has been trained with the data set in a way to avoid over-fitting. Besides its generic performance which can be evaluated by the cross-validation for example, unsymmetrical misclassification costs have been introduce to avoid problematic errors (false U-constrained feeders). The side-effect of reaching a high sensitivity (close to 100%), is a rather high false alarm rate, which represents in fact a loss of potential in terms of feeders potentially benefiting from voltage control (the main application here). However, this loss of potential is limited since most of the affected feeders are in fact close to experience over-loading when implementing reactive power control to increase the hosting capacity.
3.4. Clustering of LV Feeders
As explained in
Section 1 and
Section 2.2.2, all the studies mentioned in
Section 1 are based on clustering analysis (i.e., process of grouping a set of observations into clusters). In these studies, the results of the clustering analysis have been analyzed and validated through an internal validation, whose purpose was in most cases to support the decision on the number of clusters to be used. In this paper, a clustering analysis has been conducted in a similar way as in most of the considered studies (see
Table 1). The results of this analysis are analyzed through an external validation since the information of the class membership (voltage or current-constrained feeders) was available.
The feeders have been clustered with the k-means clustering (using the squared Euclidean distance), which is, as mentioned in
Section 1 (
Table 1), the most widely used clustering method.
The most important parameters of the clustering analysis which can have a significant impact on the results are:
Variables used
Number of clusters used
As in the previous studies, the variables used in the clustering analysis have been selected based on several analyses (e.g., correlation analysis, PCA). However, this variable selection is still subject, to some extent, to subjective decisions and is hard to fully justify. For this reason, the number of variables and the variables themselves have been varied. Regarding the number of clusters, there is no universal method to determine the “optimal” or “appropriate” number of clusters. Instead, there is a number of established methods supporting to some extent the decision.
In the studies reviewed in
Section 1, different metrics have been used to quantify the clustering performance and select the “optimal” or “appropriate” number of clusters (
R2 (coefficient of determination) [
16], sum of squared errors [
10,
12,
33], silhouette value [
10], cubic clustering [
17] and Calinski-Harabasz [
19] criterions).
In this study, the following two metrics have been evaluated: the silhouette value and the normalized sum of squared errors (nSSE).
The silhouette coefficient for each observation (here for each feeder
) is a measure of how similar that observation is to observations in its own cluster, compared to observations in other clusters:
where
is the average distance to all other objects in the cluster to which feeder
belongs to, and
is the minimum over all the clusters not containing feeder
, of the average distance between all the feeders in the considered clusters (not containing feeder
), and feeder
. The silhouette value lies between −1 and +1. A high silhouette value indicates that the observation
is well-matched to its own cluster, and poorly-matched to neighboring clusters.
The normalized sum squared error is the sum of the squared errors (here distance between observations and centroids (cluster centers)), normalized to the error obtained for a single cluster.
Figure 11 shows the clusters obtained by specifying the number of clusters to 3, 4 or 5. The left part shows the clusters on a scatter plot, using the first two principal components to allow an easy visualization (as in [
17,
19]). The obtained normalized sum squared error (
nSSE) is given on the top of each scatter plot. The right part of the figure shows the corresponding silhouette plots: the silhouette value is computed for each single observation and shown in a sorted way for each cluster (the average value over all the clusters is given above the silhouette plots).
This figure shows that, as expected, the normalized sum squared error decreases when the number of clusters increases (0.38 for three clusters and 0.24 for five clusters). However, the silhouette value does not show such a monotonic behavior: the best (highest) silhouette value among these three cases is obtained for four clusters (0.56). This means that, in this case and for this metric (silhouette), increasing the number of clusters does not necessarily lead to a better clustering result (in terms of how close observations are within a cluster and how far they are from other cluster).
The cluster plots also show that most of the observations are very close to each other (continuous distribution of feeders) and that there does not seem to have any “natural” cluster structure in the feeder population. The result of the clustering seems to divide the areas with a high density of observations into sectors or roughly equal size.
Figure 12 shows the impact of varying the number of clusters (between 1 and 20) on the two metrics used to quantify the clustering performance (note that the silhouette value is only defined for at least two clusters).
This figure shows that, as expected and previously observed, the normalized sum squared error (
nSSE) decreases monotonously when increasing the number of clusters. A usual way of selecting the “optimal” number of clusters is to use the elbow criterion. In this case, the
nSSE-curve does not exhibit a clear elbow shape and a visual selection of the “optimal” number of clusters is questionable (this has also been observed in previous studies with sufficient large data sets). A number of clusters between six and 16 (for the cluster number ranges of most previous studies, see
Table 1) could be somehow justified. However, the silhouette curve does not exhibit a monotonic behavior and even shows that the highest clustering performance according to this metric is reached for only two clusters. These analyses confirm the known difficulty to interpret the clustering results and select the “optimal” or “appropriate” number of clusters.
Finally, the results of the clustering and their suitability for the considered problem (reflect the behavior of feeders in terms of hosting capacity) have been analyzed through an external validation. For this, the classes, which had also been used in the classification, are used to measure how close the clustering is to the predetermined classes (voltage or current-constrained feeder). The results of this external validation for a clustering with four clusters is shown on
Figure 13 (projection on the first two principal components).
The left part shows the scatter plot (similar to
Figure 11 with different coloring) and the right part shows for each cluster (the wide bars are colored as the clusters) the share of voltage (blue) and current (red)-constrained feeders. This share between both classes is in addition indicated as a numerical value, which allows to evaluate the performance of the clustering for the considered problem (to identify feeders according to their class (voltage or current-constrained)). Indeed, this share can be interpreted as a “partial purity” measurement. A common metric used to quantify the clustering performance with an external validation is the (global) purity, given by Equation (12):
where
is the (global) purity of the clusters (against the classes),
are the classes belonging to the set of classes
(here voltage or constrained-feeders),
the clusters belonging to the set of clusters
and
the number of observations (feeders). In our case, the “partial purity” values (for each cluster) have been considered.
The more dissymmetric the ratio (the purest), the more the clustering is able to discriminate between both classes. For example, the fourth cluster (purple) almost only contains voltage-constrained feeders (99.8% of the feeders in this cluster are voltage-constrained). On the opposite, the cluster 1 (blue) mostly contains only current-constrained feeders, with however a significantly lower level of purity. The reader should note that these two clusters (with the highest purity levels for both classes) do not have any border.
Following the same reasoning as for the classification (considering that current-constrained feeders should be identified with the highest possible confidence), the most interesting cluster is the fourth cluster which has the highest purity level (only cluster with a purity level greater than 99%). This cluster contains however only about 16% of the whole feeder population (or about 18% of the population of voltage-constrained feeders). This means that this clustering leads to a rather poor result: its ability to discriminate between the two classes (voltage and current-constrained feeders) is low, even when accepting a “risk” (here 0.2% due to the “impurity”).
By following the same conservative approach as for the classification (considering only feeders which are (almost) surely voltage constrained), the deployment potential of reactive power-based voltage control would be significantly lower than for the classification. The share of voltage-constrained feeders which would be safely identified as such, is about 46% for the decision tree-based classification (see
Section 3.3), and only about 18% for the clustering (loss of potential or 54% and 82% respectively).