Enhancing Load Stratification in Power Distribution Systems Through Clustering Algorithms: A Practical Study

Williams Mendoza-Vitonera; Xavier Serrano-Guerrero; María-Fernanda Cabrera; John Enriquez-Loja; Antonio Barragán-Escandón

doi:10.3390/en18195314

,

and

¹

Energy Transition Research Group (GITE), Universidad Politécnica Salesiana, Calle Vieja 12-30 y Elia Liut, Cuenca 010102, Ecuador

²

Department of Electrical Engineering, Electronics and Telecommunications, Universidad de Cuenca, Cuenca 010107, Ecuador

^*

Authors to whom correspondence should be addressed.

Energies2025, 18(19), 5314;https://doi.org/10.3390/en18195314

This article belongs to the Special Issue Advanced Machine Learning and Data Analysis Technologies in Modern Energy Systems

Version Notes

Order Reprints

Abstract

Accurate load profile identification is crucial for effective and sustainable power system planning. This study proposes a characterization methodology based on clustering techniques applied to consumption data from medium- and low-voltage users, as well as distribution transformers from an electric utility. Three algorithms—K-means, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and Gaussian Mixture Models (GMM)—were implemented and compared in terms of their ability to form representative strata using variables such as observation count, projected energy, load factor (

L F

), and characteristic power levels. The methodology includes data cleaning, normalization, dimensionality reduction, and quality metric analysis to ensure cluster consistency. Results were benchmarked against a prior study conducted by Empresa Eléctrica Regional Centro Sur C.A. (EERCS). Among the evaluated algorithms, GMM demonstrated superior performance in modeling irregular consumption patterns and probabilistically assigning observations, resulting in more coherent and representative segmentations. The resulting clusters exhibited an average

L F

of 58.82%, indicating balanced demand distribution and operational consistency across the groups. Compared to alternative clustering techniques, GMM demonstrated advantages in capturing heterogeneous consumption patterns, adapting to irregular load behaviors, and identifying emerging user segments such as induction-cooking households. These characteristics arise from its probabilistic nature, which provides greater flexibility in cluster formation and robustness in the presence of variability. Therefore, the findings highlight the suitability of GMM for real-world applications where representativeness, efficiency, and cluster stability are essential. The proposed methodology supports improved transformer sizing, more precise technical loss assessments, and better demand forecasting. Periodic application and integration with predictive models and smart grid technologies are recommended to enhance strategic and operational decision-making, ultimately supporting the transition toward smarter and more resilient power distribution systems.

Keywords:

cluster; daily load profiles; load stratification; strata

1. Introduction

The generation, transmission, distribution, and commercialization of electricity are currently based on extensive databases of measurement data. Analyzing this information is essential for the characterization of consumers; however, this task can become very complex in the absence of appropriate analytical tools []. Traditional methods face significant challenges related to computational load, particularly in the case of power quality measurements, due to the sheer volume of data involved. Recent studies indicate that the most effective strategy to address these challenges is the application of clustering methods, which allow a more efficient structuring of information and enhance the interpretability of the data [,,].

Data clustering is a set of methodologies used for automatic partitioning of large datasets into well-defined groups that are internally homogeneous and mutually heterogeneous []. Despite significant advancements achieved in recent years, data clustering continues to face challenges in producing satisfactory results. Several studies have proposed the incorporation of preprocessing techniques and additional methods aimed at optimizing algorithm performance and more effectively capturing intrinsic structures within the data [,,,,,].

One of the initial strategies used to address challenges such as handling large and small datasets, objects with a high number of attributes or heterogeneous attribute types, and complex datasets and patterns is the application of Machine Learning (ML) techniques [,,]. This approach focuses on the development of algorithms and models that allow computational systems to learn from data and enhance their performance in specific tasks without requiring explicit programming []. In tasks related to classification and characterization, supervised and unsupervised learning methods are used [,,].

This study employs the K-means, GMM, and DBSCAN algorithms, widely used in the field of data mining, with the objective of characterizing the load curves of consumers of electricity. K-means, one of the most well-known clustering methods, partitions a dataset into k clusters, each associated with a centroid; its goal is to minimize the sum of squared errors within clusters [,,]. In contrast, GMM can be considered a probabilistic generalization of K-means. Instead of assigning each observation to a single group, it estimates the probability distribution of the cluster membership for each observation [,]. DBSCAN, on the other hand, is a spatial clustering algorithm designed to identify high-density regions and group them while robustly handling noise and outliers [,]. This method expands dense regions to form clusters without requiring the number of clusters to be specified in advance []. Previous studies have explored improvements to the performance of the clustering algorithm through hybrid approaches that combine multiple techniques to compensate for their respective weaknesses. However, a complementary alternative that may offer significant enhancements lies in the incorporation of a preprocessing stage. This step would optimize the input data of the algorithm, potentially resulting in greater homogeneity within the generated clusters. It is therefore of interest to compare and analyze whether the addition of a preprocessing stage yields a more substantial improvement than algorithmic hybridization or whether the combination of techniques remains superior.

Tambunan, Barus, Hartono, and Alam present in their work an analytical method aimed at extracting information through clustering, by studying load behavior to gain insights into the operation of a power plant. In addition, they analyze the data using a K-means clustering simulation, optimally grouping the data into three clusters, which are evaluated using silhouette scores and interpreted as high, medium, and low load levels. The silhouette coefficient (SC) assesses the quality of clustering by measuring the average distance between a data point and all other points within the same cluster, serving as an indicator of how well the point fits within its assigned group [].

Wang, Jia, and Chen propose a load curve clustering approach based on the combination of Singular Value Decomposition (SVD) and the K-means algorithm []. In their study, eight characteristic indices are extracted from the load curves, and SVD is subsequently applied to reduce dimensionality and capture the main consumption features on the user side, thus enhancing the effectiveness of the clustering process. Although promising results are obtained, computational efficiency is not analyzed as an essential aspect, particularly when dealing with large-scale datasets. Furthermore, the study is limited exclusively to the use of K-means, without comparison with other unsupervised classification techniques, which restricts the performance assessment and prevents the generalization of the results.

Ciardullo and Quaglino conducted a systematic comparison of the K-means, K-medoids, DBSCAN, and Expectation-Maximization (EM) algorithms, with the aim of identifying their strengths and weaknesses in clustering tasks []. To evaluate the effectiveness of these categorization strategies in detecting patterns that reflect structures present in real-world scenarios, the authors performed an analysis based on controlled simulations. Four simulated scenarios were considered, each consisting of ten continuous quantitative variables and four distinct populations. However, the analysis was performed using raw data without the application of any prior preprocessing to clean, normalize, or adequately transform the information. This limitation restricts the full potential of the algorithms used. Therefore, it is recommended to apply preprocessing steps such as data cleaning, outlier detection, and normalization. These measures would improve classification accuracy, optimize the performance of the algorithms used, and enable a more accurate assessment of their advantages and limitations.

Previous studies have focused on determining which algorithm or combination of techniques best fits specific datasets to achieve homogeneous clustering. However, the methodology proposed in this study aims to evaluate which clustering technique produces the most coherent segmentation, using the strata defined by EERCS as the validation method. This update is essential because it enables the development of a practical tool applicable in real-world contexts for an electric power distribution company. The results obtained can be implemented for transformer sizing, technical loss analysis, and load management. Consequently, this methodology becomes a valuable tool for the daily work of electrical engineers, particularly in the planning, designing, and evaluating projected medium- and low-voltage power networks.

Furthermore, the proposed methodology provides a comprehensive evaluation of projected energy,

L F

, maximum power of the strata, as well as the maximum, minimum, and average power of the load profile. This facilitates the identification of the most appropriate clustering technique for the database. As a result, the proposed methodology can be applied to define the strata used by EERCS.

In this study, we propose a systematic methodology for load stratification in power distribution systems, structured in three main stages. First, raw consumption data from low- and medium-voltage customers are preprocessed by cleaning, normalization, and dimensionality reduction to ensure quality and consistency. Second, clustering algorithms (K-means, DBSCAN, and GMM) are applied and compared using both statistical metrics and validation against the strata defined by EERCS. Finally, the resulting clusters are analyzed in terms of representativeness, load factor, and energy distribution, enabling the identification of the most suitable technique for practical applications such as transformer sizing, demand forecasting, and technical loss assessment. This stepwise approach provides a clear framework that integrates data management, algorithm evaluation, and practical validation, ensuring both methodological rigor and engineering applicability.

The structure of this document is as follows. Section 2 presents the background supporting the research. Section 3 describes the methodology and details the proposed approach. Section 4 presents the results obtained from the database analysis. Section 5 provides a discussion and interpretation of the results. Section 6 outlines the main conclusions of the study. Finally, Section 7 offers recommendations for future work.

2. Background

In Ecuador, the electricity tariffs are based on the tariff schedule issued by ARCERNNR in 2022 [], as shown in Table 1.

Table 1. Electric tariff.

The strata are defined according to energy consumption and the type of associated usage type, which can be residential, commercial, industrial, or other (see Table 2) [].

Table 2. Strata by consumer group.

3. Methodology

This study proposes an applied methodology structured as a systematic process of data collection, analysis, and interpretation, with the aim of ensuring robust and reproducible results. The methodology consists of three stages, with the flowchart of the proposed methodology presented in Figure 1. In the first stage, data acquisition and preprocessing are performed. The second stage involves data export, during which various clustering techniques are evaluated and applied. Finally, the third stage encompasses classification and characterization, including a quantitative and qualitative analysis of the results obtained.

Figure 1. Methodology flowchart.

3.1. Stage 1: Data Acquisition and Processing

3.1.1. Data Acquisition of Low and Medium Voltage Customers

The data used for the analysis and characterization of profiles correspond to the low- and medium-voltage quality measurements of EERCS. These records include measurements taken from residential, commercial, and industrial customers, with a sampling frequency of every 10 min throughout the seven days of the week.

The dataset used in this study spans from March 2015 to May 2023, comprising 65,806 records from 21,601 customers, each with 144 daily time points at 10 min intervals. Data were collected in the Sierra, Coast, and Amazon regions of Ecuador, ensuring both temporal depth and regional diversity. This wide coverage provides a robust basis for clustering analysis and the identification of representative load profiles.

3.1.2. Data Validation

In cases where the data contain incomplete measurements, missing values, or negative active power values, such information is discarded. This step ensures that only complete customer records are used for the analysis.

3.1.3. Calculations

The total active power

P [k W]

and the reactive power

Q [k V A r]

are obtained. For this, the average power in each of the phases a, b y c.

P (k W) = P_{m e a n_{a}} + P_{m e a n_{b}} + P_{m e a n_{c}}

(1)

Q (k V A r) = Q_{m e a n_{a}} + Q_{m e a n_{b}} + Q_{m e a n_{c}}

(2)

Once the power values have been summed, the average is calculated by dividing the totals by the number of customers. Similarly, the maximum values and the values per unit are computed.

3.1.4. Information Filtering

User data is filtered according to the requirements of the program. In the case study, filtering is performed based on load, region, weekday, holidays (with holidays treated as Sundays), and consumption group.

3.2. Stage 2: Data Management System Export and Evaluation of Clustering Methods

To establish communication between Python 3.11 and MySQL 8.0.31, the required packages are installed using the command pip install mysql-connector. Subsequently, the necessary library is imported using import mysql.connector. For a successful connection, it is essential to provide the corresponding parameters: host name, user, password, and database name.

3.2.1. Optimal Number of Clusters

Selecting an appropriate number of groups is a critical step to ensure that load profiles are properly represented and that each group captures consistent and meaningful consumption patterns. An inadequate choice may lead to overfitting or result in poorly representative segmentations. To address this problem, the elbow method [] was applied, which identifies the point at which increasing the number of clusters no longer yields significant improvements in cluster quality. This procedure is illustrated in Figure 2 and Figure 3.

Figure 2. Optimal number of the GMM and K-means algorithm.

Figure 3. Optimal number for the DBSCAN algorithm.

3.2.2. Analysis of Clustering Methods for Characterization

Based on the prior analysis of the K-means, GMM, and DBSCAN clustering algorithms, an examination was conducted to determine which of these methods provides the most accurate characterization and grouping of customer data. These approaches were applied to the RD1 strata with the aim of identifying different consumption profiles and evaluating how the observations (that is, customer consumption data) are distributed within each generated cluster. Based on the results obtained, the methods are compared to determine which technique offers the best performance.

3.2.3. Selection of Methods to Apply

Through a comparative analysis of the K-means, DBSCAN, and GMM methods, it was determined which algorithm best adapts to the classification and characterization of customers in a satisfactory manner. The selection was based on the inherent advantages and limitations of each technique, evaluated within the specific context of the study. As a result, both the DBSCAN and the GMM demonstrated favorable performance.

3.3. Stage 3: Classification and Characterization

3.3.1. Customer Classification and Characterization

The DBSCAN method is implemented following a preliminary classification based on the strata defined in Table 1, as this approach proves to be more effective in detecting predominant patterns within the clusters. However, the GMM method is applied using a simpler categorization, as it does not require a stratification-based approach. In this case, users are grouped according to their level of consumption, incorporating an additional category for residential users with induction cooktops, to distinguish their specific consumption patterns.

3.3.2. Analysis

By visualizing the load profiles obtained for each strata using the different methods, along with the analysis of

L F

with Equation (3), energy consumption, and number of observations per group, it is possible to compare their effectiveness. This enables the identification of the most suitable method, ultimately leading to an accurate and satisfactory classification and characterization of the customers.

L o a d f a c t o r = \frac{P_{m e a n}}{P_{m a x i m u m}}

(3)

4. Results

Once the methodology was established, it was applied to the data obtained from the measurement campaigns provided by EERCS, which include information from 2707 customers at medium and low voltage levels, encompassing clients from both the Sierra and Amazon regions.

4.1. Test Methods

The K-means, GMM and DBSCAN methods were applied to the RD1 strata, which corresponds to residential customers with consumption between 0 and 60 kWh. Figure 4 presents the representative load profiles of each group generated by these methods; the depicted load profiles correspond to the centroid of each grouping. The results obtained, detailed in Table 3, include information such as the number of clusters generated by each method and the total energy associated with each average profile.

K-means and GMM: It is essential to determine the appropriate number of clusters to ensure reliable segmentation. For this purpose, the elbow method was applied, as illustrated in Figure 2. The figure represents the average distance between the observations and their corresponding centroids as a function of the number of clusters.
DBSCAN: To determine this value, the elbow method was used, based on the relationship between the distance among observations and the number of clusters, as illustrated in Figure 3. The optimal number of clusters was established at “5”, as no significant improvement was observed when testing different values around this point.

Figure 4. Load profiles of the various clusters generated by the methods in strata RD1.

Table 3. Comparison of applied methods.

Figure 5 presents a comparison of the load profiles corresponding to the most representative group within the RD1 strata, as identified by each of the applied methods. The selection of these clusters is based on the identification of the predominant consumption pattern within the strata, using the profiles previously presented in Figure 4 as a reference.

Figure 5. Load profiles for stratum RD1 obtained from the application of the three methods.

To determine which of the applied methods is most appropriate, both the number of samples and the energy associated with the load profile of each cluster are considered, as summarized in Table 3. The analysis focuses on the load profile of the most representative group in the RD1 strata for each method, as shown in Figure 5. In addition, the percentage of representation of the strata is included, which corresponds to the proportion of observations contained in the most representative cluster relative to the total analyzed by each method. This indicator allows for an assessment of each technique’s effectiveness in characterizing the RD1 strata.

As shown in Table 3, the K-means and GMM methods generated the same number of clusters; however, the distribution of observations differed significantly. DBSCAN tended to concentrate most consumption profiles into a single dominant cluster, and the remaining clusters mainly captured outliers and irregular cases, which limited their usefulness for detailed analysis. K-means produced clusters with higher average energy values, but the distribution of observations between groups was less consistent. In contrast, GMM provided a more balanced segmentation, generating clusters with clear differentiation in load profiles and better representativeness of real consumption behaviors, making it the most suitable method overall.

4.2. Evaluation of the Results Obtained by the Methods Applied and Results of EERCS

Based on a comparative analysis between the applied methods and the data collected in the study titled “Research and Characterization of Load”, conducted by EERCS, it will be determined which of the employed techniques offers a more effective classification and characterization. To this end, Table 4 presents the results obtained by each method, facilitating the establishment of well-founded quantitative conclusions.

Table 4. Summary of results.

Table 4 presents the results corresponding to strata formation using the DBSCAN and GMM methods, as well as the EERCS study. DBSCAN generates a total of 16 strata, the same number as the EERCS classification, while GMM produces 17 strata. However, the distribution obtained through DBSCAN is notably unbalanced, concentrating most of the observations in only two consumption groups. In contrast, both the GMM and the EERCS study show a more homogeneous distribution of strata, with the EERCS study exhibiting the best balance in group formation, although it should be noted that it includes the smallest number of observations. GMM, for its part, groups the largest number of observations and achieves a satisfactory classification of its customers.

It is important to note that within the residential consumer group, an additional category is included compared to the EERCS study, corresponding to residential users with induction cooktops.

Table 5 includes a column that presents the percentage of observation reduction, representing the proportion of data contained in the strata identified as removable relative to the total number of observations within the residential tariff category. The justification for the removal of these strata is detailed in the following paragraphs. Furthermore, the same table reports the

L F

, calculated as the relationship between the average power and the maximum power of the load profile, multiplied by 100.

Table 5. Comparison of residential strata.

Based on the analysis of Table 5 and Figure 6, the strata generated within the residential consumption group were examined using DBSCAN, k-means methods, and the EERCS study. To ensure a fair comparison among the methods, in each case, a strata was removed. For the GMM method, the first strata were discarded because they exhibited an energy level too low to be considered representative. Similarly, a strata was removed in the other two methods. In the DBSCAN method, five strata were initially generated; however, strata RD2 and RD3 did not show significant differences between them. Compared with the GMM results, the observations contained within these two strata were distributed across other groups, reinforcing their low differentiation.

Figure 6. Load profiles of residential strata.

Among the three methods evaluated, GMM exhibits the most favorable classification results. As shown in Table 4, both the GMM and the EERCS study show a similar strata distribution in each tariff. Furthermore, according to Table 5, GMM generates residential strata with clearly differentiated energy consumption levels, avoiding the formation of groups with very similar characteristics, which contributes to a more precise and meaningful classification.

Figure 6 shows the load profiles corresponding to the final strata of the residential consumption group. It is observed that both the DBSCAN and GMM methods achieve remarkable stability in the waveform shapes, demonstrating the internal coherence of the formed clusters. This stability supports the effectiveness of both clustering methods, as they display consistent profiles comparable to those obtained in the EERCS study.

When exclusive consumption profiles are compared through the GMM and DBSCAN methods, a similarity in behavioral patterns is identified. However, in terms of energy consumption, GMM performs better in all its strata, reflecting a greater ability to differentiate consumption levels. In contrast, DBSCAN does not achieve satisfactory results in representing strata corresponding to low consumption levels.

Figure 7 presents a comparison of the load profiles generated by the three methods, selecting a representative strata from each consumption group. The profiles show some similarity in the overall behavior within each group, but notable differences are observed in the energy levels between the methods, especially in the higher consumption strata, where the discrepancy is more pronounced. As shown in Table 5, the EERCS and GMM methods exhibit greater diversity in the number of strata generated for industrial and other groups, reflecting clearer variations in energy levels between the different strata.

Figure 7. Load profiles of different consumer groups.

4.3. Analysis of GMM Results

Once GMM was identified as the method with the best classification and characterization performance, a detailed analysis of its strata and load profiles was performed, taking into account the Sierra and Amazona regions corresponding to the service area of EERCS.

Figure 8 shows that the Sierra region exhibits a more stable load behavior, which translates into a better

L F

. In contrast, the Amazona region has a smaller number of observations, which limits the formation of strata and restricts the classification to only two consumption groups: residential and others.

Figure 8. Residential load profiles for the Sierra and Amazon regions.

Figure 9 illustrates the behavior of a representative strata of each consumption group on different types of days. Residential profiles are observed to maintain a similar shape throughout the week; however, on nonworking days, particularly Saturdays, there is a noticeable decrease in energy consumption. This pattern is repeated in all consumption groups, with the exception of the industrial sector, where the reduction in consumption occurs only on Sundays. Unlike the residential group, the other sectors exhibit significant variations in the shape of the load profile depending on the type of day.

Figure 9. Load profiles of consumer groups on different types of days.

4.4. Assessment of the Method for Sizing Transformer Stations

Table 6 presents the data corresponding to the projected energy, the number of customers, and the

L F

of transformer No. 31240, rated at 37.5 kVA, which serves a total of 71 customers. In addition, the table details the different strata generated based on the individual electric energy consumption (EEC) of each customer connected to the transformer. It is evident that the sum of customers across the various strata matches the total number served by the transformer, enabling a more detailed characterization of the EEC behavior.

Table 6. Data on transformer 31240 and its types of customers.

Figure 10 shows the load profiles resulting from the application of the GMM model to the customers served by the analyzed transformer. Four distinct groups are identified, each corresponding to a strata with a characteristic load profile. The sum of the load profiles of these strata corresponds to the total load profile of the transformer, demonstrating the consistency of the segmentation process.

Figure 10. Load profiles of the different types of customers of transformer 31240.

Figure 11 presents a comparison between the load profile resulting from the sum of all segmented customers and the measured load profile of transformer No. 31240. A high degree of similarity is observed between the two, both in the shape of the curve and in the energy consumption, which validates the consistency of the segmentation process and the accuracy of the applied method.

Figure 11. Load profile resulting from customers and load profile of transformer 31240.

Therefore, the use of the GMM model enables a more detailed analysis of the behavior of customers connected to a transformer. By decomposing the overall load profile into representative groups, it facilitates more informed decision-making regarding transformer sizing and opens the possibility of optimizing the network through the redistribution of customers to other transformers when necessary.

5. Discussion and Analysis of Results

The GMM model proposed in this study demonstrates better performance compared to the EERCS study, titled “Load Research and Characterization—Stratified Random Sampling” []. This improvement is attributed to the ability of GMM to identify and model irregular patterns or atypical behaviors present in the data, enabling more precise segmentation of customers within specific consumption groups. Although the

L F s

and profiles obtained show similarities to those reported in the EERCS study, the GMM profiles exhibit greater stability in their shape, contributing to a more robust characterization that is valuable for energy management purposes [].

As shown in Table 7, the energy ranges of the DBSCAN method and the EERCS study are identical, since DBSCAN was applied directly to the strata previously defined by EERCS. In contrast, GMM generated its own strata and energy ranges, demonstrating a greater ability to adapt to different consumption profiles. As evidenced in Table 4, GMM appropriately determines the number of strata per consumption group, and Table 7 shows how the resulting energy ranges reflect the various behavioral patterns within each group with greater precision. This confirms that GMM outperforms both DBSCAN and the previous EERCS study.

Table 7. Residential strata.

There are quantifiable advantages of the GMM compared to the DBSCAN and EERCS study. GMM achieves a high average density of observations per cluster (499 users/cluster), reflecting efficient segmentation and avoiding overfragmentation of the data.

The approach presented in [] is less effective for the current case study. Its focus is on the characterization of a single customer with a high-density dataset, which allows the capture of consistent behaviors but is not representative for analyses involving multiple customers with more limited time series. In our case, weekly observations are available for each customer, introducing uncertainty regarding whether the observed behavior reflects their typical annual consumption. Furthermore, the presence of outliers related to specific circumstances during the measurement period further complicates the analysis. Applying the method from [] in this context leads to inefficient segmentation, as unnecessary clusters are formed due to the absence of mechanisms to filter out atypical observations. However, the method could be valuable if applied to a detailed individual-level analysis.

It is important to note that extreme values were not removed during preprocessing. Although these observations can affect the homogeneity of the clusters, they also reflect real and atypical consumption behaviors, such as households with induction cooktops or users with irregular operating schedules. Completely discarding them could lead to the loss of valuable information for utility planning.

On the other hand, the approach presented in [] proposes the use of K-means combined with a validation threshold to assess the suitability of the generated clusters, which is theoretically efficient. However, when applied to datasets with highly irregular profiles, as in this study, it tends to over-segment the strata, affecting the coherence of the analysis. In contrast, GMM more effectively groups these irregular profiles, as its probabilistic nature allows for better distribution of observations and minimizes the creation of unnecessary clusters, thus providing a more robust classification.

6. Conclusions

When evaluating the different clustering methods, it is concluded that the application of the GMM is particularly suitable for analyzing customers with irregular behaviors within each consumption group. This method leverages the standard deviation as a segmentation criterion, which facilitates the exclusion of outliers and enables a probabilistic assignment of each observation, thereby improving both the flexibility and accuracy of cluster formation.

Thanks to this feature, GMM provides more precise segmentation and more robust classification of consumption profiles, key aspects for efficient energy demand management. However, although the model outperforms other applied methods, it also presents limitations, such as limited differentiation between observations from different geographical regions and a lack of discrimination between certain residential and commercial profiles. This suggests the need to complement its application with additional contextual analysis criteria.

The generated clusters exhibit an average

L F

of 58.82%, indicating a balanced demand distribution and operational consistency within the groups formed. These results further reinforce the robustness of GMM as a clustering technique suitable for real-world applications, where efficiency, representativeness, and stability in the identified consumption profiles are essential.

The results obtained allow the identification of the behavior of the EEC (Electric Energy Consumption) at the customer level, which constitutes a valuable tool for the EERCS to implement more effective strategies for operational planning, load redistribution, and the promotion of energy efficiency practices.

7. Recommendations

It is recommended to perform an individualized analysis of each customer using at least one year of historical data, in order to achieve a more accurate and refined classification of consumption patterns. Although the overall results may not undergo significant changes, verifying this hypothesis is valuable in demonstrating the stability of the model.

Likewise, it is suggested to expand the database by incorporating information from new customers, with the aim of increasing the representativeness of each strata and enabling a more thorough analysis in groups with a limited number of observations. Finally, the design of an automated process is proposed to periodically update the consumption records by incorporating data from new measurement campaigns, thus ensuring the relevance and precision of the clustering models.

Furthermore, it is recommended that future work explore the influence of geographical variables on clustering performance. In particular, external factors such as climate (temperature, humidity), socioeconomic conditions (income levels, electrification rates), and cultural consumption habits should be incorporated into the analysis. Correlation studies between these variables and load patterns could provide quantitative evidence of the geographic impact on demand behavior, thus improving the robustness and applicability of clustering models for regional power system planning.

Author Contributions

Conceptualization, W.M.-V. and M.-F.C.; methodology, W.M.-V. and M.-F.C.; validation, W.M.-V., M.-F.C., X.S.-G., A.B.-E. and J.E.-L.; formal analysis, X.S.-G. and J.E.-L.; investigation, W.M.-V., M.-F.C., X.S.-G. and J.E.-L.; resources, X.S.-G.; data curation, W.M.-V. and M.-F.C.; Writing—original draft preparation, W.M.-V., X.S.-G., A.B.-E. and M.-F.C.; Writing—review and editing, J.E.-L. and X.S.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This work is part of the “METODOLOGÍA PARA VALORAR LA IMPLEMENTACIÓN DE PLANTAS FOTOVOLTAICAS PARA AUTOCONSUMO” research project. It has received support of the Salesian Polythecnic University of Ecuador.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to data security and confidentiality requirements.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ARCERNNR	Agencia de Regulación y Control de Energía y Recursos Naturales no Renovables
EEC	Electric Energy Consumption

References

Mora-Alvarez, M.; Contreras-Ortiz, P.; Serrano-Guerrero, X.; Escrivá-Escriva, G. Characterization and Classification of Daily Electricity Consumption Profiles: Shape Factors and k-Means Clustering Technique. E3S Web Conf. 2018, 64, 08004. [Google Scholar] [CrossRef]
Soto, P.A.; Castro, J.R.; Reategui, R.M.; Castillo, T.D. Partición de una Red Eléctrica de Distribución Aplicando Algoritmos de Agrupamiento K-means y DBSCAN. Rev. Tec. Energ. 2023, 20, 73–81. [Google Scholar] [CrossRef]
Mahesh, B. Machine Learning Algorithms—A Review. Int. J. Sci. Res. 2020, 9, 381–386. [Google Scholar] [CrossRef]
Bonaccorso, G. Machine Learning Algorithms; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
Enriquez-Loja, J.; Castillo-Pérez, B.; Serrano-Guerrero, X.; Barragán-Escandón, A. Performance evaluation method for different clustering techniques. Comput. Electr. Eng. 2025, 123, 110132. [Google Scholar] [CrossRef]
Germán, A.; Juan, S.P. Optimización de Los Alimentadores de Media TensióN y Transformadores de Distribución de la S/E 17 Los Cerezos Proyectada por la CENTROSUR. Bachelor’s Thesis, UPS, Cuenca, Ecuador, 2019. [Google Scholar]
Ullah, A.; Haydarov, K.; Haq, I.U.; Muhammad, K.; Rho, S.; Lee, M.; Baik, S.W. Deep Learning Assisted Buildings Energy Consumption Profiling Using Smart Meter Data. Sensors 2020, 20, 873. [Google Scholar] [CrossRef] [PubMed]
Huang, D.; Wang, C.D.; Peng, H.; Lai, J.; Kwoh, C.K. Enhanced ensemble clustering via fast propagation of cluster-wise similarities. IEEE Trans. Syst. Man Cybern. Syst. 2018, 51, 508–520. [Google Scholar] [CrossRef]
McLoughlin, F.; Duffy, A.; Conlon, M. A Clustering Approach to Domestic Electricity Load Profile Characterisation Using Smart Metering Data. Appl. Energy 2015, 141, 190–199. [Google Scholar] [CrossRef]
Seem, J.E. Pattern recognition algorithm for determining days of the week with similar energy consumption profiles. Energy Build. 2005, 37, 127–139. [Google Scholar] [CrossRef]
Wang, J.; Wang, K.; Jia, R.; Chen, X. Research on Load Clustering Based on Singular Value Decomposition and K-means Clustering Algorithm. In Proceedings of the 2020 Asia Energy and Electrical Engineering Symposium (AEEES), Chengdu, China, 28–31 May 2020; pp. 831–835. [Google Scholar]
Benavoli, A.; Corani, G.; Demšar, J.; Zaffalon, M. Time for a Change: A Tutorial for Comparing Multiple Classifiers through Bayesian Analysis. J. Mach. Learn. Res. 2017, 18, 136–181. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; Tang, Q.; Huang, W.; Jiang, Y.; Xia, S.T. tk-means: A robust and stable k-means variant. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3120–3124. [Google Scholar]
Likas, A.; Vlassis, N.; Verbeek, J.J. The global k-means clustering Algorithm. Pattern Recognit. 2003, 36, 451–461. [Google Scholar] [CrossRef]
García-Santander, L.; San Martín-Ayala, J.; Ulloa-Vásquez, F.; Carrizo, D.; Esparza, V.; Rohten, J.; Mejias, C. Classification of Behavior Profiles for Non-Residential Customers Considering the Variable of Electrical Energy Consumption: Case Study—SAESA Group SA Company. Energies 2022, 15, 6634. [Google Scholar] [CrossRef]
Rodrigo, J.A. Detección de AnomalíAs Con Gaussian Mixture Model (GMM) y Python. 2020. Available online: https://cienciadedatos.net/documentos/py23-deteccion-anomalias-gmm-python (accessed on 3 February 2023).
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 1–22. [Google Scholar] [CrossRef]
Ohadi, N.; Kamandi, A.; Shabankhah, M.; Fatemi, S.M.; Hosseini, S.M.; Mahmoudi, A. Sw-dbscan: A grid-based dbscan algorithm for large datasets. In Proceedings of the 2020 6th International Conference on Web Research (ICWR), Tehran, Iran, 22–23 April 2020; pp. 139–145. [Google Scholar]
Pascual, D.; Pla, F.; Sánchez, S. Algoritmos de agrupamiento. In Método Informáticos Avanzados; Publicacions de la Universitat Jaume I: Castelló, Spain, 2007; pp. 164–174. [Google Scholar]
Jebari, S.; Smiti, A.; Louati, A. AF-DBSCAN: An Unsupervised Automatic Fuzzy Clustering Method Based on DBSCAN Approach. In Proceedings of the 2019 IEEE International Work Conference on Bioinspired Intelligence (IWOBI), Budapest, Hungary, 3–5 July 2019; pp. 1–6. [Google Scholar]
Ciardullo, E.; Quaglino, M. Estudio Comparativo de méTodos de Clasificación no Supervisada en Contextos de Grandes Bases de Datos. 2020. Available online: http://hdl.handle.net/11086/16851 (accessed on 1 July 2025).
Betancourt Vasco, E.E. Estudio y Planteamiento Para Establecer Una Tarifa Horaria en el Pico del Sistema EléCtrico en el Ecuador Como Incentivo de Eficiencia EnergéTica. Bachelor’s Thesis, EPN, Quito, Ecuador, 2012. [Google Scholar]
Zambrano, S.; Molina, M. Investigación y Caracterización de la Carga Muestreo Aleatorio Por Estratos; Empresa Eléctrica Regional Centro Sur C.A., Departamento de Estudios Técnicos: Cuenca, Ecuador, 2016. [Google Scholar]
Shi, C.; Wei, B.; Wei, S.; Wang, W.; Liu, H.; Liu, J. A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. EURASIP J. Wirel. Commun. Netw. 2021, 2021, 31. [Google Scholar] [CrossRef]

Figure 1. Methodology flowchart.

Figure 2. Optimal number of the GMM and K-means algorithm.

Figure 3. Optimal number for the DBSCAN algorithm.

Figure 4. Load profiles of the various clusters generated by the methods in strata RD1.

Figure 5. Load profiles for stratum RD1 obtained from the application of the three methods.

Figure 6. Load profiles of residential strata.

Figure 7. Load profiles of different consumer groups.

Figure 8. Residential load profiles for the Sierra and Amazon regions.

Figure 9. Load profiles of consumer groups on different types of days.

Figure 10. Load profiles of the different types of customers of transformer 31240.

Figure 11. Load profile resulting from customers and load profile of transformer 31240.

Table 1. Electric tariff.

Category	Tariff
General low voltage without demand	RD (Residential) SC (Senior Citizens) SA (Social Assistance) PB (Public Benefit) CO (Commercial) SA (Sports Arena) IA (Industrial Artisanal) OW (Official Entities) WP (Water Pumping)
General low voltage without demand	AB (Social assistance in LV with demand) BB (Public benefit in LV with demand) CB (Commercial LV with demand) B (Industrial LV with demand) MB (Municipal entity LV with demand)
General low voltage with hourly demand metering	A3 (Social assistance LV with hourly demand) B3 (Public benefit LV with hourly demand) C3 (Commercial LV with hourly demand) E3 (Sports facility LV with hourly demand) HH (Artisanal industry with hourly demand
Public lighting	PL (Public Lighting)
General with demand	WP (Water Pumping) PB (Public Benefit) CD (Commercial) RW (Religious Worship) SF (Sports Facility) ID (Industrial) ME (Municipal Entity) OE (Official Entities)
General with hourly demand	AH (Social assistance with hourly demand) BH (Public benefit with hourly demand) CH (Commercial with hourly demand) SH (Sports facility with hourly demand) IH (Industrial with hourly demand) JH (Industrial with hourly metering and incentives in MV)
High voltage service	KH (Industrial with hourly metering and incentives in HV)

Table 2. Strata by consumer group.

Consumer Group	Strata kWh>	Strata kWh>	Strata
Residential	0	60	1
Residential	60	110	2
Residential	110	180	3
Residential	180	310	4
Residential	310	Upper	5
Commercial	0	290	1
Commercial	290	1235	2
Commercial	1235	Upper	3
Industrial	0	410	1
Industrial	410	2520	2
Industrial	2520	Upper	3
Others	0	405	1
Others	405	1820	2
Others	1820	Upper	3

Table 3. Comparison of applied methods.

Methods	Clusters	E [kWh/Month]	Observations	Strata Representation [%]
K-Means	4	6.84 43.13 46.44 62.17	1647 1008 407 132	31.56
DBSCAN	5	54.59 21.49 46.33 31.66 64.05	390 2793 5 3 3	92.38
GMM	4	0.02 2.73 13.54 41.41	436 482 616 1308	46.02

Table 4. Summary of results.

Methods	Observations	Strata	Clusters
DBSCAN	7480	Residential Commercial Industrial Others	8 5 2 1
EERCS	1214	Residential Commercial Industrial Others	5 5 3 3
GMM	8487	Residential Commercial Industrial Others	7 4 3 3

Table 5. Comparison of residential strata.

Methods	Strata	Observations	E [kW/Month]	$LF$	Reduction Observations [%]
DBSCAN	RD1 RD2 RD3 RD4 RD5	2793 1991 1662 1355 662	21.5 80.0 135.8 228.3 481.1	55.0 57.2 62.1 69.6 74.8	33.0
EERCS	Residential 1 Residential 2 Residential 3 Residential 4 Residential 5	161 212 173 116 48	60 110 180 310 500	49.3 54.2 58.2 65.6 75.2	24.366
GMM	Residential 1 Residential 2 Residential 3 Residential 4 Residential 5	1455 1919 2544 2052 775	4.5 44.0 92.4 170.9 346.5	42.1 56.7 56.7 65.2 73.4	16.6

Table 6. Data on transformer 31240 and its types of customers.

Element	E [kWh/mes]	Customers	LF [%]
Trafo 31240	9443.85	71	67.948
Residential 1	60	110	2
Residential 3	170.94	23	65.2
Residential 4	346.51	5	73.4
Commercial 1	727.59	3	61.4

Table 7. Residential strata.

Methods	Consumer Group	Energy kWh>	Energy kWh>	Strata
DBSCAN	Residential	0 60 110 180 310	60 110 180 310 Upper	1 2 3 4 5
EERCS	Residential	0 60 110 180 310	60 110 180 310 Upper	1 2 3 4 5
GMM	Residential	0 68 131 258	68 131 258 Upper	1 2 3 4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Enhancing Load Stratification in Power Distribution Systems Through Clustering Algorithms: A Practical Study

Abstract

1. Introduction

2. Background

3. Methodology

3.1. Stage 1: Data Acquisition and Processing

3.1.1. Data Acquisition of Low and Medium Voltage Customers

3.1.2. Data Validation

3.1.3. Calculations

3.1.4. Information Filtering

3.2. Stage 2: Data Management System Export and Evaluation of Clustering Methods

3.2.1. Optimal Number of Clusters

3.2.2. Analysis of Clustering Methods for Characterization

3.2.3. Selection of Methods to Apply

3.3. Stage 3: Classification and Characterization

3.3.1. Customer Classification and Characterization

3.3.2. Analysis

4. Results

4.1. Test Methods

4.2. Evaluation of the Results Obtained by the Methods Applied and Results of EERCS

4.3. Analysis of GMM Results

4.4. Assessment of the Method for Sizing Transformer Stations

5. Discussion and Analysis of Results

6. Conclusions

7. Recommendations

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics