2. Literature Review
The characterization of energy demand in regions lacking Advanced Metering Infrastructure (AMI) presents a unique set of challenges. Most conventional approaches to demand estimation rely on granular, high-frequency data that AMI systems provide. However, in much of Latin America, and especially in Colombia, these technologies remain under-deployed due to a combination of economic, technical, and regulatory factors [
3,
4,
6]. Technically, AMI systems require robust and reliable communication infrastructure, which is often lacking or unreliable in rural and mountainous areas of Colombia, particularly in the Andean region. These areas also face significant logistical barriers to installation and maintenance. From a regulatory standpoint, delays in standardization, limited incentives for private utilities to invest in metering upgrades, and insufficient funding programs have all slowed AMI adoption [
3,
4,
6]. Therefore, alternative methods capable of functioning under low-data conditions are needed for effective planning and demand management.
This review is structured into three main thematic groups. (i) Machine-learning-based and AMI hybrid modeling techniques for demand estimation with limited data, which are the most common type of techniques found in the recent literature. (ii) Statistical and stratified sampling strategies used in demand characterization. (iii) Emerging information-theoretic approaches, including the use of entropy and Kullback–Leibler divergence for sample optimization. These last two points are more closely related to the purposes of this study. Each group is examined with an emphasis on its applicability to low-infrastructure contexts, and its relevance to the proposed methodologies in this study.
2.1. Demand Modeling with ML and Hybrid Approaches
Traditionally, demand estimation in regions with limited data availability has been addressed through mathematical modeling techniques. Early efforts used Monte Carlo simulations to estimate residential transformer loads, aiding in grid expansion planning [
1]. These models function by using probabilistic reconstructions of load curves from aggregated or infrequent data. Monte Carlo methods align with the goals of this study in that they enable the composition of a user sample by extracting users from a reconstructed demand curve. However, unlike the proposed information-theoretic strategies, Monte Carlo techniques are less efficient at identifying areas of high diversity for sampling and do not inherently optimize information gathering, as this method is random. It is possible to directly quantify uncertainty (via entropy) or divergence from population distributions (via KL divergence) to guide sampling placement.
As modeling techniques advanced, more sophisticated tools such as regression, clustering, and wavelet analysis were incorporated to improve accuracy and represent heterogeneity in consumption [
8,
9]. Furthermore, the concept of community networks has been applied to group curves and improves the detection of consumption patterns [
10]. In a study on under-electrified rural regions, hierarchical clustering and geospatial proxies like satellite imagery were used to estimate demand [
11]. This approach emphasizes the value of indirect indicators—a strategy that complements the use of entropy by validating the usefulness of external features for demand estimation.
The integration of AI and machine learning has enhanced these models, enabling more dynamic representations of consumption patterns. For example, clustering methods have been used to disaggregate aggregated loads, while LSTM neural networks and nonparametric regression have been applied to adaptively predict variations linked to new technologies like electric vehicles or solar PV [
12,
13]. Bayesian inference has also been used to reconstruct consumption from readings in low-income regions [
14]. Despite their promise, these methods are often constrained by the need for large training datasets, which are not so often available in low-infrastructure regions [
1].
Hybrid models that leverage limited smart meter data have also been developed. Hybrid models that combine statistics and machine learning have been proposed to improve demand prediction. One approach integrates MEMD with PSO-optimized SVR and improves the estimation of daily peaks [
15]. Another model combines IEMD, ARIMA, and FOA-optimized Wavelet Neural Networks to predict short-term load [
16]. A strategy combining SVR with fuzzy systems has been proposed. This approach improves the management of nonlinear weather conditions and reduces load uncertainty. This increases the accuracy of daily forecasting [
17]. These methodologies allow for the inference of detailed consumption patterns and improve energy management in regions where the AMI infrastructure does not have complete coverage. To characterize demand in systems without full AMI coverage, hybrid models that combine limited data with machine learning have been developed. One study combined ARIMA, logistic regression, and neural networks to predict peak load days. This facilitates demand response and reduces costs [
18]. Additionally, an intelligent system was designed to manage non-technical losses in residential users. This system uses machine learning to detect anomalies [
19]. Load disaggregation models are applied in smart buildings that allow the identification of the consumption of each device [
20]. Nevertheless, while effective, these methods still depend on at least partial AMI coverage and significant computational resources.
2.2. Stratified and Statistical Sampling in Demand Estimation
Given the limitations of modeling techniques in data-scarce environments, statistical sampling strategies are a practical alternative. Sampling allows for the collection of data from a representative subset of users, reducing monitoring costs while still providing valid inferences about the broader population [
21].
An effective technique for representing different types of users is stratified sampling. This method divides the population into homogeneous subgroups based on criteria such as user type, location, or consumption level. This improves the accuracy of inferences and reduces bias. For example, the TURKSTAT 2019 Household Budget Survey applied stratified cluster sampling to estimate monthly electricity consumption in households in Turkey. This improved the extrapolation of results [
22]. In Colombia, similar approaches have been employed by public utilities to improve the representativeness of manual metering surveys [
5]. Sample size determination in these methods typically relies on Cochran’s formula, accounting for desired confidence levels and margins of error. Some studies have also recommended post-survey validation through follow-up interviews or statistical comparisons with known consumption distributions [
23]. Although effective, stratified sampling methods often lack the mathematical rigor to optimize sampling placement, especially in heterogeneous regions.
To overcome this, smart sampling techniques have been introduced. These approaches utilize clustering algorithms, historical billing data, and optimization metrics to select the most informative users or time periods for measurement. One example is machine-learning-assisted active sampling, which alternates between estimation and data collection based on predictions about unobserved data [
24]. In environmental monitoring, a strategy for selecting sampling sites for micropollutants in rivers is proposed that combines spline interpolation, hierarchical analysis, and geographic information systems to optimize spatial distribution [
25]. In fields like deep metric learning, an approach based on clustering representations in feature spaces has been proposed to select informative samples. This improves convergence by avoiding local minima [
26]. Sampling is also essential in contexts with unbalanced data. Increasing the sample size and balancing classes improves accuracy and predictive power. The choice of random seeds also influences the stability and generalization of the model [
27]. Furthermore, Gibbs sampling in conjunction with GRNN neural networks allows the generation of synthetic samples that preserve the original structure of the data. This improves the representation of the true distribution and the predictive performance in contexts with limited data [
28]. These advances show the variety of methodologies and highlight the importance of adapting each technique to its context. However, once again, these methods are limited by the availability of data in regions with limited measurement infrastructure. Deep-learning-based methods generally improve as the amount of data increases. They often outperform other techniques when trained on very large datasets, magnitudes greater than a million records [
29]. Time-series scenario reduction through smart sampling has been used to maintain accuracy in power flow simulations while lowering computational loads [
30]. Algorithms like A-DPC apply sampling logic to allow prioritizing customers in response programs based on their consumption patterns. This increases the effectiveness of the strategies [
31]. In informal settlements in Kenya, combining stratified sampling with regression has improved consumption estimation [
32]. Demand characterization has improved thanks to data-driven load modeling too. These techniques capture the diversity and variability of loads in distribution networks and improve simulation and planning [
33]. However, all these studies lack divergence-based optimization or the determinism and clarity of a fixed metric such as entropy proposed in this study.
2.3. Information-Theory Approaches to Sample Optimization
More recently, information-theoretic metrics have been introduced into sampling methodology to improve efficiency and representativeness. These approaches are particularly valuable in regions where AMI is absent and the sampling budget is constrained. Among these metrics are Shannon entropy and Kullback–Leibler (KL) divergence.
Entropy-based sampling identifies users or areas with high diversity in consumption, ensuring that the most informative data is collected [
34]. In semi-rural grids, entropy-maximizing models have been used to reconstruct load curves using limited and coarse data sources [
35]. Entropy-based selection has shown promise in identifying diverse consumption zones even when only billing or categorical data is available [
34]. These studies support the strategy of targeting uncertainty-rich areas—a foundational principle in the methodology proposed in this study.
KL divergence, on the other hand, provides a quantitative measure of how one probability distribution diverges from another. In sampling, it can be used to select subsets whose aggregate consumption profile most closely resembles the full population, minimizing representational bias. This is particularly useful when comparing sample-generated distributions to known or estimated consumption patterns. When combined with clustering or smoothing techniques, KL divergence-based sampling has proven to yield robust results even in data-scarce settings [
36,
37].
Both entropy and KL divergence can complement stratified methods, adding quantitative rigor to group definitions and sample selection. The methods proposed in this study—Shannon Entropy Sampling and KL Divergence Sampling—build on these principles, offering flexible yet statistically grounded tools for representative sampling in low-infrastructure regions. These approaches directly address reviewer concerns about the lack of discussion comparing conventional techniques like stratified sampling to more sophisticated methods. While stratified sampling ensures group-level representation, entropy and divergence optimize for information content and distributional fidelity, respectively. The proposed methodology bridges these strategies, enhancing applicability to Colombia’s Andean region where infrastructure, terrain, and socioeconomics vary significantly.
5. Results of Sampling Strategies
5.1. Determination of Sample Size Across Subregions
This brief section presents
Table 1 with the results for the theoretical sample sizes using Cochran’s Formula (2) segregated by subregion.
Based on the results in
Table 1, a 95% confidence level with a 5% error was selected. This level provides a moderate sample size. The sample size becomes too large for a higher confidence level. This defeats the initial purpose of achieving a convenient sample size.
5.2. Samples Geospatial Distribution Comparation
This section presents a description and comparison of the spatial distribution obtained with the three sampling approaches applied. The maps shown have been anonymized and are intended solely to illustrate and compare the location of the selected users with the rest of the population.
Figure 4 shows the spatial distribution of the sample based on Shannon entropy. This sample was obtained with a 95% confidence level and a customer/population weighting of 90/10 (the weighting is explained in
Section 5.4).
Figure 4b presents a close-up of a major city where a high concentration of users is observed in high-activity areas. Although there appear to be few selected users, they are actually so clustered that they are not clearly distinguishable on the overall map. To achieve a more uniform distribution, a 98% confidence level and a sample 7.7 times larger would be necessary. The clusters are located in residential neighborhoods of large cities where most consumption is concentrated. The clusters have so many residential users that commercial, official, and provisional users are not shown in this close-up of the city, and the street lighting points are covered by residential user points within the clusters. When analyzing each subregion, we also observe high-density areas near population centers. This reflects the natural pattern of consumption. This pattern is concentrated in a few urban centers.
Figure 4a shows that dividing the problem by subregions disperses the concentrations more evenly throughout the region. This prevents the sample from concentrating in the central area, as would occur if sampling were applied directly to the entire population. This occurs because the transformers with the greatest diversity are in the largest cities.
Figure 5 shows the spatial distributions of the consumption-based strategy.
Figure 5a presents the general distribution across the entire region.
Figure 5b shows a close-up of the same city used in
Figure 4b. The main difference is greater geographic dispersion. Even within urban areas, dense clusters do not form, as is the case with entropy sampling. This occurs because the consumption-based strategy prioritizes diversity. Logarithmic discretization and smoothing are used to include users with atypical consumption patterns. While the entropy sample is concentrated in residential neighborhoods, the consumption sample has a greater presence of industrial users and users from strata 5 and 6 with high consumption. This is analyzed in
Section 5.5.
Figure 6 shows the maps of the best sample according to the KL divergence criterion at 95% confidence level. Both
Figure 6a,b show distributions very similar to that of the consumption sample. At first glance, it is not possible to adequately determine how the diversity of customer types in the KL divergence sample differs from the consumption sample. The difference between these two approaches will be better explored in the following consumption sample and KL Divergence results
Section 5.4 and
Section 5.5.
In summary, the spatial distribution maps clearly show the differences between the three strategies. The entropy sample tends to concentrate in densely populated areas and reflects the urban consumption pattern. The consumption sample shows greater geographic dispersion and seeks diversity by including atypical users. The KL divergence approach achieves a very similar geographic dispersion to that of the consumption sample. This may increase logistical complexity but improves overall coverage. This spatial comparison complements the analyses in subsequent sections and reinforces the importance of choosing a sampling strategy aligned with the operator’s objectives and the characteristics of the territory.
5.3. Evaluation and Results of the Sample by Shanon Entropy
This section presents the results of applying the Shannon entropy-based sampling strategy. The objective is to construct a sample that maximizes the amount of information contained in the selected users.
The first step was to select the weights between customer type and population type to construct the ranking of distribution transformers according to their entropy. For
Table 2, samples with a 95% confidence level and different combinations of customer/population weights were generated. The proportions of each sample were then calculated by customer type and population type. These proportions were compared with those of the general population. Based on the differences, a cumulative total deviation was estimated, and the sample with the lowest deviation was selected. The best configuration was the 90/10.
The selected sample has 1750 users associated with eight distribution transformers (two transformers per subregion). These transformers have a normalized average Shannon entropy of 186.41. By comparison, the general population includes 569,423 users, 20,222 transformers, and their weighted average entropy with 90/10 weights is 12.03. The sample size for a 95% confidence level was determined to be 1525. However, the entropy strategy generates slightly different sizes. This is because transformers are selected to cover or exceed the theoretical size, and all users associated with each transformer are included to avoid losing information. Each transformer has a different number of users. Therefore, variations in size depend on the weighting, which affects the ranking of transformers.
Table 3 and
Table 4 show the distribution of users by subregion with 95% confidence level and 90/10 weights, as well as the average distance between users. The Central subregion is the most urbanized subregion and has the smallest average distance from them all. Which is to be expected, since in an urban environment, users are closely located, living in residential neighborhoods that can group together several thousand people. This behavior is amplified within the entropy-based sample, since the sample tends to generate dense user clusters in these urban areas. To the point that the average distance in the central subregion is slightly smaller than the average distance for the entire population of this same subregion. The average distance between each user in the central subregion’s population is 7.41 km, while the average distance for the sample is 6.98 km, representing a difference of 0.43 km. On the other hand, the overall average distance of the whole sample increases by 11.48 km relative to the whole population. This is explained by the smaller number of users in the sample compared to the original total population, which will naturally make the sample more spatially sparse. The formation of clusters that leave gaps between subregions, especially in the mountainous Eastern subregion, will also be a normal cause that explains the increase in the average distance in the sample.
Figure 7 highlights an excellent fit with respect to annual consumption. This is achieved despite the fact that the sample is 325 times smaller than the population and does not use consumption as a selection criterion. This indicates that the customer type and population type categories contain relevant information about consumption patterns. This result is important because it shows that the sample can describe the demand of the original population.
Figure 8 shows that the sample under-represents users in stratum 3 and does not include strata 4, 5, and 6. A bias that excludes high-income users is evident. A strong overrepresentation of users in stratum 1 is also observed. Correcting this bias would require increasing the confidence level. This would imply a significant increase in the sample size. Consumption-based and KL-divergence methodologies explore a consumption-based approach that seeks to alleviate this bias toward more atypical users.
Figure 9 shows that the sample has a slightly lower proportion of rural users and a higher proportion of urban users. Even so, the overall fit for the population type category is good. This confirms that the 90/10 balance is adequate for this dimension.
Table 5 provides the correlation coefficient values for the three sample distributions that were compared with those of the original population. The results for the consumption distribution are again noteworthy, with a coefficient of perfect 1. For the customer type and population distributions, values greater than 0.8 were obtained, reflecting moderately high representativeness. All distributions obtained a
p-value less than 0.05, indicating significant relationships.
In summary, the entropy sampling strategy allows for the construction of samples with high informational diversity without using random sampling based on consumption as a direct criterion. Thanks to this, the approach provides deterministic results. Clusters are identified in densely populated areas. These excellently reflect the distribution of consumption and demand. The weighting between customer type and population makes it possible to maintain representativeness in key groups such as rural areas and low-income users. However, the sample tends to under-represent users with atypical consumption, such as industrial users or high-income users. Overall, it is a solid alternative for obtaining compact, information-rich, and logistically manageable samples.
5.4. Evaluation and Results of the Sample by Consumption
This section presents the results of applying the energy consumption-based sampling strategy. The objective is to construct a diverse sample that represents different consumption levels, highlighting less frequent patterns.
A 95% confidence level results in a sample of 1534 users.
Table 6 and
Table 7 show the differences between the sample and the population by category and subregion. The differences are generally low, indicating a good fit and adequate representativeness.
Table 6 uses the average distance between users as a measure of dispersion. This distance is greater than that of the original population in the North, South, and Central subregions. In the South subregion, the difference is 3.9 km. This confirms that the consumption sample is more dispersed. This dispersion ensures geographic coverage and representativeness even in small populations.
Since this strategy seeks to reconstruct the consumption curve, the analysis by subregion shows a good fit with respect to the actual consumption distribution. This is consistent with the high 95% confidence level. However,
Figure 10 presents a counterintuitive result. When the four subsamples are combined to form the overall sample, the fit with respect to the consumption distribution of the entire population is poor. This may occur due to a randomness bias that the Monte Carlo method introduces into the sample. By defining any random seed, there is no guarantee that the selected random seed will offer the best results. It is possible for a random seed to offer better or worse results than any other seed when extracting users. This is a significant shortcoming because the sample fails to meet the primary objective of fitting the consumption curve, which represents the population’s energy demand. The strategy based on KL divergence seeks to improve the main weaknesses of the consumption-only strategy.
Figure 11 shows the inclusion of users from strata 4, 5, and 6, who were absent or under-represented in the entropy sample. It is also observed that users in stratum 1 are no longer as overrepresented. There is a moderate increase in commercial and industrial users thanks to the use of smoothing, which favors less frequent groups.
Figure 12 shows a slight overrepresentation of users from urban and populated centers and a moderate under-representation of urban users. The fit is good for all three population types.
Table 8 shows the correlation coefficient values of the sample compared with those of the original population. The coefficients for the distribution of customer type and population are high, above 0.95, which reinforces what was previously described in
Figure 11 and
Figure 12. Likewise, the correlation coefficient confirms the poor fit of this sample with the consumption distribution, obtaining a negative coefficient of −0.1587. All the
p values indicate significant relationships.
5.5. Evaluation and Results of the Sample by KL Divergence
This section presents the results of the selection strategy based on minimizing the KL divergence. Starting from an initial random sample, an iterative optimization is applied. The configuration with the lowest divergence from the original distribution is selected. This results in a representative sample in terms of energy consumption, customer type, population type, and geographic location.
A 95% confidence level was used. This level offers a good balance between sample size and representativeness. The final sample includes 1534 users. Each subregion was optimized independently. For each, multiple samples with different random seeds were generated, and their KL divergence was calculated. The sample with the lowest divergence was chosen. This process reduces the bias that random seeds can introduce. Ten thousand different samples were generated. This high number increases the probability of finding a configuration that reduces bias and improves representativeness. The optimized sample achieved a KL divergence of 0.045. This value is 40.76 times smaller than the worst case, which was 1.834.
Figure 13 shows several sampling distributions (in purple) compared to the population (in gray). The 10,000 generated samples were ordered by their KL divergence. Five equidistant samples were selected, from worst to best. The last corresponds to the optimal sample. The top-down iterations show how the fit improves by customer type. The first two iterations achieve a notable improvement. From the third iteration onwards, the improvement is significantly smaller. This indicates rapid convergence toward stable configurations with similar results. The example is presented for the Center subregion. The final optimized sample has a fit very close to that of the population in this subregion. This behavior was also observed in the other subregions during the optimization process.
Table 9 and
Table 10 show similar behavior to that observed in the consumption strategy. Proportions by category are presented, segmented by subregion. The fits are very good, and better representativeness of stratum 3 users is achieved. The entropy and consumption samples had limitations in achieving this level of representativeness.
Table 10 reports the average distance between users. This distance increases by 14.31 km compared to the average distance for the general population. This increase is explained by the geographic conditions of the region, which create separation between the four subsamples, and by the small number of users in the sample. The differences are small when analyzing each subregion separately. The Central subregion shows only a 20-m difference. Even with this metric, the divergence sample maintains a good fit.
Figure 14 shows a significant improvement in the consumption distribution compared to the consumption sample presented in
Figure 10. This strategy overcomes the main weakness of the previous approach. The integration of the divergence criterion evaluated by customer type also improves the sample’s representativeness of actual consumption patterns. The consumption distribution in
Figure 14 presents a less pronounced peak and longer tails than the original distribution. This is due to the smoothing applied (4). Despite these effects, the sample curves retain a high similarity to the original curves. This allows the sample to adequately represent demand.
Figure 15 and
Figure 16 show a robust fit to the distributions by customer type and population type. These figures confirm that the sample includes both common and atypical categories. The strategy achieves a highly representative sample by including consumption and characterization categories within the selection criterion.
Table 11 shows excellent values for the distributions by customer type and population, both exceeding 0.99. Furthermore, a considerable improvement in the correlation coefficient for the consumption distribution is observed in the KL sample, in contrast to the consumption-only sample. A value of 0.8954 is high and reflects a good fit with respect to the consumption curve.
In summary, the results of the KL divergence-optimized strategy show a substantial improvement in representativeness compared to the original population. This is evident in the fit of the consumption curve and the distributions by user type. The integration of multiple criteria allows for the construction of a balanced and diverse sample that clearly reflects the population structure. Although geographic dispersion increases when subsamples are consolidated, within each subregion, the average distances remain close to those of the population. This strategy represents a robust and accurate alternative for studies that require high fidelity to real distributions.
6. Discussion
This section analyzes the advantages, limitations, and recommended applications of each of the proposed sampling strategies. The flexibility, statistical quality, and geographic behavior of the samples are compared. This analysis allows for identifying which methodology best suits different operational, budgetary, and planning objectives. This is especially relevant in developing regions where a balance between representativeness and logistical efficiency is required.
The strategy based on KL divergence retains the main characteristics of consumption sampling, particularly its geographic dispersion. However, it improves the fit to the distributions by user type and consumption distribution. This is evident by comparing distribution
Figure 10,
Figure 11 and
Figure 12 with
Figure 14,
Figure 15,
Figure 16 and by comparing the correlation coefficients in
Table 8 and
Table 11. Minimizing the KL-divergence allows for greater representativeness by customer type and reduces the bias caused by suboptimal random seeds. The KL-divergence strategy maintains the advantages of the consumption-based approach and improves its results. Its only disadvantage is the greater computational burden of the iterative process that searches for the optimal sample. This increases the time required to obtain the results. Although the consumption-based strategy can be useful in rapid exploratory analyses, the KL-divergence strategy is more advisable for studies requiring high precision. For these reasons, the analysis of results will focus on comparing the Shannon entropy and KL-divergence approaches.
Firstly,
Table 12 is presented to summarize the advantages and disadvantages of the methods and their specific applications. For this summary, four main evaluation criteria were used. The selection of the criteria—logistical ease and representativeness, methodological robustness, interpretability of results, and flexibility and adaptability—responds to both the operational realities of low-infrastructure regions and the rigor required for demand characterization. Logistical ease and representativeness are essential to ensure that samples not only reflect the true structure of the population but can also be implemented efficiently in areas with limited connectivity, difficult terrain, or budget constraints. Methodological robustness ensures that the results are consistent, reliable, and reproducible, even when faced with variations in data quality, parameter settings, or initial conditions, thereby reinforcing the validity and applicability of the study in real-world scenarios. Interpretability of results is critical for transparent communication and justification of sampling decisions, enabling stakeholders and decision-makers—often without technical backgrounds—to understand, trust, and adopt the findings. Finally, flexibility and adaptability determine the capacity of a methodology to adjust to changing objectives, constraints, or operational environments without losing effectiveness, enhancing its practical value and scalability for different contexts, objectives, and territorial scales.
After this, the obtained results will be compared and discussed with the findings and research results obtained by other authors.
At a general level, both strategies—the strategies based in entropy and KL divergence—generate samples that fit well with the general population distribution. The strategy based on divergence offers a better fit to user type distributions and greater inclusion of outliers. The entropy-based strategy provides a better fit to the consumption distribution. The main difference between them lies in the way the samples are geographically distributed within the region.
The entropy-based sampling strategy tends to form clusters of users in specific areas, especially in urban areas. Instead of covering the entire city homogeneously, the sample is concentrated in sectors such as entire neighborhoods. This pattern has been observed in recent studies using entropy analysis to identify consumption patterns in urban environments. For example, a study on water demand applied entropy and time series clustering to identify distinct residential patterns. This work showed how some areas exhibit more homogeneous consumption behaviors than others [
44]. In another study on electricity demand, entropy-based metrics were used to maximize information diversity with small samples. This allowed for optimized characterization and management of residential consumption [
45].
The ability of this approach to generate clusters is useful beyond the electricity sector. In contexts with logistical or budgetary constraints, it allows for the selection of a few clustered users that closely represent the population. This optimizes resource use. In environmental studies, entropy-based methods have been effective in selecting observation points that maximize the collected information, even in complex urban settings. Maximum entropy modeling has been used to select sampling sites in environmental networks. This allowed for capturing heterogeneity with few observation points [
46]. Entropy criteria have also been used to optimize sensor networks in cities. These criteria allowed for the detection of hazardous emission sources with few sensors [
47]. In water networks, entropy metrics helped locate a limited number of sensors that detect leaks with high efficiency [
48]. In industrial contexts, an entropy-based strategy was used to estimate emission sources in chemical plants and optimize sensor placement to capture variability in operating conditions [
49]. These approaches demonstrate the usefulness of entropy as a tool for designing efficient and representative monitoring networks in contexts with operational constraints.
The ability of the entropy-based strategy to form clusters that represent complex patterns makes it an effective tool for decision support in heterogeneous environments. The KL-divergence-based sampling strategy is based on capturing consumption information and offers advantages in contexts where diversity in energy behavior is a priority. This strategy allows for the identification of geographic variability and atypical profiles. It is especially useful in tariff studies where understanding how different consumption levels affect costs and subsidies is key. Research has shown that tariff designs impact socioeconomic groups differently. These studies highlight the need to adjust tariffs based on variations in consumption to distribute costs more fairly [
50]. The effects of time-of-use tariffs on commercial and industrial users have also been analyzed. This research shows how these structures significantly modify costs for these sectors [
51]. These findings support the use of the sampling strategy to capture consumption diversity and support the formulation of more inclusive tariff policies. It allows for the inclusion of users who are often under-represented, such as users in small towns, neighborhoods with extreme social strata, and rural areas. Recent studies highlight the importance of considering spatial variability in sampling processes. This consideration improves representativeness and reduces bias in distribution models [
52]. The KL-divergence-based strategy generates samples with broad territorial coverage. This allows characterizing regions with high socioeconomic disparity and a great diversity of consumption profiles. However, this broad coverage entails greater logistical challenges and higher operating costs, especially if the implementation of sophisticated infrastructures such as AMI is sought. Although representativeness improves, the costs and complexity of collection can limit its use in projects with budgetary or logistical constraints [
53]. The strategy based on divergence is adequate for studies of moderate territorial scope where statistical representativeness is a priority.
The KL-divergence-based approach also applies to sectors such as drinking water and basic sanitation. In these contexts, consumption patterns vary by urban and rural areas and by user type. Sampling that prioritizes diversity allows for more precise identification of the behavior of minority groups. This facilitates the design of more targeted and equitable cross-subsidies or conservation strategies. Studies in Brazil show that increasing block tariffs (IBT) can generate regressive subsidies if not adjusted to local socioeconomic conditions. These studies highlight the importance of tariffs that reflect consumption diversity to improve equity in water access [
54,
55]. In Chile, a water distribution scheme complemented by subsidies was implemented in underdeveloped regions. This policy supported customers living in poverty and demonstrated the need for tariff structures that recognize differences between residential and non-residential users [
56]. In the manufacturing industry, geographic diversification of the customer base has been useful for improving inventory allocation efficiency, especially during periods of economic downturn. A study of manufacturing companies in the United States found that having a geographically diverse customer base allows for more efficient inventory sales and allocation in times of economic crisis [
57]. This result demonstrates that it is possible to draw diverse samples from a population and use them as market targets. KL divergence has also been useful in artificial intelligence applications, such as in sampling schemes for non-autoregressive language models. In a study of natural language models, KL divergence was used to improve the quality and consistency of text generation by balancing diversity and fidelity [
58]. These applications across different sectors highlight the versatility of the KL-divergence-based and consumption-based sampling strategy to improve representativeness, efficiency, and sustainability in diverse contexts.
The overall analysis concludes that no single strategy is superior in all respects. Each approach offers distinct advantages depending on the required balance between accuracy, diversity, coverage, and practical feasibility. The entropy-based strategy offers the best fit to the consumption curve. It is the most appropriate option when there are logistical constraints, as it concentrates information users in specific areas. The KL-divergence-based strategy excels at capturing the diversity of profiles across the region and is the most robust in terms of representativeness. The latter is the most comprehensive option when statistical accuracy and territorial coverage are the priority, provided the necessary operational resources are available.
7. Conclusions
This study addressed the hypothesis that it is possible to construct representative and operationally viable energy demand samples in regions without AMI, through the integration of entropy-based, consumption-based, or divergence-minimizing sampling strategies. The results obtained confirm this and provide guidance for low-infrastructure contexts in Colombia.
Considering the objective of offering flexible tools for demand characterization in contexts with limited infrastructure, the KL-divergence strategy is the best option for the grid operator. This strategy yields near-perfect correlation with customer (r = 0.9913, p = 0) and population type distributions (r = 0.9999, p = 0.0067), while maintaining good correlation with consumption distribution (r = 0.8954, p = 0). By optimizing across multiple seeds, it minimizes sampling bias and ensures representativeness in both consumption and user categories. Despite higher logistical costs from spatial dispersion, it balances statistical accuracy and territorial coverage, with a KL divergence of 0.045 and an average user distance difference of 14.31 km. For these reasons, it is recommended when characterization quality is a priority.
The entropy strategy achieved perfect correlation between entropy and consumption distribution (r = 1.0000, p < 0.001) without using consumption as a criterion. It is highly efficient logistically, thanks to geographic clustering. This lowers costs in remote areas, with an average distance difference of 11.48 km. Its limitation is poor representativeness of atypical profiles, reducing its value for comprehensive studies. It is best suited when logistical efficiency is prioritized, especially if complemented by including categories with greater informational diversity.
In terms of robustness and adaptability, entropy and KL-divergence strategies provide reliable but distinct approaches. Entropy sampling, based on categorical variables, is less affected by measurement errors or missing data, fitting limited-infrastructure contexts. KL divergence is more versatile, handling continuous and categorical variables while preserving multiple population dimensions. It quickly converges to similar solutions across random seeds. This ensures consistency in the strategy results. Both are highly interpretable, using familiar variables and clear metrics that aid communication and support institutional adoption, particularly in public-sector planning requiring technical justification for territorial targeting.
The consumption-based method includes atypical profiles and broad geographic coverage but is more sensitive to random seed bias and shows weaker correlation with consumption than other methods. It is better for quick exploratory analyses, while the KL-divergence approach is preferable for more rigorous studies.
There is no universally optimal strategy; each balances logistical constraints, representativeness, and analytical capacity differently. This work provides a comparative framework to help grid operators select and adapt sampling methods to their conditions and objectives.
7.1. Recommendations for Decision-Makers
For grid operators and policy-makers without AMI, the KL-divergence strategy is most suitable when precision and broad representation are priorities. Its ability to capture consumption patterns and its multiple qualitative categories of the population make it valuable for planning, tariff design, and policy evaluation, although it could risk inefficiency. In contrast, the entropy-based strategy fits contexts prioritizing cost-efficiency and logistics, such as remote areas, by clustering users to reduce travel, installation, and maintenance costs.
7.2. Limitations
While the study confirms the feasibility of building representative samples from conventional metering data, several limitations must be noted. The analysis uses only monthly consumption records, missing finer temporal variability such as daily profiles or short-term peaks that affect grid operations. Without AMI-based validation, the strategies cannot yet be benchmarked against real-time behavior, limiting forecast precision. Results also reflect the Colombian case study’s socio-economic mix, grid topology, and rural–urban structure, which may not generalize elsewhere without adaptation. Moreover, the models exclude seasonal effects, demand shifts, or behavioral changes from policy measures. These factors underscore the need for context-specific calibration before broader application.
7.3. Future Research Directions
Future research should integrate geospatial optimization to minimize travel and logistics while preserving representativeness. Testing the strategies in other developing regions will clarify transferability and scalability. Synthetic high-frequency data or partial AMI could enable validation under more detailed consumption dynamics. Future work could examine hybrid optimization methods that merge entropy metrics and divergence, adjusting the categories weighting and evaluating which of these categories optimizes the divergence of the resulting sample. A multi-objective framework may yield adaptable, resource-efficient sampling solutions aligned with both budgetary and analytical needs.